Sustainability
The socio‑economic and environmental context has made the transition towards sustainable business models no longer postponable.
Learn more
Blog
In the process of helping its customers deal with the new variety and volatility of data sources, CRIF has come across some common concerns to think about when considering a new ML classification project. Here are some of them:
This is the most frequently asked question, and the answer is not straightforward because it depends on many factors, and given that the data the categorization process is working with is highly regulated, this makes things even more difficult.
In general, to create a categorization engine, the data sample must be “representative”:
In order to ensure that the data to be used is promising, an initial statistical check is recommended to ensure that these requirements are met so that the engine performs as expected in each possible scenario.
The evaluation of performance is essential for the continuous improvement of the categorization engine. Therefore, CRIF put a lot of effort into studying and defining state-of-the-art metrics to inspect every corner of the system’s algorithm, presenting a summary of the most important metrics for multiclass classification problems to the scientific community (for more information, see the CRIF paper Metrics for Multiclass Classification: an Overview) and developing accountability tools to study the algorithm.
Among all the metrics, the two most important KPIs from a business perspective are Coverage and Accuracy:
Transaction data, by its very nature, is constantly evolving, with new merchants entering the market every day, and spending habits that can change dramatically (think of the impact of the pandemic on food deliveries and, more generally, online shopping). Similarly, the categorization engine should not be thought of as a static model, but as a product that needs to be constantly tuned and maintained to keep a high level of performance. CRIF models are frequently monitored and finetuned: this constant evolution allows the algorithms used by the categorization engine to be kept at the cutting edge of technology.
At first glance, rule-based classification systems are more effective: you have absolute certainty of the results and full explainability. In practice, the definition of these rules and their hierarchy is not an easy task: if a rule that filters the keyword “tax” as a “taxes” category is used, this could lead to the incorrect categorization of “taxi” as a tax instead of transportation. Also, a rule-based system raises performance issues, since rules must be processed one by one until a match is found, and of course, the more rules there are, the greater the computation time.
A machine learning model can differentiate between ambiguous cases by using the other elements of the transactions, such as the description and the amount. Therefore, better classification results can be achieved when the available structured data is limited. In addition, artificial intelligence allows automation and scaling of the solution, with continuous learning over hundreds of millions of transactions, which is otherwise impossible with only human defined rules.
Finally, CRIF’s experience over the past few years suggests that the most effective approach is a hybrid one: rules are more effective when rich metadata is available and can be used to uniquely associate a category with a specific value of a variable, while machine learning excels when less, unstructured information is available.
The CRIF Categorization Engine is made up of two separate components:
The Categorization Trainer is a web application where the user can manually assign a category to a set of banking transactions. The labeled transactions are used by a supervised learning algorithm to create a prediction model.
Since labeling is the most time-consuming task, the Categorization Trainer provides a series of automated processes to reduce the labeling effort as much as possible. Banking transactions are usually similar to each other except for just a few fields, e.g., the transaction date. The first step in the training process is to identify similar transactions and group them.
Once the transactions have been grouped, a set of groups is selected by means of predicted labels, if available, or similar characteristics, and is passed to the users to be manually categorized. Once manually categorized, the anomaly detection system analyzes the consistency between different transactions with similar features. The highlighted anomalies are sent back for an additional check. A new model is generated each time a selection process runs, using the categorized transactions. The model runs against the uncategorized documents and the ones categorized with the lowest confidence are sent back for the manual categorization step.