Big Data due to the increased growth of multiple data sources (structured or unstructured) can and is becoming the main driver of innovation in the banking sector.
Investments in the analysis of large data sets in the banking sector totaled $20.8 billion in 2016. This makes the domain one of the main consumers of Big Data services and an increasingly avid market for Big Data architects, solutions and custom tools.
Machine learning thus makes it possible to exploit these large amounts of data in banking application fields such as credit risk.
The application of Machine Learning techniques to individual customer data, combined with more traditional data, now allows Artificial Intelligence to identify strategies that banks use to manage credit risk and to report on the different perspectives they may have on the types of risks that pose a threat to them.
AI is applied to better assess risk at the individual client level, which can then be used to estimate more complex risks such as measuring the credit risk of an entire economy.
The trend of big data
Data growth is accelerating, the amount of data generated per minute will increase by 700% by 2020. And banking data will be a cornerstone of this data deluge, and whoever is able to process it will have a great competitive advantage over the competition.
Let us take a look at a concrete example of the use of Machine Learning in the banking sector, more specifically for credit risk modeling.
The germancredit data set (bank data in Germany) contains 1000 rows with 20 features.
In this data set, each entry represents a person who receives a credit and is classified as a good or bad payer via the “creditability” column.
Our goal in this project is, therefore, to implement a score to model the the creditability.
- Défault Fréquency
As we can see, our dataset contains more than 30% default, which is quite high.
Credit scoring allows you to assign or not a credit request based on the customer’s score. Anyone who has already borrowed money to apply for a credit card or buy a car, house or another personal loan from a credit report.
Lenders use the scores to determine who is eligible for a loan, at what interest rate and what credit amount limits. The higher the score, the more confidence a lender can have in the customer’s creditworthiness. However, a score is not part of a regular credit report. There is a mathematical formula that translates the credit report data into a three-digit number that lenders use to make credit decisions.
The aim here is to use credit assessment techniques to assess the risk associated with loans granted to a particular client and to develop a scoring model.
Credit evaluation is the application of a statistical model to assign a risk rating to a credit application and is a form of artificial intelligence, based on predictive modeling, that evaluates the probability of a customer defaulting on a payment.
Over the years, a number of different modeling techniques for the implementation of credit rating have evolved. Despite the diversity, the Scorecard scoring model stands out and is used by nearly 90% of companies. The use of new machine learning methods makes it possible to have hybrid probabilistic models for estimating customer risk.
Before building the model, two steps are necessary. One is to calculate the WOE (Weight of Evidence), the other is to calculate the Informative Value (IV) based on the WoE value.
For the verification of the results, we use the WOE values. After splitting the continuous and discrete variables into categories for each value taken, we can calculate their WoE, then the categorical variables in question are replaced by their WoE which can later be used to build the regression model. The WoE formula is as follows.
The calculation of the WoE for each category variable allows to see an overall trend of its logic, and there are no anomalies in the data. Because the logical dependencies between WOEs and default ensure that the score weighting is perfectly interpretable, as these points reflect the logic of the model.
The calculation of the WoE for each category allows to see an overall trend of its logic, and there are no anomalies in the data. Because the logical dependencies between WOEs and default ensure that the score weighting is perfectly interpretable, as these points reflect the logic of the model.
Information Value (IV)
The Information Value comes from Information Theory and is measured using the following formula. It is used to evaluate the overall predictive power of a variable.
useless for prediction
|0.02 – 0.1||
0.1 – 0.3
|0.3 – 0.5||
suspicious too good to be true
– IV results
iv = iv(data, y = 'creditability') %>% as_tibble() %>% mutate( info_value = round(info_value, 3) ) %>% arrange( desc(info_value) ) iv %>% knitr::kable()
Logistic Regression on transformed variables:
We are now applying regression logistics to our transformed data.
- Calculation of points from logit & Odds
We remind you that the logit can be represented by:
The logit is therefore the logarithm of the ratio between the probability of defaulting on a payment and the probability of not doing so.
Logistic regression models are linear models, in that the probability of prediction transformed into logit is a linear function of the values of the predictor variables.
Thus, a scoring model thus derived has the desirable quality that credit risk is a linear function of predictors, and with some additional transformations applied to the model parameters, a simple linear function of WOEs that can be associated with each class value.
The score is therefore a simple sum of the point values of each variable that can be derived from the resulting point table.
The total score of a requesting client is then proportional to the logarithm of the bad_customer/good_customer odds ratio.
We choose to scale the points so that a total score of 600 points corresponds to a Odds bad_client /good_client of 20 to 1 and a 50 point increase corresponds to a duplication of the Odds.
Now we have then the total score :
Scaling – The choice of the point scale (600) of the score does not affect the modeling.
pdo = 50, the number of points that doubles the rating (p/1-p).
We have 600= log(50)*factor + offset and
620= log(100)*factor + offset so we have factor = pdo / ln(2)
Offset = Score – [Factor * ln(Odds)]
Kolmogorov-Smirnov test for validation
In general, the distribution of scores of good customers differs significantly from the distribution of bad customers if the KS statistic (.5071) is greater than the critical threshold value. Here we reject the null hypothesis (i.e. that good and bad customers have the same score distribution) in view of the test results on the train and test data.
In general, the threshold value at which to accept a loan varies from one type of loan to another and from one lender to another. Some loans require a minimum rating of 520, while others may accept ratings below 520. Therefore, after obtaining the limit score, we can then decide whether or not to approve the loan.
Overall, predictive models are based on the use of a client’s historical data to predict the probability that the client will exhibit defined behaviour in the future. They not only identify the “good” and “bad” requests on an individual basis, but they also predict the probability that an application with a given rating will be “good” or “bad”. These probabilities or scores, as well as other business considerations, such as expected approval rates, profit, churn rate and losses, then serve as the basis for decision making.
It’s all about this scoring project. If you have any questions or comments, please feel free to contact me or leave comments below.