Credit Risk Analysis Using Machine Learning
Approving loans without proper scientific evaluation increases the risk of default. This can lead to bankruptcy of lending agencies and consequently the destabilization of the banking system. This is what happened in the 2008 financial crisis which affected the world economy adversely. Three components decide the amount of loss that a firm faces as a result of loan default:
- Probability of Default (PD)
- Exposure at Default (EAD)
- Loss given Default (LGD)
The expected loss (ELoss) is the simple product of these three quantities:
ELoss=PD⋅EAD⋅LGD
Our focus here is on the Probability of Default (PD). Here, we will look at the example of German Credit data which is taken from the Kaggle database.
Data Exploration
As a first step, we look at the data. Numpy and Pandas libraries in python are excellent tools for data exploration. For the data visualization we mainly use the matplotlib and the seaborn libraries. We import these libraries into our workspace.
1 2 3 4 |
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns |
Now we import the data file and inspect it.
1 2 |
df = pd.read_csv('data/german_credit_data.csv', index_col=False) df.head() |
Age | Sex | Job | Housing | Saving accounts | Checking accounts | Duration | Purpose | Risk |
67 | male | 2 | own | NaN | little | 6 | radio/TV | good |
22 | female | 2 | own | little | moderate | 48 | radio/TV | bad |
49 | male | 1 | own | little | NaN | 12 | education | good |
45 | male | 2 | free | little | little | 42 | furniture | good |
53 | male | 2 | free | little | little | 24 | car | bad |
The first eight columns are the feature variables and the last column (Risk) is the target variable, which we want to classify as “good” or “bad”. The purpose of the Machine Learning model is to capture the relations between the features and the target variables and predict the credit risk for future applicants.
Most Machine Learning models cannot handle missing values within the feature space, or these can adversely diminish the prediction power of the model. Therefore, we need to check for them:
1 |
df.isnull().sum() |
Features | missing_value |
Age | 0 |
Sex | 0 |
Job | 0 |
Hosing | 0 |
Saving accounts | 183 |
Checking accounts | 394 |
Credit amount | 0 |
Duration | 0 |
Purpose | 0 |
Risk | 0 |
We need to handle these missing values in a later section. But before we actively change the dataset, we will continue our exploration by trying to figure out, which of the features affect the risk.
Data visualization
Are females more likely to default or is it less risky to lend money to rich people? These kinds of questions can qualitatively be answered by visualizing the data. We create a sub-table for each feature variable in question.
1 2 3 4 5 6 |
#cross table for the 'Sex' feature cross_sex = pd.crosstab(df['Risk'], df['Sex']).apply(lambda x: x/x.sum() * 100) decimals = pd.Series([2,2], index=['Male', 'Female']) cross_sex = cross_sex.round(decimals) cross_sex_transposed = cross_sex.T cross_sex_transposed |
Risk | bad | good |
Sex | ||
———– | ———- | ———- |
female | 35.16 | 64.83 |
male | 27.68 | 72.31 |
The values presented here are in percentage. It seems, the feature “Sex” contains valuable information for the classification. In this data set, females are slightly more likely to default (however, this cannot be used as a general conclusion). We can better perceive it from the graph below.

Similar analysis can also be performed based on the wealth in the checking account which are categorized as ‘little’, ‘moderate’ or ‘rich’.

The graph tells us that wealthy people are less likely to default. Now that we have seen the importance of the features, it is time to employ our Machine Learning model which can quantitatively capture the feature patterns and subsequently predict the risk.
Building models
From the raw data we saw that the target variable risk is categorical (‘good’ or ‘bad’). Therefore, this is a classification problem. There are many classification algorithms in the literature with the Random Forest classifier being considered one of the standard classifiers.
As mentioned before, we can notice that many columns have missing entries and also non-numerical data. Machine Learning algorithms cannot handle these values. Therefore, we need to clean the data and transform it to a numerical form.
It is possible to completely drop the rows if there are any missing values in that row. However, if we do that, we will lose a large amount of data as we have seen from the Data Exploration section. We replace the missing values with a separate category ‘None’ (although it is not accurate). At this point we can just encode the categorical values into numerical values using a method called OneHotEncoder (part of the SciKit learn library in Python). For example, the ‘good’ risk and ‘bad’ risk can be transformed as ‘1’ and ‘0’. The rent column can be transformed into three columns named ‘own’ and ‘rent’ and ‘free’, where if the applicant has his own house, the ‘own’ column will have entry ‘1’ and the other two columns will have entry ‘0’. We then drop one of the columns to avoid data redundancy. After processing the data, we get the dataset like:
Age | Sex | Job | own | rent | .. | radio/TV | repairs | vacation | risk |
67 | 1 | 2 | 1 | 0 | .. | 1 | 0 | 0 | 1 |
22 | 0 | 2 | 1 | 0 | .. | 1 | 0 | 0 | 0 |
49 | 1 | 1 | 1 | 0 | .. | 0 | 0 | 0 | 1 |
Now that we have all the features and target variable in the numerical form, we can finally train and fit our model. For that we should first split our data into a train set and a test set. The train set would be used to capture the relations between the features and the target variables while the test set will be used to verify the performance of the model.
1 2 |
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(df_[df_.columns.difference(['Risk'])], df_['Risk'], test_size=0.25, random_state=0) |
Hyperparameters are the set of parameters which should be set by the user before the training. By contrast normal parameters are tuned while the model is trained. This model has several hyperparameters and therefore the number of possible combinations is exponentially high. We tune the hyperparameters by automatizing the process via a so-called Parameter Grid in SciKit learn, while training the model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import ParameterGrid from sklearn.metrics import f1_score from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix # dictionary of hyperparameters grid = {'n_estimators': [50, 60, 70, 75, 80, 90], 'max_depth': [5, 7, 8, 9, 10, 12, 15], 'max_features': [2, 3, 4, 5], 'random_state': [42], 'min_samples_leaf' :[1, 2, 3, 4], 'min_samples_split': [2, 4, 5, 6, 7, 8], 'criterion': ["gini", "entropy"]} f1_scores = [] accuracy = [] #initializing model random_forest = RandomForestClassifier() for g in ParameterGrid(grid): random_forest.set_params(**g) # "unpacking" the dictionary random_forest.fit(x_train, y_train) f1_scores.append(f1_score(y_test, random_forest.predict(x_test))) accuracy.append(accuracy_score(y_test, random_forest.predict(x_test))) # best hyperparameters best_idx = np.argmax(f1_scores) best_accuracy = np.argmax(accuracy) |
Once the hyperparameters are tuned, the metrics of the model are calculated. The results can be evaluated by a confusion matrix, which is a useful tool for measuring the quality of classifications. The confusion matrix can be invoked as such:
1 |
confusion_matrix(y_test, random_forest.predict(x_test)) |
positive | negative | |
positive | 34 | 40 |
negative | 12 | 164 |
The accuracy and F1 score of the model are 0.79 and 0.86 respectively. Despite, limited amount of data and many missing values, the model performance is quite good.
Feature Selection
Now from the hyperparamter tuned model we can see which features are most relevant. From the plot below we can see that the most relevant features for classification are ‘Sex’, ‘Job’, ‘Age’ and ‘Checking account’. Sometimes there are scenarios where the dataset is huge and consists of multitude of features. In those cases, it is sometimes wise to drop the least relevant feature to speed up computation time.

Acceptance rate and bad rate
Acceptance rate is the percentage of the new loans that are accepted. The goal is to keep the number of defaults as low as possible, whilst maximizing profit by handing out loans. Based on our model we can calculate a threshold for a given acceptance rate. If a probability of default for a new credit is lower than the threshold then it is accepted, otherwise rejected:
Loan | prob_default | threshold | accept or reject |
1 | 0.70 | 0.81 | Accept |
2 | 0.83 | 0.81 | Reject |
We assume, the acceptance rate of a bank is 85%. What should be the threshold? For this data set and with our model the threshold is calculated to 0.55. The Numpy package can be used to calculate the threshold:
1 2 3 |
import numpy as np threshold = np.quantile(prob_default, 0.85) |
The figure below shows the threshold as the black vertical line:

For different acceptance rates, the threshold values vary. Even with the calculated threshold, there could be accepted loans which are considered to default. A bad rate is the percentage of accepted loans, which are defaulting. In our model, the bad rate has been calculated to 23 %.
Conclusion
Machine Learning based models are very pragmatic when it comes to minimizing the risk. Increasing number of banks are choosing such models for their risk analysis and to minimize the loss by preventing defaults. There are a multitude of Machine Learning techniques for the purpose of predicting the real values, classification, clustering etc. We only scratched the surface with our case. However, certain things should considered when applying Machine Learning techniques:
- The model itself should explain the result. If an algorithm denies a loan of an applicant, it is important for the bank to know the explanation. Otherwise they could even face legal obligations from the applicant. Therefore, it is sometimes reasonable to use relatively simple models (like logistic regression, Gradient boosted trees, Random Forest as in our case) instead of complicated models like Neural Networks.
- Recently Apple’s credit card has been accused of discriminating against women. Models can be socially prejudiced as they take insights from the historical data. In our case too, the model predicted that the females are more likely to default. This is because, we have performed a Discrete-Time hazard model where the probability of default is a point-in-time event. A good model however should incorporate the evolution of the impacts of the features on the risk over a period. These kinds of models are called Through-the-cycle (TTC) models, which consider the influence of the macroeconomic situation, social evolution and other factors.
Dibyajyoti holds a degree from HBNI-University in India and a PhD degree in Physics from TU Kaiserslautern. His work has been published in numerous international journals and conferences. He specialises on the financial market by conducting mathematical modeling and data analysis.
Show all posts by Dr. Dibyajyoti Dutta
Leave a Comment