Loan Applicant Risk Classification

Approving whether someone is eligible for a loan is one of the most important process in loan application. It is crucial because most financial companies rely on the lending activity as their main source of income. Also, by doing this activity, companies are exposed to a risk in which debtors stop repaying their loans (default), causing losses to the lenders. Often, it is a time extensive process during which analyst gather tons of credit and income information of the loan applicants.

In order to help lenders make the decision, I made a binary classification model which ranks a loan applicant based on their credit-related history: High-risk of late payment and Low-risk of late payment. This is a end-to-end supervised machine learning problem that is primarily based on Python. The datasets I use are courtesy of Cermati and the model I use is logistic regression.

Libraries I used

To create the model, I was helped by these following libraries:

Numpy for mathematical processing
Pandas for data processing
Seaborn & Plotly for data visualization
Scikit-learn for data preprocessing, feature selection, model training, cross-validation, and evaluation metrics.

Understanding the data

Before going deeper into the features, I explore the data that I will be dealing with at a superficial level. Doing so will greatly help us in interpreting the data and choosing proper tools to create a proper model. I have three datasets illustrated by the image below: datasets There is a lot of going on in the data. To simpliyfy the model, I assumed several more significant features and drop unwanted features. The following is the list of the significant feature.
loanapptrain.csv / loanapptest.csv

LN_ID: Loan ID
TARGET: Target variable (1 - client with late payment more than X days, 0 - all other cases)
GENDER: Gender of the client
INCOME: Monthly income of the client
APPROVED_CREDIT:
ANNUITY: Approved credit amount of the loan
PRICE: For consumer loans it is the price of the goods for which the loan is given
INCOME_TYPE: Clients income type (businessman, working, maternity leave,…)
EDUCATION: Highest education attained
FAMILY_STATUS: Married/Not Married
HOUSING_TYPE: Renting/Living with parents/house
DAYS_AGE: Client age in months
EXT_SCORE_1, EXT_SCORE_2, EXT_SCORE_3: External credit score acquired from a third party

prevloanapp.csv

SK_ID_PREV: ID of previous loan (One loan can have 0,1,2 or more previous loan application)
LN_ID: Loan ID
ANNUITY: Loan annuity (amount that must be paid monthly) of previous application
APPLICATION: For how much credit did client ask on the previous application
APPROVED_CREDIT: Final approved credit amount on the previous application. This differs from APPLICATION in a way that the APPLICATION is the amount for which the client initially applied for, but during our approval process he could have received different amount
PRICE: For consumer loans it is the price of the goods for which the loan is given
CONTRACT_STATUS: Contract status (approved, cancelled, …) of previous application
DAYS_DECISION: Relative to current application when was the decision about previous application made
TERM_PAYMENT: Term of previous credit at application of the previous application
YIELD_GROUP: Grouped interest rate into small medium and high of the previous application

Add new feature: Debt-to-income ratio

Debt-to-income ratio (DIR) provides a ratio between an individual’s monthly debt payment to their monthly gross income. A person with low DIR indicates he/she has a sufficient income relative to debt servicing, which is more attractive to grant a credit to. I decided to include DIR because it is commonly used by the finance industry to evaluate the risk of a borrower. In our case, DIR is equal to ANNUITY divided by INCOME (both previous annuity and current annuity are used).

Pre-processing

There are two preprocessing steps: Feature scaling and dataset split.

Feature Scaling

Feature scaling can minimize the possibility of inaccurate training process due to highly-varying values in the data by normalization. Only numerical features are normalized.

Split dataset into 80% train and 20% test

From now on, the processing will be focused on the training dataset (df_train). The test dataset (df_test) should be isolated from the whole process until the model is ready to prevent information leakage.

Feature Selecting

To reduce the complexity of the model, it is important to reduce the number of the input or the feature variable. After examining the dataset, I figured out the features will consist of numerical and categorical (distinct) features. I will evaluate the relationship between these features and the target variable using qualitative and statistical method. Features that have stronger relationship will be selected.

Numerical Feature Selecting with ExtraTreesClassifier

I will filter out all non-numerical feature in the main dataframe and select the best numerical features available. To achieve this, I use extra trees classifier and find five best features as illustrated by the chart below. Larger value means the feature is more important.

numfeat1 Results for numerical feature selection using extra trees classifier EXT_SCORE_2 is defined as normalized score from external data source. Meanwhile from the Figure 1 and EDA process which results in Figure 2, I infer EXT_SCORE_2 describes some kind of credit or loan score that has a high variance for troubled lenders (TARGET = 1). The visually distinct violin plot and high feature score mean that this feature contributes much to the target variable. EXT_SCORE_2 will be used as one of our main features. numfeat2 Violin plot for EXT_SCORE_2 by TARGET The second largest extra trees classifier is ANNUITY_x. But instead of this, I will use DIR_x as my next selected feature. DIR_x here is defined as Debt to Income ratio, which is a ratio between INCOME and ANNUITY_x. numfeat3 Violin plot for DIR_x by TARGET The reason is DIR_x is a more representative since it is a derived feature from two other features. It is also a commonly used feature by financial companies to assess credit risk to that of an individual. From can be seen from Figure 3 that troubled lenders have a slightly higher DIR. Yes, INCOME has a low importance score (kind of), but the dataset doesn’t reflect the real condition, so I take a more meaningful, best practices feature, which is DIR_x.

During this step, I also use installment_payment.csv to see if the dataset has a meaningful feature or not, which turned out it doesn’t, so I don’t use that dataset for the whole process.

Categorical Feature Selecting with One-hot Encoding and Pearson Chi-square

I use Pearson Chi-square in SelectKBest function from scikit-learn to select best k categorical features. Categorical features are tricky to process, since often times they don’t have ordinal relationships. One hot encoding will convert such features into dummy 0 and 1 variables. The 5 best features found are illustrated by figure below. catfeat1 Results for numerical feature selection using Chi-squared SelectKBest

X5_Refused is a binary variable from one of the CONTRACT_STATUS category. For convenience, x5 = CONTRACT_STATUS. If a lender’s credit application(s) has been refused before, the value will be 1. This feature seems to have an importance to our target variable because it related to risk a loan applicant have.

Selected Features

To summarize, the selected features are:

EXT_SCORE_2
DIR_x (Current Debt-to-income ratio, not the historical one)
CONTRACT_STATUS_Refused

Training

Since I already have our selected features, the unused one can be safely dropped from the datasets. Null values will be removed as well. Now te data is now ready to be used for training. I use logistic regression for the algorithm because of the following reasons:

It is very common algorithm to solve binary classification problem.
Logistic regression performs better with normalized data, which I already did.
No high correlations between all of the selected features.
I use only 3 features. Too many features may introduce variance and cause the model to overfit.

From the lender’s perspective, the model will give a prediction whether an applicant has a high possibility to delay their payment or not based on the applicant data that are collected, mainly external score, debt to income ratio, and refused application history. It will guide the decision making of whether the lender should grant the applicant a loan or not.

Results

Below figure depicts the confusion matrix of the model. results Confusion matrix

From the matrix we have 5938 true positives and 70133 true negatives. To further conclude the performance, I use accuracy and recall.

metrics Metrics

The model accuracy is 65%, which means 65% of the input will be classified correctly. For 1 (high risk), the recall is 55% which means about a half of the inputs are false negative. This is important, because for this case false negative is more desirable than false positive. If someone eligible for a loan but classified as high risk it is more acceptable than to classify a troubled lender as a low risk.

Conclusion

Now we have successfuly trained a binary model to classify whether an applicant is eligible for a loan or not. There is no way to use human at scale, it would be very exhaustive and resource demanding. Also, hopefully companies would reduce the time to make lending decision. However, there is still room to improve this model. I will keep this at note in case I have time to develop this in the future.

Tweak the hyperparameter to increase recall and accuracy
Explore other features from the datasets that may be overlooked
Try on some alternative methods/algorithms that still yield the same output