Random Forest on Credit Card Approval Classification

This is an excerpt from my Kaggle notebook where I used a Random Forest classifier on credit card data (https://www.kaggle.com/datasets/rohitudageri/credit-card-details/data).

Random Forest was extremely easy to use and offer great insights into the data relatively quickly.

Exploring the Data

The CSV had a total 1548 rows, the table below shows the top 10 rows and with an additional column “Approved” that is 1 or 0

CHILDRENAnnual_incomeBirthday_countEmployed_daysMobile_phoneWork_PhonePhoneEMAIL_IDFamily_MembersApproved
count1548.0000001.525000e+031526.0000001548.0000001548.01548.0000001548.0000001548.0000001548.0000001548.000000
mean0.4127911.913993e+05-16040.34207159364.6899221.00.2080100.3094320.0923772.1614990.113049
std0.7766911.132530e+054229.503202137808.0627010.00.4060150.4624090.2896510.9477720.316755
min0.0000003.375000e+04-24946.000000-14887.0000001.00.0000000.0000000.0000001.0000000.000000
25%0.0000001.215000e+05-19553.000000-3174.5000001.00.0000000.0000000.0000002.0000000.000000
50%0.0000001.665000e+05-15661.500000-1565.0000001.00.0000000.0000000.0000002.0000000.000000
75%1.0000002.250000e+05-12417.000000-431.7500001.00.0000001.0000000.0000003.0000000.000000
max14.0000001.575000e+06-7705.000000365243.0000001.01.0000001.0000001.00000015.0000001.000000

Modes

The table below shows the modes for all columns

FieldValue
GENDERF
Car_OwnerN
Propert_OwnerY
CHILDREN0.0
Annual_income135000.0
Type_IncomeWorking
EDUCATIONSecondary / secondary special
Marital_statusMarried
Housing_typeHouse / apartment
Employed_days365243.0
Mobile_phone1.0
Work_Phone0.0
Phone0.0
Type_OccupationLaborers
Family_Members2.0
Approved0.0
StatusDeclined

Number of Approved vs Declined

Note the data is heavily skewed to “Declined”, this will be important later when fitting the model.

png

Education Distribution

Mostly high school and junior college.

png

Occupation Type Distribution

png

Income Type Distribution

png

Marital Status Distribution

png

Employed Days Distribution

This looks odd, but, it’s the start day of the job backwards from the current day (0). A positive number means the person is currently unemployed (currently at the time of collection). Mostly around 5-7 years employment

png

Housing Type Distribution

png

Preprocess Data

We’ll first separate the categorical and continuous fields.

CategoricalContinous
GENDERCHILDREN
Car_OwnerFamily_Members
Propert_OwnerAnnual_income
Type_IncomeAge
EDUCATIONEmployedDaysOnly
Marital_statusUnemployedDaysOnly
Housing_type
Mobile_phone
Work_Phone
Phone
Type_Occupation
EMAIL_ID
  • Age is calculated in years from the birthday_count field.
  • Two new fields are added that count the number of employed days and unemployed days for each person

Random Forest Classifier

  • Given how skewed the classes are, over sampling is needed.
  • Given how small the dataset is, undersampling won’t be used.
X, y = df[cats + conts].copy(), df[dep]

X_over, y_over = RandomOverSampler().fit_resample(X, y)

X_train, X_val, y_train, y_val = train_test_split(X_over, y_over, test_size=0.25)

X_train[cats] = X_train[cats].apply(lambda x: x.cat.codes)
X_val[cats] = X_val[cats].apply(lambda x: x.cat.codes)
rf = RandomForestClassifier(100, oob_score=True)
rf.fit(X_train, y_train);
MetricValue
MSE0.011644832605531296
OOB0.013598834385624037
Accuracy0.9883551673944687
F1 Score0.988235294117647

Feature Importance

And now my favorite part! This plot shows the most influential fields in the data. These are the ones that have the highest split ratio. No surprise length of employment and age are the 2 dominant factors. Annual income a close 3rd. So much data analysis can be condensed into the feature importance plot!

png

Confusion

There were no cases where the model predicted a “Declined” result where the actual status was “Approved”. So the model is very good and accurately declining a credit approval. But it’s a little too lenient where it predicted 8 would get approved when they were actually declined.

confusion = confusion_matrix(y_val, preds)

png

It follows that the ROC curve would almost be perfect, with AUC 0.988

png

Results

This was a non-starter without oversampling. Without oversampling, accuracy was deceptively pretty good around 91% but when looking at the confusion matrix and abismal F1 score it was obviously aweful.

I split employed days column into “unemployed” and “employed”. It’s not surprising looking at the feature importance plot that unemployed days, age and income are the top contributors.

Final Results:

ScoreValue
ROC AUC0.9958
MSE0.0044
OOB0.0141
Accuracy0.9956
F1 Score0.9955