Using Fannie Mae mortgage data, I investigated which types of loans are at greatest risk of default (i.e., being foreclosed upon) at the beginning of the 2008 financial crisis. This is made difficult as loan defaults, even at the height of the financial crisis, was a "rare event", meaning it occurs at such a low frequency that a model becomes far better at predicting what DOES NOT cause a default compared to what DOES cause a default.
The goal of this project was to test how downsampling and upsampling techniques can improve predictions of rare events.
Downsampling techniques create balanced data by removing majority class until it matches the frequency minority class. Upsampling techniques create balanced data by resample duplicates of minority class until it matches the frequency of the majority class).
I train the model three times, once with the original balance of default to non-defaults, second with downsampling the training sample, and third with upsampling the training sampling. I show that downsampling and upsampling the training samples improves predictions.
# Core Packages
import pandas as pd
import numpy as np
import random
# Convert Time Features
from datetime import datetime as dt
# Data Visualizations
import matplotlib.pyplot as plt
from jupyterthemes import jtplot
jtplot.style(theme='chesterish', grid=False)
# Splitting Data
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Machine Learning Packages
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Save Runtime
import time
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
# Output Statistics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
In this post, I set up all data wrangling and data analysis as a series of functions, which will enable me to reuse on data from subsequent years (future projects) and various analysis techniques (this project).
1) The acquisition data includes one observation for each loan with each feature representing knowledge Fannie Mae has when acquiring the loan (e.g., balance, primary lender, credit score, etc.).
2) The performance data includes observations for each month each loan is held and information on the payment of the loan.
I use the acquisition data as predictors for a dichotomous categorization of whether the homeowner defaulted on their loan, a target variable I create using the performance data and merging onto the acquisition data.
# Load the data
def load_data(acq, per):
df_acq = pd.read_csv(acq, sep='|', header=None)
df_per = pd.read_csv(per, sep='|', header=None)
# Specify the name of the columns
df_acq.columns = ['LoanID','Channel','SellerName','OrInterestRate','OrUnpaidPrinc','OrLoanTerm',
'OrDate','FirstPayment','OrLTV','OrCLTV','NumBorrow','DTIRat','CreditScore',
'FTHomeBuyer','LoanPurpose','PropertyType','NumUnits','OccStatus','PropertyState',
'Zip','MortInsPerc','ProductType','CoCreditScore','MortInsType','RelMortInd']
df_per.columns = ['LoanID','MonthRep','Servicer','CurrInterestRate','CAUPB','LoanAge','MonthsToMaturity',
'AdMonthsToMaturity','MaturityDate','MSA','CLDS','ModFlag','ZeroBalCode','ZeroBalDate',
'LastInstallDate','ForeclosureDate','DispositionDate','FCCCost','PPRC','AssetRecCost','MHRC',
'ATFHP','NetSaleProceeds','CreditEnhProceeds','RPMWP','OFP','NIBUPB','PFUPB','RMWPF',
'FPWA','ServicingIndicator']
return df_acq, df_per
Performance data is much larger as it is transaction based, while the acquistion data has the loan owner as its unit of analysis.
I retain only the most recent performance transaction relating to foreclosure, then drop all other variables except Loan ID (the primary key) and merge performance data onto acquisition data.
I recode performance data into dichotomous categorization of whether loan was foreclosed upon.
# Merge foreclosures from performance data to acquistion data
def merge_df(df_acq, df_per):
# Columns to maintain from Performance Data
per_ColKeep = ['LoanID','ForeclosureDate']
df_per = df_per[per_ColKeep]
df_per.drop_duplicates(subset='LoanID', keep='last', inplace=True)
df = pd.merge(df_acq, df_per, on='LoanID', how='inner')
# Set Foreclosed to binary
df.loc[df['ForeclosureDate'].isnull(), 'Foreclosed'] = 0
df.loc[df['ForeclosureDate'].notnull(),'Foreclosed'] = 1
df = df.drop(columns='ForeclosureDate')
df['Foreclosed'] = df['Foreclosed'].astype(int)
# Drop mergeID column
df = df.drop(df.columns[[0]], axis=1)
return df
I remove features with really high missingness or no data variation, and then impute on features with low missingness
def missing_treat(df):
# Find features with 10% missing or more
condition = ( df.isnull().sum(axis=0)/df.shape[0]*100 )
df_HighMissing = condition > 10
# Save features that contain missing data
df_HighMissing = df_HighMissing.index[df_HighMissing.values == True]
# remove high missing features
df = df.drop(labels=df_HighMissing, axis=1)
# remove ProductType, it only had one value
df = df.drop(labels=['ProductType'], axis=1)
# remove Zip and in favor of Property State
df = df.drop(labels=['Zip'], axis=1)
# impute on the mean for low missing features that are continuous
df_cont = df.select_dtypes(include=['float64', 'int64'])
df[df_cont.columns] = df_cont.apply(lambda x: x.fillna(x.mean()),axis=0)
# impute on the mode for low missing features that are categorical
df_cat = df.select_dtypes(include=['object'])
df[df_cat.columns] = df_cat.apply(lambda x: x.fillna(x.mode()),axis=0)
return df
Changing date features to numeric
# Changing date features to ordinal variables using the function toordinal()
def change_date(df):
# Origination Date
df['OrDate'] = df['OrDate'].apply(lambda x: dt.strptime(x, '%m/%Y').toordinal())
# Date of First Payment
df['FirstPayment'] = df['FirstPayment'].apply(lambda x: dt.strptime(x, '%m/%Y').toordinal())
return df
One Hot Encoding
# Converting categorical variables to dummy variables
def onehotencoding(df):
columns = df.columns[df.isnull().any()]
nan_cols = df[columns]
df.drop(nan_cols.columns, axis=1, inplace=True)
df_cat = df.select_dtypes(include=['object'])
onehot = pd.get_dummies(df_cat)
df_cont = df.drop(df_cat.columns, axis=1)
df = pd.concat([df_cont,onehot,nan_cols], axis=1).reset_index(drop=True)
return df
Function to plot target variable
# Check the percentage and frequency of target variable
def target_values(df_depvar, data=False, prediction=False):
# save target frequencies
target_frequency = df_depvar.value_counts()
# save target percentage
target_percentage = round((df_depvar.value_counts()/df_depvar.count())*100).astype(int)
# graphing target variable
jtplot.style(ticks=True, grid=False)
plt.figure(figsize=(14,4))
target_percentage.plot.barh(stacked=True, color='#ca2c92').invert_yaxis()
if data:
plt.suptitle('Bar Chart of Target Variable', fontsize=18)
elif prediction:
plt.suptitle('Bar Chart of Predictions', fontsize=18)
else:
plt.suptitle('Percent of Mortage Defaults', fontsize=18)
plt.ylabel('Foreclosed')
plt.xlabel('Percentage')
plt.xlim([0,100])
# plt.yticks([0, 1], ['Did not Foreclose', 'Foreclosed'])
plt.show()
# display frequency of foreclosures
print('Frequency of Foreclosures\n', target_frequency, '\n', sep='')
# display percentage of foreclosures
print('Percentage of Foreclosures\n', target_percentage, '\n', sep='')
# Load 2008 Q1 data
df_acq, df_per = load_data("Acquisition_2008Q1.txt", "Performance_2008Q1.txt")
# Data Cleanings
df = merge_df(df_acq, df_per)
df = missing_treat(df)
df = change_date(df)
df = onehotencoding(df)
print('\nThe number of features is:\n', df.shape[1], sep='')
print('\nThe number of observations is:\n', df.shape[0], sep='')
target_values(df['Foreclosed'], data=True)
King, G., and Zeng, L. (1999). Logistic regression in rare events data. Department of Government, Harvard University. Online: http://GKing.Harvard.Edu.
Splitting data with options to downsample (i.e., create balanced data by removing majority class until it matches the frequency minority class) or upsample (i.e., resample duplicates of minority class until it matches the frequency of the majority class). Downsampling and upsampling is performed only on the training data in order to maintain the original integrity of the test data. The data was stratified on the target variable, ensuring that the test data maintains the 7% default rate.
# Modeling training and testing data
def split_data(df, depvar, downsamp=False, upsamp=False):
# Get the feature vector
X = df.drop(labels=depvar, axis=1)
# Get the target vector
y = df[depvar]
# Encoding target variable
le = LabelEncoder()
y = le.fit_transform(y)
# Splitting training and testing datasets 30/70
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)
# Define classes for downsampling/upsampling
## Indicies of each class' observations
index0 = np.where(y_train == 0)[0]
index1 = np.where(y_train == 1)[0]
## Number of observations in each class
class0 = len(index0)
class1 = len(index1)
# Down sample
if downsamp:
# For every observation of class 0, randomly sample from class 1 without replacement
downsampled_index = np.random.choice(index0, size=class1, replace=False)
# Join together class 0's target vector with the downsampled class 1's target vector
y_train = np.concatenate((y_train[downsampled_index], y_train[index1]))
X_train = pd.concat([X_train.iloc[downsampled_index], X_train.iloc[index1]], axis=0)
# Up sample
if upsamp:
# For every observation in class 1, randomly sample from class 0 with replacement
upsampled_index = np.random.choice(index1, size=class0, replace=True)
# Join together class 0's upsampled target vector with class 1's target vector
y_train = np.concatenate((y_train[index0], y_train[upsampled_index]))
X_train = pd.concat([X_train.iloc[index0], X_train.iloc[upsampled_index]], axis=0)
print('\nNumber of observations for training data:', y_train.shape[0], \
'\nNumber of observations for testing data:', y_test.shape[0])
return X_train, X_test, y_train, y_test
Pipeline Function:
One classifer was attempted, a multilayer perceptron. The neural network contains three hidden layers with 100 neurons in each. I tune only the batch size (64 or 128) and the initial learning rate (1e-3, 5e-4, or 1e-4). Given the computation time involved in computing this on a laptop with only an Intel Core i7-8550U processor, I keep the neural network simple with little tuning. Complete analysis should attempt a more complex algorithm given the size and difficulties associated with this data.
def pipeline():
# Creating dictionary of neural network classifer
clfs = {'mlp': MLPClassifier(random_state=0, learning_rate='adaptive', max_iter=1000,
hidden_layer_sizes = (100,100,100), alpha=0.01)}
# Initiating dictionary for pipeline
pipe_clfs = {}
# Pipeline will standardize data and run classifiers
for name, clf in clfs.items():
pipe_clfs[name] = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])
# Initiate grid
param_grids = {}
# The parameter grid for multi-layer perceptron
param_grid = [{'clf__batch_size': [64, 128],
'clf__learning_rate_init': [0.0001, 0.00005, 0.00001]}]
param_grids['mlp'] = param_grid
return pipe_clfs, param_grids
Run Model Function:
The model was tuned based on F1 scores from a three-fold cross validation.
def run_models(pipe_clfs, param_grids):
# The start time
start = time.time()
# The list of [best_score_, best_params_, best_estimator_]
best_score_param_estimators = []
# GridSearchCV
gs = GridSearchCV(estimator=pipe_clfs['mlp'],
param_grid=param_grids['mlp'],
scoring='f1',
n_jobs=-1,
cv=StratifiedKFold(n_splits=3,
shuffle=True,
random_state=0))
# Fit the pipeline
gs = gs.fit(X_train, y_train)
# Update best_score_param_estimators
best_score_param_estimators.append([gs.best_score_, gs.best_params_, gs.best_estimator_])
# Save best model
best_model = gs.best_estimator_
# The end time
end = time.time()
# Print the Run time
print('Run time: ' + str(end - start))
return best_model
In the first runs, I ran the neural network on the raw data to set baseline results.
X_train, X_test, y_train, y_test= split_data(df, 'Foreclosed')
target_values(pd.Series(y_train), data=True)
pipe_clfs, param_grids = pipeline()
best_model = run_models(pipe_clfs, param_grids)
# Best multi-layer perceptron
best_model.fit(X_train, y_train)
# Table of predictions versus actuals
y_pred = best_model.predict(X_test)
target_values(pd.Series(y_pred), prediction=True)
# Overall Scores
print('Accuracy:', end=' ')
print(accuracy_score(y_test, y_pred).round(2))
print('Precision:', end=' ')
print(precision_score(y_test, y_pred).round(2))
print('F1:', end=' ')
print(f1_score(y_test, y_pred).round(2))
# Confusion Matrix
print("\nConfusion matrix:")
PredictTable = pd.crosstab(y_test, y_pred)
PredictTable.columns = ['Predicted False', 'Predicted True']
PredictTable.index = ['Actual False', 'Actual True']
print(PredictTable)
# Precision Table
print('\nFinal Precision Percentages:')
PrecisionTable = ( (PredictTable/(PredictTable.sum(0)))*100 ).round(1)
print(PrecisionTable)
# Recall Table
print('\nFinal Recall Percentages:')
RecallTable = ( (PredictTable.div(PredictTable.sum(axis=1), axis=0))*100 ).round(1)
print(RecallTable)
On the test data, the final F1 score was a pitful 0.16. As expected, the accuracy was high at 0.92. This is because the model mostly learned what does not lead to a default as opposed to what led to a default. It therefore predicted 98% of cases to be a non-default with 93.9% of those being correct.
The false negative rate is poor: of the actual defaults, 88.9% were falsely predicted as non-defaults.
In the second model runs, I downsampled the data, randomly removing the non-default data to match the frequency of the default data in the training sample. This caused the training data to fall from 285,633 observations to 38,476.
Because I artifically balanced the data, I increased the threshold defining a default to a predicted probability of 0.75. Downsampling causes the model to better find variation that could cause a mortgage default. However, as the model is trained on data that no longer represents the true balance of mortgage defaults, it overpredicts defaults, requiring a stricter threshold defining a default.
X_train, X_test, y_train, y_test= split_data(df, 'Foreclosed', downsamp=True)
target_values(pd.Series(y_train), data=True)
pipe_clfs, param_grids = pipeline()
best_model = run_models(pipe_clfs, param_grids)
# Best multi-layer perceptron
best_model.fit(X_train, y_train)
# Table of predictions versus actuals
y_proba = pd.DataFrame(best_model.predict_proba(X_test), columns=["Did not Default", "Default"])
y_pred = y_proba["Default"].map(lambda x: 1 if x >= 0.75 else 0)
target_values(pd.Series(y_pred), prediction=True)
# Overall Scores
print('Accuracy:', end=' ')
print(accuracy_score(y_test, y_pred).round(2))
print('Precision:', end=' ')
print(precision_score(y_test, y_pred).round(2))
print('F1:', end=' ')
print(f1_score(y_test, y_pred).round(2))
# Confusion Matrix
print("\nConfusion matrix:")
PredictTable = pd.crosstab(y_test, y_pred)
PredictTable.columns = ['Predicted False', 'Predicted True']
PredictTable.index = ['Actual False', 'Actual True']
print(PredictTable)
# Precision Table
print('\nFinal Precision Percentages:')
PrecisionTable = ( (PredictTable/(PredictTable.sum(0)))*100 ).round(1)
print(PrecisionTable)
# Recall Table
print('\nFinal Recall Percentages:')
RecallTable = ( (PredictTable.div(PredictTable.sum(axis=1), axis=0))*100 ).round(1)
print(RecallTable)
On the test data, the final test score almost doubled from the baseline's 0.16 to 0.29. When the model predicted a non-default, only 3.8% was it wrong. The false negative rate improved from the baseline's 88.9% to 46.3%.
However, the model greatly overpredicted defaults, predicting 19% defaults when the true value was 7%. This caused the model's precision to fall from the baseline's 30% to 19.5%.
In the final model runs, I upsampled the data, randomly removing the non-default data to match the frequency of the default data in the training sample. This caused the training data to increase from 285,633 observations to 532,790.
Because I artifically balanced the data, I increased the threshold defining a default to a predicted probability of 0.75. Similar to downsampling, upsampling causes the model to better find variation that could cause a mortgage default. However, as the model is trained on data that no longer represents the true balance of mortgage defaults, it overpredicts defaults, requiring a stricter threshold defining a default.
X_train, X_test, y_train, y_test= split_data(df, 'Foreclosed', upsamp=True)
target_values(pd.Series(y_train), data=True)
pipe_clfs, param_grids = pipeline()
best_model = run_models(pipe_clfs, param_grids)
# Best multi-layer perceptron
best_model.fit(X_train, y_train)
# Table of predictions versus actuals
y_proba = pd.DataFrame(best_model.predict_proba(X_test), columns=["Did not Default", "Default"])
y_pred = y_proba["Default"].map(lambda x: 1 if x >= 0.75 else 0)
target_values(pd.Series(y_pred), prediction=True)
# Overall Scores
print('Accuracy:', end=' ')
print(accuracy_score(y_test, y_pred).round(2))
print('Precision:', end=' ')
print(precision_score(y_test, y_pred).round(2))
print('F1:', end=' ')
print(f1_score(y_test, y_pred).round(2))
# Confusion Matrix
print("\nConfusion matrix:")
PredictTable = pd.crosstab(y_test, y_pred)
PredictTable.columns = ['Predicted False', 'Predicted True']
PredictTable.index = ['Actual False', 'Actual True']
print(PredictTable)
# Precision Table
print('\nFinal Precision Percentages:')
PrecisionTable = ( (PredictTable/(PredictTable.sum(0)))*100 ).round(1)
print(PrecisionTable)
# Recall Table
print('\nFinal Recall Percentages:')
RecallTable = ( (PredictTable.div(PredictTable.sum(axis=1), axis=0))*100 ).round(1)
print(RecallTable)
On the test data, the final test score improved from the baseline's 0.16 to 0.24. The model predicted a reasonable number of defaults, predicting 9% when the true value was 7%. It produced a higher precision score compared to the downsampling run (21.2% compared to 19.5%).
However, its false negative rate is higher than the downsampling run at 73.2% (it was 46.3% in the downsampling run and 88.9% in the baseline run). In other words, this model has the best precision but lower recall then the downsampling run.
The downsampling and upsampling runs fared much better than the baseline results. The downsampling run produced the best harmonized balance between precision and recall, leading to the most confident results. Perhaps most importantly, it had the lowest false negative rate. Of defaults in the test data:
The upsampling run also performed much better than the baseline results. Due to the large size of its training dataset (532,790) may have caused the model to underperform due to limitations in the hardware used for analysis. A more complex neural network run on a more powerful PC or, ideally, on a server with GPU resources may have enabled the upsampling run to outperform the downsampling run.
Overall, the runs were not sufficient to produce a meaningful model. The best run only classified slightly over half of defaults correctly. This blog post only examined training sampling. Better computation power could enhance this analysis given the data's complexity (99 features and 380,845 observations). There were model convergence on some epochs, particularly with the upsampling run, which could be resolved by increasing computation power and increasing the number of iterations. Besides better computation power, feature selection/feature engineering techniques should be deployed. In particular, an advanced technique known as feature embedding would likely improve results in very substanive ways.