Research by
Michael Siebel

Table of Contents

Bad Banking Behavior

Analyzing Bank Mortgages during the 2008 Housing Bubble

Fannie Mae

Objectives

It has been over a decade since a mass sell-off of single-family mortgage loans and housing foreclosures popped the housing bubble and plunged the global economy into a financial crisis. To create greater transparency, Fannie Mae began releasing single-family loan performance data in 2015. This data has always been analyzed by the data’s level of analysis – individuals. In other words, prior analysis often observed which individuals were at greatest risk of housing foreclosure.

I investigated this data by changing the level of analysis to that of the bank to observe banking behavior during the housing bubble and how reforms in their behavior could have lessen the risk of defaults. This research analyzes the behavior of the nine largest banks during the housing bubble and alternative scenarios in which better behavior could have led to fewer foreclosures.

Data

Most of the data comes from Fannie Mae’s Loan Acquisition and Performance Data. This contains the target variable: mortgage loan foreclosures. Each observation represents a loan originating or refinanced between 2006-2008. The housing bubble and subsequent financial crisis is often noted as occurring between 2007-2009; most loans that lend to housing foreclosures originated or were refinanced only a year to three years before foreclosure, leading to a peak in loans destined for foreclosure occurring in 2007. Refinanced loans are limited to loans originating in fiscal year 2000 or later. The data is further limited only to loans made by the nine largest banks at the peak of risking lending (2007). Features modeled to predict foreclosures are limited to information known at the time of loan origination except for the date of last reporting (i.e., the window in which a foreclosure could have occurred and be in the dataset).

Other data was used to model loan foreclosures. See below for a list of data sources and see Data Wrangling for information on the ETL process used to gather this data:


Foreclosures

Mortgages made between 2006–2008 foreclosed at a rate of 9.7% among the top 9 banks. These foreclosures peaked in 2007 at 10.2%. Note, the research paper and Data Mining scripts show that these 9 banks issued loans more likely to be foreclosed upon than smaller banks, increasing this rate from the total U.S. rate.

Of importance, the average mortgage amount not only grew dramatically since the year 2001, but mortgage amounts on loans that were eventually foreclosed upon increased at a higher rate, matching (and briefly exceeding) the mortgage amounts on loans not foreclosed upon. This implies that large banks may have been willing to make riskier loans for inflated housing prices.

Foreclosure Bar
Foreclosure Line
Mortgage Line


Michigan, Arizona, Southern California, and the New York-area contain some of the highest loans destined for foreclosure. The worst are Las Vegas and most of Florida.

State Foreclosures
Zip Foreclosures


Banks

This research investigates the differences between the 9 largest banks. Bank of America is the largest as defined by the sheer quantity of loans it made, While PNC is the smallest of the nine (note, this is not factoring in the amount of those loans).

Loans destined to be foreclosed upon begin to notably increase after 2004. As these loans increase in foreclosures, the difference between the banks increase.

Overall, Flagstar Bank had the highest foreclosures (11.6%) between 2006 and 2008 with Bank of America immediately following it (11.6%). Wells Fargo Bank and JPMorgan Chase had the lowest number of foreclosures (7.6%) with Wells Fargo Bank containing the lowest before rounding to the first decimal place. CitiMortgage immediately followed (7.7%).

Bank Foreclosures
Best-Worst Actors Table Bank Sizes Bar


Methodology

The main challenge of modelling mortgage foreclosures is that they represent a rare event – defined as a dichotomous variable in which an occurrence of the event is low relative to cases in which that event could occur. Consequently, counts in the numerator of the rate are expected to be small by comparison to the denominator. Such small numerators in a target variable are known to create higher bias in estimation and, when forming a proportion, higher variance (King & Zeng, 2001). In laymen terms, classification models are likely to become better at predicting non-occurrences than occurrences simply because there are more examples of non-occurrences to learn from. This results in models over-predicting non-occurrences. In very imbalanced datasets, this can result in zero predictions of an occurrence.

Further, the features in the Fannie Mae do not correlate well with foreclosures. To this end, I linked many features from other data. Still, it remains is difficult to predict foreclosures without overfitting the training data.

To solve these two problems, I use two primary ensembling techniques. The first, algorithm-based ensembling, rebalances the data to augment foreclosure predictions. The second uses vote ensembling to increase generalization of each model’s results.

Balancing Predictions

The chart below displays a hypothetical dataset in which the target variable occurs at a rate of 20%. In order to ensure a predictive model remains balanced in its learning from occurrences and non-occurrences, an analyst can use an artificial intervention to balance the classes. One approach is to draw small samples with balanced classes. These samples produce relatively weak and simple models, which are then ensembled into stronger predictions.

Moving from Step 1 to Step 2 displays how this example data could be sampled using an ensembling technique. The analyst converts the training data into four balanced datasets in which the target variable occurs at a rate of 50%. These samples can encompass the full training data and yet each model learns from the same number of occurrences as it does non-occurrences. Ensembling these simpler models involves averaging their predicted probabilities and then converting those predicted probabilities into a dichotomous result using a classification threshold (i.e., the hardlimit). Finally, that prediction model is used to predict on the validation data for model evaluation. The validation data remains in its raw form (i.e., without balanced classes) to evaluate these predictions as with natural data (i.e., no analyst interventions).

While this approach overcomes the issue of over-predicting non-occurrence, it is prone to over-predicting occurrences. By over-predicting the rare events, the ensembled model is shown to know what situations lead to preventing or causing a mortgage foreclosure, although it is biased towards predicting mortgage foreclosures at higher rates. Fortunately, this bias can be minimized by optimizing the classification threshold.

As shown in Step 3, the testing data remains imbalanced and the predicted probability predictions are equally likely to be above or below 0.5 – the standard classification threshold. Because the analyst intervened and forced balanced classes in the training data, the analyst should compensate by increasing the threshold to an appropriate level. In this example, the analyst could choose a 0.8 threshold, which would provide perfect predictions on the test set. Note, this decision is made on validation data, separate from the testing data which will make the final model evaluations.

In modeling mortgage foreclosures, the classification threshold that maximizes F1 score is selected as this is the best at balancing recall and precision. Many models were run on different cohorts and using different algorithms; more on this in the next section. The results are quite poor: one of the 3 first models on all nine banks resulted in only 0.24 precision and, despite the stricter threshold, it predicted 19% foreclosures when the testing data contained only 10% foreclosures. (Note, while the full data contained 9.7% foreclosures, the testing data contained 10%.)

Model_Ensembling_Balancing
Thresholds
First_Model_Table
Voting Architecture

Three models on each of the 9 banks plus on an all banks dataset were run (for a total of 30 models). As described above, the three models resampled the data to better balance foreclosures to non-foreclosures. However, while one substitutes an even balance (1:1 between the two outcomes), one substitutes a 3:1 balance, and the other substitutes a 5:1 balance. This is done to ensure greater diversity between models. The two that favor non-foreclosures still oversample foreclosures as a true dataset (such as the validation and testing data) favor non-foreclosures at 10:1.

The 1:1 model is a random forest that uses principal component analysis (PCA) to reduce the features to 10. The 3:1 model is an AdaBoost decision tree algorithm called SAMME.R, which uses all 42 variables. The 5:1 model is another random forest that selects a different square root amount of the 42 variables per tree. Each of the three models receives one vote per bank.

The figure below illustrates the architecture in which these models were ensembled—simplified to only show three banks. The bottom layer displays the models, and the arrows represent a vote of foreclosure or not being sent to another layer. The all bank dataset is a special case, which receives a middle layer for the first round of voting. The all banks dataset attempts to capture cases of foreclosure that are likely to occur regardless of bank-specifics. As a result, I imposed a strict criterion of unanimous vote, meaning the three models must all agree that a prediction is a foreclosure for it to be marked as a foreclosure. The top layer receives three votes from the three models per bank plus a vote from the model layer’s all bank dataset. The top layer requires a majority vote—three out of the four votes must agree that a prediction is a foreclosure for it to be marked as a foreclosure.

This voting architecture is designed to use the algorithm-based ensembling to ensure model attention on what cases a foreclosure, and vote ensembling to keep foreclosures rare.

Model_Ensembling_Architecture
Modeling Performance

After all votes were cast, the final model, which stacks each of the final bank results from the top layer on one another, correctly predicted a foreclosure rate of 10%. The overall accuracy rate appears high at 0.88, but it contains issues which can be understood by the low F1 score (0.38). The F1 score is a better measure of accuracy as it accounts for the low prevalence of foreclosures.

Certainly, the voting architectures improved the results of the individual models. It was low on bias, but high in variation; each prediction was prone to error (i.e., high variation), but the errors were about as likely to be a false positive (0.37 precision) as a false negative (0.39 recall).

A similar pattern held across banks, with all but one bank containing predicted foreclosures only a percentage point off actual foreclosures. SunTrust Mortage performed the worst with two percentage points off actual foreclosures, predicting 8% foreclosures instead of 10. It contained a 0.09 point gap between precision (0.41) and recall (0.32), causing more false negatives (i.e., bias towards non-foreclosures). All other banks contained between 0 – 0.05 point gap between precision and recall. F1 scores ranged from 0.31 (AmTrust Bank) to 0.43 (Bank of America).

Therefore, these predictions appear valuable in the aggregate (e.g., bank predictions), but not as individual predictions (e.g., mortgage recipient predictions). In other words, this model seems reasonably equipped to predict foreclosure rates for bank lenders, but not equipped to predict whether a loanee will eventually have their home foreclosed upon.

Final_Model_Table
BoA_Model_Table JPMorgan_Model_Table AmTrust_Model_Table
Wells_Fargo_Model_Table GMAC_Model_Table PNC_Model_Table
Citi_Model_Table SunTrust_Model_Table Flagstar_Model_Table


Analysis

Analysis was conducted on the full dataset, combining the training, validation, and testing data. Below contains information on five features: credit score, debt-to-income ratio, loan-to-value ratio, median household income at the 3-digit zip code-level, and the dollar amount change in mortgage loans made 1 year ago and 5 years ago. The latter feature was created by taking total loan amount during a fiscal year quarter for each bank within a 3-digit zip code.

Using data mining techniques, each feature was examined at each bank. Then, the feature was replaced by an improved and a weakened assumption based on the inter-quartile range (25-75 percentiles) of the feature across all banks and predicted probabilities were generated to see what the expected foreclosure rate would be if each banks’ behavior were different. For example, a high credit score is associated with fewer foreclosures. Among all banks, the average credit score was 719 (on a scale of 300 to 850). I modified the credit score at each bank to the 75th percentile—an improved assumption of a 770 credit score—and to the 25th percentile—a weakened assumption of 675 credit score. I left all other feature values unchanged. I ran these values through the saved model detailed in the section above and analyzed the change in foreclosure rates. One can interpret the findings as: “If GMAC Mortgage only lent to those with a credit score of 770, with all other considerations staying the same, its foreclosure rate is predicted to fall from 9.7% to 1%.”



Credit Score

Data Mining

Credit score is perhaps the feature most directly associated with the performance of a mortgage loan as it is a measure of one’s past ability to pay off debt. It contained large difference between loans foreclosed (with an average credit score of 668) and loans not destined to be foreclosed upon (with an average credit score of 722). However, there is little difference between banks, ranging from the best actor PNC Bank (728) to worst actor GMAC Mortgage (710). Notably, Wells Fargo Bank, boasting the lowest foreclosure rate, only possesses the third highest credit score, and Flagstar Bank, holding the highest foreclosure rate, only possesses the third lowest credit score.

This seems to indicate that a change in targeting individuals with certain credit scores should highly influence foreclosure rates, but that banks appeared unwilling to change their credit score standards if other banks did not follow suit.

Credit_Score_Best_Density Credit_Score_Worst_Density
Credit_Score_Bar
Model Predictions

Overall, foreclosure rates were predicted to improve from 9.7% to 1.6% if all mortgage holders had a credit score of 770 and weaken to 14.1% if all mortgage holder had a credit score of 675. Bank of America had the largest change, improving from 11.6% to 1.9%. Meanwhile, the bank with the lowest credit score, PNC Bank, is predicted to have the least improvement from 8.8% to 3.1%. However, Bank of America is expected to have the largest increase in foreclosures, if credit scores were weakened to 675, from 11.6%–17.3%, compared to GMAC Mortgage’s 9.7%–12.2%.

Predicted_Credit_Score_Table
Predicted_Credit_Score_Best_Bar Predicted_Credit_Score_Worst_Bar
Main Point

Credit score displayed a substantial relationship with foreclosures, but average credit scores differ little between banks. Bank of America had the most impact by changes in credit score. Most notably, the highest predicted foreclosure rate among all features analyzed in this section was when Bank of America’s credit score assumptions weakened.



Debt-to-Income (DTI) Ratio

Data Mining

The average DTI ratio for loans destined for foreclosure is 41.4% and 37.8% for those not destined for foreclosure. Similar to credit score, banks did not range greatly with Flagstar Bank the highest (40.9%) and CitiMortgage the lowest (35.3%). The best actor, CitiMortgage, had roughly as similar proportions of loan recipients with DTI ratios at 60% as it did at 20%. A general rule of thumb in personal finance is to keep a DTI ratio of 35% or less, implying that loan recipients between 2006-2008 were exceeding conventional standards for DTI.

DTI_Best_Density DTI_Worst_Density
DTI_Bar
Model Predictions

Overall, foreclosure rates were predicted to improve from 9.7% to 6.5% if all mortgage holders had an improved DTI ratio 29% and were predicted to worsen to a foreclosure rate of 11.8% if all mortgage holders had a weakened DTI ratio of 47%. Most banks were predicted to reduce foreclosures to 5%–5.5%, with Wells Fargo Bank improving to 3.5% and Bank of America improving to only 8.9% as outliers. Under improved assumptions, Flagstar Banks was predicted to have the largest decrease in foreclosures from 11.7% to 5.3%. Under weakened assumptions, PNC Bank was predicted to have the largest increase in foreclosures from 8.8% to 12.5%. Again, Bank of America was predicted to have the highest foreclosure rate of 14.8% under a DTI ratio of 47%. However, one bank, SunTrust Mortage, decreased its foreclosure rate with weakened assumptions, showing that not all potential recipients with DTI’s were risky.

Predicted_DTI_Table
Predicted_DTI_Best_Bar Predicted_DTI_Worst_Bar
Main Point

All banks were approving loans to recipients with high DTI ratios. In general, foreclosures decreased substantially with improved DTI assumptions, but did not increase as substantially when assumptions were weakened.



Loan-to-Value (LTV) Ratio

Data Mining

LTV data displays 3 spikes around 95, 90, and 80. There is a large tail in which presumably wealthy buyers were putting down 40% (an LTV of 60) or more for their down payment. Loans destined to foreclose averaged a down payment of 20.5% (an LTV of 79.5), and loans not eventually foreclosed upon averaged a down payment of 28.7% (an LTV of 71.3). Loans with an LTV’s of 90 and 95 displayed particularly high levels of foreclosure. Bank LTV’s ranged from only 69.8 (GMAC Mortgage) to 74.6 (AmTrust Bank). AmTrust Bank contained few outliers: people who made large down payments. Interestingly, Wells Fargo Bank, the bank with the lowest foreclosure rate, had the second highest LTV.

LTV_Best_Density LTV_Worst_Density
LTV_Bar
Model Predictions

Overall, foreclosure rates were predicted to improve from 9.7% to 5.3% if all mortgage holders had a LTV of 63 and weaken to 13.2% if all mortgage holder had a LTV of 84. Under improved assumptions, SunTrust Mortage changed the most from 10.3% to 3.4%. Under weakened assumptions, Bank of America changed the most from 11.6% to 16.4% – again, the worst foreclosure rate. SunTrust Bank was the only one to improve (slightly) as LTV increased.

Predicted_LTV_Table
Predicted_LTV_Best_Bar Predicted_LTV_Worst_Bar
Main Point

LTV was the second most impactful feature analyzed in this document after credit score. Bank of America increased in foreclosures more than any other bank to an extremely high rate under weakened assumptions.



Median Household Income

Data Mining

Median household income at the 3-digit zip code does not necessarily reflect the income of the loan recipient but does indicate whether banks are focusing their loan efforts in poorer or wealthier areas. The difference between the average income of loans destined for foreclosure ($48,204) was very close to those that did not foreclose ($48,797). Between banks, income only ranged from $47,264 (SunTrust Mortage) to $49,491 (CitiMortgage).

MHI_Best_Density MHI_Worst_Density
MHI_Bar
Model Predictions

Due to low data variation, median household income was adjusted only slightly from an improved assumption of $53,615 to a weakened assumption of $43,298. Perhaps, because a $10,000 income difference is too small, banks overall did not reduce predicted foreclosures under improved assumptions or increase foreclosure under weakened assumptions.

Predicted_MHI_Table
Predicted_MHI_Best_Bar Predicted_MHI_Worst_Bar
Main Point

Foreclosure predictions seem random: only slightly change and often in the wrong direction. This could be related to two issues: 1) there is little correlation between median household income at the 3-digit zip code, and/or 2) there is little data variation in this feature. Given the latter potential explanation, I adjusted the assumptions to the maximum (an income of $101,651) and the minimum ($28,832).

Best/Worst-Case Scenerios

After adjusting assumptions to the richest region ($101,651) and the poorest region ($28,832), the predictions begin looking more like one would expect. Emphasizing the highest region is predicted to decrease foreclosures from 9.7% to 4.9%, although emphasizing the poorest region is not predicted to change foreclosures greatly (from 9.7% to 9.9%).

Predicted_MHI_Table_Strongest
Predicted_MHI_Best_Bar_Strongest Predicted_MHI_Worst_Bar_Strongest


Loan Change

Data Mining

Banks increasing the total value of their loans over 1 year did not correspond with lower foreclosures. However, there appears to be a minor effect among banks increasing their loans over 5 years; loans destined for foreclosure on average increased $63,530 over 5 years compared to a $61,1932 increase in non-foreclosed loans. There are dramatic bank differences in loan changes with AmTrust increasing only $34,664 over 5 years and Well Fargo increasing $96,423.

LC1_Best_Density LC1_Worst_Density
LC1_Bar
LC5_Best_Density LC5_Worst_Density
LC5_Bar
Model Predictions

Under improved assumptions, 1 year loan changes increased $2,292 and 5 year loan changes increased $37,980, while overall foreclosure rates were predicted to improve from 9.7% to 7.9%. Under weakened assumptions, 1 year loan changes increased $27,661 and 5 year loan changes increased $86,358, while overall foreclosure rates were predicted to decline from 9.7% to 12.9%. Wells Fargo Bank decreased in foreclosures the most from 7.6% to 3.6%, while Bank of America increased in foreclosures the most from 11.6% to 17%.

Predicted_LC1_Table
Predicted_LC1_Best_Bar Predicted_LC1_Worst_Bar Predicted_LC5_Best_Bar Predicted_LC5_Worst_Bar
Main Point

Changes in loans vary across banks more than any other feature examined in this document. Likewise, predicted changes in foreclosure varied greatly between banks. Wells Fargo Bank substantially decreased foreclosures under improved assumptions but also slightly decreased foreclosures under weakened assumptions. Overall, changes in loans over time impacted foreclosures more than median household income at 3-digit zip codes but less than the other features examined in this document.



Conclusion

After 2005, mortgage loans issued augmented in likelihood to foreclose as well as differences in bank foreclosure rates. Wells Fargo Bank and JPMorgan Chase held the lowest foreclosure rates between 2006-2008, while Flagstar Bank and Bank of America held the highest foreclosure rates.

Of the five features focused on in this analysis, credit score had the highest impact on foreclosures but littlest discrepancy between banks. With improved assumptions, a high credit score of 770 is predicted to reduce foreclosures to between 1%–3.1%. With weakened assumptions, a low credit score of 675 is predicted to increase foreclosures to 10.9%–17.3%.

Of the five features focused on in this analysis, loan change had the lowest impact on foreclosures but highest discrepancy between banks. With improved assumptions, a low 1 year ($2,292) and 5 year ($37,980) increase in mortgage loans made is predicted to reduce foreclosures to between 3.6%–13.6%. With weakened assumptions, a high 1 year ($27,661) and 5 year ($86,358) increase in mortgage loans is predicted to increase foreclosures to 6.4%–17%.

Bank of America always displayed the highest foreclosure rate when assumptions were weakened, even though it possesses a slightly lower rate to Flagstar Bank without any assumptions made. Similarly, Wells Fargo usually (but not always) had the lowest foreclosure rate when assumptions were improved, even though it possesses the same rate as JPMorgan Chase without any assumptions made.