Bad Banking Behavior
Analyzing Bank Mortgages during the 2008 Housing Bubble

Objectives
It has been over a decade since a mass sell-off of single-family mortgage loans and housing foreclosures popped the housing bubble and plunged the global economy into a financial crisis. To create greater transparency, Fannie Mae began releasing single-family loan performance data in 2015. This data has always been analyzed by the data’s level of analysis – individuals. In other words, prior analysis often observed which individuals were at greatest risk of housing foreclosure.
I investigated this data by changing the level of analysis to that of the bank to observe banking behavior during the housing bubble and how reforms in their behavior could have lessen the risk of defaults. This research analyzes the behavior of the nine largest banks during the housing bubble and alternative scenarios in which better behavior could have led to fewer foreclosures.
Data
Most of the data comes from Fannie Mae’s Loan Acquisition and Performance Data. This contains the target variable: mortgage loan foreclosures. Each observation represents a loan originating or refinanced between 2006-2008. The housing bubble and subsequent financial crisis is often noted as occurring between 2007-2009; most loans that lend to housing foreclosures originated or were refinanced only a year to three years before foreclosure, leading to a peak in loans destined for foreclosure occurring in 2007. Refinanced loans are limited to loans originating in fiscal year 2000 or later. The data is further limited only to loans made by the nine largest banks at the peak of risking lending (2007). Features modeled to predict foreclosures are limited to information known at the time of loan origination except for the date of last reporting (i.e., the window in which a foreclosure could have occurred and be in the dataset).
Other data was used to model loan foreclosures. See below for a list of data sources and see Data Wrangling for information on the ETL process used to gather this data:
1) Fannie Mae Loan Acquisition and Performance Data [Individual Mortgage Loans],
2) Past Fannie Mae Loan Acquisition Data [Changes in Loan Amounts],
3) U.S. Census Bureau, Small Area Estimates Branch [Median Household Income by County],
4) Federal Reserve Economic Data (FRED) [Macroeconomic Data related to the Housing Market],
5) Federal Deposit Insurance Corporation (FDIC) Data [Information on FDIC-backed Banks]
Foreclosures
Mortgages made between 2006–2008 foreclosed at a rate of 9.7% among the top 9 banks. These foreclosures peaked in 2007 at 10.2%. Note, the research paper and Data Mining scripts show that these 9 banks issued loans more likely to be foreclosed upon than smaller banks, increasing this rate from the total U.S. rate.
Of importance, the average mortgage amount not only grew dramatically since the year 2001, but mortgage amounts on loans that were eventually foreclosed upon increased at a higher rate, matching (and briefly exceeding) the mortgage amounts on loans not foreclosed upon. This implies that large banks may have been willing to make riskier loans for inflated housing prices.



Michigan, Arizona, Southern California, and the New York-area contain some of the highest loans destined for foreclosure. The worst are Las Vegas and most of Florida.


Banks
This research investigates the differences between the 9 largest banks.
Loans destined to be foreclosed upon begin to notably increase after 2004. As these loans increase in foreclosures, the difference between the banks increase.
Overall,



Methodology
The main challenge of modelling mortgage foreclosures is that they represent a rare event – defined as a dichotomous variable in which an occurrence of the event is low relative to cases in which that event could occur. Consequently, counts in the numerator of the rate are expected to be small by comparison to the denominator. Such small numerators in a target variable are known to create higher bias in estimation and, when forming a proportion, higher variance (King & Zeng, 2001). In laymen terms, classification models are likely to become better at predicting non-occurrences than occurrences simply because there are more examples of non-occurrences to learn from. This results in models over-predicting non-occurrences. In very imbalanced datasets, this can result in zero predictions of an occurrence.
Further, the features in the Fannie Mae do not correlate well with foreclosures. To this end, I linked many features from other data. Still, it remains is difficult to predict foreclosures without overfitting the training data.
To solve these two problems, I use two primary ensembling techniques. The first, algorithm-based ensembling, rebalances the data to augment foreclosure predictions. The second uses vote ensembling to increase generalization of each model’s results.
Balancing Predictions
The chart below displays a hypothetical dataset in which the target variable occurs at a rate of 20%. In order to ensure a predictive model remains balanced in its learning from occurrences and non-occurrences, an analyst can use an artificial intervention to balance the classes. One approach is to draw small samples with balanced classes. These samples produce relatively weak and simple models, which are then ensembled into stronger predictions.
Moving from Step 1 to Step 2 displays how this example data could be sampled using an ensembling technique. The analyst converts the training data into four balanced datasets in which the target variable occurs at a rate of 50%. These samples can encompass the full training data and yet each model learns from the same number of occurrences as it does non-occurrences. Ensembling these simpler models involves averaging their predicted probabilities and then converting those predicted probabilities into a dichotomous result using a classification threshold (i.e., the hardlimit). Finally, that prediction model is used to predict on the validation data for model evaluation. The validation data remains in its raw form (i.e., without balanced classes) to evaluate these predictions as with natural data (i.e., no analyst interventions).
While this approach overcomes the issue of over-predicting non-occurrence, it is prone to over-predicting occurrences. By over-predicting the rare events, the ensembled model is shown to know what situations lead to preventing or causing a mortgage foreclosure, although it is biased towards predicting mortgage foreclosures at higher rates. Fortunately, this bias can be minimized by optimizing the classification threshold.
As shown in Step 3, the testing data remains imbalanced and the predicted probability predictions are equally likely to be above or below 0.5 – the standard classification threshold. Because the analyst intervened and forced balanced classes in the training data, the analyst should compensate by increasing the threshold to an appropriate level. In this example, the analyst could choose a 0.8 threshold, which would provide perfect predictions on the test set. Note, this decision is made on validation data, separate from the testing data which will make the final model evaluations.
In modeling mortgage foreclosures, the classification threshold that maximizes F1 score is selected as this is the best at balancing recall and precision. Many models were run on different cohorts and using different algorithms; more on this in the next section. The results are quite poor: one of the 3 first models on all nine banks resulted in only 0.24 precision and, despite the stricter threshold, it predicted 19% foreclosures when the testing data contained only 10% foreclosures. (Note, while the full data contained 9.7% foreclosures, the testing data contained 10%.)



Voting Architecture
Three models on each of the 9 banks plus on an all banks dataset were run (for a total of 30 models). As described above, the three models resampled the data to better balance foreclosures to non-foreclosures. However, while one substitutes an even balance (1:1 between the two outcomes), one substitutes a 3:1 balance, and the other substitutes a 5:1 balance. This is done to ensure greater diversity between models. The two that favor non-foreclosures still oversample foreclosures as a true dataset (such as the validation and testing data) favor non-foreclosures at 10:1.
The 1:1 model is a random forest that uses principal component analysis (PCA) to reduce the features to 10. The 3:1 model is an AdaBoost decision tree algorithm called SAMME.R, which uses all 42 variables. The 5:1 model is another random forest that selects a different square root amount of the 42 variables per tree. Each of the three models receives one vote per bank.
The figure below illustrates the architecture in which these models were ensembled—simplified to only show three banks. The bottom layer displays the models, and the arrows represent a vote of foreclosure or not being sent to another layer. The all bank dataset is a special case, which receives a middle layer for the first round of voting. The all banks dataset attempts to capture cases of foreclosure that are likely to occur regardless of bank-specifics. As a result, I imposed a strict criterion of unanimous vote, meaning the three models must all agree that a prediction is a foreclosure for it to be marked as a foreclosure. The top layer receives three votes from the three models per bank plus a vote from the model layer’s all bank dataset. The top layer requires a majority vote—three out of the four votes must agree that a prediction is a foreclosure for it to be marked as a foreclosure.
This voting architecture is designed to use the algorithm-based ensembling to ensure model attention on what cases a foreclosure, and vote ensembling to keep foreclosures rare.

Modeling Performance
After all votes were cast, the final model, which stacks each of the final bank results from the top layer on one another, correctly predicted a foreclosure rate of 10%. The overall accuracy rate appears high at 0.88, but it contains issues which can be understood by the low F1 score (0.38). The F1 score is a better measure of accuracy as it accounts for the low prevalence of foreclosures.
Certainly, the voting architectures improved the results of the individual models. It was low on bias, but high in variation; each prediction was prone to error (i.e., high variation), but the errors were about as likely to be a false positive (0.37 precision) as a false negative (0.39 recall).
A similar pattern held across banks, with all but one bank containing predicted foreclosures only a percentage point off actual foreclosures.
Therefore, these predictions appear valuable in the aggregate (e.g., bank predictions), but not as individual predictions (e.g., mortgage recipient predictions). In other words, this model seems reasonably equipped to predict foreclosure rates for bank lenders, but not equipped to predict whether a loanee will eventually have their home foreclosed upon.










Analysis
Analysis was conducted on the full dataset, combining the training, validation, and testing data. Below contains information on five features: credit score, debt-to-income ratio, loan-to-value ratio, median household income at the 3-digit zip code-level, and the dollar amount change in mortgage loans made 1 year ago and 5 years ago. The latter feature was created by taking total loan amount during a fiscal year quarter for each bank within a 3-digit zip code.
Using data mining techniques, each feature was examined at each bank. Then, the feature was replaced by an improved and a weakened assumption based on the inter-quartile range (25-75 percentiles) of the feature across all banks and predicted probabilities were generated to see what the expected foreclosure rate would be if each banks’ behavior were different. For example, a high credit score is associated with fewer foreclosures. Among all banks, the average credit score was 719 (on a scale of 300 to 850). I modified the credit score at each bank to the 75th percentile—an improved assumption of a 770 credit score—and to the 25th percentile—a weakened assumption of 675 credit score. I left all other feature values unchanged. I ran these values through the saved model detailed in the section above and analyzed the change in foreclosure rates. One can interpret the findings as: “If
Credit Score
Data Mining
Credit score is perhaps the feature most directly associated with the performance of a mortgage loan as it is a measure of one’s past ability to pay off debt. It contained large difference between loans foreclosed (with an average credit score of 668) and loans not destined to be foreclosed upon (with an average credit score of 722). However, there is little difference between banks, ranging from the best actor
This seems to indicate that a change in targeting individuals with certain credit scores should highly influence foreclosure rates, but that banks appeared unwilling to change their credit score standards if other banks did not follow suit.



Model Predictions
Overall, foreclosure rates were predicted to improve from 9.7% to 1.6% if all mortgage holders had a credit score of 770 and weaken to 14.1% if all mortgage holder had a credit score of 675.



Main Point
Credit score displayed a substantial relationship with foreclosures, but average credit scores differ little between banks.
Debt-to-Income (DTI) Ratio
Data Mining
The average DTI ratio for loans destined for foreclosure is 41.4% and 37.8% for those not destined for foreclosure. Similar to credit score, banks did not range greatly with



Model Predictions
Overall, foreclosure rates were predicted to improve from 9.7% to 6.5% if all mortgage holders had an improved DTI ratio 29% and were predicted to worsen to a foreclosure rate of 11.8% if all mortgage holders had a weakened DTI ratio of 47%. Most banks were predicted to reduce foreclosures to 5%–5.5%, with



Main Point
All banks were approving loans to recipients with high DTI ratios. In general, foreclosures decreased substantially with improved DTI assumptions, but did not increase as substantially when assumptions were weakened.
Loan-to-Value (LTV) Ratio
Data Mining
LTV data displays 3 spikes around 95, 90, and 80. There is a large tail in which presumably wealthy buyers were putting down 40% (an LTV of 60) or more for their down payment. Loans destined to foreclose averaged a down payment of 20.5% (an LTV of 79.5), and loans not eventually foreclosed upon averaged a down payment of 28.7% (an LTV of 71.3). Loans with an LTV’s of 90 and 95 displayed particularly high levels of foreclosure. Bank LTV’s ranged from only 69.8 (



Model Predictions
Overall, foreclosure rates were predicted to improve from 9.7% to 5.3% if all mortgage holders had a LTV of 63 and weaken to 13.2% if all mortgage holder had a LTV of 84. Under improved assumptions,



Main Point
LTV was the second most impactful feature analyzed in this document after credit score.
Median Household Income
Data Mining
Median household income at the 3-digit zip code does not necessarily reflect the income of the loan recipient but does indicate whether banks are focusing their loan efforts in poorer or wealthier areas. The difference between the average income of loans destined for foreclosure ($48,204) was very close to those that did not foreclose ($48,797). Between banks, income only ranged from $47,264 (



Model Predictions
Due to low data variation, median household income was adjusted only slightly from an improved assumption of $53,615 to a weakened assumption of $43,298. Perhaps, because a $10,000 income difference is too small, banks overall did not reduce predicted foreclosures under improved assumptions or increase foreclosure under weakened assumptions.



Main Point
Foreclosure predictions seem random: only slightly change and often in the wrong direction. This could be related to two issues: 1) there is little correlation between median household income at the 3-digit zip code, and/or 2) there is little data variation in this feature. Given the latter potential explanation, I adjusted the assumptions to the maximum (an income of $101,651) and the minimum ($28,832).
Best/Worst-Case Scenerios
After adjusting assumptions to the richest region ($101,651) and the poorest region ($28,832), the predictions begin looking more like one would expect. Emphasizing the highest region is predicted to decrease foreclosures from 9.7% to 4.9%, although emphasizing the poorest region is not predicted to change foreclosures greatly (from 9.7% to 9.9%).



Loan Change
Data Mining
Banks increasing the total value of their loans over 1 year did not correspond with lower foreclosures. However, there appears to be a minor effect among banks increasing their loans over 5 years; loans destined for foreclosure on average increased $63,530 over 5 years compared to a $61,1932 increase in non-foreclosed loans. There are dramatic bank differences in loan changes with AmTrust increasing only $34,664 over 5 years and Well Fargo increasing $96,423.






Model Predictions
Under improved assumptions, 1 year loan changes increased $2,292 and 5 year loan changes increased $37,980, while overall foreclosure rates were predicted to improve from 9.7% to 7.9%. Under weakened assumptions, 1 year loan changes increased $27,661 and 5 year loan changes increased $86,358, while overall foreclosure rates were predicted to decline from 9.7% to 12.9%.





Main Point
Changes in loans vary across banks more than any other feature examined in this document. Likewise, predicted changes in foreclosure varied greatly between banks.
Conclusion
After 2005, mortgage loans issued augmented in likelihood to foreclose as well as differences in bank foreclosure rates.
Of the five features focused on in this analysis, credit score had the highest impact on foreclosures but littlest discrepancy between banks. With improved assumptions, a high credit score of 770 is predicted to reduce foreclosures to between 1%–3.1%. With weakened assumptions, a low credit score of 675 is predicted to increase foreclosures to 10.9%–17.3%.
Of the five features focused on in this analysis, loan change had the lowest impact on foreclosures but highest discrepancy between banks. With improved assumptions, a low 1 year ($2,292) and 5 year ($37,980) increase in mortgage loans made is predicted to reduce foreclosures to between 3.6%–13.6%. With weakened assumptions, a high 1 year ($27,661) and 5 year ($86,358) increase in mortgage loans is predicted to increase foreclosures to 6.4%–17%.