# Remove warnings
import warnings
warnings.filterwarnings('ignore')
%run ../HTML_Functions.ipynb
Lately, I have been recreating Stata estimates using Python. Below, I reproduce Stata's (and R's survey package's) robust standard errors it uses in its prop and mean commands. I am using population data from Python's vega_datasets.
# Core Packages
import numpy as np
import pandas as pd
from vega_datasets import data
# Load data
df = data.population()
df['female'] = df['sex']-1
df = df[['year', 'female', 'age', 'people']]
df.head()
Statsmodels is a common Python library for producing weighted statistics. It uses the most conventional method for calculating weighted standard deviations (note: standard error is simply the standard deviation divided by the square root of the sample size). Unfortunately, this is not how Stata or R calculates weighted standard deviations as they calculate robust corrections. Below, Python calculates a standard error of 0.021, while Stata calculates 0.028.
from statsmodels.stats.weightstats import DescrStatsW
stat = DescrStatsW(df['female'], weights=df['people'])
print('% Female')
print('mean:', stat.mean)
print('se: ', stat.std/np.sqrt(df.shape[0]))
Stata code:
# . prop female [pw=people]
# Proportion estimation Number of obs = 570
# --------------------------------------------------------------
# | Logit
# | Proportion Std. err. [95% conf. interval]
# -------------+------------------------------------------------
# female |
# 0 | .4953639 .0282502 .4401598 .5506813
# 1 | .5046361 .0282502 .4493187 .5598402
# --------------------------------------------------------------
Fortunately, statsmodels can calculate the same robust weighted standard errors using weighted linear regressions.
import statsmodels.api as sm
y = df['female']
x = np.ones(df.shape[0])
w = df['people']
fpc = 1
# Calculate
stat = sm.WLS(y, x, weights=w).fit(cov_type='HC1')
# Stats
mean = stat.params[0]
se = stat.bse[0]*fpc
print('% Female')
print('mean:', mean)
print('se: ', se)
So I researched raw code from statsmodels weighted linear regressions so that you don't have to. Below is how we can calculate it manually.
n = df.shape[0]
y = df['female']
x = df['people']
fpc = 1
# Prop/Mean
mean = (y*x).sum() / (x).sum()
# SE
het_scale = n/(n-1) * ( ((y-mean) * np.sqrt(x))**2 ) # Heteroscedasticity robust covariance matrix (HC1)
wx = np.array([ x*(1/np.sqrt(x)) ]).T # whitened by Cholesky decomposition (cholsigmainv)
pinv_wx = np.linalg.pinv(wx) # Moore-Penrose pseudo-inverse of a matrix
H = np.dot(pinv_wx, het_scale[:, None] * pinv_wx.T).item() # heteroscedasticity-corrected (or “White-Huber”) covariance matrices
se = np.sqrt(H)*fpc
print('% Female')
print('mean:', mean)
print('se: ', se)
We can use this code as the basis for rewriting Stata's weighted prop and mean functions. I wrote stata_func() to allow the y_var to be a binary "0, 1" proportion or a continuous mean--but not multinomial proportions. The over variable (i.e., subpopution) can be binary, multinomial, or not included. Weights and FPC can be added, but strata cannot (this would greatly enlarge the function).
# Recreate Stata stats
def stata_func(y_var, df, over=None, weight=None, fpc=1):
# Libraries
import numpy as np
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
# Remove weight if weight==None
if weight is None:
df['weight'] = np.ones(df.shape[0])
weight = 'weight'
# Remove over if over==None
if over is None:
df['over'] = np.ones(df.shape[0])
over = 'over'
# Prepare over variable to loop
one_hot = OneHotEncoder(handle_missing='return_nan', use_cat_names=True).fit_transform(df[over].astype('O'))
x_vars = sorted(one_hot.columns)
df = pd.concat([df, one_hot], axis=1)
# Calculate Stats
stats = {}
stats['stats'] = ('mean', 'se', 'n')
for x_var in x_vars:
# Text output
if over == 'over':
list_name = y_var
else:
list_name = y_var + " when " + x_var
# Params
cw = df[[y_var, x_var, weight]].dropna()
n = cw.shape[0]
subpop = cw[cw[x_var]==1]
y = subpop[y_var]
x = subpop[weight]
# Prop/Mean
mean = (y*x).sum() / (x).sum()
# SE
## Without weights
if len(np.unique(x)) == 1:
het_scale = (y-mean)**2 # Heteroscedasticity robust covariance matrix (HC0)
## With weights
else:
het_scale = n/(n-1) * ( ((y-mean) * np.sqrt(x))**2 ) # Heteroscedasticity robust covariance matrix (HC1)
## Some linear algebra on the robust covariance matrix
wx = np.array([ x*(1/np.sqrt(x)) ]).T # whitened by Cholesky decomposition (cholsigmainv)
pinv_wx = np.linalg.pinv(wx) # Moore-Penrose pseudo-inverse of a matrix
H = np.dot(pinv_wx, het_scale[:, None] * pinv_wx.T).item() # heteroscedasticity-corrected (or “White-Huber”) covariance matrices
se = np.sqrt(H)*fpc
stats[list_name] = (mean, se, str(n))
stats = pd.DataFrame(stats)
return(stats)
# Adding subpopulations
cats = [-np.inf, 1900, 1950, 1990, np.inf]
cat_labs = ["1850-1900", "1901-1950", "1951-1990", "1991-2000"]
df['period'] = pd.cut(df['year'], bins=cats, labels=cat_labs)
stata_func('female', df, over='period', weight='people')
stata_func('age', df, over='period', weight='people')
stata_func('age', df, over=['period', 'female'], weight='people')
Why are the standard errors different?
Let's start by discussing standard deviations. Traditional standard deviations are calculated by taking overall variance (the sum of values minus the mean squared) relative to the sample size (dividing the sample size and then taking the square root). The simplest way to add weights is by multiplying the values minus mean by the weights and replacing the sample size with the summation of the weights.
However, traditional standard deviations only consider total variance relative to the n-value (sample size or summation of weights) and not the distribution of the variance. Robust standard deviations consider the distribution of the variance. Mathematically, robust formulas do not sum the variance before relating it to the n-value. Instead, it considers the variance of each value to the n-value, adds penalties for larger variance distributions, and uses some fancy linear algebra to convert into a single standard deviation value.
What does this do (in plain English)?
Robust standard deviations should always increase the standard deviation from the traditional formula. Let's call this a "robust correction". The robust correction should be larger if the difference between each value and the mean is very inconsistent (sometimes large, sometimes small). If the difference between each value and the mean is often the same, the robust correction shouldn't increase the standard deviation by much. This is the same in concept to heteroskedasticity in linear regressions. The formula for robust standard deviations can be found here:
MacKinnon, J. G., & White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of econometrics, 29(3), 305-325.
But aren't we concerned with errors not deviations?
Standard errors are just scaled standard deviations (scaled to the original values). We can get the standard error by dividing the standard deviation by the square root of the n-value. This is true of traditional standard errors and robust standard errors.
from IPython.display import display, Javascript
display(Javascript(
"document.body.dispatchEvent("
"new KeyboardEvent('keydown', {key:'s', keyCode: 83, ctrlKey: true}"
"))"
))
!jupyter nbconvert --to html_toc "Tip7_Reproducing_Stata_SE.ipynb" --ExtractOutputPreprocessor.enabled=False --CSSHTMLHeaderPreprocessor.style=stata-dark