Heckman Selection
This demonstration of the Heckman selection model is based on Bleven’s example here, but which is more or less the ‘classic’ example regarding women’s wages, variations of which you’ll find all over.
Data Setup
Description of the data:
- Draw 10,000 obs at random
- educ uniform over [0,16]
- age uniform over [18,64]
- wearnl = 4.49 + 0.08 * educ + 0.012 * age + ε
Generate missing data for wearnl drawn z from standard normal [0,1]. z is actually never explained in the slides, I think it’s left out on slide 3 and just represents an additional covariate.
- d*=-1.5+0.15*educ+0.01*age+0.15*z+v
- wearnl missing if d*≤0 wearn reported if d*>0
- wearnl_all = wearnl with non-missing obs
library(tidyverse)
set.seed(123456)
= 10000
N = sample(1:16, N, replace = TRUE)
educ = sample(18:64, N, replace = TRUE)
age
= matrix(c(.46^2, .25*.46, .25*.46, 1), ncol = 2)
covmat = mvtnorm::rmvnorm(N, sigma = covmat)
errors = rnorm(N)
z = errors[, 1]
e = errors[, 2]
v
= 4.49 + .08 * educ + .012 * age + e
wearnl
= -1.5 + 0.15 * educ + 0.01 * age + 0.15 * z + v
d_star
= d_star > 0
observed_index
= data.frame(wearnl, educ, age, z, observed_index) d
Examine linear regression approaches if desired.
# lm based on full data
= lm(wearnl ~ educ + age, data=d)
lm_all
# lm based on observed data
= lm(wearnl ~ educ + age, data=d[observed_index,])
lm_obs
summary(lm_all)
Call:
lm(formula = wearnl ~ educ + age, data = d)
Residuals:
Min 1Q Median 3Q Max
-2.03286 -0.31240 0.00248 0.31578 1.50828
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.4691266 0.0171413 260.72 <2e-16 ***
educ 0.0798814 0.0010005 79.84 <2e-16 ***
age 0.0124381 0.0003398 36.60 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4611 on 9997 degrees of freedom
Multiple R-squared: 0.4331, Adjusted R-squared: 0.433
F-statistic: 3820 on 2 and 9997 DF, p-value: < 2.2e-16
summary(lm_obs) # smaller coefs, resid standard error
Call:
lm(formula = wearnl ~ educ + age, data = d[observed_index, ])
Residuals:
Min 1Q Median 3Q Max
-1.75741 -0.30289 -0.00133 0.30918 1.50032
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6713823 0.0258865 180.46 <2e-16 ***
educ 0.0705849 0.0014760 47.82 <2e-16 ***
age 0.0114758 0.0004496 25.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4507 on 5517 degrees of freedom
Multiple R-squared: 0.3374, Adjusted R-squared: 0.3372
F-statistic: 1405 on 2 and 5517 DF, p-value: < 2.2e-16
Two step approach
The two-step approach first conducts a probit model regarding whether the individual is observed or not, in order to calculate the inverse mills ratio, or ‘nonselection hazard’. The second step is a standard linear model.
Step 1: Probit Model
= glm(observed_index ~ educ + age + z,
probit data = d,
family = binomial(link = 'probit'))
summary(probit)
Call:
glm(formula = observed_index ~ educ + age + z, family = binomial(link = "probit"),
data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4674 -0.9062 0.4628 0.8800 2.2674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.519248 0.052819 -28.763 <2e-16 ***
educ 0.150027 0.003220 46.588 <2e-16 ***
age 0.010072 0.001015 9.926 <2e-16 ***
z 0.159292 0.013889 11.469 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13755 on 9999 degrees of freedom
Residual deviance: 11119 on 9996 degrees of freedom
AIC: 11127
Number of Fisher Scoring iterations: 4
# http://www.stata.com/support/faqs/statistics/inverse-mills-ratio/
= predict(probit)
probit_lp = dnorm(probit_lp)/pnorm(probit_lp)
mills0
summary(mills0)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07588 0.38632 0.70027 0.75664 1.09246 1.96602
# identical formulation
# probit_lp = -predict(probit)
# imr = dnorm(probit_lp)/(1-pnorm(probit_lp))
= mills0[observed_index]
imr
summary(imr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07588 0.28739 0.48466 0.57015 0.77617 1.87858
Take a look at the distribution.
::qplot(imr, geom = 'histogram') ggplot2
Step 2: Estimate via Linear Regression
Standard regression model using the inverse mills ratio as covariate
= lm(wearnl ~ educ + age + imr, data = d[observed_index, ])
lm_select
summary(lm_select)
Call:
lm(formula = wearnl ~ educ + age + imr, data = d[observed_index,
])
Residuals:
Min 1Q Median 3Q Max
-1.75994 -0.30293 -0.00186 0.31049 1.48179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5159161 0.1063144 42.477 <2e-16 ***
educ 0.0782580 0.0052989 14.769 <2e-16 ***
age 0.0119700 0.0005564 21.513 <2e-16 ***
imr 0.0955209 0.0633557 1.508 0.132
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4506 on 5516 degrees of freedom
Multiple R-squared: 0.3377, Adjusted R-squared: 0.3373
F-statistic: 937.4 on 3 and 5516 DF, p-value: < 2.2e-16
Compare to sampleSelection package.
library(sampleSelection)
= selection(observed_index ~ educ + age + z, wearnl ~ educ + age,
selection_2step method = '2step')
summary(selection_2step)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
10000 observations (4480 censored and 5520 observed)
10 free parameters (df = 9991)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.519248 0.052725 -28.815 <2e-16 ***
educ 0.150027 0.003221 46.577 <2e-16 ***
age 0.010072 0.001014 9.934 <2e-16 ***
z 0.159292 0.013937 11.430 <2e-16 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5159161 0.1066914 42.33 <2e-16 ***
educ 0.0782580 0.0053181 14.71 <2e-16 ***
age 0.0119700 0.0005592 21.41 <2e-16 ***
Multiple R-Squared:0.3377, Adjusted R-Squared:0.3373
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio 0.09552 0.06354 1.503 0.133
sigma 0.45550 NA NA NA
rho 0.20970 NA NA NA
--------------------------------------------
coef(lm_select)['imr'] / summary(lm_select)$sigma # slightly off
imr
0.2119813
coef(lm_select)['imr'] / summary(selection_2step)$estimate['sigma', 'Estimate']
imr
0.2097041
Maximum Likelihood
The following likelihood function takes arguments as follows:
- par: the regression coefficients pertaining to the two models, the residual standard error
- sigma and rho for the correlation estimate
- X: observed data model matrix for the linear regression model
- Z: complete data model matrix for the probit model
- y: the target variable
- observed_index: an index denoting whether y is observed
<- function(par, X, Z, y, observed_index) {
select_ll = par[1:4]
gamma = Z %*% gamma
lp_probit
= par[5:7]
beta = X %*% beta
lp_lm
= par[8]
sigma = par[9]
rho
= sum(log(1-pnorm(lp_probit[!observed_index]))) +
ll - log(sigma) +
sum(dnorm(y, mean = lp_lm, sd = sigma, log = TRUE)) +
sum(
pnorm((lp_probit[observed_index] + rho/sigma * (y-lp_lm)) / sqrt(1-rho^2),
log.p = TRUE)
)
-ll
}
= model.matrix(lm_select)
X = model.matrix(probit)
Z
# initial values
= c(coef(probit), coef(lm_select)[-4], 1, 0) init
Estimate via optim. Without bounds for sigma and rho you’ll get warnings, but does fine anyway
= optim(
fit_unbounded
init,
select_ll,X = X[, -4],
Z = Z,
y = wearnl[observed_index],
observed_index = observed_index,
method = 'BFGS',
control = list(maxit = 1000, reltol = 1e-15),
hessian = T
)
= optim(
fit_bounded
init,
select_ll,X = X[, -4],
Z = Z,
y = wearnl[observed_index],
observed_index = observed_index,
method = 'L-BFGS',
lower = c(rep(-Inf, 7), 1e-10,-1),
upper = c(rep(Inf, 8), 1),
control = list(maxit = 1000, factr = 1e-15),
hessian = T
)
Comparison
Comparison model.
= selection(observed_index ~ educ + age + z, wearnl ~ educ + age,
selection_ml method = 'ml')
# summary(selection_ml)
We now compare the results of the different estimation approaches.
model | par | sampselpack_ml | unbounded_ml | bounded_ml | explicit_twostep | sampselpack_2step |
---|---|---|---|---|---|---|
probit | (Intercept) | -1.52026540 | -1.521 | -1.523 | -1.519 | -1.519 |
probit | educ | 0.15020205 | 0.150 | 0.150 | 0.150 | 0.150 |
probit | age | 0.01006608 | 0.010 | 0.010 | 0.010 | 0.010 |
probit | z | 0.15747033 | 0.158 | 0.158 | 0.159 | 0.159 |
lm | (Intercept) | 4.47798502 | 4.480 | 4.481 | 4.516 | 4.516 |
lm | educ | 0.08012367 | 0.080 | 0.080 | 0.078 | 0.078 |
lm | age | 0.01209129 | 0.012 | 0.012 | 0.012 | 0.012 |
lm | sigma | 0.45820605 | 0.458 | 0.458 | 0.451 | 0.456 |
both | rho | 0.25945397 | 0.257 | 0.255 | 0.212 | 0.210 |
model | par | sampselpack_ml | unbounded_ml | bounded_ml | explicit_twostep | sampselpack_2step |
---|---|---|---|---|---|---|
probit | (Intercept) | 0.053 | 0.053 | 0.053 | 0.053 | 0.053 |
probit | educ | 0.003 | 0.003 | 0.003 | 0.003 | 0.003 |
probit | age | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
probit | z | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 |
lm | (Intercept) | 0.090 | 0.090 | 0.090 | 0.106 | 0.107 |
lm | educ | 0.004 | 0.004 | 0.005 | 0.005 | 0.005 |
lm | age | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
lm | sigma | 0.008 | 0.008 | 0.008 | NA | NA |
both | rho | 0.112 | 0.112 | 0.112 | NA | NA |
Source
Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/heckman_selection.R