Heckman Selection

This demonstration of the Heckman selection model is based on Bleven’s example here, but which is more or less the ‘classic’ example regarding women’s wages, variations of which you’ll find all over.

Data Setup

Description of the data:

  • Draw 10,000 obs at random
  • educ uniform over [0,16]
  • age uniform over [18,64]
  • wearnl = 4.49 + 0.08 * educ + 0.012 * age + ε

Generate missing data for wearnl drawn z from standard normal [0,1]. z is actually never explained in the slides, I think it’s left out on slide 3 and just represents an additional covariate.

  • d*=-1.5+0.15*educ+0.01*age+0.15*z+v
  • wearnl missing if d*≤0 wearn reported if d*>0
  • wearnl_all = wearnl with non-missing obs
library(tidyverse)

set.seed(123456)

N = 10000
educ = sample(1:16, N, replace = TRUE)
age  = sample(18:64, N, replace = TRUE)

covmat = matrix(c(.46^2, .25*.46, .25*.46, 1), ncol = 2)
errors = mvtnorm::rmvnorm(N, sigma = covmat)
z = rnorm(N)
e = errors[, 1]
v = errors[, 2]

wearnl = 4.49 + .08 * educ + .012 * age + e

d_star = -1.5 + 0.15 * educ + 0.01 * age + 0.15 * z + v

observed_index  = d_star > 0

d = data.frame(wearnl, educ, age, z, observed_index)

Examine linear regression approaches if desired.

# lm based on full data
lm_all = lm(wearnl ~ educ + age, data=d)

# lm based on observed data
lm_obs = lm(wearnl ~ educ + age, data=d[observed_index,])

summary(lm_all)

Call:
lm(formula = wearnl ~ educ + age, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.03286 -0.31240  0.00248  0.31578  1.50828 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.4691266  0.0171413  260.72   <2e-16 ***
educ        0.0798814  0.0010005   79.84   <2e-16 ***
age         0.0124381  0.0003398   36.60   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4611 on 9997 degrees of freedom
Multiple R-squared:  0.4331,    Adjusted R-squared:  0.433 
F-statistic:  3820 on 2 and 9997 DF,  p-value: < 2.2e-16
summary(lm_obs) # smaller coefs, resid standard error

Call:
lm(formula = wearnl ~ educ + age, data = d[observed_index, ])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75741 -0.30289 -0.00133  0.30918  1.50032 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.6713823  0.0258865  180.46   <2e-16 ***
educ        0.0705849  0.0014760   47.82   <2e-16 ***
age         0.0114758  0.0004496   25.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4507 on 5517 degrees of freedom
Multiple R-squared:  0.3374,    Adjusted R-squared:  0.3372 
F-statistic:  1405 on 2 and 5517 DF,  p-value: < 2.2e-16

Two step approach

The two-step approach first conducts a probit model regarding whether the individual is observed or not, in order to calculate the inverse mills ratio, or ‘nonselection hazard’. The second step is a standard linear model.

Step 1: Probit Model

probit = glm(observed_index ~ educ + age + z,
             data   = d,
             family = binomial(link = 'probit'))

summary(probit)

Call:
glm(formula = observed_index ~ educ + age + z, family = binomial(link = "probit"), 
    data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4674  -0.9062   0.4628   0.8800   2.2674  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.519248   0.052819 -28.763   <2e-16 ***
educ         0.150027   0.003220  46.588   <2e-16 ***
age          0.010072   0.001015   9.926   <2e-16 ***
z            0.159292   0.013889  11.469   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13755  on 9999  degrees of freedom
Residual deviance: 11119  on 9996  degrees of freedom
AIC: 11127

Number of Fisher Scoring iterations: 4
# http://www.stata.com/support/faqs/statistics/inverse-mills-ratio/
probit_lp = predict(probit)
mills0 = dnorm(probit_lp)/pnorm(probit_lp)

summary(mills0)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.07588 0.38632 0.70027 0.75664 1.09246 1.96602 
# identical formulation 
# probit_lp = -predict(probit)
# imr = dnorm(probit_lp)/(1-pnorm(probit_lp))
imr = mills0[observed_index]

summary(imr)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.07588 0.28739 0.48466 0.57015 0.77617 1.87858 

Take a look at the distribution.

ggplot2::qplot(imr, geom = 'histogram')

Step 2: Estimate via Linear Regression

Standard regression model using the inverse mills ratio as covariate

lm_select = lm(wearnl ~ educ + age + imr, data = d[observed_index, ])

summary(lm_select)

Call:
lm(formula = wearnl ~ educ + age + imr, data = d[observed_index, 
    ])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75994 -0.30293 -0.00186  0.31049  1.48179 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.5159161  0.1063144  42.477   <2e-16 ***
educ        0.0782580  0.0052989  14.769   <2e-16 ***
age         0.0119700  0.0005564  21.513   <2e-16 ***
imr         0.0955209  0.0633557   1.508    0.132    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4506 on 5516 degrees of freedom
Multiple R-squared:  0.3377,    Adjusted R-squared:  0.3373 
F-statistic: 937.4 on 3 and 5516 DF,  p-value: < 2.2e-16

Compare to sampleSelection package.

library(sampleSelection)

selection_2step = selection(observed_index ~ educ + age + z, wearnl ~ educ + age, 
                            method = '2step')

summary(selection_2step)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
10000 observations (4480 censored and 5520 observed)
10 free parameters (df = 9991)
Probit selection equation:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.519248   0.052725 -28.815   <2e-16 ***
educ         0.150027   0.003221  46.577   <2e-16 ***
age          0.010072   0.001014   9.934   <2e-16 ***
z            0.159292   0.013937  11.430   <2e-16 ***
Outcome equation:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.5159161  0.1066914   42.33   <2e-16 ***
educ        0.0782580  0.0053181   14.71   <2e-16 ***
age         0.0119700  0.0005592   21.41   <2e-16 ***
Multiple R-Squared:0.3377,  Adjusted R-Squared:0.3373
   Error terms:
              Estimate Std. Error t value Pr(>|t|)
invMillsRatio  0.09552    0.06354   1.503    0.133
sigma          0.45550         NA      NA       NA
rho            0.20970         NA      NA       NA
--------------------------------------------
coef(lm_select)['imr'] / summary(lm_select)$sigma         # slightly off
      imr 
0.2119813 
coef(lm_select)['imr'] / summary(selection_2step)$estimate['sigma', 'Estimate']
      imr 
0.2097041 

Maximum Likelihood

The following likelihood function takes arguments as follows:

  • par: the regression coefficients pertaining to the two models, the residual standard error
  • sigma and rho for the correlation estimate
  • X: observed data model matrix for the linear regression model
  • Z: complete data model matrix for the probit model
  • y: the target variable
  • observed_index: an index denoting whether y is observed
select_ll <- function(par, X, Z, y, observed_index) {
  gamma     = par[1:4]
  lp_probit = Z %*% gamma
  
  beta  = par[5:7]
  lp_lm = X %*% beta
  
  sigma = par[8]
  rho   = par[9]
  
  ll = sum(log(1-pnorm(lp_probit[!observed_index]))) +
    - log(sigma) +
    sum(dnorm(y, mean = lp_lm, sd = sigma, log = TRUE)) +
    sum(
      pnorm((lp_probit[observed_index] + rho/sigma * (y-lp_lm)) / sqrt(1-rho^2), 
            log.p = TRUE)
    )
  
  -ll
}
X = model.matrix(lm_select)
Z = model.matrix(probit)

# initial values
init = c(coef(probit), coef(lm_select)[-4],  1, 0)

Estimate via optim. Without bounds for sigma and rho you’ll get warnings, but does fine anyway

fit_unbounded = optim(
  init,
  select_ll,
  X = X[, -4],
  Z = Z,
  y = wearnl[observed_index],
  observed_index = observed_index,
  method  = 'BFGS',
  control = list(maxit = 1000, reltol = 1e-15),
  hessian = T
)

fit_bounded = optim(
  init,
  select_ll,
  X = X[, -4],
  Z = Z,
  y = wearnl[observed_index],
  observed_index = observed_index,
  method  = 'L-BFGS',
  lower   = c(rep(-Inf, 7), 1e-10,-1),
  upper   = c(rep(Inf, 8), 1),
  control = list(maxit = 1000, factr = 1e-15),
  hessian = T
)

Comparison

Comparison model.

selection_ml = selection(observed_index ~ educ + age + z, wearnl ~ educ + age, 
                         method = 'ml')
# summary(selection_ml)

We now compare the results of the different estimation approaches.

model par sampselpack_ml unbounded_ml bounded_ml explicit_twostep sampselpack_2step
probit (Intercept) -1.52026540 -1.521 -1.523 -1.519 -1.519
probit educ 0.15020205 0.150 0.150 0.150 0.150
probit age 0.01006608 0.010 0.010 0.010 0.010
probit z 0.15747033 0.158 0.158 0.159 0.159
lm (Intercept) 4.47798502 4.480 4.481 4.516 4.516
lm educ 0.08012367 0.080 0.080 0.078 0.078
lm age 0.01209129 0.012 0.012 0.012 0.012
lm sigma 0.45820605 0.458 0.458 0.451 0.456
both rho 0.25945397 0.257 0.255 0.212 0.210
model par sampselpack_ml unbounded_ml bounded_ml explicit_twostep sampselpack_2step
probit (Intercept) 0.053 0.053 0.053 0.053 0.053
probit educ 0.003 0.003 0.003 0.003 0.003
probit age 0.001 0.001 0.001 0.001 0.001
probit z 0.014 0.014 0.014 0.014 0.014
lm (Intercept) 0.090 0.090 0.090 0.106 0.107
lm educ 0.004 0.004 0.005 0.005 0.005
lm age 0.001 0.001 0.001 0.001 0.001
lm sigma 0.008 0.008 0.008 NA NA
both rho 0.112 0.112 0.112 NA NA