Probit & Bivariate Probit
Stata users seem to be the primary audience concerned with probit models, but I thought I’d play around with one even though I’ve never had reason to use it. Stata examples come from the UCLA ATS website and the Stata manual, so one can investigate the Stata result for comparison.
Standard Probit
The standard probit model is identical to the logistic model but using a different link function.
Function
<- function(beta, X, y) {
probit_ll = X %*% beta
mu
# these produce identical results, but the second is the typical depiction
= sum(dbinom(
ll
y,size = 1,
prob = pnorm(mu),
log = T
))
# ll = sum(y * pnorm(mu, log = T) + (1 - y) * log(1 - pnorm(mu)))
-ll
}
Examples
Example 1 detail available here.
library(tidyverse)
= haven::read_dta('https://stats.idre.ucla.edu/stat/stata/dae/binary.dta')
admit
head(admit)
# A tibble: 6 x 4
admit gre gpa rank
<dbl> <dbl> <dbl> <dbl>
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3 2
= model.matrix(admit~ gre + gpa + factor(rank), admit)
X = admit$admit
y
= rep(0, ncol(X))
init
= optim(
fit fn = probit_ll,
par = init,
X = X,
y = y,
method = 'BFGS'
)
fit
$par
[1] -2.2518260555 0.0007915268 0.5453357468 -0.4211611887 -0.8285356685 -0.9457812896
$value
[1] 229.6227
$counts
function gradient
140 11
$convergence
[1] 0
$message
NULL
Example 2 from Stata manual on standard probit.
We have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. We wish to fit a probit model explaining whether a car is foreign based on its weight and mileage."
= haven::read_dta('http://www.stata-press.com/data/r13/auto.dta')
auto
head(auto)
# A tibble: 6 x 12
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>
1 AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0 [Domestic]
2 AMC Pacer 4749 17 3 3 11 3350 173 40 258 2.53 0 [Domestic]
3 AMC Spirit 3799 22 NA 3 12 2640 168 35 121 3.08 0 [Domestic]
4 Buick Century 4816 20 3 4.5 16 3250 196 40 196 2.93 0 [Domestic]
5 Buick Electra 7827 15 4 4 20 4080 222 43 350 2.41 0 [Domestic]
6 Buick LeSabre 5788 18 3 4 21 3670 218 43 231 2.73 0 [Domestic]
= model.matrix(foreign~ weight + mpg, auto)
X = auto$foreign
y
= rep(0, ncol(X))
init
= optim(
fit fn = probit_ll,
par = init,
X = X,
y = y
)
fit
$par
[1] 8.277015097 -0.002335939 -0.103973147
$value
[1] 26.84419
$counts
function gradient
380 NA
$convergence
[1] 0
$message
NULL
Bivariate Probit
For the bivariate model, we are dealing with two binary outcomes and their correlation.
Here is the main function.
<- function(pars, X, y1, y2) {
bivariate_probit_ll = pars[1]
rho = X %*% pars[2:(ncol(X) + 1)]
mu1 = X %*% pars[(ncol(X) + 2):length(pars)]
mu2 = ifelse(y1 == 1, 1,-1)
q1 = ifelse(y2 == 1, 1,-1)
q2
require(mnormt)
= q1 * mu1
eta1 = q2 * mu2
eta2
= matrix(NA, nrow = nrow(X))
ll
for (i in 1:length(ll)) {
= q1[i] * q2[i] * rho
corr = matrix(c(1, corr, corr, 1), 2)
corr = log(
ll[i] pmnorm(
x = c(eta1[i], eta2[i]),
mean = c(0, 0),
varcov = corr
)
)
}
# the loop is probably clearer, and there is no difference in time, but here's
# a oneliner ll = mapply(function(e1, e2, q1, q2) log(pmnorm(x = c(e1, e2),
# varcov = matrix(c(1,q1*q2*rho,q1*q2*rho,1),2))), eta1, eta2, q1, q2)
-sum(ll)
}
Example
From the Stata manual on bivariate probit:
We wish to model the bivariate outcomes of whether children attend private school and whether the head of the household voted for an increase in property tax based on the other covariates.
= haven::read_dta('http://www.stata-press.com/data/r13/school.dta')
school
head(school)
# A tibble: 6 x 11
obs pub12 pub34 pub5 private years school loginc logptax vote logeduc
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 10 1 9.77 7.05 1 7.21
2 2 0 1 0 0 8 0 10.0 7.05 0 7.61
3 3 1 0 0 0 4 0 10.0 7.05 0 8.28
4 4 0 1 0 0 13 0 9.43 6.40 0 6.82
5 5 0 1 0 0 3 1 10.0 7.28 1 7.69
6 6 1 0 0 0 5 0 10.5 7.05 0 6.91
= model.matrix(private ~ years + logptax + loginc, school)
X = school$private
y1 = school$vote
y2
= c(0, rep(0, ncol(X)*2))
init
# you'll probably get a warning or two, ignore; takes a couple seconds
= optim(
fit fn = bivariate_probit_ll,
par = init,
X = X,
y1 = y1,
y2 = y2,
method = 'BFGS'
)
= fit$value
loglik = fit$par[1]
rho = fit$par[2:(ncol(X) + 1)]
coefs_private = fit$par[(ncol(X) + 2):length(init)]
coefs_vote names(coefs_private) = names(coefs_vote) = c('Int', 'years', 'logptax', 'loginc')
list(
loglik = loglik,
rho = rho,
Private = coefs_private,
Vote = coefs_vote
)
$loglik
[1] 89.25407
$rho
[1] -0.2695802
$Private
Int years logptax loginc
-4.14327955 -0.01193699 -0.11030513 0.37459892
$Vote
Int years logptax loginc
-0.52933721 -0.01686685 -1.28983223 0.99840976
Source
Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/bivariateProbit.R