Preface

This work will be added to, corrected, etc. but has enough content for now to make available.

Introduction

This document attempts to provide clarification for the uninitiated. Many in applied disciplines are exposed to standard linear regression and little more. Or they might have a cursory glance at logistic regression, where it is presented in a manner such that one may come away feeling it is a wholly different type of analysis, even though some versions are a generalized linear model just as standard regression is. Furthermore, there is a bit of a nomenclature issue, that again might make some think they are dealing with wholly different models, when in fact they are dealing with extensions of simpler ones they’ve been exposed to. This document hopes to provide a bit of clarity in this realm.

Two categories

Data

The most common use of logistic regression is the case where the target variable we wish to understand is a binary variable, e.g. yes/no, buy/sell, dead/alive etc. Aside from the standard linear model for a continuous outcome, this is probably the most common statistical modeling situation.

Binomial Distribution

As the outcome is binary, we can potentially understand the data generating process as a binomial distribution. The binomial is typically one of the first probability distributions introduced in a statistics class. The classic example involves a coin flipping situation.

Going with that, let’s say we flip the coin 10 times. The chance it comes up heads or tails is 50/50, i.e. the probability = .5. If we do this once, who knows how many times heads will come up.

sample(c('heads','tails'), 10, prob=c(.5, .5), replace=T)
 [1] "tails" "tails" "tails" "heads" "tails" "heads" "tails" "heads" "tails" "heads"

However, if we repeat it over and over, we can see how often we might expect each possible outcome of 10 coin flips. The binomial has a parameter \(\pi\), the probability that the event in question occurs. We set the size, or number of ‘trials’, accordingly.

output = rbinom(5000, size=10, prob=.5)

Since the probability is .5, we would expect an outcome of 5 heads out of 10 more than other outcomes.

Now what if we have a situation where there is only one coin flip? In this case the size or number of trials is 1, and the distribution would look like the following.

output = rbinom(1000, size=1, prob=.5)

Now we have come to the most common form of logistic regression. The target variable we wish to understand is binary, and the number of times in which it is observed is once, i.e. once per individual, tweet, firm, country, or whatever our data regards.

Logistic Regression Model

We’ll start by writing up the data formally. For reference I’ll start with the standard linear model for regression just to get our bearings. We depict it as follows:

\[ \mu = b_0 + b_1*x_1 + b_2*x_2 \dots b_p*x_p \] \[ \mu = X\beta \] \[ y \sim \mathcal{N}(\mu, \sigma^2)\]

In the above \(\mu\) is the linear predictor, the weighted combination of \(p\) covariates \(x\), written two ways, one explicit and one using matrix notation, where \(X\) is the model matrix and \(\beta\) the vector of coefficients. The former is for those who are not familiar with matrix notation, as the latter can just be considered shorthand1. The coefficients we wish to estimate are \(\beta\), and for the normal distribution we also need to estimate the variance \(\sigma^2\).

For binary target variables we do not assume the data generating process is a normal distribution, but instead we often consider a binomial as above. The Bernoulli distribution is a special case of the binomial for the situation where size=1, and might be more optimal to use for some modeling approaches (e.g. Stan). With logistic regression, the linear predictor is the logit, or log of the probability of the specific label of interest over the 1 minus that probability. Note that which label we refer to is arbitrary, e.g. whether you want the probability to regard a ‘yes’ outcome or ‘no’ is entirely up to you.

The logit is the name for the (natural) log of the odds, \(\pi/(1-\pi)\), i.e. the ratio of the probability of the event of interest, \(\pi\), to the probability of its non-occurrence. It theoretically ranges from \(-\infty\) to \(\infty\) and is centered at zero, which is akin to a probability of .5. The logit is assumed to be some function of the covariates.

\[\begin{align} \textrm{Logit}&\: \vcenter{:}\mathord{=} \:\ln(\frac{\pi}{1-\pi}) \\ \textrm{Logit} &= X\beta \end{align}\]

The transformation function, or link function, of interest is the logistic link, hence the name logistic regression. Probabilities are inherently nonlinear, e.g. the change from .05 to .10 is a doubling of the probability, while that from .50 to .55 is only a 10% increase. To engage in a linear model of the sort we use in other modeling approaches, the logistic link transforms the probability response to the logit. Converting back to the probability scale requires the inverse logistic function, which might be depicted in different ways.

\[\begin{align} \pi &= \textrm{Logit}^{-1} \\ \pi &= \frac{1}{1+e^{-XB}}, \textrm{or} \\ \pi &= \frac{e^{XB}}{1+e^{XB}} \end{align}\]

And finally, given the probability, we can get the likelihood of the the response given that probability.

\[\begin{align} y &\sim \mathrm{Bin}(\pi, \mathrm{size}=1), \textrm{ or} \\ y &\sim \mathrm{Bern}(\pi) \end{align}\]

To make this clearer, let’s convert a value presumed to be on the logit scale to a probability. We’ll demonstrate the logit and inverse logit as well as the alternate ‘by-hand’ ways it might be depicted in various sources.

plogis(0)                  # convert to probability
[1] 0.5
log(.5/(1-.5))             # logit
[1] 0
qlogis(.5)
[1] 0
plogis(-1)                 
[1] 0.2689414
plogis(1)                  # 1 minus plogis(-1)
[1] 0.7310586
1/(1+exp(-1))
[1] 0.7310586
exp(1)/(1+exp(1))
[1] 0.7310586

Such a model is often called a logit model. However, calling a model by its link function seems odd to me for several reasons: there are several link functions one might use, the logit link function is used for other models (e.g. ordinal, neural nets), just calling it logit model doesn’t tell you how many categories are present, and we don’t do this with other models. Case in point, a very common alternative is the probit link function, which uses the cumulative normal distribution2 to convert the probability scale to the logit. Practically anything that converts the linear predictor to \((0,1)\) will technically work. However, calling it a ‘link’ model does not change what one thinks about the underlying response distribution in this setting.

\[ y \sim \mathrm{Bern}(\pi) \quad \scriptsize{\textrm{(a probit model, same as the logistic model)}} \]

An example of different link functions is shown in the next figure. There are at times reasons to prefer one over another, e.g. considering what one thinks about the tails of the distribution, and some can be seen as special cases of others (e.g. the complementary log-log is a special case of generalized extreme value). The choice becomes more important in multinomial and ordinal models.

Example

We’ll use the example from the UCLA ATS website, in case one wants a bit more detail or see it with languages other than R. The hypothetical data regards graduate school admission, and we have undergraduate GPA, GRE scores, and school prestige, a variable with values 1 through 4 where institutions with a rank of 1 have the highest prestige, and 4 the lowest.

admission = read.csv("data/admit.csv")

In this case our binary target is whether a prospective candidate is admitted or not (1=admitted, 0 not), and we wish to predict it with the other variables.

mod = glm(admit ~ gre + gpa + rank, data=admission, family=binomial)
summary(mod)

Call:
glm(formula = admit ~ gre + gpa + rank, family = binomial, data = admission)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5802  -0.8848  -0.6382   1.1575   2.1732  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.449548   1.132846  -3.045  0.00233 ** 
gre          0.002294   0.001092   2.101  0.03564 *  
gpa          0.777014   0.327484   2.373  0.01766 *  
rank        -0.560031   0.127137  -4.405 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.44  on 396  degrees of freedom
AIC: 467.44

Number of Fisher Scoring iterations: 4

Interpretation

The effects seen are in the directions expected, i.e. higher GRE and GPA and more prestige suggests more likely admittance (recall lower prestige rank score means higher prestige).

Odds ratios

The coefficients tell us what happens on the logit scale, which is perhaps not that interpretable, except in the usual regression sense. If I move up one value on GPA, the logit increases .77. People typically convert the coefficients to odds ratios, which we obtain by exponentiating the coefficients.

exp(coef(mod))
(Intercept)         gre         gpa        rank 
 0.03175998  1.00229659  2.17496718  0.57119114 

Now if I move up one on GPA, the odds of being admitted increase by a factor of 2.2, i.e. more than double. If my prestige rank increases by one (i.e. less prestige), the odds decrease by 43%.

Predicted probabilities

Unless you do a lot of gambling, odds ratios probably aren’t all that interpretable either. One can get the estimated probabilities at key values of the covariates, which are easier to understand. The following looks at the predicted probabilities for the extremes of prestige, holding GPA and GRE scores at their mean.

prediction_data = data.frame(gre=mean(admission$gre), gpa=mean(admission$gpa), rank=c(1,4))
predict(mod, newdata=prediction_data)                    # logit scale
          1           2 
-0.02742147 -1.70751563 
predict(mod, newdata=prediction_data, type='response')   # probability scale
        1         2 
0.4931451 0.1534862 

Thus even at average GRE and GPA, one has a ~50% of being admitted to graduate school if they went to a highly prestigious school versus one that is not.

To make things perfectly clear, let’s do it by hand.

coefs = coef(mod)
prediction = coefs[1] + coefs[2]*mean(admission$gre)  + coefs[3]*mean(admission$gpa) + coefs[4]*c(1,4)
prediction
[1] -0.02742147 -1.70751563
plogis(prediction)
[1] 0.4931451 0.1534862

The nonlinear nature of the model becomes clear if we visualize the relationships of covariates on the logit versus probability scales. The plot below depicts the GPA effect.

prediction_data = data.frame(gre = mean(admission$gre),
                              gpa = rep(seq(from = 0, to = 4, length.out = 100), 4), rank = 1)

preds_logit = predict(mod, newdata = prediction_data, type="link")
preds_prob = predict(mod, newdata = prediction_data, type="response")

Summary of Standard Logistic Regression

So there you have it. The standard logistic regression is the simplest setting for a categorical outcome, one in which there are two possibilities, and only one of which can occur. It is a special case of multinomial, ordinal, and conditional logistic regression, and so can serve as a starting point for moving toward those models.

Extensions

Counts

As noted, the binomial distribution refers to situations in which an event occurs x number of times out of some number of trials/observations. In the binary logistic regression model, the number of trials is 1, but it certainly doesn’t have to be. If it is more than 1 we can then model the proportion, and at least in R the glm function is still used with family = binomial just as before. We just specify the target variable differently, in terms of the number of times the vs. the number of times it did not.

glm(cbind(occurrences, non-occurrences) ~ x + z, data=mydata, family=binomial)

Conditional Logistic

Situations arise when there are alternative specific covariates, such that the value a covariate takes can be different for the two outcomes. This is the first step toward discrete choice models (a.k.a. McFadden choice model), in which there are often more than two choices (as we will see with multinomial models), and values vary with choice. The key idea is that we have strata or groups which contain both positive and negative target values. However, the model is similar.

\[ \textrm{Logit} \propto X\beta \]

The odds of the event are proportional to the linear combination of the covariates. This generalizes the previous logistic regression model depicted, as it can be seen as a special case.

Example

The following example regards infertility after spontaneous and induced abortion. It is a matched case-control study such that we have exactly one ‘case’, i.e. person with infertility, per two control observations3. We will model this with covariates that regard whether or not they previously had a spontaneous or induced abortion.

infert = infert %>%
  mutate(spontaneous = factor(spontaneous >0, labels=c('No','Yes')),
         induced = factor(induced>0, labels=c('No','Yes')))
# ?infert
mod_logreg = glm(case ~ spontaneous + induced, data = infert, family = binomial)
summary(mod_logreg)

Call:
glm(formula = case ~ spontaneous + induced, family = binomial, 
    data = infert)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3541  -0.7320  -0.5871   1.2135   1.9200  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -1.6709     0.2802  -5.964 2.46e-09 ***
spontaneousYes   1.5863     0.3037   5.223 1.76e-07 ***
inducedYes       0.4909     0.3059   1.605    0.109    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 316.17  on 247  degrees of freedom
Residual deviance: 286.19  on 245  degrees of freedom
AIC: 292.19

Number of Fisher Scoring iterations: 4
library(survival)
model_condlogreg = clogit(case ~ spontaneous + induced + strata(stratum), data = infert)
summary(model_condlogreg)
Call:
coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous + induced + 
    strata(stratum), data = infert, method = "exact")

  n= 248, number of events= 83 

                 coef exp(coef) se(coef)     z Pr(>|z|)    
spontaneousYes 2.1344    8.4522   0.4050 5.270 1.36e-07 ***
inducedYes     1.0926    2.9821   0.4058 2.693  0.00709 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

               exp(coef) exp(-coef) lower .95 upper .95
spontaneousYes     8.452     0.1183     3.821    18.695
inducedYes         2.982     0.3353     1.346     6.606

Rsquare= 0.141   (max possible= 0.519 )
Likelihood ratio test= 37.82  on 2 df,   p=6.118e-09
Wald test            = 27.82  on 2 df,   p=9.11e-07
Score (logrank) test = 35.79  on 2 df,   p=1.694e-08

Note that the intercept cancels out in conditional logistic regression. It could vary by group and it would still cancel (similar to so-called fixed-effects models). Also, any variable that is constant within group will similarly cancel out. This may be better understood when the (log) likelihood is expressed as follows for the case where the strata are balanced (i.e. 1:1 matching).

\[ \mathrm{L} = (X_{y=1} - X_{y=0})\beta\] \[ \mathcal{L} = \ln(\frac{e^L}{1+e^L})\]

As such, anything constant within a strata would simply be 0.

You might have noticed the call in the clogit function output, where it says coxph(...). That isn’t an error, the conditional logit is equivalent to the stratified cox proportional hazards model where the survival time is simply 1 if the event is observed, or censored (1+) if not.

coxph(Surv(rep(1, nrow(infert)), case) ~ spontaneous + induced + strata(stratum), data=infert)
Call:
coxph(formula = Surv(rep(1, nrow(infert)), case) ~ spontaneous + 
    induced + strata(stratum), data = infert)

                coef exp(coef) se(coef)    z       p
spontaneousYes 2.134     8.452    0.405 5.27 1.4e-07
inducedYes     1.093     2.982    0.406 2.69  0.0071

Likelihood ratio test=37.8  on 2 df, p=6.12e-09
n= 248, number of events= 83 

In the case of 1:N matching, the denominator is based on the sum of all N non-event outcomes. If X represents the covariate values for which the event occurs, and Z for those in which it does not, for each strata:

\[ \mathcal{L} = \ln(\frac{e^{X_{y=1}\beta}}{\sum_{k=1}^N e^{Z_k\beta}})\]

Bradley-Terry Model

The Bradley-Terry model (BT henceforth) is one in which we look to model pairwise rankings. For example, if one were to choose among various brands of some product they might select between two products at a time. In the simplest case, the BT model posits the probability product \(i\) is chosen over product \(j\) as:

\[\pi_{i>j} = \frac{\exp(\beta_i)}{\exp(\beta_i)+\exp(\beta_j)}\] \[\pi_{i>j} = \frac{\exp(\beta_i-\beta_j)}{1+\exp(\beta_i-\beta_j)}\] \[\mathrm{Logit}(\pi_{i>j}) = \log(\frac{\pi_{i>j}}{1 - \pi_{i>j}}) = \log(\frac{\pi_{i>j}}{\pi_{j>i}}) = \beta_i - \beta_j\]

Thus it turns out the BT model has a connection to the standard logistic model, though we’ll have to set up the data in a specific manner for comparison. We start by creating a model matrix where each column is 1 if that item is chosen, -1 if it isn’t, and zero if it is not considered. Our response in this case is simply positive values.

# glm output for comparison; no intercept; matrix columns are reordered for easier
# comparison; item1 is the reference group and so won't have a coefficient
glmmod = glm(y ~ -1 + ., data=BTdat[,c(2,3,1,4)], family=binomial)  
coef(glmmod)
    item2     item3     item1 
-2.838944 -1.419472        NA 
# using logreg function in appendix;
out = optim(rep(0,3), logreg, X=as.matrix(BTdat[,-4]), y=BTdat$y, method='BFGS')
out$par                  # default chooses item 3 as reference
[1]  1.419472 -1.419472  0.000000
out$par[2:3]-out$par[1]  # now same as glm with item 1 as reference
[1] -2.838944 -1.419472

Once constructed we can run the standard logistic model, and I do so with a custom function, and with the glm function. For identification, one of the parameters must be set to zero, which glm just simply drops out. By default the optim output settles on the third item as the reference (as would glm4).

For the BT model we need a different data structure, and will have a binary response where 1 represents that comparison the first comparison item was chosen. I show the first few rows here.

The following uses the BradleyTerry2 package and compares the results to the standard logistic both with our custom function and the glm result.

library(BradleyTerry2)
btmod = BTm(y, comp1, comp2, data = BTdat2)

# coefficients in the BT model are a glm binomial regression with appropriate
# coding of the model matrix, and represent the difference in the coefficient
# from the reference group coefficient.
         optim      glm       BT
item2 -2.83894 -2.83894 -2.83894
item3 -1.41947 -1.41947 -1.41947

We can see that all results are the same. Once again the output parameters from the model thus tells us the coefficient difference from some reference group coefficient. Using the inverse logit transform tells us the probability of selecting that item relative to the reference group, but as noted previously, that transformation applied to any difference of the coefficients tells us that probability. For example, with Item 1’s parameter set to zero, the probability of choosing Item 1 vs. Item 2 is plogis(0-(-2.83894)) 0.945, which you can confirm by applying the fitted function to the btmod.

Note also the BT model is generalizable to counts, just like binary logistic regression is a special case of binomial regression more generally. Also, the BT model can handle ties, and choice specific covariates, but at that point we’re in the realm of multinomial regression, so we’ll turn to that now.

More than two

In many situations the variables of interest have more than two categories, and as such the standard binary approach will not be the best approach5. In the simplest case, we might have the same interpretation, but as we will see things can be notably extended from that.

Multinomial Logistic Regression

Multinomial logistic regression will allow for modeling situations in which the target variable has more than two nominal categories.

A starting point

Let’s first compare the multinomial distribution vs. the binomial. The multinomial distribution can be thought of similarly, just with more category labels. Before we saw the probabilities associated of one category (whatever equals 1), because the other is just 1 minus that. Now we examine the probabilities for each category in the case of more than two categories6.

rmultinom(10, size=1, prob=c(.25, .25, .5))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    1    0    0    0    0    1     0
[2,]    1    1    1    0    0    1    1    0    0     0
[3,]    0    0    0    0    1    0    0    1    0     1
rowMeans(rmultinom(1000, size=1, prob=c(.25, .25, .5))) # roughly .25, .25, .5
[1] 0.278 0.265 0.457


In the above, we have multiple categories (3), but only one of them has a chance of occurring. However, they could all occur with different frequency. In the following we show the case where the size can be 8.


rmultinom(10, size=8, prob=c(.25, .25, .5))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    2    2    0    2    2    3    2    1    4     3
[2,]    4    3    3    1    0    1    2    2    0     2
[3,]    2    3    5    5    6    4    4    5    4     3
rowMeans(rmultinom(1000, size=8, prob=c(.25, .25, .5))) # roughly 2, 2, 4
[1] 2.000 1.953 4.047


Keeping the probability of each category the same, but bumping the total number of possible occurrences to 8, the mean counts become 2, 2, and 4 for each category respectively. Now we let the size vary.


lapply(1:10, rmultinom, n=1000, prob=c(.25, .25, .5)) %>% map(rowMeans)
[[1]]
[1] 0.246 0.254 0.500

[[2]]
[1] 0.508 0.478 1.014

[[3]]
[1] 0.734 0.731 1.535

[[4]]
[1] 0.942 1.001 2.057

[[5]]
[1] 1.245 1.305 2.450

[[6]]
[1] 1.462 1.531 3.007

[[7]]
[1] 1.762 1.799 3.439

[[8]]
[1] 1.95 1.94 4.11

[[9]]
[1] 2.242 2.287 4.471

[[10]]
[1] 2.543 2.576 4.881


In this last instance, the first set of results with size=1 and when size=8 we get the same results as previously. In the demonstration that follows, we will stay in size=1 territory, but just be aware that modeling with multinomial counts is very common, e.g. with compositional data, text analysis etc.

Example

We’ll start with another example from the UCLA ATS website (the example there is done with Stata for those interested). 200 entering high school students make program choices: general program, vocational program and academic program. We will model their choice using their writing score as a proxy for scholastic ability and their socioeconomic status, a categorical variable of low, middle, and high values.

program = foreign::read.dta('data/hsbdemo.dta')

As there are three categories we no longer use the glm function, which does not have the multinomial as a possible value to the ‘family’ argument. There are a few packages available we could use, and here I’ll use the mlogit package7. I should also mention that it has a very extensive vignette that is worth perusal, at least part of which has been incorporated into later discussion.

Let’s run the model to predict program choice. Because of the mlogit’s modeling flexibility, the data either has to have a specific structure or be told what the structure is. This oddly means that one can apparently not use it directly for what some might consider the standard multinomial model without pre-processing, and it’s not the most intuitive approach to get it into the proper format with the mlogit.data function. The gist is that all choices must be represented for each individual, along with a binary variable that represents the actual choice made. Those familiar with mixed models, e.g. with longitudinal data, will note this as long format.

Original:


Long format:

The program variable is now just a logical variable that notes which of the alternatives, i.e. the alt variable, was chosen. The reason for this data format is that it would allow for alternative specific variables to come into the model as well, something we will see later.

Keeping with our previous formulation, we can see the model as a set of logistic regression models for the probability of some category vs. a reference category. Here we’ll choose the reference category as academic.

\[\ln\left(\frac{\pi_{\textrm{prog=general}}}{\pi_{\textrm{prog=academic}}}\right) = b_{0\_\mathrm{gen}} + b_{1\_\mathrm{gen}}\mathrm{write} + b_{3\_\mathrm{gen}}\textrm{ses}_\textrm{middle} + b_{4\_\mathrm{gen}}\textrm{ses}_\textrm{high}\] \[\ln\left(\frac{\pi_{\textrm{prog=vocation}}}{\pi_{\textrm{prog=academic}}}\right) = b_{0\_\mathrm{voc}} + b_{1\_\mathrm{voc}}\mathrm{write} + b_{3\_\mathrm{voc}}\textrm{ses}_\textrm{middle} + b_{4\_\mathrm{voc}}\textrm{ses}_\textrm{high}\]

So our interpretation will be as before, it’s just that we’ll have to interpret K-1 sets of parameters for K total categories. Let’s go ahead and run the model.

multi_mod = mlogit(prog ~ 1|write + ses, data=programLong)
summary(multi_mod)

Call:
mlogit(formula = prog ~ 1 | write + ses, data = programLong, 
    method = "nr", print.level = 0)

Frequencies of alternatives:
academic  general vocation 
   0.525    0.225    0.250 

nr method
4 iterations, 0h:0m:0s 
g'(-H)^-1g = 7.89E-07 
gradient close to zero 

Coefficients :
                      Estimate Std. Error t-value  Pr(>|t|)    
general:(intercept)   2.852186   1.166439  2.4452  0.014477 *  
vocation:(intercept)  5.218200   1.163549  4.4847 7.301e-06 ***
general:write        -0.057928   0.021411 -2.7056  0.006819 ** 
vocation:write       -0.113603   0.022220 -5.1127 3.177e-07 ***
general:sesmiddle    -0.533291   0.443732 -1.2018  0.229429    
vocation:sesmiddle    0.291393   0.476374  0.6117  0.540743    
general:seshigh      -1.162832   0.514219 -2.2614  0.023737 *  
vocation:seshigh     -0.982670   0.595567 -1.6500  0.098948 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -179.98
McFadden R^2:  0.11815 
Likelihood ratio test : chisq = 48.23 (p.value = 1.063e-08)

Note that we get two sets of output. You can think of it as you would two separate binary logistic regressions, one where academic is the reference group and general the positive value, and another model that also has academic as the reference group and vocational as the positive value. We see statistically notable effects of writing score and high ses (low ses is the reference category). The general interpretation would similar to before, but sometimes the exponentiated values are referred to relative risk rather than an odds ratio. It is no longer \(\pi/(1-\pi)\), but a ratio of two relative probabilities (one group vs. an arbitrary reference group), hence the name relative risk (the ‘risk’ comes from biostats where the event of interest is usually something like death, disease, etc.)8.

As we move up one score on the variable write, we would see a decrease of -0.058 in the log probability of choosing the general program relative to the academic. In other words, higher write scores are associated with a more likely choosing of the academic program instead of the general. The same goes for the vocational group, but it is stronger, i.e. we’re even more likely to select the academic program over the vocational with an increase in write score.

It is important to note that the output doesn’t tell us the overall effect of write is due to the relative nature of the model. Let’s plot the predictions so that we can understand it better. In the following, ses is fixed to be ‘middle’.

So the general interpretation of the effect of write would be that higher scores increase the probability of choosing the academic program, decrease the probability of choosing the vocational program, and have little effect on choosing the general program. Here are predictions for all three ses levels.

Utility Maximization

In general we can consider the bi/multinomial logistic model from the perspective of utility maximization, in which a person (or whatever the unit of observation is) selects the choice that will provide most utility to them9. The probability a person will choose alternative \(i\), \(\pi_i\), given some utility over all other alternatives \(j\), can be formulated as:

\[ \pi_i = \textrm{Prob}(U_i > U_{\forall j\neq i)}\]

Note that the actual value of utility doesn’t matter, just the difference. So comparing choice \(i\) vs. some specific category \(j\):

\[ \pi_i = \textrm{Prob}(U_i - U_j > 0)\]

The fact that only differences matter has implications for how we estimate the model. Consider the utility for category 1 and 2 we have the typical linear combination we’ve seen before. Here we’ll assume X is an individual level covariate, say age. There might be other covariates as well, but we’ll just focus on this one.

\[U_1 = \dots + b^1_1\mathrm{Age} + \epsilon_1\] \[U_2 = \dots + b^1_2\mathrm{Age} + \epsilon_2\]

If we try to estimate the probability U_2 - U_1 > 0, the age variable doesn’t change for the individual and thus would drop out. However, we can set the \(b^1_1\) to zero and we end up with the following:

\[\begin{align} U_1 &= \dots + \epsilon_1 \\ U_2 &= \dots + b_2\mathrm{Age} + \epsilon_2 \end{align}\]

And \(b_2\) now represents the difference \(b^1_2 - b^2_2\). The interpretation of the coefficient isn’t the effect of age on the utility of choice 2, but rather on the utility of choice 2 relative to choice 1. In addition, it turns out that if we assume an extreme value distribution for the \(\epsilon\), the logistic distribution naturally arises when considering their difference. In terms of the predicted probabilities, recall our previous depiction for the binary outcome case. Now the probability of selecting choice \(i\) is:

\[ \frac{e^{X\beta_i}}{\Sigma^J_{j=1}e^{X\beta_j}}\]

If the first set of coefficients is zero, then it becomes:

\[ \pi_1 = \frac{1}{1 + \Sigma^J_{j=2}e^{X\beta}} \quad \tiny\textrm{(choice 1)}\] \[ \pi_{i \neq 1} = \frac{e^{X\beta_i}}{1 + \Sigma^J_{j=2}e^{X\beta_j}} \quad \tiny\textrm{(others)}\]

Alternative specific covariates

Consider the following situation in which we have several possible outcomes10 chosen by individuals.

  • There are alternative specific variables \(x_{ij}\) with constant coefficient \(\beta\)
  • There are individual specific variables \(z_i\) with alternative specific coefficients \(\gamma_j\)
  • There are alternative specific variables \(w_{ij}\) with alternative specific coefficients \(\delta_j\)

So the conceptual model for the utility for alternative \(j\) (with intercepts \(\alpha\)) is:

\[U_{ij} = \alpha_j + \beta x_{ij} + \gamma_jz_i + \delta_jw_{ij}\]

Common names for some models pertaining to such data are:

multinomial logit: in this case we have only individual specific covariates

conditional logit: we have only alternative specific covariates

mixed logit: has both kinds of variables

We can see our model in terms of differences as follows. Here we’ll keep the individual subscript as it is particularly informative. So for individual \(i\) and choice \(j\) versus choice \(k\)

\[ U_{ij} - U_{ik} = (\alpha_j-\alpha_k) + \beta (x_{ij} -x_{ik}) + (\gamma_j-\gamma_k)z_i + (\delta_jw_{ij}-\delta_kw_{ik})\]

Again the \(x\) and \(w\) represent covariates that vary by choice, the former will have a constant effect \(\beta\), while the latter will have alternative specific effects. Variables specific to the individual are \(z\) and will have K-1 effects.

As an demonstration, we have the following data on travel mode choice for travel between Sydney and Melbourne, Australia to help further clarify the nature these types of variables. We have a data frame containing 840 observations on 4 modes for 210 individuals.

  • individual: Factor indicating individual with levels 1 to 200.
  • mode: Factor indicating travel mode with levels “car”, “air”, “train”, or “bus”.
  • choice: Factor indicating choice with levels “no” and “yes”.
  • wait: Terminal waiting time, 0 for car.
  • vcost: Vehicle cost component.
  • travel: Travel time in the vehicle.
  • gcost: Generalized cost measure.
  • income: Household income.
  • size: Party size.

It looks like the following:

We’ve seen the conditional logit model previously for the two category case. Unfortunately, conditional logit might also refer to a type of model for longitudinal data, and mixed logit might be short for a mixed effects model with random effects. The point is you need to be careful and make sure the model is clear. We’ll use mlogit to model each type of covariate, and in this example we’ll allow travel time to have a choice constant effect, income will be an individual specific effect, and vehicle cost will have choice specific effects.

library(mlogit)
travel_mod = mlogit(choice ~ travel | income | vcost, data=TravelMode, 
                    shape='long', alt.var='mode',  chid.var = "individual")
summary(travel_mod)

Call:
mlogit(formula = choice ~ travel | income | vcost, data = TravelMode, 
    shape = "long", alt.var = "mode", chid.var = "individual", 
    method = "nr", print.level = 0)

Frequencies of alternatives:
    air   train     bus     car 
0.27619 0.30000 0.14286 0.28095 

nr method
4 iterations, 0h:0m:0s 
g'(-H)^-1g = 6.81E-08 
gradient close to zero 

Coefficients :
                     Estimate  Std. Error t-value  Pr(>|t|)    
train:(intercept)  4.11059735  0.79413671  5.1762 2.265e-07 ***
bus:(intercept)    2.26495436  0.91877943  2.4652  0.013695 *  
car:(intercept)    2.17778777  0.79937581  2.7244  0.006443 ** 
travel            -0.00155711  0.00093165 -1.6713  0.094654 .  
train:income      -0.04865653  0.01237218 -3.9327 8.398e-05 ***
bus:income        -0.03659445  0.01319756 -2.7728  0.005557 ** 
car:income        -0.00249103  0.01070559 -0.2327  0.816006    
air:vcost          0.00588820  0.00804384  0.7320  0.464161    
train:vcost       -0.02285593  0.00858932 -2.6610  0.007792 ** 
bus:vcost         -0.00608481  0.01763113 -0.3451  0.730006    
car:vcost         -0.03980148  0.01622363 -2.4533  0.014155 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -243.55
McFadden R^2:  0.14169 
Likelihood ratio test : chisq = 80.412 (p.value = 4.0391e-14)

Air travel is the reference group in the above model. We can see that increased travel time will decrease the probability of making any particular choice. Increased income means one is more likely to travel by air. For The vehicle cost seems mostly to negatively impact choice of car and train. Note that the covariates in this case represent the difference in vehicle cost for that mode relative to air travel. For example, as the difference in cost decreases for train relative to air travel, we’d expect the individual to less likely to select train travel.

It can be instructive to see the model matrix for the model.


Alternative constant variables remain unchanged, while alternative specific variable get coded in a pseudo-indicator style. Individual specific covariates are in pseudo-dummy code style. In both cases the 1 that would exist in standard coding is replaced by the actual value.

Extensions

Nested logit

One of the more common extensions of the above includes the nested logit model, or hierarchical logit, in which one choice may lead to a specific subset of others. This is in fact very common in survey design in which there is branching, such that an individual may not encounter some questions unless they answer a previous one in a certain way. As a s specific example, one might first ask if a respondent has purchased beer in the past week, then if so, whether they bought a big brew, a fake craft brew, a craft brew, or a local craft brew. I won’t provide a specific example here (yet), but it’s worth knowing about it as a possible option.

Bivariate probit

Another extension one may come across is a bivariate probit model, in which one has two responses one wants to investigate simultaneously, and while acknowledging their correlation. I’ve honestly not seen it outside of the binary outcome setting in econometrics or those trained in that fashion (e.g. the Stata crowd). However out of curiosity I looked into at one time, and provide some code for maximum likelihood estimation in the appendix.

Ordered categories

In the case of ordered categories, we can still use the latent variable conceptualization as before. A key idea to keep in mind is that we are treating the target variable as categorical, not numeric, which is often done as a convenience. I think this is mostly done simply because people are not taught ordinal regression, but the output is not much more complex than what the standard linear model would provide, and the predicted probabilities of each category might be more desirable.

The following represents the ordered logistic model, otherwise known as the proportional odds model.

\[U = \theta_j + X\beta\]

\[\begin{align} y &= 1 \quad \textrm{if}\quad U \leq \theta_1 \\ y &= 2 \quad \textrm{if}\quad \theta_1 < U \leq \theta_2 \\ y &= 3 \quad \textrm{if}\quad \theta_2 < U \leq \theta_3 \\ \vdots \\ y &= K \quad \textrm{if}\quad \theta_{k-1} < U \leq \theta_k \\ \end{align}\]

The \(\theta_j\) represent cutoffs, or thresholds, which when transformed represent the baseline probability for category \(j\) or below. When running a standard ordinal model such as this, one will get the regular coefficients from a logistic model along with a threshold for k-1 categories, as the last is set to \(\infty\). Note that if we only had two categories, it is equivalent to the binary logistic regression model.

As an example, we can look at data regarding wine bitterness via the ordinal package. See ?wine for details, but essentially we are looking at factors in the pressing process that might contribute to bitterness. These are the temperature (warm vs. cold) and contact between juice and skins (yes-no).

library(ordinal)
wine_mod = clm(rating ~ contact + temp, data=wine)
summary(wine_mod)
formula: rating ~ contact + temp
data:    wine

 link  threshold nobs logLik AIC    niter max.grad cond.H 
 logit flexible  72   -86.49 184.98 6(0)  4.01e-12 2.7e+01

Coefficients:
           Estimate Std. Error z value Pr(>|z|)    
contactyes   1.5278     0.4766   3.205  0.00135 ** 
tempwarm     2.5031     0.5287   4.735 2.19e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
1|2  -1.3444     0.5171  -2.600
2|3   1.2508     0.4379   2.857
3|4   3.4669     0.5978   5.800
4|5   5.0064     0.7309   6.850
# lm(as.integer(rating) ~ contact + temp, data=wine) # for comparison
# lm(response ~ contact + temp, data=wine)           # for comparison via raw response scale

The coefficients speak to the probability of being in a higher category relative to a lower category11. Thus warmer temperatures and contact lead to a higher bitterness rating. The following shows how to get probabilities for each class.

# get fitted probs
mm = model.matrix(wine_mod)$X[,-1]
thresh = coef(wine_mod)[1:4]
beta = coef(wine_mod)[-(1:4)]
fits = sapply(thresh, function(int) plogis(int - mm %*% beta))
probs14 = t(apply(cbind(0, fits), 1, diff)); colnames(probs14) = paste0('cat',1:4)
round(cbind(probs14, 
            cat5 = 1-rowSums(probs14), 
            ordpackfit = fitted(wine_mod), 
            rating=wine$rating), 2) %>% 
  head
     cat1 cat2 cat3 cat4 cat5 ordpackfit rating
[1,] 0.21 0.57 0.19 0.02 0.01       0.57      2
[2,] 0.21 0.57 0.19 0.02 0.01       0.19      3
[3,] 0.05 0.38 0.44 0.10 0.03       0.44      3
[4,] 0.05 0.38 0.44 0.10 0.03       0.10      4
[5,] 0.02 0.20 0.50 0.20 0.08       0.20      4
[6,] 0.02 0.20 0.50 0.20 0.08       0.20      4

Extensions

Generalized

We noted that the standard model was a proportional odds model, in which the effects are the same whether we are talking about moving from category one to any above, or category (one or) two to any above etc. The generalized ordinal logistic model would relax this assumption, and allow any set of coefficients to vary depending on the category. Thus if there are \(i\) \(\beta\) coefficients, we could potentially have \(i * K-1\) parameters plus the thresholds.

Other models

Other models can be used based on substantive considerations. As an example, an adjacent category approach would posit models for categories 1 vs. 2, 2 vs. 3 etc. The continuation ratio compares previous categories to the current, e.g. 1 vs. 2, 1 and 2 vs. 3 etc. the following graphic demonstrates these models visually.

Other

Response Distributions

Categorical

As Bernoulli is to the binomial, so is the categorical distribution, commonly referred to as such in computer science, to the multinomial distribution. In other words, if only one choice among several is made (i.e. we’re not dealing with counts), then we have the categorical distribution.

Beta-binomial

The beta binomial distribution allows one to incorporate over-dispersion in the binomial model in cases where the number of trials is greater than 1. For similar reasons one employs the negative binomial instead of the Poisson distribution for count data.

Beta regression

Beta regression allows one to model a target variable that falls in \((0,1)\), such as probabilities/proportions or other ratios. The beta distribution is highly flexible- it is symmetric if the mean is .5 and skews toward 0 or 1 as the mean drifts toward those boundaries. This would be useful for targets that do not follow the binomial distribution, or perhaps as a more flexible alternative. Zero/One-inflated models are available as seen with count distributions. A few useful R packages are betareg, mgcv, and VGAM, and for the multivariate analog of the beta, the dirichlet distribution13, the DirichletReg package. Note that the beta distribution can serve as a prior for \(\pi\) in the binomial, as the dirichlet can for the vector of probabilities in the multinomial distribution. I have examples of such models in both standard and Bayesian form, including mixed models, here.

Performance metrics

In typical classification situations we are interested in overall accuracy14. However there are situations, not uncommon, in which simple accuracy isn’t a good measure of performance. As an example, consider the prediction of the occurrence of a rare disease. Guessing a non-event every time might result in 99.9% accuracy, but that isn’t how we would prefer to go about assessing some classifier’s performance.

To demonstrate other sources of classification information, we will use the following 2x2 table that shows values of some binary outcome (0 = non-event, 1 = event occurs) to the predictions made by some model for that response (arbitrary model). Both a table of actual values, often called a confusion matrix15, and an abstract version are provided.

  Actual 1 Actual 0
Predicted 1 41 21
Predicted 0 16 13
  Actual 1 Actual 0
Predicted 1 A B
Predicted 0 C D
  • True Positive, False Positive, True Negative, False Negative Above, these are A, B, D, and C respectively.
  • Accuracy Number of correct classifications out of all predictions ((A+D)/Total). In the above example this would be (41+13)/91, about 59%.
  • Error Rate 1 - Accuracy.
  • Sensitivity is the proportion of correctly predicted positives to all true positive events: A/(A+C). In the above example this would be 41/57, about 72%. High sensitivity would suggest a low type II error rate (see below), or high statistical power. Also known as true positive rate.
  • Specificity is the proportion of correctly predicted negatives to all true negative events: D/(B+D). In the above example this would be 13/34, about 38%. High specificity would suggest a low type I error rate (see below). Also known as true negative rate.
  • Positive Predictive Value (PPV) proportion of true positives of those that are predicted positives: A/A+B. In the above example this would be 41/62, about 66%.
  • Negative Predictive Value (NPV) proportion of true negatives of those that are predicted negative: D/C+D. In the above example this would be 13/29, about 45%.
  • Precision See PPV.
  • Recall See sensitivity.
  • Lift Ratio of positive predictions given actual positives to the proportion of positive predictions out of the total: (A/(A+C))/((A+B)/Total). In the above example this would be (41/(41+16))/((41+21)/(91)), or 1.06.
  • F Score (F1 score) Harmonic mean of precision and recall: 2*(Precision*Recall)/(Precision+Recall). In the above example this would be 2*(.66*.72)/(.66+.72), about .69.
  • Type I Error Rate (false positive rate) proportion of true negatives that are incorrectly predicted positive: B/B+D. In the above example this would be 21/34, about 62%. Also known as alpha.
  • Type II Error Rate (false negative rate) proportion of true positives that are incorrectly predicted negative: C/C+A. In the above example this would be 16/57, about 28%. Also known as beta.
  • False Discovery Rate proportion of false positives among all positive predictions: B/A+B. In the above example this would be 21/62, about 34%. Often used in multiple comparison testing in the context of ANOVA.
  • Phi coefficient A measure of association: (AD - BC)/(sqrt((A+C)(D+B)(A+B)*(D+C))). In the above example this would be 0.11.

Loss functions

In the above we have assumed a standard likelihood approach, but in many applications prediction is of utmost importance, and we’d prefer alternative methods. In the following we note \(f\) as the prediction, and for convenience we assume a [-1,1] response instead of a [0,1] response:

Binomial log-likelihood

We’ll start by depicting the likelihood approach in this manner.

\[L(Y, f(X)) = \sum log(1 + e^{-2yf})\]

The above is in deviance form, but is equivalent to binomial log likelihood if \(y\) is on the 0-1 scale.

Misclassification

Probably the most straightforward loss function is misclassification, or 0-1 loss.

\[L(Y, f(X)) = \sum I(y\neq sign(f))\]

In the above, \(I\) is the indicator function and so we are merely summing misclassifications.

Exponential

Exponential loss is yet another loss function at our disposal.

\[L(Y, f(X)) = \sum e^{-yf}\]

Hinge Loss

A final loss function to consider, typically used with support vector machines, is the hinge loss function.

\[L(Y, f(X)) = \max(1-yf, 0)\]

Here negative values of \(yf\) are misclassifications, and so correct classifications do not contribute to the loss. We could also note it as \(\sum (1-yf)_+\) , i.e. summing only those positive values of \(1-yf\).

The following is a visual display of a couple of those loss functions plus some others, taken from the freely available Elements of Statistical Learning:

Which of these might work best may be specific to the situation, but the gist is that they penalize negative values (misclassifications) more heavily and increasingly so the worse the misclassification (except for misclassification error, which penalizes all misclassifications equally), with their primary difference in how heavy that penalty is.

Machine Learning

There are a great many machine learning techniques for both binary and multiclass target variables. Many academic papers have been written using logistic regression models, where, if one had actually presented the classification performance, they would have found the model does no better than guessing.

Machine learning approaches, on the other hand, put almost all of the focus on predictive performance. As such, for most typical data scenarios they will perform better than logistic regression almost as a default16. As an initial step, introducing a regularized approach to logistic regression, e.g. using lasso or ridge regression, would typically allow for better prediction while keeping all the interpretability of the usual approach.

Beyond that there are random forests, neural nets, support vector machines, topic modeling (in the case of counts), etc. For a demonstration of some of these in the binary target setting, see this.

Item response theory

Item response theory models spawned from the testing scenario, in which multiple items can be answered correctly or not, leading to a binary response we wish to model. The model comes with specific parameters that relate to item difficulty, individual ability, random guessing and so forth. They have extensions to the multinomial and ordered case as well, and have a close tie to random effects and latent variable models. See more on that in my graphical and latent variable modeling text.

Logistic growth curves

Logistic growth curve models have been commonly employed to model things like population growth over time. Consider the following where \(P(t)\) is the current population at time \(t\), \(P_0\) initial population size, \(r\) defines the growth rate and \(K\) is the carrying capacity.

\[P(t) = \frac{K P_0 e^{rt}}{K + P_0 \left( e^{rt} - 1\right)}\]

I know little about these models, but I will say that unless some known physical law determines a functional form, you’re better off not assuming one. As an alternative approach, see this document on additive models.

Appendix

Maximum likelihood estimation

Binomial logistic model

logreg = function(par, X, y){
  L = X %*% par
  -sum(dbinom(y, size=1, prob=plogis(L), log=T))
}

mm = model.matrix(mod)
resp = admission$admit
inits = rep(0, 4)
out = optim(inits, logreg, X=mm, y=resp)
cbind(out$par, coefs) %>% round(2)
                  coefs
(Intercept) -3.45 -3.45
gre          0.00  0.00
gpa          0.78  0.78
rank        -0.56 -0.56

Conditional logistic model

condlogreg = function(par, X, y){
  idx = y==1
  L1 = X[idx,] %*% par
  L2 = X[!idx,] %*% par
  L3 = (X[idx,] - X[!idx,]) %*% par
  
  # the following are all identical
  -sum(log(exp(L1)/(exp(L1)+exp(L2))))
  -sum(log(exp(L3)/(1+exp(L3))))
  -sum(dbinom(y[idx], size=1, prob=plogis(L3), log=T))
}


# create 1:1 matching
infert2 = infert %>% group_by(stratum) %>% slice(1:2)
mm = model.matrix(glm(case ~ spontaneous + induced, data = infert2, family = binomial))
mm = mm[,-1]   # no intercept
resp = infert2$case
inits = rep(0, 2)     # initial values
out = optim(inits, condlogreg, X=mm, y=resp)
coefs = coef(clogit(case ~ spontaneous + induced + strata(stratum), data = infert2))
cbind(out$par, coefs) %>% round(2)
                    coefs
spontaneousYes 1.96  1.96
inducedYes     0.99  0.99

Multinomial logistic model

Individual varying only

multinomregML = function(par, X, y) {
  levs = levels(y)
  ref = levs[1]               # reference level (category label 1)
  
  y0 = y==ref
  y1 = y==levs[2]             # category 2
  y2 = y==levs[3]             # category 3
  
  beta = matrix(par, ncol=2)
  
  # more like mnlogit function
  # V1 = X %*% beta[,1]
  # V2 = X %*% beta[,2]
  # ll = sum(-log(1 + exp(V1)+exp(V2))) + sum(V1[y1],V2[y2])
  
  V = X %*% beta               #  a vectorized approach
  baseProbVec = 1/(1 + rowSums(exp(V)))  # reference group probabilities

  loglik = sum(log(baseProbVec))  + crossprod(c(V), c(y1, y2))
  loglik
}

out = optim(runif(8, -.1, .1), multinomregML, X=model.matrix(prog ~ ses + write, data = program),
             y=program$prog, control=list(maxit=1000, reltol=1e-12, ndeps=rep(1e-8, 8),
                                      trace=T, fnscale=-1, type=3),
             method='BFGS')
initial  value 453.915760 
iter  10 value 181.566223
final  value 179.981726 
converged
out$par
[1]  2.85210719 -0.53330216 -1.16285179 -0.05792682  5.21818316  0.29140402 -0.98265328 -0.11360255
cbind(out$par, coef(multi_mod)[c(1,5,7,3,2,6,8,4)])
                            [,1]        [,2]
general:(intercept)   2.85210719  2.85218622
general:sesmiddle    -0.53330216 -0.53329101
general:seshigh      -1.16285179 -1.16283199
general:write        -0.05792682 -0.05792841
vocation:(intercept)  5.21818316  5.21820011
vocation:sesmiddle    0.29140402  0.29139311
vocation:seshigh     -0.98265328 -0.98267029
vocation:write       -0.11360255 -0.11360264
X=model.matrix(prog ~ ses + write, data = program)
y=program$prog
y1 = y=='general'
y2 = y=='vocation'
pars = matrix(out$par, ncol=2)
V = X %*% pars
acadprob = 1/(1+rowSums(exp(V)))
fitnonacad = exp(V) * matrix(rep(acadprob, 2), ncol = 2)
fits = cbind(acadprob, fitnonacad)
yind = model.matrix(~-1+prog, data=program)


# because dmultinom can't take matrix for prob
ll = 0
for (i in 1:200){
  ll = ll+dmultinom(yind[i,], size=1, prob=fits[i,], log=T)
}
ll
[1] -179.9817
out$value
[1] -179.9817

Alternative specific example

# Alternative specific and constant variables -----------------------------

# in this example, price is alternative invariant (Z) income is
# individual/alternative specific (X), and catch is alternative specific (Y)
library(mnlogit)
data(Fish)
head(Fish)
           mode   income     alt   price  catch chid
1.beach   FALSE 7083.332   beach 157.930 0.0678    1
1.boat    FALSE 7083.332    boat 157.930 0.2601    1
1.charter  TRUE 7083.332 charter 182.930 0.5391    1
1.pier    FALSE 7083.332    pier 157.930 0.0503    1
2.beach   FALSE 1250.000   beach  15.114 0.1049    2
2.boat    FALSE 1250.000    boat  10.534 0.1574    2
fm = formula(mode ~ price | income | catch)
fit = mnlogit(fm, Fish)
# fit = mlogit(fm, Fish)
summary(fit)

Call:
mnlogit(formula = fm, data = Fish)

Frequencies of alternatives in input data:
  beach    boat charter    pier 
0.11337 0.35364 0.38240 0.15059 

Number of observations in data = 1182
Number of alternatives = 4
Intercept turned: ON
Number of parameters in model = 11
  # individual specific variables = 2
  # choice specific coeff variables = 1
  # individual independent variables = 1

-------------------------------------------------------------
Maximum likelihood estimation using the Newton-Raphson method
-------------------------------------------------------------
  Number of iterations: 7
  Number of linesearch iterations: 10
At termination: 
  Gradient norm = 2.09e-06
  Diff between last 2 loglik values = 0
  Stopping reason: Succesive loglik difference < ftol (1e-06).
Total estimation time (sec): 0.01
Time for Hessian calculations (sec): 0.01 using 1 processors.

Coefficients : 
                       Estimate   Std.Error  t-value  Pr(>|t|)    
(Intercept):boat     8.4184e-01  2.9996e-01   2.8065 0.0050080 ** 
(Intercept):charter  2.1549e+00  2.9746e-01   7.2443 4.348e-13 ***
(Intercept):pier     1.0430e+00  2.9535e-01   3.5315 0.0004132 ***
income:boat          5.5428e-05  5.2130e-05   1.0633 0.2876611    
income:charter      -7.2337e-05  5.2557e-05  -1.3764 0.1687088    
income:pier         -1.3550e-04  5.1172e-05  -2.6480 0.0080977 ** 
catch:beach          3.1177e+00  7.1305e-01   4.3724 1.229e-05 ***
catch:boat           2.5425e+00  5.2274e-01   4.8638 1.152e-06 ***
catch:charter        7.5949e-01  1.5420e-01   4.9254 8.417e-07 ***
catch:pier           2.8512e+00  7.7464e-01   3.6807 0.0002326 ***
price               -2.5281e-02  1.7551e-03 -14.4046 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -1199.1, df = 11
AIC:  2420.3 
# X dim nrow(Fish)/K x p + 1 (intercept)
# Z, Y nrow(N); Y has alt specific coefs; then for Z ref group dropped so nrow = nrow*(K-1)/K
# for ll everything through previous X the same
# then calc probmat for Y and Z, add to X probmat, and add to base

multinomregML2 <- function(par, X, Y, Z, respVec, choice) {

  N = sum(choice)
  K = length(unique(respVec))
  levs = levels(respVec)

  xpar = matrix(par[1:6], ncol=K-1)
  ypar = matrix(par[7:10], ncol=K)
  zpar = matrix(par[length(par)], ncol=1)

  # Calc X
  Vx  = X %*% xpar

  # Calc Y (mnlogit finds N x 1 results by going through 1:N, N+1:N*2 etc; then
  # makes 1 vector, then subtracts the first 1:N from whole vector, then makes
  # Nxk-1 matrix with N+1:end values (as 1:N are just zero)); creating the
  # vector and rebuilding the matrix is unnecessary though
  Vy = sapply(1:K, function(alt) Y[respVec == levs[alt],, drop=F] %*% ypar[alt])
  Vy = Vy[,-1] - Vy[,1]

  # Calc Z
  Vz = Z %*% zpar
  Vz = matrix(Vz, ncol=3)

  # all Vs must fit into N x K -1 matrix where N is nobs (i.e. individuals)
  V = Vx + Vy + Vz

  ll0 = crossprod(c(V), choice[-(1:N)])
  baseProbVec <- 1/(1 + rowSums(exp(V)))
  loglik = sum(log(baseProbVec)) + ll0
  loglik

  # note fitted values via
  # fitnonref = exp(V) * matrix(rep(baseProbVec, K-1), ncol = K-1)
  # fitref = 1-rowSums(fitnonref)
  # fits = cbind(fitref, fitnonref)
}


inits = runif(11, -.1, .1)
mdat = mnlogit(fm, Fish)$model  # this data already ordered!

# X has a constant value across alternatives; the coefficients regard selection of alternative relative to reference
X = cbind(1, mdat[mdat$`_Alt_Indx_`=='beach', 'income']); dim(X); head(X)
[1] 1182    2
     [,1]     [,2]
[1,]    1 7083.332
[2,]    1 1250.000
[3,]    1 3750.000
[4,]    1 2083.333
[5,]    1 4583.332
[6,]    1 4583.332
# Y will use the complete data to start; coefficients will be differences from reference alternative coefficient
Y = as.matrix(mdat[,'catch', drop=F]); dim(Y)
[1] 4728    1
# Z are difference scores from reference group
Z = as.matrix(mdat[mdat$`_Alt_Indx_`!='beach','price', drop=F])
Z = Z-mdat[mdat$`_Alt_Indx_`=='beach','price']; dim(Z);
[1] 3546    1
respVec = mdat$`_Alt_Indx_` # first 10 should be 0 0 1 0 1 0 0 0 1 1 after beach dropped

# debugonce(multinomregML2)
multinomregML2(inits, X, Y, Z, respVec, choice=mdat$mode)
     [,1]
[1,] -Inf
out = optim(par=rep(0,11), multinomregML2, X=X, Y=Y, Z=Z, respVec=respVec, choice=mdat$mode,
            control=list(maxit=1000, reltol=1e-12, ndeps=rep(1e-8, 11),
                         trace=T, fnscale=-1, type=3),
            method='BFGS')
initial  value 1638.599935 
iter  10 value 1253.603448
iter  20 value 1199.143447
final  value 1199.143445 
converged
# out
# round(out$par, 3)
round(cbind(out$par, coef(fit)), 3)
                      [,1]   [,2]
(Intercept):boat     0.842  0.842
income:boat          0.000  0.000
(Intercept):charter  2.155  2.155
income:charter       0.000  0.000
(Intercept):pier     1.043  1.043
income:pier          0.000  0.000
catch:beach          3.118  3.118
catch:boat           2.542  2.542
catch:charter        0.759  0.759
catch:pier           2.851  2.851
price               -0.025 -0.025
cbind(logLik(fit), out$value)
          [,1]      [,2]
[1,] -1199.143 -1199.143

Bivariate probit

See the Stata manual for comparison.

bivariateProbitLL = function(pars, X, y1, y2){
  rho = pars[1]
  mu1 = X %*% pars[2:(ncol(X)+1)]
  mu2 = X %*% pars[(ncol(X)+2):length(pars)]
  q1 = ifelse(y1==1, 1, -1)
  q2 = ifelse(y2==1, 1, -1)
  
  require(mnormt)
  eta1 = q1*mu1
  eta2 = q2*mu2

  ll = matrix(NA, nrow=nrow(X))
  for (i in 1:length(ll)){
    corr = q1[i]*q2[i]*rho
    corr = matrix(c(1,corr,corr,1),2)
    ll[i] = log(pmnorm(x=c(eta1[i], eta2[i]), mean=c(0,0), varcov=corr))
  }
  
  # the loop is probably clearer, and there is no difference in time, but here's a oneliner
  # ll = mapply(function(e1, e2, q1, q2) log(pmnorm(x=c(e1, e2), varcov = matrix(c(1,q1*q2*rho,q1*q2*rho,1),2))),
  #             eta1, eta2, q1, q2)
  
  -sum(ll)
}


### Example 3: from stata manual on bivariate probit
# "We wish to model the bivariate outcomes of whether children attend private 
# school and whether the head of the household voted for an increase in property
# tax based on the other covariates."

school = haven::read_dta('http://www.stata-press.com/data/r13/school.dta')
head(school)
# A tibble: 6 × 11
    obs pub12 pub34  pub5 private years school    loginc logptax  vote  logeduc
  <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>     <dbl>   <dbl> <dbl>    <dbl>
1     1     0     1     0       0    10      1  9.770001  7.0475     1 7.206195
2     2     0     1     0       0     8      0 10.021000  7.0475     0 7.609900
3     3     1     0     0       0     4      0 10.021000  7.0475     0 8.278470
4     4     0     1     0       0    13      0  9.433500  6.3969     0 6.821488
5     5     0     1     0       0     3      1 10.021000  7.2792     1 7.686149
6     6     1     0     0       0     5      0 10.463000  7.0475     0 6.913056
X = model.matrix(private ~ years + logptax + loginc, school)
y1 = school$private
y2 = school$vote
init = c(0, rep(0, ncol(X)*2))

# you'll probably get a warning or two, ignore; takes a couple seconds
optimResult = optim(fn=bivariateProbitLL, par=init, X=X, y1=y1, y2=y2, method='BFGS', control=list(maxit=1000, reltol=1e-12))


loglik = optimResult$value
rho = optimResult$par[1]
coefsPrivate = optimResult$par[2:(ncol(X)+1)]
coefsVote = optimResult$par[(ncol(X)+2):length(init)]
names(coefsPrivate) = names(coefsVote) = c('Int', 'years', 'logptax', 'loginc')

list(loglik = loglik, rho=rho, Private = coefsPrivate, Vote= coefsVote)
$loglik
[1] 89.25403

$rho
[1] -0.269658

$Private
        Int       years     logptax      loginc 
-4.17648484 -0.01191014 -0.10653563  0.37529006 

$Vote
        Int       years     logptax      loginc 
-0.53882675 -0.01685062 -1.28876081  0.99859618 

Ordered logistic model

Compare to the example provided in the ordered categorical model section.

ologit_ll = function(par, X, y){
  ncats = dplyr::n_distinct(y)
  theta = par[1:(ncats-1)]
  beta = par[-(1:(ncats-1))]
  
  probs = rep(0, length(y))
  LP = X %*% beta
  for (i in 1:ncats){
    if (i == 1){
      probs[y==i] = plogis(theta[i] - LP[y==i, ])
    } else if (i < ncats) {
      probs[y==i] = plogis(theta[i] - LP[y==i, ]) - plogis(theta[i-1] - LP[y==i, ])
    } else {
      probs[y==i] = 1- plogis(theta[i-1] - LP[y==i])
    }
  }
  ll = -sum(log(probs))
  ll
}

# starting values
pars = c(-(4:1), runif(2))

# compare to wine data example
out = optim(pars, ologit_ll, X=model.matrix(rating ~ contact + temp, data=wine)[,-1], y=as.integer(wine$rating), 
            method='BFGS')
out
$par
[1] -1.344381  1.250808  3.466895  5.006414  1.527797  2.503120

$value
[1] 86.49192

$counts
function gradient 
      51       22 

$convergence
[1] 0

$message
NULL

Comparing Counts

The following demonstrates the link between binomial logistic regression for counts, binary outcomes and Poisson regression.

Comparing count-based vs. binary logistic regression

# standard binomial regression
set.seed(1234)                                # if you want to duplicate exactly
N = 500                                       # Sample size
total = sample(10:20, N, replace=T)           # total occurrences
x = rnorm(N)                                  # covariate
eta = 0 + .3*x                                # logit; 0 is intercept (notice no 'error' as in standard regression)
probs = plogis(eta)                           # inverse logit to put on probability scale
y = rbinom(N, size=total, prob = probs)       # successes, generated by probability and totals
minusy = total - y                            # failures

df = data.frame(x, y, minusy, total)          # create dataframe and inspect
head(df)
           x  y minusy total
1  0.4369306  6      5    11
2  1.0601239  9      7    16
3  0.4521904  8      8    16
4  0.6631986  7      9    16
5 -1.1363736 10      9    19
6 -0.3704975  9      8    17

Note the results, coefficients are about where we expect (0, .3)

summary(glm(cbind(y, minusy) ~ x, data=df, family='binomial'))

Call:
glm(formula = cbind(y, minusy) ~ x, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7019  -0.6384   0.0502   0.6842   2.9755  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.005208   0.023337   0.223    0.823    
x           0.321189   0.023798  13.497   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 653.01  on 499  degrees of freedom
Residual deviance: 463.86  on 498  degrees of freedom
AIC: 2009.7

Number of Fisher Scoring iterations: 4

Convert the data to binary.

# as binary
ybinary = unlist(mapply(function(s, f) c(rep(1, s), rep(0, f)), df$y, df$minusy))
df2 = data.frame(x=rep(x, total), ybinary)
head(df2, 20)
           x ybinary
1  0.4369306       1
2  0.4369306       1
3  0.4369306       1
4  0.4369306       1
5  0.4369306       1
6  0.4369306       1
7  0.4369306       0
8  0.4369306       0
9  0.4369306       0
10 0.4369306       0
11 0.4369306       0
12 1.0601239       1
13 1.0601239       1
14 1.0601239       1
15 1.0601239       1
16 1.0601239       1
17 1.0601239       1
18 1.0601239       1
19 1.0601239       1
20 1.0601239       1

Identical results.

summary(glm(ybinary ~ x, data=df2, family='binomial'))

Call:
glm(formula = ybinary ~ x, family = "binomial", data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6349  -1.1519  -0.7628   1.1572   1.6592  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.005208   0.023337   0.223    0.823    
x           0.321189   0.023798  13.497   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10454  on 7540  degrees of freedom
Residual deviance: 10265  on 7539  degrees of freedom
AIC: 10269

Number of Fisher Scoring iterations: 4

Compare to Poisson Regression

What if the events are rare?

set.seed(1234)
totalRare = total*100
probs = plogis(-5 + .3*x)
yRare = rbinom(N, size=totalRare, prob = probs ) 

summary(data_frame(yRare, totalRare, proportion=yRare/totalRare))
     yRare         totalRare      proportion      
 Min.   : 0.00   Min.   :1000   Min.   :0.000000  
 1st Qu.: 6.00   1st Qu.:1200   1st Qu.:0.004654  
 Median :10.00   Median :1500   Median :0.006583  
 Mean   :10.82   Mean   :1508   Mean   :0.006759  
 3rd Qu.:14.00   3rd Qu.:1800   3rd Qu.:0.008667  
 Max.   :34.00   Max.   :2000   Max.   :0.017000  
minusyRare = totalRare - yRare  
df3 = data_frame(x, yRare, minusyRare, totalRare)  

glmBinomRare = glm(cbind(yRare, minusyRare) ~ x, data=df3, family='binomial')
summary(glmBinomRare)

Call:
glm(formula = cbind(yRare, minusyRare) ~ x, family = "binomial", 
    data = df3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4226  -0.7975  -0.1562   0.4712   2.8769  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.95876    0.01404 -353.17   <2e-16 ***
x            0.28229    0.01379   20.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 945.52  on 499  degrees of freedom
Residual deviance: 521.81  on 498  degrees of freedom
AIC: 2541.4

Number of Fisher Scoring iterations: 4

We now make the comparison to Poisson regression. The coefficient for x is similar. The intercept in the Poisson regression is roughly the (log) count that would occur on average across the total sizes.

glmPois = glm(yRare ~ x, data=df3, family='poisson')
summary(glmPois)

Call:
glm(formula = yRare ~ x, family = "poisson", data = df3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.4351  -1.2913  -0.1535   1.0588   4.0025  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.35223    0.01400  168.08   <2e-16 ***
x            0.27891    0.01377   20.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1764.9  on 499  degrees of freedom
Residual deviance: 1352.0  on 498  degrees of freedom
AIC: 3375

Number of Fisher Scoring iterations: 4
c(mean(plogis(coef(glmBinomRare)[1])*totalRare), 
  exp(coef(glmPois)[1]))
            (Intercept) 
   10.51620    10.50902 

If we actually use totalRare as an offset in Poisson we’re dealing with the rate rather than the raw count, which would be identical to the proportion we’re modeling in the binomial regression, and so the result is the same.

glmPoisOffset = glm(yRare ~ x, offset=log(totalRare), data=df3, family='poisson')
summary(glmPoisOffset)

Call:
glm(formula = yRare ~ x, family = "poisson", data = df3, offset = log(totalRare))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4175  -0.7953  -0.1558   0.4681   2.8653  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.96602    0.01399 -354.90   <2e-16 ***
x            0.28014    0.01373   20.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 939.01  on 499  degrees of freedom
Residual deviance: 518.42  on 498  degrees of freedom
AIC: 2541.4

Number of Fisher Scoring iterations: 4
summary(glmBinomRare)

Call:
glm(formula = cbind(yRare, minusyRare) ~ x, family = "binomial", 
    data = df3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4226  -0.7975  -0.1562   0.4712   2.8769  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.95876    0.01404 -353.17   <2e-16 ***
x            0.28229    0.01379   20.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 945.52  on 499  degrees of freedom
Residual deviance: 521.81  on 498  degrees of freedom
AIC: 2541.4

Number of Fisher Scoring iterations: 4

Let’s now compare predicted values from these different models.

# this is identical to previous offset model but allows us to set a value for the offset
logtot =  log(totalRare)
glmPoisOffset2 = glm(yRare ~ x, offset=logtot, data=df3, family='poisson')

# compare predicted value ranges
summary(data_frame(pois=fitted(glmPoisOffset),
                   binom=fitted(glmBinomRare)))
      pois            binom         
 Min.   : 3.614   Min.   :0.002685  
 1st Qu.: 8.019   1st Qu.:0.005743  
 Median :10.297   Median :0.006944  
 Mean   :10.818   Mean   :0.007178  
 3rd Qu.:13.177   3rd Qu.:0.008259  
 Max.   :26.899   Max.   :0.017014  
summary(data_frame(pois=fitted(glmPoisOffset)/mean(df3$totalRare), 
                   binom=fitted(glmBinomRare)))
      pois              binom         
 Min.   :0.002397   Min.   :0.002685  
 1st Qu.:0.005317   1st Qu.:0.005743  
 Median :0.006827   Median :0.006944  
 Mean   :0.007173   Mean   :0.007178  
 3rd Qu.:0.008737   3rd Qu.:0.008259  
 Max.   :0.017835   Max.   :0.017014  
# on count scale
predict(glmPoisOffset, newdata = data_frame(x=2, totalRare=mean(totalRare)), type='response')
       1 
18.41081 
exp(cbind(int=1, x=2) %*% coef(glmPoisOffset) + log(mean(totalRare)))
         [,1]
[1,] 18.41081
# on rate scale
predict(glmPoisOffset, newdata = data_frame(x=2, totalRare=mean(totalRare)), type='response') / mean(totalRare)
         1 
0.01220714 
predict(glmPoisOffset2, newdata = data_frame(x=2, logtot=0), type='response') # offset=0
         1 
0.01220714 
exp(cbind(int=1, x=2) %*% coef(glmPoisOffset))  # offset=0
           [,1]
[1,] 0.01220714
predict(glmBinomRare, newdata = data_frame(x=2, totalRare=mean(totalRare)), type='response')
         1 
0.01219848 

References

Kenneth Train, Discrete Choice Methods with Simulation

Interpreting logistic regression in all its forms

See this for more points on multinomial.

Non-open texts

Agresti, A. (2012). Categorical Data Analysis.

Long, S. (1997). Regression Models for Categorical and Limited Dependent Variables.

Greene, W. (2015). Applied Choice Analysis.


  1. I drop the subscript to denote individual observations as it doesn’t add clarity here. You may treat it as a single observation or \(N\) observations. Later it will become more important.

  2. In the previous examples, we’d use q/pnorm above in R, instead of q/plogis.

  3. Technically one of the strata has a single case and control.

  4. It would be more accurate to simply estimate two coefficients while fixing the reference to zero.

  5. One place you will see the situation routinely ignored is with machine learning approaches, where ‘multi-class’ problems end up being multiple outcomes of 1 category vs. all others, or possibly all pairwise comparisons of categories. In other words, it is just made into multiple binary classification problems.

  6. If you’re like me, you’ll have to mentally transpose the rmultinom output. You’ll also have to explicitly do it in any normal data situation if you actually work with it, as your data would almost never expect to deal with it in that fashion. You will typically have in your data frame a single column representing the class label, or a column for each class with a one in the column for the class that is chosen, or counts for each column.

  7. See also the mnlogit package, which has a faster implementation.

  8. This is how Stata refers to the exponentiated coefficients.

  9. The notion of utility doesn’t really work so well if your outcome is a disease or similarly negatively oriented outcomes. In that case, one can just think of it as a latent variable representing the likelihood of some category.

  10. This is a summary of a nicely laid out portion of the vignette for the mlogit package.

  11. Some statistical packages use references in which the interpretation is in the reverse, and so the coefficients will be of opposite sign. I think it makes more sense to keep it consistent with a standard regression.

  12. This would suggest the underlying latent variable has an extreme value distribution.

  13. A random draw from a dirichlet distribution is a simplex of values between 0 and 1 that sum to 1.

  14. This example comes from a document I made on machine learning that will eventually be updated at some point.

  15. This term has always struck me as highly sub-optimal.

  16. But not always. Despite many who believe machine learning to be magic, if you have a lot of data and very good predictors, you standard logistic regression will do just fine in a lot of cases. Conversely, if you have crap data, machine learning techniques are not going to make the variables suddenly correlate where they didn’t before.