Maximum Likelihood

This is a brief refresher on maximum likelihood estimation using a standard regression approach as an example, and more or less assumes one hasn’t tried to roll their own such function in a programming environment before. Given the likelihood’s role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation. The following is taken directly from my document with mostly just cleaned up code and visualization. The TLDR version can be viewed in the Linear Regression chapter.

In the standard model setting we attempt to find parameters \(\theta\) that will maximize the probability of the data we actually observe. We’ll start with an observed random target vector \(y\) with \(i...N\) independent and identically distributed observations and some data-generating process underlying it \(f(\cdot|\theta)\). We are interested in estimating the model parameter(s), \(\theta\), that would make the data most likely to have occurred. The probability density function for \(y\) given some particular estimate for the parameters can be noted as \(f(y_i|\theta)\). The joint probability distribution of the (independent) observations given those parameters, \(f(y_i|\theta)\), is the product of the individual densities, and is our likelihood function. We can write it out generally as:

\[\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)\]

Thus, the likelihood for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates. Furthermore, we can compare one set, \(\mathcal{L}(\theta_A)\), to that of another, \(\mathcal{L}(\theta_B)\), and whichever produces the greater likelihood would be the preferred set of estimates. We can get a sense of this with the following visualization, based on a single parameter. The data is drawn from Poisson distributed variable with mean \(\theta=5\). We note the calculated likelihood increases as we estimate values for \(\theta\) closer to \(5\), or more precisely, whatever the mean observed value is for the data. However, with more and more data, the final ML estimate will converge on the true value.

Final estimate = 5.02

For computational reasons, we instead work with the sum of the natural log probabilities, and so deal with the log likelihood:

\[\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]\]

Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) \(\theta\).

The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters. It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:

\[\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]\]


In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values \(y\). First the data is created, and then we create the function that will compute the log likelihood. Using the built in R distributions makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters. We will set things up to work with the bbmle package, which has some nice summary functionality and other features. However, one should take a glance at optim and the other underlying functions that do the work.

# for replication

# create the data
y = rnorm(1000, mean = 5, sd = 2)
starting_values = c(0, 1)

# the log likelihood function
simple_ll <- function(mu, sigma, verbose = TRUE) {
  ll = sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  if (verbose)
    message(paste(mu, sigma, ll))

The simple_ll function takes starting points for the parameters as arguments, in this case we call them \(\mu\) and \(\sigma\), which will be set to 0 and 1 respectively. Only the first line (ll = -sum…) is actually necessary, and we use dnorm to get the density for each point. Since this optimizer is by default minimization, we reverse the sign of the sum so as to minimize the negative log likelihood, which is the same as maximizing the likelihood. Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values. I do not show that here but you’ll see it in your console if trace = TRUE.

We are now ready to obtain maximum likelihood estimates for the parameters. For comparison we will use bbmle due to its nice summary result, but you can use optim as in the other demonstrations. For the mle2 function we will need the function we’ve created, plus other inputs related to that function or the underlying optimizing function used (by default optim). In this case we will use an optimization procedure that will allow us to set a lower bound for \(\sigma\). This isn’t strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for \(\sigma\) were allowed.

# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm = bbmle::mle2(
  start  = list(mu = 2, sigma = 1),
  method = "L-BFGS-B",
  lower  = c(sigma = 0),
  trace = TRUE


bbmle::mle2(minuslogl = simple_ll, start = list(mu = 2, sigma = 1), 
    method = "L-BFGS-B", trace = TRUE, lower = c(sigma = 0))

      mu    sigma 
4.946803 1.993680 

Log-likelihood: -2108.92 
# compare to an intercept only regression model

lm(formula = y ~ 1)

    Min      1Q  Median      3Q     Max 
-6.7389 -1.2933 -0.0264  1.2848  6.4450 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.94681    0.06308   78.42   <2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.995 on 999 degrees of freedom

We can see that the ML estimates are the same as the lm model estimates based on least squares, and which given the sample size are close to the true values.

In terms of the parameters we estimate, instead of the curve presented previously, in the typical case of two or more parameters we can think of a likelihood surface that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust). I also plot the path of the optimizer from a top down view. The large dot noted represents the minimum negative log likelihood.

Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.

Linear Model

In the standard regression context, our expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters. We can expand our previous example to the standard linear model without too much change. In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations. Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters. As we need to estimate our intercept and coefficient for the X predictor (collectively referred to as \(\beta\)), we can think of our likelihood explicitly as before:

\[\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]\]

# for replication

# predictor
X = rnorm(1000)

# coefficients for intercept and predictor
beta = c(5, 2)

# add intercept to X and create y with some noise
y = cbind(1, X) %*% beta + rnorm(1000, sd = 2.5)

regression_ll <- function(sigma = 1, Int = 0, b1 = 0) {
  coefs = c(Int, b1)
  mu = cbind(1,X)%*%coefs
  ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  message(paste(sigma, Int, b1, ll))

mlopt =  bbmle::mle2(regression_ll, method = "L-BFGS-B", lower = c(sigma = 0)) 

Maximum likelihood estimation

bbmle::mle2(minuslogl = regression_ll, method = "L-BFGS-B", lower = c(sigma = 0))

      Estimate Std. Error z value     Pr(z)    
sigma 2.447823   0.054735  44.721 < 2.2e-16 ***
Int   5.039976   0.077435  65.087 < 2.2e-16 ***
b1    2.139284   0.077652  27.549 < 2.2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 4628.273 
# plot(profile(mlopt), absVal=F)

modlm = lm(y ~ X)

lm(formula = y ~ X)

    Min      1Q  Median      3Q     Max 
-7.9152 -1.6097  0.0363  1.6343  7.6711 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.03998    0.07751   65.02   <2e-16 ***
X            2.13928    0.07773   27.52   <2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.45 on 998 degrees of freedom
Multiple R-squared:  0.4315,    Adjusted R-squared:  0.4309 
F-statistic: 757.5 on 1 and 998 DF,  p-value: < 2.2e-16
- 2 * logLik(modlm)
'log Lik.' 4628.273 (df=3)

As before, our estimates and final log likelihood value are about where they should be, and reflect the lm output, as the OLS estimates are the maximum likelihood estimates. The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot.

To move to generalized linear models, very little changes of the process outside of the distribution assumed and that we are typically modeling a function of the target variable (e.g. \(\log(y)=X\beta; \mu = e^{X\beta}\)).