Penalized Maximum Likelihood

Introduction

This demonstration regards a standard regression model via penalized likelihood. See the Maximum Likelihood chapter for a starting point. Here the penalty is specified (via lambda argument), but one would typically estimate the model via cross-validation or some other fashion. Two penalties are possible with the function. One using the (squared) L2 norm (aka ridge regression, Tikhonov regularization), another using the L1 norm (aka lasso) which has the possibility of penalizing coefficients to zero, and thus can serve as a model selection procedure. I have more technical approaches to the lasso and ridge in the lasso and ridge sections.

Note that both L2 and L1 approaches can be seen as maximum a posteriori (MAP) estimates for a Bayesian regression with a specific prior on the coefficients. The L2 approach is akin to a normal prior with zero mean, while L1 is akin to a zero mean Laplace prior. See the Bayesian regression chapter for an approach in that regard.

Data Setup

library(tidyverse)

set.seed(123)  # ensures replication

# predictors and response
N = 100 # sample size
k = 2   # number of desired predictors
X = matrix(rnorm(N * k), ncol = k)  
y = -.5 + .2*X[, 1] + .1*X[, 2] + rnorm(N, sd = .5)  # increasing N will get estimated values closer to these

dfXy = data.frame(X,y)

Functions

A straightforward function to estimate the log likelihood.

penalized_ML <- function(
  par,
  X,
  y,
  lambda = .1,
  type = 'L2'
) {
  # arguments- 
  # par: parameters to be estimated
  # X: predictor matrix with intercept column
  # y: response
  # lambda: penalty coefficient
  # type: penalty approach
  
  # setup
  beta   = par[-1]                               # coefficients
  sigma2 = par[1]                                # error variance
  sigma  = sqrt(sigma2)
  N = nrow(X)

  LP = X %*% beta                                # linear predictor
  mu = LP                                        # identity link in the glm sense
  
  # calculate likelihood
  L = dnorm(y, mean = mu, sd = sigma, log = T)   # log likelihood
  
  switch(
    type,
    'L2' = -sum(L) + lambda * crossprod(beta[-1]),    # the intercept is not penalized
    'L1' = -sum(L) + lambda * sum(abs(beta[-1]))
  )
}

This next function is a glmnet style approach that will put the lambda coefficient on equivalent scale. It uses a different objective function. Note that glmnet is actually using elasticnet, which mixes both L1 and L2 penalties.

penalized_ML2 <- function(
  par,
  X,
  y,
  lambda = .1,
  type = 'L2'
) {
  
  # arguments- 
  # par: parameters to be estimated
  # X: predictor matrix with intercept column
  # y: response
  # lambda: penalty coefficient
  # type: penalty approach
  
  # setup
  beta = par                                   # coefficients
  N = nrow(X)
  
  # linear predictor
  LP = X %*% beta                              # linear predictor
  mu = LP                                      # identity link in the glm sense
  
  switch(
    type,
    'L2' = .5 * crossprod(y - X %*% beta) / N + lambda * crossprod(beta[-1]),
    'L1' = .5 * crossprod(y - X %*% beta) / N + lambda * sum(abs(beta[-1]))
  )
}

Estimation

Setup the model matrix for use with optim.

X = cbind(1, X)

We’ll need to set initial values. Note we’d normally want to handle the sigma parameter differently as it’s bounded by zero, but we’ll ignore for demonstration.

init = c(1, rep(0, ncol(X)))
names(init) = c('sigma2', 'intercept', 'b1', 'b2')

fit_l2 = optim(
  par = init,
  fn  = penalized_ML,
  X   = X,
  y   = y,
  lambda  = 1,
  control = list(reltol = 1e-12)
)

fit_l1 = optim(
  par = init,
  fn  = penalized_ML,
  X   = X,
  y   = y,
  lambda  = 1,
  type    = 'L1',
  control = list(reltol = 1e-12)
)

Comparison

Compare to lm in base R.

fit_lm = lm(y ~ ., dfXy)
sigma2 intercept b1 b2
fit_l2 0.219 -0.432 0.133 0.111
fit_l1 0.219 -0.432 0.131 0.109
fit_lm 0.226 -0.432 0.133 0.112

Compare to glmnet. Setting alpha to 0 and 1 is equivalent to L2 and L1 penalties respectively. You also wouldn’t want to specify lambda normally, and rather let it come about as part of the estimation procedure. We do so here just for demonstration.

library(glmnet)

glmnetL2 = glmnet(
  X[, -1],
  y,
  alpha  = 0,
  lambda = .01,
  standardize = FALSE
)

glmnetL1 = glmnet(
  X[, -1],
  y,
  alpha  = 1,
  lambda = .01,
  standardize = FALSE
)

pars_L2 = optim(
  par = init[-1],
  fn  = penalized_ML2,
  X   = X,
  y   = y,
  lambda  = .01,
  control = list(reltol = 1e-12)
)$par

pars_L1 = optim(
  par = init[-1],
  fn  = penalized_ML2,
  X   = X,
  y   = y,
  lambda  = .01,
  type    = 'L1',
  control = list(reltol = 1e-12)
)$par
(Intercept) V1 V2
s0 -0.4324 0.1301 0.1094
pars_L2 -0.4324 0.1301 0.1094
s0 -0.4325 0.1207 0.1005
pars_L1 -0.4325 0.1207 0.1005

See the subsequent chapters for an additional look at both lasso and ridge regression approaches.

L1 (lasso) regularization

See Tibshirani (1996) for the original paper, or Murphy PML (2012) for a nice overview (watch for typos in depictions).

Data Setup

library(tidyverse)

set.seed(8675309)

N = 500
p = 10
X = scale(matrix(rnorm(N*p), ncol = p))
b = c(.5, -.5, .25, -.25, .125, -.125, rep(0, p - 6))
y = scale(X %*% b + rnorm(N, sd = .5))
lambda = .1

Function

Coordinate descent function.

lasso <- function(
  X,                   # model matrix
  y,                   # target
  lambda  = .1,        # penalty parameter
  soft    = TRUE,      # soft vs. hard thresholding
  tol     = 1e-6,      # tolerance
  iter    = 100,       # number of max iterations
  verbose = TRUE       # print out iteration number
) {
  
  # soft thresholding function
  soft_thresh <- function(a, b) {
    out = rep(0, length(a))
    out[a >  b] = a[a > b] - b
    out[a < -b] = a[a < -b] + b
    out
  }
  
  w  = solve(crossprod(X) + diag(lambda, ncol(X))) %*% crossprod(X,y)
  tol_curr = 1
  J  = ncol(X)
  a  = rep(0, J)
  c_ = rep(0, J)
  i  = 1
  
  while (tol < tol_curr && i < iter) {
    w_old = w 
    a  = colSums(X^2)
    l  = length(y)*lambda  # for consistency with glmnet approach
    c_ = sapply(1:J, function(j)  sum( X[,j] * (y - X[,-j] %*% w_old[-j]) ))
    
    if (soft) {
      for (j in 1:J) {
        w[j] = soft_thresh(c_[j]/a[j], l/a[j])
      }
    }
    else {
      w = w_old
      w[c_< l & c_ > -l] = 0
    }
    
    tol_curr = crossprod(w - w_old)  
    
    i = i + 1
    
    if (verbose && i%%10 == 0) message(i)
  }
  
  w
}

Estimation

Note, if lambda = 0, i.e. no penalty, the result is the same as what you would get from the base R lm.fit.

fit_soft = lasso(
  X,
  y,
  lambda = lambda,
  tol    = 1e-12,
  soft   = TRUE
)

fit_hard = lasso(
  X,
  y,
  lambda = lambda,
  tol    = 1e-12,
  soft   = FALSE
)

The glmnet approach is by default a mixture of ridge and lasso penalties, setting alpha = 1 reduces to lasso (alpha = 0 would be ridge). We set the lambda to a couple values while only wanting the one set to the same lambda value as above (s).

library(glmnet)

fit_glmnet = coef(
  glmnet(
    X,
    y,
    alpha  = 1,
    lambda = c(10, 1, lambda),
    thresh = 1e-12,
    intercept = FALSE
  ),
  s = lambda
)

library(lassoshooting)

fit_ls = lassoshooting(
  X = X,
  y = y,
  lambda = length(y) * lambda,
  thr = 1e-12
)

Comparison

We can now compare the coefficients of all our results.

lm lasso_soft lasso_hard lspack glmnet truth
X1 0.535 0.435 0.535 0.435 0.436 0.500
X2 -0.530 -0.429 -0.530 -0.429 -0.429 -0.500
X3 0.234 0.124 0.234 0.124 0.124 0.250
X4 -0.294 -0.207 -0.294 -0.207 -0.208 -0.250
X5 0.126 0.020 0.126 0.020 0.020 0.125
X6 -0.159 -0.055 -0.159 -0.055 -0.055 -0.125
X7 -0.017 0.000 0.000 0.000 0.000 0.000
X8 0.010 0.000 0.000 0.000 0.000 0.000
X9 -0.005 0.000 0.000 0.000 0.000 0.000
X10 0.011 0.000 0.000 0.000 0.000 0.000

L2 (ridge) regularization

Compare to the lasso section.

Data Setup

library(tidyverse)

set.seed(8675309)

N = 500
p = 10
X = scale(matrix(rnorm(N * p), ncol = p))
b = c(.5, -.5, .25, -.25, .125, -.125, rep(0, 4))
y = scale(X %*% b + rnorm(N, sd = .5))

Function

ridge <- function(w, X, y, lambda = .1) {
  # X: model matrix; 
  # y: target; 
  # lambda: penalty parameter; 
  # w: the weights/coefficients
  
  crossprod(y - X %*% w) + lambda * length(y) * crossprod(w)
}

Estimation

Note, if lambda = 0, i.e. no penalty, the result is the same as what you would get from the base R lm.fit.

fit_ridge = optim(
  rep(0, ncol(X)),
  ridge,
  X = X,
  y = y,
  lambda = .1,
  method = 'BFGS'
)

Analytical result.

fit_ridge2 = solve(crossprod(X) + diag(length(y)*.1, ncol(X))) %*% crossprod(X, y)

An alternative approach using ‘augmented’ data (note sigma is ignored as it equals 1, but otherwise X/sigma and y/sigma).

X2 = rbind(X, diag(sqrt(length(y)*.1), ncol(X)))
y2 = c(y, rep(0, ncol(X)))

tail(X2)
       [,1] [,2] [,3] [,4]     [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
[505,]    0    0    0    0 7.071068 0.000000 0.000000 0.000000 0.000000 0.000000
[506,]    0    0    0    0 0.000000 7.071068 0.000000 0.000000 0.000000 0.000000
[507,]    0    0    0    0 0.000000 0.000000 7.071068 0.000000 0.000000 0.000000
[508,]    0    0    0    0 0.000000 0.000000 0.000000 7.071068 0.000000 0.000000
[509,]    0    0    0    0 0.000000 0.000000 0.000000 0.000000 7.071068 0.000000
[510,]    0    0    0    0 0.000000 0.000000 0.000000 0.000000 0.000000 7.071068
tail(y2)
[1] 0 0 0 0 0 0
fit_ridge3 = solve(crossprod(X2)) %*% crossprod(X2, y2)

The glmnet approach is by default a mixture of ridge and lasso penalties, setting alpha = 1 reduces to lasso (alpha = 0 would be ridge). We set the lambda to a couple values while only wanting the one set to the same lambda value as above (s).

library(glmnet)

fit_glmnet = coef(
  glmnet(
    X,
    y,
    alpha = 0,
    lambda = c(10, 1, .1),
    thresh = 1e-12,
    intercept = F
  ), 
  s = .1
)

Comparison

We can now compare the coefficients of all our results.

lm ridge ridge2 ridge3 glmnet truth
X1 0.535 0.485 0.485 0.485 0.485 0.500
X2 -0.530 -0.481 -0.481 -0.481 -0.481 -0.500
X3 0.234 0.209 0.209 0.209 0.209 0.250
X4 -0.294 -0.269 -0.269 -0.269 -0.269 -0.250
X5 0.126 0.115 0.115 0.115 0.115 0.125
X6 -0.159 -0.146 -0.146 -0.146 -0.146 -0.125
X7 -0.017 -0.022 -0.022 -0.022 -0.022 0.000
X8 0.010 0.007 0.007 0.007 0.007 0.000
X9 -0.005 0.001 0.001 0.001 0.001 0.000
X10 0.011 0.011 0.011 0.011 0.011 0.000