# Penalized Maximum Likelihood

## Introduction

This demonstration regards a standard regression model via penalized likelihood. See the Maximum Likelihood chapter for a starting point. Here the penalty is specified (via lambda argument), but one would typically estimate the model via cross-validation or some other fashion. Two penalties are possible with the function. One using the (squared) L2 norm (aka ridge regression, Tikhonov regularization), another using the L1 norm (aka lasso) which has the possibility of penalizing coefficients to zero, and thus can serve as a model selection procedure. I have more technical approaches to the lasso and ridge in the lasso and ridge sections.

Note that both L2 and L1 approaches can be seen as maximum a posteriori (MAP) estimates for a Bayesian regression with a specific prior on the coefficients. The L2 approach is akin to a normal prior with zero mean, while L1 is akin to a zero mean Laplace prior. See the Bayesian regression chapter for an approach in that regard.

### Data Setup

library(tidyverse)

set.seed(123)  # ensures replication

# predictors and response
N = 100 # sample size
k = 2   # number of desired predictors
X = matrix(rnorm(N * k), ncol = k)
y = -.5 + .2*X[, 1] + .1*X[, 2] + rnorm(N, sd = .5)  # increasing N will get estimated values closer to these

dfXy = data.frame(X,y)

### Functions

A straightforward function to estimate the log likelihood.

penalized_ML <- function(
par,
X,
y,
lambda = .1,
type = 'L2'
) {
# arguments-
# par: parameters to be estimated
# X: predictor matrix with intercept column
# y: response
# lambda: penalty coefficient
# type: penalty approach

# setup
beta   = par[-1]                               # coefficients
sigma2 = par                                # error variance
sigma  = sqrt(sigma2)
N = nrow(X)

LP = X %*% beta                                # linear predictor
mu = LP                                        # identity link in the glm sense

# calculate likelihood
L = dnorm(y, mean = mu, sd = sigma, log = T)   # log likelihood

switch(
type,
'L2' = -sum(L) + lambda * crossprod(beta[-1]),    # the intercept is not penalized
'L1' = -sum(L) + lambda * sum(abs(beta[-1]))
)
}

This next function is a glmnet style approach that will put the lambda coefficient on equivalent scale. It uses a different objective function. Note that glmnet is actually using elasticnet, which mixes both L1 and L2 penalties.

penalized_ML2 <- function(
par,
X,
y,
lambda = .1,
type = 'L2'
) {

# arguments-
# par: parameters to be estimated
# X: predictor matrix with intercept column
# y: response
# lambda: penalty coefficient
# type: penalty approach

# setup
beta = par                                   # coefficients
N = nrow(X)

# linear predictor
LP = X %*% beta                              # linear predictor
mu = LP                                      # identity link in the glm sense

switch(
type,
'L2' = .5 * crossprod(y - X %*% beta) / N + lambda * crossprod(beta[-1]),
'L1' = .5 * crossprod(y - X %*% beta) / N + lambda * sum(abs(beta[-1]))
)
}

### Estimation

Setup the model matrix for use with optim.

X = cbind(1, X)

We’ll need to set initial values. Note we’d normally want to handle the sigma parameter differently as it’s bounded by zero, but we’ll ignore for demonstration.

init = c(1, rep(0, ncol(X)))
names(init) = c('sigma2', 'intercept', 'b1', 'b2')

fit_l2 = optim(
par = init,
fn  = penalized_ML,
X   = X,
y   = y,
lambda  = 1,
control = list(reltol = 1e-12)
)

fit_l1 = optim(
par = init,
fn  = penalized_ML,
X   = X,
y   = y,
lambda  = 1,
type    = 'L1',
control = list(reltol = 1e-12)
)

### Comparison

Compare to lm in base R.

fit_lm = lm(y ~ ., dfXy)
sigma2 intercept b1 b2
fit_l2 0.219 -0.432 0.133 0.111
fit_l1 0.219 -0.432 0.131 0.109
fit_lm 0.226 -0.432 0.133 0.112

Compare to glmnet. Setting alpha to 0 and 1 is equivalent to L2 and L1 penalties respectively. You also wouldn’t want to specify lambda normally, and rather let it come about as part of the estimation procedure. We do so here just for demonstration.

library(glmnet)

glmnetL2 = glmnet(
X[, -1],
y,
alpha  = 0,
lambda = .01,
standardize = FALSE
)

glmnetL1 = glmnet(
X[, -1],
y,
alpha  = 1,
lambda = .01,
standardize = FALSE
)

pars_L2 = optim(
par = init[-1],
fn  = penalized_ML2,
X   = X,
y   = y,
lambda  = .01,
control = list(reltol = 1e-12)
)$par pars_L1 = optim( par = init[-1], fn = penalized_ML2, X = X, y = y, lambda = .01, type = 'L1', control = list(reltol = 1e-12) )$par
(Intercept) V1 V2
s0 -0.4324 0.1301 0.1094
pars_L2 -0.4324 0.1301 0.1094
s0 -0.4325 0.1207 0.1005
pars_L1 -0.4325 0.1207 0.1005

See the subsequent chapters for an additional look at both lasso and ridge regression approaches.

### Source

Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/penalized_ML.R

## L1 (lasso) regularization

See Tibshirani (1996) for the original paper, or Murphy PML (2012) for a nice overview (watch for typos in depictions).

### Data Setup

library(tidyverse)

set.seed(8675309)

N = 500
p = 10
X = scale(matrix(rnorm(N*p), ncol = p))
b = c(.5, -.5, .25, -.25, .125, -.125, rep(0, p - 6))
y = scale(X %*% b + rnorm(N, sd = .5))
lambda = .1

### Function

Coordinate descent function.

lasso <- function(
X,                   # model matrix
y,                   # target
lambda  = .1,        # penalty parameter
soft    = TRUE,      # soft vs. hard thresholding
tol     = 1e-6,      # tolerance
iter    = 100,       # number of max iterations
verbose = TRUE       # print out iteration number
) {

# soft thresholding function
soft_thresh <- function(a, b) {
out = rep(0, length(a))
out[a >  b] = a[a > b] - b
out[a < -b] = a[a < -b] + b
out
}

w  = solve(crossprod(X) + diag(lambda, ncol(X))) %*% crossprod(X,y)
tol_curr = 1
J  = ncol(X)
a  = rep(0, J)
c_ = rep(0, J)
i  = 1

while (tol < tol_curr && i < iter) {
w_old = w
a  = colSums(X^2)
l  = length(y)*lambda  # for consistency with glmnet approach
c_ = sapply(1:J, function(j)  sum( X[,j] * (y - X[,-j] %*% w_old[-j]) ))

if (soft) {
for (j in 1:J) {
w[j] = soft_thresh(c_[j]/a[j], l/a[j])
}
}
else {
w = w_old
w[c_< l & c_ > -l] = 0
}

tol_curr = crossprod(w - w_old)

i = i + 1

if (verbose && i%%10 == 0) message(i)
}

w
}

### Estimation

Note, if lambda = 0, i.e. no penalty, the result is the same as what you would get from the base R lm.fit.

fit_soft = lasso(
X,
y,
lambda = lambda,
tol    = 1e-12,
soft   = TRUE
)

fit_hard = lasso(
X,
y,
lambda = lambda,
tol    = 1e-12,
soft   = FALSE
)

The glmnet approach is by default a mixture of ridge and lasso penalties, setting alpha = 1 reduces to lasso (alpha = 0 would be ridge). We set the lambda to a couple values while only wanting the one set to the same lambda value as above (s).

library(glmnet)

fit_glmnet = coef(
glmnet(
X,
y,
alpha  = 1,
lambda = c(10, 1, lambda),
thresh = 1e-12,
intercept = FALSE
),
s = lambda
)

library(lassoshooting)

fit_ls = lassoshooting(
X = X,
y = y,
lambda = length(y) * lambda,
thr = 1e-12
)

### Comparison

We can now compare the coefficients of all our results.

lm lasso_soft lasso_hard lspack glmnet truth
X1 0.535 0.435 0.535 0.435 0.436 0.500
X2 -0.530 -0.429 -0.530 -0.429 -0.429 -0.500
X3 0.234 0.124 0.234 0.124 0.124 0.250
X4 -0.294 -0.207 -0.294 -0.207 -0.208 -0.250
X5 0.126 0.020 0.126 0.020 0.020 0.125
X6 -0.159 -0.055 -0.159 -0.055 -0.055 -0.125
X7 -0.017 0.000 0.000 0.000 0.000 0.000
X8 0.010 0.000 0.000 0.000 0.000 0.000
X9 -0.005 0.000 0.000 0.000 0.000 0.000
X10 0.011 0.000 0.000 0.000 0.000 0.000

### Source

Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/lasso.R

## L2 (ridge) regularization

Compare to the lasso section.

### Data Setup

library(tidyverse)

set.seed(8675309)

N = 500
p = 10
X = scale(matrix(rnorm(N * p), ncol = p))
b = c(.5, -.5, .25, -.25, .125, -.125, rep(0, 4))
y = scale(X %*% b + rnorm(N, sd = .5))

### Function

ridge <- function(w, X, y, lambda = .1) {
# X: model matrix;
# y: target;
# lambda: penalty parameter;
# w: the weights/coefficients

crossprod(y - X %*% w) + lambda * length(y) * crossprod(w)
}

### Estimation

Note, if lambda = 0, i.e. no penalty, the result is the same as what you would get from the base R lm.fit.

fit_ridge = optim(
rep(0, ncol(X)),
ridge,
X = X,
y = y,
lambda = .1,
method = 'BFGS'
)

Analytical result.

fit_ridge2 = solve(crossprod(X) + diag(length(y)*.1, ncol(X))) %*% crossprod(X, y)

An alternative approach using ‘augmented’ data (note sigma is ignored as it equals 1, but otherwise X/sigma and y/sigma).

X2 = rbind(X, diag(sqrt(length(y)*.1), ncol(X)))
y2 = c(y, rep(0, ncol(X)))

tail(X2)
       [,1] [,2] [,3] [,4]     [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
[505,]    0    0    0    0 7.071068 0.000000 0.000000 0.000000 0.000000 0.000000
[506,]    0    0    0    0 0.000000 7.071068 0.000000 0.000000 0.000000 0.000000
[507,]    0    0    0    0 0.000000 0.000000 7.071068 0.000000 0.000000 0.000000
[508,]    0    0    0    0 0.000000 0.000000 0.000000 7.071068 0.000000 0.000000
[509,]    0    0    0    0 0.000000 0.000000 0.000000 0.000000 7.071068 0.000000
[510,]    0    0    0    0 0.000000 0.000000 0.000000 0.000000 0.000000 7.071068
tail(y2)
 0 0 0 0 0 0
fit_ridge3 = solve(crossprod(X2)) %*% crossprod(X2, y2)

The glmnet approach is by default a mixture of ridge and lasso penalties, setting alpha = 1 reduces to lasso (alpha = 0 would be ridge). We set the lambda to a couple values while only wanting the one set to the same lambda value as above (s).

library(glmnet)

fit_glmnet = coef(
glmnet(
X,
y,
alpha = 0,
lambda = c(10, 1, .1),
thresh = 1e-12,
intercept = F
),
s = .1
)

### Comparison

We can now compare the coefficients of all our results.

lm ridge ridge2 ridge3 glmnet truth
X1 0.535 0.485 0.485 0.485 0.485 0.500
X2 -0.530 -0.481 -0.481 -0.481 -0.481 -0.500
X3 0.234 0.209 0.209 0.209 0.209 0.250
X4 -0.294 -0.269 -0.269 -0.269 -0.269 -0.250
X5 0.126 0.115 0.115 0.115 0.115 0.125
X6 -0.159 -0.146 -0.146 -0.146 -0.146 -0.125
X7 -0.017 -0.022 -0.022 -0.022 -0.022 0.000
X8 0.010 0.007 0.007 0.007 0.007 0.000
X9 -0.005 0.001 0.001 0.001 0.001 0.000
X10 0.011 0.011 0.011 0.011 0.011 0.000

### Source

Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/ridge.R