Penalized Maximum Likelihood
Introduction
This demonstration regards a standard regression model via penalized likelihood.
See the Maximum Likelihood chapter for a starting point. Here
the penalty is specified (via lambda
argument), but one would typically
estimate the model via cross-validation or some other fashion. Two penalties are
possible with the function. One using the (squared) L2 norm (aka ridge
regression, Tikhonov regularization), another using the L1 norm (aka lasso)
which has the possibility of penalizing coefficients to zero, and thus can serve
as a model selection procedure. I have more technical approaches to the lasso
and ridge in the lasso and ridge sections.
Note that both L2 and L1 approaches can be seen as maximum a posteriori (MAP) estimates for a Bayesian regression with a specific prior on the coefficients. The L2 approach is akin to a normal prior with zero mean, while L1 is akin to a zero mean Laplace prior. See the Bayesian regression chapter for an approach in that regard.
Data Setup
library(tidyverse)
set.seed(123) # ensures replication
# predictors and response
= 100 # sample size
N = 2 # number of desired predictors
k = matrix(rnorm(N * k), ncol = k)
X = -.5 + .2*X[, 1] + .1*X[, 2] + rnorm(N, sd = .5) # increasing N will get estimated values closer to these
y
= data.frame(X,y) dfXy
Functions
A straightforward function to estimate the log likelihood.
<- function(
penalized_ML
par,
X,
y,lambda = .1,
type = 'L2'
) {# arguments-
# par: parameters to be estimated
# X: predictor matrix with intercept column
# y: response
# lambda: penalty coefficient
# type: penalty approach
# setup
= par[-1] # coefficients
beta = par[1] # error variance
sigma2 = sqrt(sigma2)
sigma = nrow(X)
N
= X %*% beta # linear predictor
LP = LP # identity link in the glm sense
mu
# calculate likelihood
= dnorm(y, mean = mu, sd = sigma, log = T) # log likelihood
L
switch(
type,'L2' = -sum(L) + lambda * crossprod(beta[-1]), # the intercept is not penalized
'L1' = -sum(L) + lambda * sum(abs(beta[-1]))
) }
This next function is a glmnet style
approach that will put the lambda coefficient on equivalent scale. It uses a
different objective function. Note that glmnet is actually using elasticnet
, which mixes both L1 and L2
penalties.
<- function(
penalized_ML2
par,
X,
y,lambda = .1,
type = 'L2'
) {
# arguments-
# par: parameters to be estimated
# X: predictor matrix with intercept column
# y: response
# lambda: penalty coefficient
# type: penalty approach
# setup
= par # coefficients
beta = nrow(X)
N
# linear predictor
= X %*% beta # linear predictor
LP = LP # identity link in the glm sense
mu
switch(
type,'L2' = .5 * crossprod(y - X %*% beta) / N + lambda * crossprod(beta[-1]),
'L1' = .5 * crossprod(y - X %*% beta) / N + lambda * sum(abs(beta[-1]))
) }
Estimation
Setup the model matrix for use with optim.
= cbind(1, X) X
We’ll need to set initial values. Note we’d normally want to handle the sigma parameter differently as it’s bounded by zero, but we’ll ignore for demonstration.
= c(1, rep(0, ncol(X)))
init names(init) = c('sigma2', 'intercept', 'b1', 'b2')
= optim(
fit_l2 par = init,
fn = penalized_ML,
X = X,
y = y,
lambda = 1,
control = list(reltol = 1e-12)
)
= optim(
fit_l1 par = init,
fn = penalized_ML,
X = X,
y = y,
lambda = 1,
type = 'L1',
control = list(reltol = 1e-12)
)
Comparison
Compare to lm
in base R.
= lm(y ~ ., dfXy) fit_lm
sigma2 | intercept | b1 | b2 | |
---|---|---|---|---|
fit_l2 | 0.219 | -0.432 | 0.133 | 0.111 |
fit_l1 | 0.219 | -0.432 | 0.131 | 0.109 |
fit_lm | 0.226 | -0.432 | 0.133 | 0.112 |
Compare to glmnet. Setting alpha to 0 and 1 is equivalent to L2 and L1 penalties respectively. You also wouldn’t want to specify lambda normally, and rather let it come about as part of the estimation procedure. We do so here just for demonstration.
library(glmnet)
= glmnet(
glmnetL2 -1],
X[,
y,alpha = 0,
lambda = .01,
standardize = FALSE
)
= glmnet(
glmnetL1 -1],
X[,
y,alpha = 1,
lambda = .01,
standardize = FALSE
)
= optim(
pars_L2 par = init[-1],
fn = penalized_ML2,
X = X,
y = y,
lambda = .01,
control = list(reltol = 1e-12)
$par
)
= optim(
pars_L1 par = init[-1],
fn = penalized_ML2,
X = X,
y = y,
lambda = .01,
type = 'L1',
control = list(reltol = 1e-12)
$par )
(Intercept) | V1 | V2 | |
---|---|---|---|
s0 | -0.4324 | 0.1301 | 0.1094 |
pars_L2 | -0.4324 | 0.1301 | 0.1094 |
s0 | -0.4325 | 0.1207 | 0.1005 |
pars_L1 | -0.4325 | 0.1207 | 0.1005 |
See the subsequent chapters for an additional look at both lasso and ridge regression approaches.
Source
Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/penalized_ML.R
L1 (lasso) regularization
See Tibshirani (1996) for the original paper, or Murphy PML (2012) for a nice overview (watch for typos in depictions).
Data Setup
library(tidyverse)
set.seed(8675309)
= 500
N = 10
p = scale(matrix(rnorm(N*p), ncol = p))
X = c(.5, -.5, .25, -.25, .125, -.125, rep(0, p - 6))
b = scale(X %*% b + rnorm(N, sd = .5))
y = .1 lambda
Function
Coordinate descent function.
<- function(
lasso # model matrix
X, # target
y, lambda = .1, # penalty parameter
soft = TRUE, # soft vs. hard thresholding
tol = 1e-6, # tolerance
iter = 100, # number of max iterations
verbose = TRUE # print out iteration number
) {
# soft thresholding function
<- function(a, b) {
soft_thresh = rep(0, length(a))
out > b] = a[a > b] - b
out[a < -b] = a[a < -b] + b
out[a
out
}
= solve(crossprod(X) + diag(lambda, ncol(X))) %*% crossprod(X,y)
w = 1
tol_curr = ncol(X)
J = rep(0, J)
a = rep(0, J)
c_ = 1
i
while (tol < tol_curr && i < iter) {
= w
w_old = colSums(X^2)
a = length(y)*lambda # for consistency with glmnet approach
l = sapply(1:J, function(j) sum( X[,j] * (y - X[,-j] %*% w_old[-j]) ))
c_
if (soft) {
for (j in 1:J) {
= soft_thresh(c_[j]/a[j], l/a[j])
w[j]
}
}else {
= w_old
w < l & c_ > -l] = 0
w[c_
}
= crossprod(w - w_old)
tol_curr
= i + 1
i
if (verbose && i%%10 == 0) message(i)
}
w }
Estimation
Note, if lambda = 0
, i.e. no penalty, the result is the same as what you would get from the base R lm.fit.
= lasso(
fit_soft
X,
y,lambda = lambda,
tol = 1e-12,
soft = TRUE
)
= lasso(
fit_hard
X,
y,lambda = lambda,
tol = 1e-12,
soft = FALSE
)
The glmnet approach is by default a mixture
of ridge and lasso penalties, setting alpha = 1
reduces to lasso (alpha = 0
would be ridge). We set the lambda to a couple values while only wanting the one
set to the same lambda value as above (s
).
library(glmnet)
= coef(
fit_glmnet glmnet(
X,
y,alpha = 1,
lambda = c(10, 1, lambda),
thresh = 1e-12,
intercept = FALSE
),s = lambda
)
library(lassoshooting)
= lassoshooting(
fit_ls X = X,
y = y,
lambda = length(y) * lambda,
thr = 1e-12
)
Comparison
We can now compare the coefficients of all our results.
lm | lasso_soft | lasso_hard | lspack | glmnet | truth | |
---|---|---|---|---|---|---|
X1 | 0.535 | 0.435 | 0.535 | 0.435 | 0.436 | 0.500 |
X2 | -0.530 | -0.429 | -0.530 | -0.429 | -0.429 | -0.500 |
X3 | 0.234 | 0.124 | 0.234 | 0.124 | 0.124 | 0.250 |
X4 | -0.294 | -0.207 | -0.294 | -0.207 | -0.208 | -0.250 |
X5 | 0.126 | 0.020 | 0.126 | 0.020 | 0.020 | 0.125 |
X6 | -0.159 | -0.055 | -0.159 | -0.055 | -0.055 | -0.125 |
X7 | -0.017 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
X8 | 0.010 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
X9 | -0.005 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
X10 | 0.011 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Source
Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/lasso.R
L2 (ridge) regularization
Compare to the lasso section.
Data Setup
library(tidyverse)
set.seed(8675309)
= 500
N = 10
p = scale(matrix(rnorm(N * p), ncol = p))
X = c(.5, -.5, .25, -.25, .125, -.125, rep(0, 4))
b = scale(X %*% b + rnorm(N, sd = .5)) y
Function
<- function(w, X, y, lambda = .1) {
ridge # X: model matrix;
# y: target;
# lambda: penalty parameter;
# w: the weights/coefficients
crossprod(y - X %*% w) + lambda * length(y) * crossprod(w)
}
Estimation
Note, if lambda = 0
, i.e. no penalty, the result is the same as what you would get from the base R lm.fit.
= optim(
fit_ridge rep(0, ncol(X)),
ridge,X = X,
y = y,
lambda = .1,
method = 'BFGS'
)
Analytical result.
= solve(crossprod(X) + diag(length(y)*.1, ncol(X))) %*% crossprod(X, y) fit_ridge2
An alternative approach using ‘augmented’ data (note sigma
is ignored as it equals 1, but otherwise
X/sigma and y/sigma).
= rbind(X, diag(sqrt(length(y)*.1), ncol(X)))
X2 = c(y, rep(0, ncol(X)))
y2
tail(X2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[505,] 0 0 0 0 7.071068 0.000000 0.000000 0.000000 0.000000 0.000000
[506,] 0 0 0 0 0.000000 7.071068 0.000000 0.000000 0.000000 0.000000
[507,] 0 0 0 0 0.000000 0.000000 7.071068 0.000000 0.000000 0.000000
[508,] 0 0 0 0 0.000000 0.000000 0.000000 7.071068 0.000000 0.000000
[509,] 0 0 0 0 0.000000 0.000000 0.000000 0.000000 7.071068 0.000000
[510,] 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 7.071068
tail(y2)
[1] 0 0 0 0 0 0
= solve(crossprod(X2)) %*% crossprod(X2, y2) fit_ridge3
The glmnet approach is by default a mixture
of ridge and lasso penalties, setting alpha = 1
reduces to lasso (alpha = 0
would be ridge). We set the lambda to a couple values while only wanting the one
set to the same lambda value as above (s
).
library(glmnet)
= coef(
fit_glmnet glmnet(
X,
y,alpha = 0,
lambda = c(10, 1, .1),
thresh = 1e-12,
intercept = F
), s = .1
)
Comparison
We can now compare the coefficients of all our results.
lm | ridge | ridge2 | ridge3 | glmnet | truth | |
---|---|---|---|---|---|---|
X1 | 0.535 | 0.485 | 0.485 | 0.485 | 0.485 | 0.500 |
X2 | -0.530 | -0.481 | -0.481 | -0.481 | -0.481 | -0.500 |
X3 | 0.234 | 0.209 | 0.209 | 0.209 | 0.209 | 0.250 |
X4 | -0.294 | -0.269 | -0.269 | -0.269 | -0.269 | -0.250 |
X5 | 0.126 | 0.115 | 0.115 | 0.115 | 0.115 | 0.125 |
X6 | -0.159 | -0.146 | -0.146 | -0.146 | -0.146 | -0.125 |
X7 | -0.017 | -0.022 | -0.022 | -0.022 | -0.022 | 0.000 |
X8 | 0.010 | 0.007 | 0.007 | 0.007 | 0.007 | 0.000 |
X9 | -0.005 | 0.001 | 0.001 | 0.001 | 0.001 | 0.000 |
X10 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.000 |
Source
Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/ridge.R