Stochastic Gradient Descent
Here we have ‘online’ learning via stochastic gradient descent. See the standard gradient descent chapter. In the following, we have basic data for standard regression, but in this ‘online’ learning case, we can assume each observation comes to us as a stream over time rather than as a single batch, and would continue coming in. Note that there are plenty of variations of this, and it can be applied in the batch case as well. Currently no stopping point is implemented in order to trace results over all data points/iterations. On revisiting this much later, I thought it useful to add that I believe this was motivated by the example in Murphy’s Probabilistic Machine Learning text.
Data Setup
Create some data for a standard linear regression.
library(tidyverse)
set.seed(1234)
= 1000
n = rnorm(n)
x1 = rnorm(n)
x2 = 1 + .5*x1 + .2*x2 + rnorm(n)
y = cbind(Intercept = 1, x1, x2) X
Function
The estimating function using the adagrad approach.
<- function(
sgd # parameter estimates
par, # model matrix
X, # target variable
y, stepsize = 1, # the learning rate
stepsize_tau = 0, # if > 0, a check on the LR at early iterations
average = FALSE # a variation of the approach
){
# initialize
= par
beta names(beta) = colnames(X)
= matrix(0, nrow(X), ncol = length(beta)) # Collect all estimates
betamat = NA # Collect fitted values at each point
fits = NA # Collect loss at each point
loss = 0 # adagrad per parameter learning rate adjustment
s = 1e-8 # a smoothing term to avoid division by zero
eps
for (i in 1:nrow(X)) {
= X[i, , drop = FALSE]
Xi = y[i]
yi = Xi %*% beta # matrix operations not necessary,
LP = t(Xi) %*% (LP - yi) # but makes consistent with standard gd func
grad = s + grad^2 # adagrad approach
s
# update
= beta - stepsize/(stepsize_tau + sqrt(s + eps)) * grad
beta
if (average & i > 1) {
= beta - 1/i * (betamat[i - 1, ] - beta) # a variation
beta
}
= beta
betamat[i,] = LP
fits[i] = (LP - yi)^2
loss[i] = grad
grad_old
}
= X %*% beta
LP = crossprod(LP - y)
lastloss
list(
par = beta, # final estimates
par_chain = betamat, # estimates at each iteration
RMSE = sqrt(sum(lastloss)/nrow(X)),
fitted = LP
) }
Estimation
Set starting values.
= rep(0, 3) starting_values
For any particular data you might have to fiddle with the stepsize
, perhaps
choosing one based on cross-validation with old data.
= sgd(
fit_sgd
starting_values,X = X,
y = y,
stepsize = .1,
stepsize_tau = .5,
average = FALSE
)
str(fit_sgd)
List of 4
$ par : num [1:3, 1] 1.024 0.537 0.148
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "Intercept" "x1" "x2"
.. ..$ : NULL
$ par_chain: num [1:1000, 1:3] -0.06208 -0.00264 0.04781 0.09866 0.08242 ...
$ RMSE : num 1.01
$ fitted : num [1:1000, 1] 0.198 1.218 1.379 -0.141 1.358 ...
$par fit_sgd
[,1]
Intercept 1.0241049
x1 0.5368198
x2 0.1478470
Comparison
We can compare to standard linear regression.
# summary(lm(y ~ x1 + x2))
= coef(lm(y ~ x1 + x2)) coef1
Intercept | x1 | x2 | |
---|---|---|---|
fit_sgd | 1.024 | 0.537 | 0.148 |
lm | 1.030 | 0.518 | 0.163 |
Visualize Estimates
Data Set Shift
This data includes a shift of the previous data, where the data fundamentally changes at certain times.
Data Setup
We’ll add data with different underlying generating processes.
set.seed(1234)
= 1000
n2 .2 = rnorm(n2)
x1.2 = rnorm(n2)
x2= -1 + .25*x1.2 - .25*x2.2 + rnorm(n2)
y2 = rbind(X, cbind(1, x1.2, x2.2))
X2 = coef(lm(y2 ~ x1.2 + x2.2))
coef2 = c(y, y2)
y2
= 1000
n3 .3 = rnorm(n3)
x1.3 = rnorm(n3)
x2= 1 - .25*x1.3 + .25*x2.3 + rnorm(n3)
y3 = coef(lm(y3 ~ x1.3 + x2.3))
coef3
= rbind(X2, cbind(1, x1.3, x2.3))
X3 = c(y2, y3) y3
Estimation
We’ll use the same function as before.
= sgd(
fit_sgd_shift
starting_values,X = X3,
y = y3,
stepsize = 1,
stepsize_tau = 0,
average = FALSE
)
str(fit_sgd_shift)
List of 4
$ par : num [1:3, 1] 0.821 -0.223 0.211
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "Intercept" "x1" "x2"
.. ..$ : NULL
$ par_chain: num [1:3000, 1:3] -1 -0.119 0.624 1.531 1.063 ...
$ RMSE : num 1.57
$ fitted : num [1:3000, 1] 0.836 0.823 0.254 1.479 0.874 ...
Comparison
Compare with lm result for each data part.
Intercept | x1 | x2 | |
---|---|---|---|
lm_part1 | 1.030 | 0.518 | 0.163 |
lm_part2 | -0.970 | 0.268 | -0.287 |
lm_part3 | 1.045 | -0.236 | 0.242 |
sgd_part1 | 1.086 | 0.513 | 0.146 |
sgd_part2 | -0.925 | 0.295 | -0.294 |
sgd_part3 | 0.821 | -0.223 | 0.211 |
Visualize Estimates
Visualize estimates across iterations.
SGD Variants
The above uses the Adagrad approach for stochastic gradient descent, but there are many variations. A good resource can be found here, as well as this post covering more recent developments. We will compare the Adagrad, RMSprop, Adam, and Nadam approaches.
Data Setup
For this demo we’ll bump the sample size. I’ve also made the coefficients a little different.
library(tidyverse)
set.seed(1234)
= 10000
n = rnorm(n)
x1 = rnorm(n)
x2 = cbind(Intercept = 1, x1, x2)
X = c(Intercept = 1, x1 = 1, x2 = -.75)
true
= X %*% true + rnorm(n) y
Function
For this we’ll add a functional component to the primary function. We create a function factory update_ff
that, based on the input will create an appropriate update step (update
) for use each iteration. This is mostly is just a programming exercise, but might allow you to add additional components arguments or methods more easily.
<- function(
sgd # parameter estimates
par, # model matrix
X, # target variable
y, stepsize = 1e-2, # the learning rate; suggest 1e-3 for non-adagrad methods
type = 'adagrad', # one of adagrad, rmsprop, adam or nadam
average = FALSE, # a variation of the approach
# arguments to pass to an updating function, e.g. gamma in rmsprop
...
){
# initialize
= par
beta names(beta) = colnames(X)
= matrix(0, nrow(X), ncol = length(beta)) # Collect all estimates
betamat = rep(0, length(beta)) # gradient variance (sum of squares)
v = rep(0, length(beta)) # average of gradients for n/adam
m = 1e-8 # a smoothing term to avoid division by zero
eps = rep(0, length(beta))
grad_old
<- function(type, ...) {
update_ff
# if stepsize_tau > 0, a check on the LR at early iterations
<- function(grad, stepsize_tau = 0) {
adagrad <<- v + grad^2
v
/(stepsize_tau + sqrt(v + eps)) * grad
stepsize
}
<- function(grad, grad_old, gamma = .9) {
rmsprop = gamma * grad_old^2 + (1 - gamma) * grad^2
v
/ sqrt(v + eps) * grad
stepsize
}
<- function(grad, b1 = .9, b2 = .999) {
adam <<- b1 * m + (1 - b1) * grad
m <<- b2 * v + (1 - b2) * grad^2
v
if (type == 'adam')
# dividing v and m by 1 - b*^i is the 'bias correction'
/(sqrt(v / (1 - b2^i)) + eps) * (m / (1 - b1^i))
stepsizeelse
# nadam
/(sqrt(v / (1 - b2^i)) + eps) * (b1 * m + (1 - b1)/(1 - b1^i) * grad)
stepsize
}
switch(
type,adagrad = function(grad, ...) adagrad(grad, ...),
rmsprop = function(grad, ...) rmsprop(grad, grad_old, ...),
adam = function(grad, ...) adam(grad, ...),
nadam = function(grad, ...) adam(grad, ...)
)
}
= update_ff(type, ...)
update
for (i in 1:nrow(X)) {
= X[i, , drop = FALSE]
Xi = y[i]
yi = Xi %*% beta # matrix operations not necessary,
LP = t(Xi) %*% (LP - yi) # but makes consistent with standard gd func
grad
# update
= beta - update(grad, ...)
beta
if (average & i > 1) {
= beta - 1/i * (betamat[i - 1, ] - beta) # a variation
beta
}
= beta
betamat[i,] = grad
grad_old
}
= X %*% beta
LP = crossprod(LP - y)
lastloss
list(
par = beta, # final estimates
par_chain = betamat, # estimates at each iteration
RMSE = sqrt(sum(lastloss)/nrow(X)),
fitted = LP
) }
Estimation
We’ll now use all four methods for estimation.
= rep(0, ncol(X))
starting_values # starting_values = runif(3, min = -1)
= sgd(
fit_adagrad
starting_values,X = X,
y = y,
stepsize = .1 # suggestion is .01 for many settings, but this works better here
)
= sgd(
fit_rmsprop
starting_values,X = X,
y = y,
stepsize = 1e-3,
type = 'rmsprop'
)
= sgd(
fit_adam
starting_values,X = X,
y = y,
stepsize = 1e-3,
type = 'adam'
)
= sgd(
fit_nadam
starting_values,X = X,
y = y,
stepsize = 1e-3,
type = 'nadam'
)
Comparison
We’ll compare our results to standard linear regression and the true values.
fit | Intercept | x1 | x2 |
---|---|---|---|
fit_adagrad | 1.0439 | 0.9748 | -0.7459 |
fit_rmsprop | 1.0355 | 0.9802 | -0.7160 |
fit_adam | 1.0429 | 0.9733 | -0.7458 |
fit_nadam | 1.0430 | 0.9735 | -0.7459 |
fit_lm | 1.0042 | 0.9851 | -0.7510 |
true | 1.0000 | 1.0000 | -0.7500 |
Visualize Estimates
We can visualize the route of estimation for each technique. While Adagrad works well for this particular problem, in standard machine learning contexts with possibly millions of parameters, and possibly massive data, it would quickly get to a point where it is no longer updating (the denominator continues to grow). These other techniques are attempts to get around the limitations of Adagrad.
Source
Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/stochastic_gradient_descent.R