Extreme Learning Machine

A very simple implementation of an extreme learning machine for regression, which can be seen as a quick way to estimate a ‘good enough’ neural net, one that can be nearly as performant but with a lot less computational overhead. See elmNN and ELMR for some R package implementations. I add comparison to generalized additive models (elm/neural networks and GAMs are adaptive basis function models).

Data Setup

One variable, complex function.

library(tidyverse)
library(mgcv)

set.seed(123)

n = 5000
x = runif(n)
# x = rnorm(n)
mu = sin(2*(4*x-2)) + 2* exp(-(16^2) * ((x-.5)^2))
y  = rnorm(n, mu, .3)

d = data.frame(x, y) 

qplot(x, y, color = I('#ff55001A'))

Motorcycle accident data.

data('mcycle', package = 'MASS')

times = matrix(mcycle$times, ncol = 1)
accel = mcycle$accel

Function

elm <- function(X, y, n_hidden = NULL, active_fun = tanh) {
  # X: an N observations x p features matrix
  # y: the target
  # n_hidden: the number of hidden nodes
  # active_fun: activation function
  
  pp1 = ncol(X) + 1
  w0  = matrix(rnorm(pp1*n_hidden), pp1, n_hidden)       # random weights
  h   = active_fun(cbind(1, scale(X)) %*% w0)            # compute hidden layer
  B   = MASS::ginv(h) %*% y                              # find weights for hidden layer
  
  fit = h %*% B                                          # fitted values
  
  list(
    fit  = fit,
    loss = crossprod(y - fit),
    B    = B,
    w0   = w0
  )
}

Estimation

X_mat = as.matrix(x, ncol = 1)

fit_elm = elm(X_mat, y, n_hidden = 100)
str(fit_elm)
List of 4
 $ fit : num [1:5000, 1] -1.0239 0.7311 -0.413 0.0806 -0.4112 ...
 $ loss: num [1, 1] 442
 $ B   : num [1:100, 1] 217 -608 1408 -1433 -4575 ...
 $ w0  : num [1:2, 1:100] 0.35 0.814 -0.517 -2.692 -1.097 ...
ggplot(aes(x, y), data = d) +
  geom_point(color = '#ff55001A') + 
  geom_line(aes(y = fit_elm$fit), color = '#00aaff')

cor(fit_elm$fit[,1], y)^2
[1] 0.8862518
fit_elm_mcycle = elm(times, accel, n_hidden = 100)
cor(fit_elm_mcycle$fit[,1], accel)^2
[1] 0.8122349

Comparison

We’ll compare to a generalized additive model with gaussian process approximation.

fit_gam = gam(y ~ s(x, bs = 'gp', k = 20), data = d)
summary(fit_gam)$r.sq
[1] 0.8856188
d %>%
  mutate(fit_elm = fit_elm$fit,
         fit_gam = fitted(fit_gam)) %>%
  ggplot() +
  geom_point(aes(x, y), color = '#ff55001A') +
  geom_line(aes(x, y = fit_elm), color = '#1e90ff') +
  geom_line(aes(x, y = fit_gam), color = '#990024')

fit_gam_mcycle = gam(accel ~ s(times), data = mcycle)
summary(fit_gam_mcycle)$r.sq
[1] 0.7832988
mcycle %>% 
  ggplot(aes(times, accel)) +
  geom_point(color = '#ff55001A') +
  geom_line(aes(y = fit_elm_mcycle$fit), color = '#1e90ff') +
  geom_line(aes(y = fitted(fit_gam_mcycle)), color = '#990024')

Supplemental Example

Yet another example with additional covariates.

d = gamSim(eg = 7, n = 10000)
Gu & Wahba 4 term additive model, correlated predictors
X = as.matrix(d[, 2:5])
y = d[, 1]

n_nodes = c(10, 25, 100, 250, 500, 1000)

The following estimation over multiple models will take several seconds.

fit_elm = purrr::map(n_nodes, function(n) elm(X, y, n_hidden = n))

Now find the best fitting model.

# estimate 
best_loss = which.min(map_dbl(fit_elm, function(x) x$loss))
fit_best  = fit_elm[[best_loss]]

A quick check of the fit.

# str(fit_best)
# qplot(fit_best$fit[, 1], y, alpha = .2)
cor(fit_best$fit[, 1], y)^2
[1] 0.7241967

And compare again to mgcv. In this case, we’re comparing fit on test data of the same form.

fit_gam = gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = d)
gam.check(fit_gam)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 15 iterations.
The RMS GCV score gradient at convergence was 9.309879e-07 .
The Hessian was positive definite.
Model rank =  37 / 37 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

        k'  edf k-index p-value
s(x0) 9.00 4.71    1.00    0.48
s(x1) 9.00 4.89    1.00    0.35
s(x2) 9.00 8.96    0.99    0.20
s(x3) 9.00 1.00    1.00    0.47
summary(fit_gam)$r.sq
[1] 0.6952978
test_data0 = gamSim(eg = 7)  # default n = 400
Gu & Wahba 4 term additive model, correlated predictors
test_data  = cbind(1, scale(test_data0[, 2:5]))

 # remember to use your specific activation function here
elm_prediction = tanh(test_data %*% fit_best$w0) %*% fit_best$B         
gam_prediction = predict(fit_gam, newdata = test_data0)

cor(data.frame(elm_prediction, gam_prediction), test_data0$y)^2
                    [,1]
elm_prediction 0.6873090
gam_prediction 0.7185687