Technical details

Workshops see all types of folks at different levels. This is for some of the more technically inclined, though you’ll find more context and detail in my document and Wood (2006), from which a good chunk of this section is taken more or less directly from.

GAM Introduction

A GAM is a GLM whose linear predictor includes a sum of smooth functions of covariates. With link function \(g(.)\), model matrix \(X\) of \(n\) rows and \(p\) covariates (plus a column for the intercept), a vector of \(p\) coefficients \(\beta\), we can write a GLM as follows:

\[\mathrm{GLM:}\quad g(\mu) = X\beta\] For the GAM, it could look something like:

\[\mathrm{GAM:}\quad g(\mu) = X\beta + f(x_1) + f(x_2) + f(x_3, x_4) + \ldots\] or as we did when discussing mixed models:

\[\mathrm{GAM:}\quad g(\mu) = X\beta + Z\gamma\]

Where \(Z\) represents the basis functions of some subset of \(X\) or other covariates. Thus we can have any number of smooth terms, possibly of different bases, and even combinations of them.

Finally we can depict the structured additive regression, or STAR, model as a GAM + random effects (\(\Upsilon\)) + spatial effects (\(\Xi\)) + other fun stuff.

\[\mathrm{STAR:}\quad g(\mu) = X\beta + Z\gamma + \Upsilon\varpi + \Xi\varrho + \ldots \mathbf{?}\vartheta \]

Penalized regression

Consider a standard GLM that we usually estimate with maximum likelihood, \(l(\beta)\), where \(\beta\) are the associated regression coefficients. We can write the penalized likelihood as follows:

\[l_p(\beta)= l(\beta) - \color{darkred}{\lambda B'SB}\]

Where \(S\) is a penalty matrix of known coefficients14. If you prefer least squares as the loss function, we can put it as:

\[\mathcal{Loss} = \sum (y-X\beta)^2 + \color{darkred}{\lambda B'SB}\]

So if we’re maximizing the likelihood will make it lower than it otherwise would have been, and vice versa in terms of a loss function. What it means for a GAM is that we’ll add a penalty for the coefficients associated with the basis functions15. The practical side is that it will help to keep us from overfitting the data, where our smooth function might get too wiggly. As \(\lambda \rightarrow \infty\), the result is a linear fit because any wiggliness will add too much to the loss function. As \(\lambda \rightarrow 0\), we have the opposite effect, where any wiggliness is incorporated into the model. In mgcv, by default the estimated parameters are chosen via a generalized cross validation, or GCV, approach, and that statistic is reported in the summary. It modifies the loss function depicted above to approximate leave-one-out cross-validation selection.

A detailed example

The following will demonstrate a polynomial spline by hand16. The example, and in particular the visual display, is based on a depiction in Fahrmeier et al. (2013).

The approach is defined as follows, with \(\kappa\) knots on the interval \([a,b]\) as \(a=\kappa_1 < ... < \kappa_m =b\):

\[y_i = \gamma_1 + \gamma_2x_i + ... + \gamma_{l+1}(x_i)_+^l + \gamma_{l+2}(x_i - \kappa_2)_+^l ... + \gamma_{l+m-1}(x_i - \kappa_{m-1})_+^l + e_i\]

\[(x - \kappa_j)^l_+= \begin{cases} (x-\kappa_j)^l & x \geq \kappa_j \\ 0 & \textrm{otherwise} \end{cases}\]

So we subtract the current knot being considered from \(x\) for values of \(x\) greater than or equal to the knot, otherwise, it’s 0. It might look complicated, but note that there is nothing particularly special about the model itself. It is just a standard linear regression model when everything is said and done.

More generally we can write a GAM as follows:

\[y = f(x) + e = \sum_{j=1}^{d}B_j(x)\gamma_j + e\]

With the spline above this becomes:

\[B_1(x)=1, B_2(x)=x, ..., B_{l+1}(x)=x^l, B_{l+2}(x)=(x-\kappa_2)_+^l...B_{d}(x)=(x-\kappa_{m-1})_+^l\]

Let’s see it in action. Here our polynomial spline will be done with degree \(l\) equal to 1, which means that we are just fitting a linear regression between knots. This uses the data we employed for demonstration before.

knots = seq(0, 1, by=.1)
knots = knots[-length(knots)]  # don't need the last value
l = 1
bs = sapply(1:length(knots), function(k) ifelse(x >= knots[k], (x-knots[k])^l, 0))

          [,1]      [,2]       [,3]      [,4]        [,5]      [,6]      [,7]       [,8]      [,9]      [,10]
[1,] 0.2875775 0.1875775 0.08757752 0.0000000 0.000000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000
[2,] 0.7883051 0.6883051 0.58830514 0.4883051 0.388305135 0.2883051 0.1883051 0.08830514 0.0000000 0.00000000
[3,] 0.4089769 0.3089769 0.20897692 0.1089769 0.008976922 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000
[4,] 0.8830174 0.7830174 0.68301740 0.5830174 0.483017404 0.3830174 0.2830174 0.18301740 0.0830174 0.00000000
[5,] 0.9404673 0.8404673 0.74046728 0.6404673 0.540467284 0.4404673 0.3404673 0.24046728 0.1404673 0.04046728
[6,] 0.0455565 0.0000000 0.00000000 0.0000000 0.000000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000
# a more digestible example to make things even more clear
# x2 = seq(0,1, length.out=30)
# sapply(1:3, function(k) ifelse(x2 >= knots[k], (x2-knots[k])^l, 0))

If we plot this against our target variable \(y\), it doesn’t look like much.

If we multiply each basis by it’s corresponding regression coefficient we can start to interpret the result.

lmMod = lm(y~.-1, bs)  # just a regression!
bscoefs = coef(lmMod)
bsScaled = sweep(bs, 2, bscoefs,`*`)
colnames(bsScaled) = c('int', paste0('X', 1:10))
round(bscoefs, 3)
    int      X1      X2      X3      X4      X5      X6      X7      X8      X9     X10 
  0.783  -6.637  -1.388   3.839   7.366  25.769 -41.462  15.167  -7.144  -1.738  -3.394 

In the plot above, the initial dot represents the global constant (\(\gamma_1\), i.e. our intercept). We have a decreasing function starting from that point onward (line). Between .1 and .2 (line), the coefficient is negative again, furthering the already decreasing slope (i.e. steeper downward). The fourth coefficient is positive, which means that between .2 and .3 our decreasing trend is lessening (line). Thus our coefficients \(j\) tell us the change in slope for the data from the previous section of data defined by the knots. The length’s of the lines reflect the size of the coefficient, i.e. how dramatic the change is. If this gets you to thinking about interactions in more common model settings, you’re on the right track (e.g. adding a quadratic term is just letting x have an interaction with itself; same thing is going on here).

Finally if we plot the sum of the basis functions, which is the same as taking our fitted values from a regression on the basis expansion of X, we get the following fit to the data. And we can see the trend our previous plot suggested.

One of the more common approaches with GAMs uses a cubic spline fit. So we’ll change our polynomial degree from 1 to 3 (i.e. l = 3).

Now we’re getting somewhere. Let’s compare it to the gam function from the mgcv package. We won’t usually specify the knots directly, and even as we have set things up similar to the mgcv approach, the gam function is still doing some things our by-hand approach is not (penalized regression). We still get pretty close agreement however.

We can see that we’re on the right track by using the constructor function within mgcv and a custom function for truncated power series like what we’re using above. See the example in the help file for smooth.construct for the underlying truncated power series function. I only show a few of the columns, but our by-hand construction and that used by gam are identical.

xs = scale(x, scale=F)
bs = sapply(1:length(knots), function(k) ifelse(x >= knots[k], (x-knots[k])^l, 0))
sm = smoothCon(s(x, bs='tr', k=14), data=d, knots=list(x=knots))[[1]]
     [,1]        [,2]        [,3]          [,4]         [,5]        [,6]
[1,]    1 -0.20770617 0.043141852 -0.0089608288 2.378290e-02 0.006599976
[2,]    1  0.29302145 0.085861568  0.0251592810 4.898725e-01 0.326094166
[3,]    1 -0.08630677 0.007448858 -0.0006428868 6.840635e-02 0.029497019
[4,]    1  0.38773372 0.150337434  0.0582908920 6.885061e-01 0.480080698
[5,]    1  0.44518360 0.198188434  0.0882302397 8.318233e-01 0.593693698
[6,]    1 -0.44972719 0.202254545 -0.0909593678 9.454771e-05 0.000000000
modelMatrix = cbind(1, xs, xs^2, xs^3, bs)
all.equal(sm$X, modelMatrix)
[1] TRUE

Preview of other bases

As an example of other types of smooth terms we might use, here are the basis functions for b-splines. They work notably differently, e.g. over intervals of \(l+2\) knots.

The number of knots and where to put them.

A natural question may arise as to how many knots to use. More knots potentially means more ‘wiggliness’, as demonstrated here. Note that these are number of knots, not powers in a polynomial regression as shown in the main section of this document.

However, as we’ll come to learn in the next section, we don’t really have to worry about this except in the conceptual sense, i.e. being able to control the wiggliness. The odds of you knowing beforehand the number of knots and where to put them is somewhere between slim and none, so this is a good thing.

Interpreting output for smooth terms

Effective degrees of freedom

In interpreting the output from mgcv, we’ll start with the effective degrees of freedom, or edf. In typical OLS regression the model degrees of freedom is equivalent to the number of predictors/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure. In this example there are actually 9 terms associated with this smooth, but their corresponding parameters are each penalized to some extent and thus the effective degrees of freedom does not equal 9. For hypothesis testing an alternate edf is actually used, which is the other one provided there in the summary result (Ref.df). For more on this see my document, ?summary.gam and ?anova.gam.

At this point you might be thinking these p-values are a bit fuzzy, and you’d be right. As is the case with mixed models, machine learning approaches, etc., p-values are not straightforward. The gist is that in the GAM setting they aren’t to be used for harsh cutoffs, say, at an arbitrary .05 level, but then standard p-values shouldn’t be used that way either. If they are pretty low you can feel comfortable claiming statistical significance, but if you want a tight p-value you’ll need to go back to using a GLM.

The edf would equal 1 if the model penalized the smooth term to a simple linear relationship17, and so the effective degrees of freedom falls somewhere between 1 and k-1 (or k), where k is chosen based on the basis. You can think of it as akin to the number of knots. There is functionality to choose the k value, but note the following from Wood in the help file for ?choose.k:

So, exact choice of k is not generally critical: it should be chosen to be large enough that you are reasonably sure of having enough degrees of freedom to represent the underlying ‘truth’ reasonably well, but small enough to maintain reasonable computational efficiency. Clearly ‘large’ and ‘small’ are dependent on the particular problem being addressed.

And the following:

One scenario that can cause confusion is this: a model is fitted with k=10 for a smooth term, and the EDF for the term is estimated as 7.6, some way below the maximum of 9. The model is then refitted with k=20 and the EDF increases to 8.7 - what is happening - how come the EDF was not 8.7 the first time around? The explanation is that the function space with k=20 contains a larger subspace of functions with EDF 8.7 than did the function space with k=10: one of the functions in this larger subspace fits the data a little better than did any function in the smaller subspace. These subtleties seldom have much impact on the statistical conclusions to be drawn from a model fit, however.

If you want a more statistically oriented approach, see ?gam.check.

Deviance explained

R-sq.(adj) =  0.904   Deviance explained = 92.1%
GCV = 0.010074  Scale est. = 0.0081687  n = 58

For the standard Gaussian setting, we can use our R2. Also provided is ‘deviance explained’, which in this setting is identical to the unadjusted R2, but for non-Gaussian families would be preferred.

As noted above, the GCV, or generalized cross validation score, can be taken as an estimate of the mean square prediction error based on a leave-one-out cross validation estimation process. Steps can be taken to choose model parameters specifically based on this (it’s actually the default, as opposed to, e.g. maximum likelihood).

Visual depiction

The following reproduces the plots produced by mgcv and visreg. I’ll even use base R plotting to ‘keep it real’.

First we’ll start with the basic GAM plot. First we get the term for year, i.e. the linear combination of the basis functions for year. The mgcv package provides this for you via the predict function with type='terms'. For non-smooth components this is just the original covariate times its corresponding coefficient. Next we need the partial residuals, which are just the basic residuals plus the term of interest added. With the original predictor variable, we’re ready to proceed.

year_term = predict(gam_model, type='terms')[,'s(year)']
res_partial = residuals(gam_model) + year_term
year = 1948:2005 
plot(year, res_partial, ylim=c(-.8,.8), col='black', 
     ylab='s(year, 7.06)', pch=19, main='ours')
lines(year, year_term)
plot(gam_model, se=F, residuals=T, pch=19, rug=F, ylim=c(-.8,.8), main='mgcv')

Now for visreg. It uses the underlying predict function also, but gets a standard prediction while holding the other variables at key values (which you can manipulate). By default these key values are at the median for numeric variables, and most common category for categorical variables.

pred_data = movies_yearly %>% 
  select(budget_2016, votes, length) %>% 
  na.omit() %>% 
  summarise_all(median) %>% 
preds = predict(gam_model, newdata=pred_data)
res_partial = gam_model$residuals + preds

plot(year, preds, col='dodgerblue', ylab='f(year)', lwd=3, 
     ylim=c(5,7), type='l', main='ours')
points(year, res_partial, col='gray50', pch=19)
visreg(gam_model, xvar='year', points.par=list(cex=1), band=F, 
       ylim=c(5,7), main='visreg')

  1. For example, it would be an identity matrix if we were dealing with random effects, or a neighborhood matrix when dealing with a spatial context.

  2. More formally, the penalty focuses on the second derivative of the function (i.e. it’s curvature): \(\lambda \int f''(x)^2dx\)

  3. This is the truncated power series variant of polynomial splines.

  4. You can actually penalize the covariate right out of the model if desired, i.e. edf=0.