Technical details
This section is for some of the more technically inclined, though you’ll find more context and detail in and S. N. Wood (2006) and S. N. Wood (2017), from which a good chunk of this section is taken more or less directly from.
GAM
As we noted before, a GAM is a GLM whose linear predictor includes a sum of smooth functions of covariates. With link function \(g(.)\), model matrix \(X\) of \(n\) rows and \(p\) features (plus a column for the intercept), a vector of \(p\) coefficients \(\beta\), we can write a GLM as follows:
\[\mathrm{GLM:}\quad g(\mu) = X\beta\] For the GAM, it could look something like:
\[\mathrm{GAM:}\quad g(\mu) = X\beta + f(x_1) + f(x_2) + f(x_3, x_4) + \ldots\] or in a manner similar to mixed models:
\[\mathrm{GAM:}\quad g(\mu) = X\beta + Z\gamma\]
Where \(Z\) represents the basis functions of some subset of \(X\) or other features. So, we can have any number of smooth terms, possibly of different bases, and even combinations of them.
Finally, we can depict the structured additive regression, or STAR, model as a GAM + categorical random effects (\(\Upsilon\)) + spatial effects (\(\Xi\)) + other fun stuff (\(?\)).
\[\mathrm{STAR:}\quad g(\mu) = X\beta + Z\gamma + \Upsilon\varpi + \Xi\varrho + \ldots \mathbf{?}\vartheta \]
Penalized regression
Consider a standard GLM that we usually estimate with maximum likelihood, \(l(\beta)\), where \(\beta\) are the associated regression coefficients. We can write the penalized likelihood as follows:
\[l_p(\beta)= l(\beta) - \color{#b2001d}{\lambda B'SB}\]
Where \(S\) is a penalty matrix of known coefficients49. If you prefer least squares as the loss function, we can put it as:
\[\mathcal{Loss} = \sum (y-X\beta)^2 + \color{#b2001d}{\lambda B'SB}\]
So, if we’re maximizing the likelihood will make it lower than it otherwise would have been, and vice versa in terms of a loss function. What it means for a GAM is that we’ll add a penalty for the coefficients associated with the basis functions50. The practical side is that it will help to keep us from overfitting the data, where our smooth function might get too wiggly. As \(\lambda \rightarrow \infty\), the result is a linear fit because any wiggliness will add too much to the loss function. As \(\lambda \rightarrow 0\), we have the opposite effect, where any wiggliness is incorporated into the model. In mgcv, by default the estimated parameters are chosen via a generalized cross validation, or GCV, approach, and that statistic is reported in the summary. It modifies the loss function depicted above to approximate leave-one-out cross-validation selection.
Effective degrees of freedom again
If we define a matrix \(F\) that maps the unpenalized estimates of \(\beta\) to the penalized estimates such that
\[F = (X^T X + S)^{-1} X^T X\]
and note
\[\tilde{\beta} = (X^T X)^{-1} X^T y\] \[\hat{\beta} = F\tilde{\beta}\]
the diagonal elements of \(F\) are the effective degrees of freedom for each feature.
A detailed example
The following will demonstrate a polynomial spline by hand51. The example, and in particular the visual display, is based on a depiction in Fahrmeir et al. (2013) (figure 8.6) 52.
The approach is defined as follows, with \(\kappa\) knots on the interval \([a,b]\) as \(a=\kappa_1 < ... < \kappa_m =b\):
\[y_i = \gamma_1 + \gamma_2x_i + ... + \gamma_{l+1}(x_i)_+^l + \gamma_{l+2}(x_i - \kappa_2)_+^l ... + \gamma_{l+m-1}(x_i - \kappa_{m-1})_+^l + e_i\]
\[(x - \kappa_j)^l_+= \begin{cases} (x-\kappa_j)^l & x \geq \kappa_j \\ 0 & \textrm{otherwise} \end{cases}\]
So we subtract the current knot being considered from \(x\) for values of \(x\) greater than or equal to the knot, otherwise, it’s 0. It might look complicated, but note that there is nothing particularly special about the model itself. It is just a standard linear regression model when everything is said and done.
More generally we can write a GAM as follows:
\[y = f(x) + e = \sum_{j=1}^{d}B_j(x)\gamma_j + e\]
With the spline above this becomes:
\[B_1(x)=1, B_2(x)=x, ..., B_{l+1}(x)=x^l, B_{l+2}(x)=(x-\kappa_2)_+^l...B_{d}(x)=(x-\kappa_{m-1})_+^l\]
Let’s see it in action. Here our polynomial spline will be done with degree \(l\) equal to 1, which means that we are just fitting a linear regression between knots. The following uses the data we employed for demonstration before.
# data same as earlier examples
set.seed(123)
= runif(500)
x = sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2))
mu = rnorm(500, mu, .3)
y = seq(0, 1, by = .1)
knots = tibble(x, y) %>%
d mutate(
xcut = cut(x, knots, right = F),
xcut = factor(xcut, levels = c('Int', levels(xcut)))
)
# knots = seq(0, 1, by = .1)
= knots[-length(knots)] # don't need the last value
knots = 1
l = sapply(
bs 1:length(knots),
function(k)
ifelse(x >= knots[k], (x - knots[k]) ^ l, 0)
)
# head(bs)
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 |
---|---|---|---|---|---|---|---|---|---|
0.288 | 0.188 | 0.088 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
0.788 | 0.688 | 0.588 | 0.488 | 0.388 | 0.288 | 0.188 | 0.088 | 0.000 | 0.000 |
0.409 | 0.309 | 0.209 | 0.109 | 0.009 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
0.883 | 0.783 | 0.683 | 0.583 | 0.483 | 0.383 | 0.283 | 0.183 | 0.083 | 0.000 |
0.940 | 0.840 | 0.740 | 0.640 | 0.540 | 0.440 | 0.340 | 0.240 | 0.140 | 0.040 |
0.046 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
If we plot this against our target variable \(y\), it doesn’t look like much, but we can maybe start to see the partitioning of the effect by knots.
If we multiply each basis by its corresponding regression coefficient we can start to interpret the result.
= lm(y ~ . - 1, data = bs) # just a regression!
lmMod = coef(lmMod)
bscoefs = sweep(bs, 2, bscoefs,`*`)
bsScaled colnames(bsScaled) = c('int', paste0('X', 1:10))
## bscoefs
int | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
---|---|---|---|---|---|---|---|---|---|---|
0.783 | −6.637 | −1.388 | 3.839 | 7.366 | 25.769 | −41.462 | 15.167 | −7.144 | −1.738 | −3.394 |
In the plot above, the initial dot represents the global constant (\(\gamma_1\), i.e. our intercept). We have a decreasing function starting from that point onward (line). Between .1 and .2 (line), the coefficient is negative again, furthering the already decreasing slope (i.e. steeper downward). The fourth coefficient is positive, which means that between .2 and .3 our decreasing trend is lessening (line). So our coefficients \(j\) tell us the change in slope for the data from the previous section of data defined by the knots. The lengths of the lines reflect the size of the coefficient, i.e. how dramatic the change is. If this gets you to thinking about interactions in more common model settings, you’re on the right track (e.g. adding a quadratic term is just letting x have an interaction with itself; same thing is going on here).
Finally, if we plot the sum of the basis functions, which is the same as taking our fitted values from a regression on the basis expansion of X, we get the following fit to the data. And we can see the trend our previous plot suggested.
One of the more common approaches with GAMs uses a cubic spline fit. So we’ll change our polynomial degree from 1 to 3 (i.e. l = 3
).
Now we’re getting somewhere. Let’s compare it to the gam function from the mgcv package. We won’t usually specify the knots directly, and even as we have set things up similar to the mgcv approach, the gam function is still doing some things our by-hand approach is not (penalized regression). We still get pretty close agreement however.
We can see that we’re on the right track by using the constructor function within mgcv and a custom function for truncated power series like what we’re using above. See the example in the help file for smooth.construct for the underlying truncated power series function. I only show a few of the columns, but our by-hand construction and that used by gam are identical.
= scale(x, scale = F)
xs
= sapply(1:length(knots), function(k)
bs ifelse(x >= knots[k], (x - knots[k]) ^ l, 0))
##
= smoothCon(s(x, bs = 'tr', k = 14),
sm data = d,
knots = list(x = knots))[[1]]
# head(sm$X[, 1:6])
= cbind(1, xs, xs^2, xs^3, bs)
modelMatrix
all.equal(sm$X, modelMatrix)
[1] TRUE
V1 | V2 | V3 | V4 | V5 | V6 |
---|---|---|---|---|---|
1.000 | −0.208 | 0.043 | −0.009 | 0.024 | 0.007 |
1.000 | 0.293 | 0.086 | 0.025 | 0.490 | 0.326 |
1.000 | −0.086 | 0.007 | −0.001 | 0.068 | 0.029 |
1.000 | 0.388 | 0.150 | 0.058 | 0.689 | 0.480 |
1.000 | 0.445 | 0.198 | 0.088 | 0.832 | 0.594 |
1.000 | −0.450 | 0.202 | −0.091 | 0.000 | 0.000 |
Preview of other bases
As an example of other types of smooth terms we might use, here are the basis functions for b-splines. They work notably differently, e.g. over intervals of \(l+2\) knots. An easy way to create your own matrix and subsequent plot of this sort would be to use the basis function in the gratia package, which is basically a cleaner version of the mgcv approach.
The number of knots and where to put them
A natural question may arise as to how many knots to use. More knots potentially means more ‘wiggliness’, as demonstrated here (feel free to click on the different knot values). Note that these are number of knots, not powers in a polynomial regression as shown in the main section of this document.
However, we don’t really have to worry about this except in the conceptual sense, i.e. being able to control the wiggliness. The odds of you knowing beforehand the number of knots and where to put them is somewhere between slim and none, so it’s good that we can control this via a single parameter and via a more automatic process.
Interpreting output for smooth terms
Effective degrees of freedom
In interpreting the output from mgcv, we’ll start with the effective degrees of freedom, or edf. In typical OLS regression, the model degrees of freedom is equivalent to the number of predictors/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure. In our previous example in the application section, there are actually 9 terms associated with the smooth term, but their corresponding parameters are each penalized to some extent, and so the effective degrees of freedom does not equal 9. For hypothesis testing, an alternate edf is actually used, which is the other one provided there in the summary result (Ref.df). For more on this see ?summary.gam
and ?anova.gam
.
At this point you might be thinking these p-values are a bit fuzzy, and you’d be right. As is the case with mixed models, machine learning approaches, etc., p-values are not straightforward. The gist is that in the GAM setting they aren’t to be used for harsh cutoffs, say, at an arbitrary .05 level, but then standard p-values shouldn’t be used that way either. If they are pretty low you can feel comfortable claiming statistical significance, but if you want a tight p-value you’ll need to go back to using a non-penalized approach like standard GLM.
The edf would equal 1 if the model penalized the smooth term to a simple linear relationship53, and so the effective degrees of freedom falls somewhere between 1 and k-1 (or k), where k is chosen based on the basis. You can think of it as akin to the number of knots. There is functionality to choose the k value, but note the following from Wood in the help file for ?choose.k
:
So, exact choice of k is not generally critical: it should be chosen to be large enough that you are reasonably sure of having enough degrees of freedom to represent the underlying ‘truth’ reasonably well, but small enough to maintain reasonable computational efficiency. Clearly ‘large’ and ‘small’ are dependent on the particular problem being addressed.
And the following:
One scenario that can cause confusion is this: a model is fitted with k=10 for a smooth term, and the EDF for the term is estimated as 7.6, some way below the maximum of 9. The model is then refitted with k=20 and the EDF increases to 8.7 - what is happening - how come the EDF was not 8.7 the first time around? The explanation is that the function space with k=20 contains a larger subspace of functions with EDF 8.7 than did the function space with k=10: one of the functions in this larger subspace fits the data a little better than did any function in the smaller subspace. These subtleties seldom have much impact on the statistical conclusions to be drawn from a model fit, however.
If you want a more statistically oriented approach, see ?gam.check
.
Deviance explained
R-sq.(adj) = 0.904 Deviance explained = 92.1%
GCV = 0.010074 Scale est. = 0.0081687 n = 58
For the standard Gaussian setting, we can use our R2. Also provided is ‘deviance explained’, which in this setting is identical to the unadjusted R2, but for non-Gaussian families would be preferred.
As noted above, the GCV, or generalized cross validation score, can be taken as an estimate of the mean square prediction error based on a leave-one-out cross validation estimation process. Steps can be taken to choose model parameters specifically based on this (it’s actually the default, as opposed to, e.g. maximum likelihood).
Visual depiction
The following reproduces the plots produced by mgcv. I’ll even use base R plotting to ‘keep it real’.
First we’ll start with the basic GAM plot using our mod_gam2
from the application section. First we get the term for year, i.e. the linear combination of the basis functions for year. The mgcv package provides this for you via the predict function with type='terms'
. For non-smooth components, this is just the original covariate times its corresponding coefficient. Next we need the partial residuals, which are just the basic residuals plus the term of interest added. With the original predictor variable, we’re ready to proceed.
= predict(mod_gam2, type = 'terms')[, 's(Income)']
income_term = residuals(mod_gam2) + income_term
res_partial = mod_gam2$model$Income
Income par(mfrow = c(1, 2))
plot(
x = Income,
y = res_partial,
ylim = c(-200, 100),
col = 'black',
ylab = 's(income, 7.59)',
pch = 19,
main = 'ours'
)lines(Income[order(Income)], income_term[order(Income)])
plot(
mod_gam2,select = 1,
se = F,
residuals = T,
pch = 19,
rug = F,
ylim = c(-200, 100),
main = 'mgcv'
)
Now for a comparison to ggeffects, which is an easy tool to use to get predictions on the response scale. It uses the underlying predict function also, but gets a standard prediction while holding the other variables at key values (which you can manipulate). By default, these key values are at the median for numeric variables, and most common category for categorical variables. Note that we could add more fitted values to make the line smoother, but for our purposes getting the concept is the goal.
= pisa %>%
pred_data select(Edu, Health) %>%
summarise_all(median, na.rm = TRUE) %>%
tibble(Income = mod_gam2$model$Income)
= tibble(
plot_data Income = pred_data$Income,
preds = predict(mod_gam2, newdata = pred_data),
p_resid = mod_gam2$residuals + preds
)
= plot_data %>%
p1 ggplot(aes(x = Income, y = preds)) +
geom_line(alpha = .5)
= plot(
p2 ::ggpredict(mod_gam2, terms = 'Income'),
ggeffectsci = FALSE,
use.theme = FALSE,
show.title = FALSE
)
Examining first derivatives
It may be the case that we’d like to investigate the slopes of the smooth terms at various points to better understand how change is taking place. In standard linear model, the slope is constant, so the regression coefficient for a particular feature tells us all we need to know. For additive models or others with nonlinear effects, the slope changes over the values of the feature of interest.
We can easily extract this information using the gratia package54, specifically with the fderiv function. This allows one to get the estimated slope at various points, either by specifying them with the newdata
argument or specifying the number of values desired along the range of the the model covariate, as below.
library(gratia)
= derivatives(mod_gam1, n = 500) fd_inc
While we we can start to make sense of things by looking at the standard effect plot, we can get more precise with this approach if desired. Here we see that the initial flattening a little after .6 on Income, and both the peak positive slope and sharpest negative slope (blue dots). Depending on the modeling context, one may see patterns of theoretical importance or what theory would predict. In other cases, it just gives you more to talk about.
References
For example, it would be an identity matrix if we were dealing with random effects, or a neighborhood matrix when dealing with a spatial context.↩︎
More formally, the penalty focuses on the second derivative of the function (i.e. it’s curvature): \(\lambda \int f''(x)^2dx\)↩︎
This is the truncated power series variant of polynomial splines.↩︎
The notation is slightly different here but consistent with previous. Given the applied nature of the previous sections, I opted to keep to the typical linear regression of using \(b\) or \(\beta\) to represent the coefficients, but was too lazy to redo this section, which just uses Farhmeir’s.↩︎
You can actually penalize the covariate right out of the model if desired, i.e. edf=0.↩︎
Fairly new and still being developed as of this writing, it is available on GitHub.↩︎