Application Using R
We’re now ready to do some modeling! We’ve now seen the relationships of GAM to techniques we know, deficiencies with common approaches, and a little bit about how GAMs work conceptually. The next step is to dive in.
Initial Examination
The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. The key variables are as follows (variable abbreviations in bold)9:
- Overall Science Score (average score for 15 year olds)
- Interest in science
- Identifying scientific Issues
- Explaining phenomena scientifically
- Support for scientific inquiry
- Income Index
- Health Index
- Education Index
- Human Development Index (composed of the Income index, Health Index, and Education Index)
But even before we get too far, it would be good to know our options in the world of GAMs. The appendix of this document has a list of some packages to be aware of, and there is quite a bit of GAM functionality available within R, even for just plotting10. We will use mgcv for our purposes.
The first thing to do is get the data in and do some initial inspections.
# url for the data is:
# https://raw.githubusercontent.com/m-clark/generalized-additive-models/master/data/pisasci2006.csv
= read_csv('data/pisasci2006.csv') pisa
Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max | % Missing |
---|---|---|---|---|---|---|---|---|---|
Overall | 57.0 | 473.1 | 54.6 | 322.0 | 428.0 | 489.0 | 513.0 | 563.0 | 12.0 |
Issues | 57.0 | 469.9 | 53.9 | 321.0 | 427.0 | 489.0 | 514.0 | 555.0 | 12.0 |
Explain | 57.0 | 475.0 | 54.0 | 334.0 | 432.0 | 490.0 | 517.0 | 566.0 | 12.0 |
Evidence | 57.0 | 469.8 | 61.7 | 288.0 | 423.0 | 489.0 | 515.0 | 567.0 | 12.0 |
Interest | 57.0 | 528.2 | 49.8 | 448.0 | 501.0 | 522.0 | 565.0 | 644.0 | 12.0 |
Support | 57.0 | 512.2 | 26.1 | 447.0 | 494.0 | 512.0 | 529.0 | 569.0 | 12.0 |
Income | 61.0 | 0.7 | 0.1 | 0.4 | 0.7 | 0.8 | 0.8 | 0.9 | 6.0 |
Health | 61.0 | 0.9 | 0.1 | 0.7 | 0.8 | 0.9 | 0.9 | 1.0 | 6.0 |
Edu | 59.0 | 0.8 | 0.1 | 0.5 | 0.7 | 0.8 | 0.9 | 1.0 | 9.0 |
HDI | 59.0 | 0.8 | 0.1 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 | 9.0 |
The scatterplot matrix has quite a bit of information to spend some time with- univariate density, loess curves, and correlations. For our purposes, we will ignore the issue regarding the haves vs. have-nots on the science scale and save that for another day. Note the surprising negative correlation between interest and overall score for science, which might remind us of Simpson’s paradox, namely, that what occurs for the individual may not be the same for the group. One will also note that while there is a positive correlation between Income and Overall science scores, it flattens out after an initial rise. Linear fits might be difficult to swallow for some of these, but let’s take a closer look.
We can see again that linear fits aren’t going to do so well for some, though it might be a close approximation for interest in science and support for science. Now let’s run the smooth. By default, ggplot2 will use a loess smoother for small data sets (i.e. < 1000 observations), but one can use the mgcv gam function as a smoother by setting method = 'gam'
in when using geom_smooth.
Often in these situations people will perform some standard transformation, such as a log, but as we noted earlier, it doesn’t help as nearly as often as it is used. For example, in this case one can log the overall score, Income, or both and a linear relation will still not be seen.
Single Feature
Linear Fit
We will start with the simple situation of using a single feature for prediction. Let’s begin by using a typical linear regression to predict science scores by the Income index. We could use the standard R lm function, but I’ll leave that as an exercise for comparison. We can still do straightforward linear models with the gam function, and again it is important to note that the standard linear model can be seen as a special case of a GAM.
library(mgcv)
= gam(Overall ~ Income, data = pisa)
mod_lm summary(mod_lm)
Family: gaussian
Link function: identity
Formula:
Overall ~ Income
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 204.32 35.37 5.777 4.32e-07 ***
Income 355.85 46.79 7.606 5.36e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.518 Deviance explained = 52.7%
GCV = 1504.5 Scale est. = 1448.8 n = 54
What are we getting here? The same thing you get from a regular linear model, because you just ran one. However, there are a couple things to look at. The coefficient is statistically significant, but serves as a reminder that it usually a good idea to scale feature variables so that the effect is more meaningful. Here, moving one unit on Income is akin from going broke to being the richest country. But in a more meaningful sense, if we moved from say, .7 to .8, we’d expect an increase of about 35 points on the science score. We also see the deviance explained11, which serves as a generalization of R-squared, and in this case, it actually is equivalent to the unadjusted R-squared. Likewise, there is the familiar adjusted version of it to account for small sample size and model complexity. The scale estimate is the scaled deviance, which here is equivalent to the residual sums of squares. The GCV score we will save for when we run a GAM.
GAM
Let’s now try some nonlinear approaches12, keeping in mind that \(\mu=f(x)\). As a point of comparison, we can start by trying a standard polynomial regression, and it might do well enough, but let’s go further. To begin we must consider a basis to use, a space that \(f\) is an element of. Doing so leads to choosing a set of basis functions \(F_j\), with parameters \(b_j\) that will be combined to produce \(f(x)\):
\[f(x)=\displaystyle\sum\limits_{j=1}^q F_{j}(x)b_{j}\]
To better explain by example, if we use a cubic polynomial, the basis is: \(F_1(x) = 1\), \(F_2(x)=x\), \(F_3(x)=x^2\), \(F_4(x)=x^3\), which leads to the following:
\[f(x) = b_1 + b_2\cdot x + b_3\cdot x^2 + b_4\cdot x^3\]
The following visualization allows us to see the effects in action. It is based on the results extracted from running such a model13 and obtaining the coefficients. The first plot represents the intercept of 470.44, the second plot, our \(b_2\) coefficient of 289.5 multiplied by Income and so forth. The bottom plot shows the final fit \(f(x)\), i.e. the linear combination of the basis functions.
At this point we have done nothing we couldn’t do in our regular regression approach, as polynomial regression has a long history of use in modeling. However, the take home message is that as we move to GAMs we are going about things in much the same fashion; we are simply changing the nature of the basis, and have a great deal more flexibility in choosing the form.
In the next visualization I show the fit using a ‘by-hand’ cubic spline basis (see the Appendix and p.126-7 in S. N. Wood (2006)).
A cubic spline is essentially a connection of multiple cubic polynomial regressions, similar to what we demonstrated previously. We choose points of the feature variable at which to create sections, and these points are referred to as knots. Separate cubic polynomials are fit at each section, and then joined at the knots to create a continuous curve. The above graph represents a cubic spline with 8 knots between the first and third quartiles14. The red line uses the GAM results from mgcv.
Fitting the model
Let’s now fit an actual generalized additive model using the same cubic spline as our basis. We again use the gam function as before for basic model fitting, but now we are using a function s within the formula to denote the smooth terms. Within that function we also specify the type of smooth, though a default is available. I chose bs = cr
, denoting cubic regression splines, to keep consistent with our previous example.
= gam(Overall ~ s(Income, bs = "cr"), data = pisa)
mod_gam1 summary(mod_gam1)
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 470.444 4.082 115.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 6.895 7.741 16.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.7 Deviance explained = 73.9%
GCV = 1053.7 Scale est. = 899.67 n = 54
The first thing to note is that, aside from the smooth part, our model code is similar to what we’re used to with core R functions such as lm and glm. In the summary, we first see the distribution assumed, as well as the link function used, in this case normal and identity, respectively, which to iterate, had we had no smoothing, would result in a SLiM. After that we see that the output is separated into parametric and smooth, or nonparametric parts15. In this case, the only parametric component is the intercept, but it’s good to remember that you are not bound to use a smooth for every feature of interest, and indeed, as we will discuss in more detail later, part of the process may involve refitting the model with terms that were found to be linear for the most part anyway. The smooth component of our model regarding a country’s income and its relationship with overall science score suggests it is statistically significant, but there are a couple of things in the model summary that would be unfamiliar.
We’ll start with the effective degrees of freedom16, or edf
. In typical OLS regression the model degrees of freedom is equivalent to the number of features/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure, something that will be discussed more later17. In this situation, we are still trying to minimize the residual sums of squares, but we also have a built-in penalty for ‘wiggliness’ of the fit, where in general we try to strike a balance between an undersmoothed fit and an oversmoothed fit. The default p-value for the test is based on the effective degrees of freedom and the rank \(r\) of the covariance matrix for the coefficients for a particular smooth, so here, conceptually, it is the p-value associated with the \(F(r, n-edf)\). However, there are still other issues to be concerned about, and ?summary.gam
will provide your first step down that particular rabbit hole. For hypothesis testing an alternate edf is actually used, which is the other one provided there in the summary result18. At this point you might be thinking these p-values are a bit fuzzy, and you’d be right. The gist is, they aren’t to be used for harsh cutoffs, say, at an arbitrary .05 level19, but if they are pretty low you can feel comfortable claiming statistical significance, which of course is the end all, be all, of the scientific endeavor- right?
The GCV, or generalized cross validation score can be taken as an estimate of the mean square prediction error based on a leave-one-out cross validation estimation process. We estimate the model for all observations except \(i\), then note the squared residual predicting observation \(i\) from the model. Then we do this for all observations. However, the GCV score is an efficient measure of this concept that doesn’t actually require fitting all those models and overcomes other issues20. It is this score that is minimized by default when determining the specific nature of the smooth. On its own it doesn’t tell us much, but we can use it similar to AIC as a comparative measure to choose among different models, with lower being better.
Visualization
One can get sense of the form of the fit by plotting the model object as follows21:
## plot(mod_gam1)
Note that the plot function will only plot the smooth terms in the model- the others are straight lines since they are linear effects. The intervals are Bayesian credible intervals based on the posterior predictive distribution. In this single feature case, one can also revisit the previous graph.
Model Comparison
Lets compare our regular regression fit to the GAM fit. The following shows how one can extract various measures of performance, and the subsequent table shows them gathered together.
AIC(mod_lm)
[1] 550.2449
summary(mod_lm)$sp.criterion
GCV.Cp
1504.496
summary(mod_lm)$r.sq # adjusted R squared
[1] 0.5175346
Do the same to extract those same elements from the GAM. The following display makes for easy comparison.
model | AIC | GCV | R^2 |
---|---|---|---|
LM | 550.24 | 1,504.50 | 0.52 |
GAM | 529.81 | 1,053.73 | 0.70 |
Comparing these various measures, it’s safe to conclude that the GAM fits better. We can also perform the familiar statistical test via the anova function we apply to other R model objects. As with the previous p-value issue, we can’t get too carried away, and technically one could have a model with even more terms but lower edf, but would be difficult to interpret22. As it would be best to be conservative, we’ll proceed cautiously.
anova(mod_lm, mod_gam1, test = "Chisq")
Analysis of Deviance Table
Model 1: Overall ~ Income
Model 2: Overall ~ s(Income, bs = "cr")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 52.000 75336
2 45.259 41479 6.7411 33857 2.778e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It would appear the ANOVA results tell us what we have probably come to believe already, that incorporating nonlinear effects has improved the model considerably.
Multiple Features
Let’s now see what we can do with a more realistic case where we have added model complexity.
Linear Fit
We’ll start with the linear model approach again, this time adding the Health and Education indices.
= gam(Overall ~ Income + Edu + Health, data = pisa)
mod_lm2 summary(mod_lm2)
Family: gaussian
Link function: identity
Formula:
Overall ~ Income + Edu + Health
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.18 78.97 1.535 0.1314
Income 182.32 85.27 2.138 0.0376 *
Edu 234.11 54.78 4.274 9.06e-05 ***
Health 27.01 134.90 0.200 0.8421
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.616 Deviance explained = 63.9%
GCV = 1212.3 Scale est. = 1119 n = 52
It appears we have statistical effects for Income and Education, but not for Health, and the adjusted R-squared suggests a notable amount of the variance is accounted for23. Let’s see about nonlinear effects.
GAM
As far as the generalized additive model goes, we can approach things in a similar manner as before, and now and look for nonlinear effects for each feature24.
= gam(Overall ~ s(Income) + s(Edu) + s(Health), data = pisa)
mod_gam2 summary(mod_gam2)
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income) + s(Edu) + s(Health)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 471.154 2.772 170 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 7.593 8.415 8.826 1.29e-06 ***
s(Edu) 6.204 7.178 3.308 0.00771 **
s(Health) 1.000 1.000 2.736 0.10679
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.863 Deviance explained = 90.3%
GCV = 573.83 Scale est. = 399.5 n = 52
There are again a couple things to take note of. First, statistically speaking, we come to the same conclusion as the linear model regarding the individual effects. One should take particular note of the effect of Health index. The effective degrees of freedom with value 1 suggests that it has essentially been reduced to a simple linear effect. The following will update the model to explicitly model the effect as linear, but as one can see based on GCV and other values, the results are identical, because the penalty had effectively rendered it linear.
= update(mod_gam2, . ~ . - s(Health) + Health)
mod_gam2B summary(mod_gam2B)
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income) + s(Edu) + Health
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 640.3 102.3 6.260 3.06e-07 ***
Health -189.5 114.6 -1.654 0.107
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 7.593 8.415 8.826 1.29e-06 ***
s(Edu) 6.204 7.178 3.308 0.00771 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.863 Deviance explained = 90.3%
GCV = 573.83 Scale est. = 399.5 n = 52
We can also note that this model accounts for much of the variance in Overall science scores, with an adjusted R-squared of .86. In short, it looks like the living standards and educational resources of a country are associated with overall science scores, even if we don’t really need the Health index in the model.
Visualization
Now we examine the effects of interest visually via component plots25. The following uses my own function from visibly but you can also use packages like ggeffects or gratia. The following is close to what you’d get with the former.
plot(ggeffects::ggpredict(mod_gam2), facets = TRUE)
::draw(mod_gam2) gratia
Here we can see the effects of interest, and again one might again note the penalized-to-linear effect of Health26. As before, we see the tapering off of Income’s effect at its highest levels, and in addition, a kind of sweet spot for a positive effect of Education in the mid-range values, with a slight positive effect overall. Health, as noted, has been reduced to a linear, surprisingly negative effect, but again this is not statistically significant.
The following code demonstrates how to create data necessary for the plot, specifically for incomes, with the other features held at their mean values. See the technical section for more details on these visualizations.
# Note that mod_gam2$model is the data that was used in the modeling process,
# so it will have NAs removed.
= data.frame(
testdata Income = seq(.4, 1, length = 100),
Edu = mean(mod_gam2$model$Edu),
Health = mean(mod_gam2$model$Health)
)
= predict(
predictions
mod_gam2,newdata = testdata,
type = 'response',
se = TRUE
)
= data.frame(testdata, predictions) %>%
df_preds mutate(lower = fit - 1.96 * se.fit,
upper = fit + 1.96 * se.fit)
ggplot(aes(x = Income, y = fit), data = df_preds) +
geom_ribbon(aes(ymin = lower, ymax = upper), fill = 'gray92') +
geom_line(color = '#56B4E9')
2d Smooths
The GAM gives us a sense for one feature, but let’s now take a gander at Income and Education at the same time. Previously, we saw how to use the plot method for a GAM class object. There is another plotting function, vis.gam, that will give us a bit more to play with, and specifically to display 2d smooths. The actual plot shown provided instead depicts a heatmap with the values on the response scale.
vis.gam(mod_gam2, type = 'response', plot.type = 'contour')
First and foremost, the figure reflects the individual plots, and we can see high on Income generally produces the highest scores, while Education has less of an effect. Conversely, being low on both the Education and Income indices are associated with poor Overall science scores. While interesting, these respective smooths were created separately of one another, and there is another way we might examine how these two effects work together in predicting the response.
So let’s take a look at another approach, continuing the focus on visualization. It may not be obvious at all, but one can utilize smooths of more than one feature, in effect, a smooth of the smooths of the variables that go into it. This is akin to an interaction in typical model settings27. Let’s create a new model to play around with this. After fitting the model, I provide both a visualization for comparison to the previous, as well as a 3D view one can interactively rotate to their liking.
= gam(Overall ~ te(Income, Edu), data = pisa)
mod_gam3 summary(mod_gam3)
Family: gaussian
Link function: identity
Formula:
Overall ~ te(Income, Edu)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 471.154 3.349 140.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
te(Income,Edu) 10.1 12.19 16.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.8 Deviance explained = 84%
GCV = 741.42 Scale est. = 583.17 n = 52
## vis.gam(
## mod_gam3,
## type = 'response',
## plot.type = 'persp',
## phi = 30,
## theta = 30,
## n.grid = 500,
## border = NA
## )
In the above we are using a type of smooth called a tensor product smooth, and by smoothing the marginal smooths of Income and Education, we see a bit clearer story. As we might suspect, wealthy countries with more of an apparent educational infrastructure are going to score higher on the Overall science score. However, wealth alone does not necessarily guarantee higher science scores (note the dark bottom right corner on the contour plot)28, though without at least moderate wealth hopes are fairly dim for a decent score.
One can also, for example, examine interactions between a smooth and a linear term \(f(x)z\), and in a similar vein of thought look at smooths at different levels of a grouping factor.
Model Comparison
As before, we can examine indices such as GCV or perhaps adjusted R-squared, which both suggest our GAM performs considerably better. Statistically we can compare the two models with anova as before.
anova(mod_lm2, mod_gam2, test = "Chisq")
Analysis of Deviance Table
Model 1: Overall ~ Income + Edu + Health
Model 2: Overall ~ s(Income) + s(Edu) + s(Health)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 48.000 53713
2 34.408 14463 13.592 39250 6.754e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not that we couldn’t have assumed as such already, but now we have additional statistical evidence to suggest that incorporating nonlinear relationships of the features improves the model.
References
The education component is measured by mean of years of schooling for adults aged 25 years and expected years of schooling for children of school entering age, the health index by life expectancy at birth, and the wealth component is based on the gross national income per capita. The HDI sets a minimum and a maximum for each dimension, and values indicate where each country stands in relation to these endpoints, expressed as a value between 0 and 1. More information on the HDI measures can be found here.↩︎
ggplot2 has basic gam functionality for scatterplot smoothing.↩︎
For those more familiar with generalized linear models, this is calculated as (\(\mathrm{Dev}_{Null} - \mathrm{Dev}_{Residual}\))/\(\mathrm{Dev}_{Null}\)
One can verify this by running the same model via the glm function, and using the corresponding values from the summary of the model object.↩︎This example is pretty much straight from S. N. Wood (2006) with little modification.↩︎
See poly for how to fit a polynomial in R.↩︎
So ten total knots including the endpoints.↩︎
As an aside, the term nonparametric has at least two general uses in the statistical world. Wikipedia has a nice delineation.↩︎
This part regarding effective degrees of freedom should ring a bell for those who use the lme4 package for mixed models. All of this applies there too, and may provide more insight as to why they don’t even provide p-values as a default. See the Other Approaches section.↩︎
In this example, there are actually 9 terms associated with this smooth, but they are each ‘penalized’ to some extent and so the edf does not equal 9.↩︎
Here it is noted
Ref.df
. In the past, there were four or five p-values to choose from, but now the option has settled to Bayesian-esque vs. frequentist approaches. The full story of edf, p-values and related is scattered throughout Wood’s text. See also?anova.gam
.↩︎But then, standard p-values shouldn’t be used that way either.↩︎
In this initial model the GCV can be found as:
\(GCV = \frac{n*scaled\, est.}{(n-edf-{[n\,of\, parametric\, terms]})^{2}}\)↩︎More visually appealing alternatives to the default mgcv plot can be found with ggeffects, mgcViz, gratia, and probably others. In most places, I’m using my own functions from visibly, which you can get from GitHub.↩︎
?anova.gam
for more information.↩︎Note that a difference in sample sizes do not make this directly comparable to the first model.↩︎
The default smoother for
s()
is the argumentbs='tp'
, a thin plate regression spline.↩︎See e.g. John Fox’s texts and chapters regarding regression diagnostics, and his crplots function in the car package.↩︎
One will notice that for the default gam plot in mgcv, the y-axis scale is on that of the linear predictor, but due to identifiability constraints, the smooths must sum to zero, and so they are presented in a mean-centered fashion.↩︎
See the smooth
ti
for an ANOVA decomposition of such smooths into main effects and interaction.↩︎Probably this is due to Qatar. Refer again to the previous scatterplot.↩︎