Application Using R
We’re now ready to do some modeling! We’ve now seen the relationships of GAM to techniques we know, deficiencies with common approaches, and a little bit about how GAMs work conceptually. The next step is to dive in.
Initial Examination
The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. The key variables are as follows (variable abbreviations in bold):
The education component is measured by mean of years of schooling for adults aged 25 years and expected years of schooling for children of school entering age, the health index by life expectancy at birth, and the wealth component is based on the gross national income per capita. The HDI sets a minimum and a maximum for each dimension, and values indicate where each country stands in relation to these endpoints, expressed as a value between 0 and 1. More information on the HDI measures can be found here.
- Overall Science Score (average score for 15 year olds)
- Interest in science
- Support for scientific inquiry
- Income Index
- Health Index
- Education Index
- Human Development Index (composed of the Income index, Health Index, and Education Index)
But even before we get too far, it would be good to know our options in the world of GAMs. The appendix of this document has a list of some packages to be aware of, and there is quite a bit of GAM functionality available within R, even for just plottingggplot2 has basic gam functionality for scatterplot smoothing. Examples are further in the text.. We will use mgcv for our purposes.
The first thing to do is get the data in and do some initial inspections.
Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max | Missing |
---|---|---|---|---|---|---|---|---|---|
Overall | 57 | 473.14 | 54.58 | 322.00 | 428.00 | 489.00 | 513.00 | 563.00 | 8 |
Issues | 57 | 469.91 | 53.93 | 321.00 | 427.00 | 489.00 | 514.00 | 555.00 | 8 |
Explain | 57 | 475.02 | 54.02 | 334.00 | 432.00 | 490.00 | 517.00 | 566.00 | 8 |
Evidence | 57 | 469.81 | 61.74 | 288.00 | 423.00 | 489.00 | 515.00 | 567.00 | 8 |
Interest | 57 | 528.16 | 49.84 | 448.00 | 501.00 | 522.00 | 565.00 | 644.00 | 8 |
Support | 57 | 512.18 | 26.08 | 447.00 | 494.00 | 512.00 | 529.00 | 569.00 | 8 |
Income | 61 | 0.74 | 0.11 | 0.41 | 0.66 | 0.76 | 0.83 | 0.94 | 4 |
Health | 61 | 0.89 | 0.07 | 0.72 | 0.84 | 0.89 | 0.94 | 0.99 | 4 |
Edu | 59 | 0.80 | 0.11 | 0.54 | 0.72 | 0.81 | 0.88 | 0.99 | 6 |
HDI | 59 | 0.81 | 0.09 | 0.58 | 0.75 | 0.82 | 0.88 | 0.94 | 6 |
The scatterplot matrix has quite a bit of information to spend some time with- univariate density, loess curves, and correlations. For our purposes, we will ignore the issue regarding the haves vs. have-nots on the science scale and save that for another day. Note the surprising negative correlation between interest and overall score for science, which might remind us of Simpson’s paradox, namely, that what occurs for the individual may not be the same for the group. One will also note that while there is a positive correlation between Income and Overall science scores, it flattens out after an initial rise. Linear fits might be difficult to swallow for some of these, but let’s take a closer look.
We can see again that linear fits aren’t going to do so well for some, though it might be a close approximation for interest in science and support for science. Now let’s run the smooth. By default, ggplot2 will use a loess smoother for small data sets (i.e. < 1000 observations), but one can use the mgcv gam function as a smoother via the method
argument.
Often in these situations people will perform some standard transformation, such as a log, but as we noted earlier it doesn’t help as nearly as often as it is used. For example, in this case one can log the overall score, Income, or both and a linear relation will still not be seen.
Single Predictor
Linear Fit
We will start with the simple situation of a single predictor. Let’s begin by using a typical linear regression to predict science scores by the Income index. We could use the standard R lm function, but I’ll leave that as an exercise for comparison. We can still do straightforward linear models with the gam function, and again it is important to note that the standard linear model can be seen as a special case of a GAM.
Family: gaussian
Link function: identity
Formula:
Overall ~ Income
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 204.32 35.37 5.777 4.32e-07 ***
Income 355.85 46.79 7.606 5.36e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.518 Deviance explained = 52.7%
GCV = 1504.5 Scale est. = 1448.8 n = 54
What are we getting here? The same thing you get from a regular linear model, because you just ran one. However, there are a couple things to look at. The coefficient is statistically significant, but serves as a reminder that it usually a good idea to scale predictor variables so that the effect is more meaningful. Here, moving one unit on Income is akin from going broke to being the richest country. But in a more meaningful sense, if we moved from say, .7 to .8, we’d expect an increase of about 35 points on the science score. We also see the deviance explained^{10}, which serves as a generalization of R-squared, and in this case, it actually is equivalent to the unadjusted R-squared. Likewise, there is the familiar adjusted version of it to account for small sample size and model complexity. The scale estimate is the scaled deviance, which here is equivalent to the residual sums of squares. The GCV score we will save for when we run a GAM.
GAM
Let’s now try some nonlinear approaches, keeping in mind that \(\mu=f(x)\). As a point of comparison, we can start by trying a standard polynomial regression, and it might do well enoughThis example is pretty much straight from Wood (2006) with little modification.. To begin we must consider a basis to use, a space that \(f\) is an element of. Doing so leads to choosing a set of basis functions \(b_j\), with parameters \(\gamma_j\) that will be combined to produce \(f(x)\):
\[f(x)=\displaystyle\sum\limits_{j=1}^q b_{j}(x)\gamma_{j}\]
To better explain by example, if we use a cubic polynomial, the basis is: \(b_1(x) = 1\), \(b_2(x)=x\), \(b_3(x)=x^2\), \(b_4(x)=x^3\), which leads to the following:
\[f(x) = \gamma_1 + \gamma_2x + \gamma_3x^2 + \gamma_4x^3\]
The following visualization allows us to see the effects in action. It is based on the results extracted from running such a model^{11} and obtaining the coefficients. The first plot represents the intercept of 470.44, the second plot, our \(b_2\) coefficient of 289.5 multiplied by Income and so forth. The bottom plot shows the final fit \(f(x)\), i.e. the linear combination of the basis functions.
At this point we have done nothing we couldn’t do in our regular regression approach, as polynomial regression has a long history of use in modeling. However, the take home message is that as we move to GAMs we are going about things in much the same fashion; we are simply changing the nature of the basis, and have a great deal more flexibility in choosing the form.
In the next figure I show the fit using a ‘by-hand’ cubic spline basis (see the Appendix and p.126-7 in Wood (2006)).
A cubic spline is essentially a connection of multiple cubic polynomial regressions, similar to what we demonstrated previously. We choose points of the predictor variable at which to create sections, and these points are referred to as knots. Separate cubic polynomials are fit at each section, and then joined at the knots to create a continuous curve. The above graph represents a cubic spline with 8 knots^{12} between the first and third quartiles. The red line uses the GAM functionality within ggplot2’s geom_smooth as a point of comparison.
Fitting the model
Let’s now fit an actual generalized additive model using the same cubic spline as our basis. We again use the gam function as before for basic model fitting, but now we are using a function s within the formula to denote the smooth terms. Within that function we also specify the type of smooth, though a default is available. I chose bs = cr
, denoting cubic regression splines, to keep consistent with our previous example.
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 470.444 4.082 115.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 6.895 7.741 16.67 1.59e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.7 Deviance explained = 73.9%
GCV = 1053.7 Scale est. = 899.67 n = 54
The first thing to note is that, aside from the smooth part, our model code is similar to what we’re used to with core R functions such as lm and glm. In the summary, we first see the distribution assumed, as well as the link function used, in this case normal and identity, respectively, which to iterate, had we had no smoothing, would result in a SLiM. After that we see that the output is separated into parametric and smooth, or nonparametric parts.As an aside, the term nonparametric has at least two general uses in the statistical world. Wikipedia has a nice delineation. In this case, the only parametric component is the intercept, but it’s good to remember that you are not bound to smooth every effect of interest, and indeed, as we will discuss in more detail later, part of the process may involve refitting the model with terms that were found to be linear for the most part anyway. The smooth component of our model regarding a country’s income and its relationship with overall science score suggests it is statistically significant, but there are a couple of things in the model summary that would be unfamiliar.
This part regarding effective degrees of freedom should ring a bell for those who use the lme4 package for mixed models. All of this applies there too, and may provide more insight as to why they don’t even provide p-values as a default. See the Other Approaches section.
We’ll start with the effective degrees of freedom, or edf
. In typical OLS regression the model degrees of freedom is equivalent to the number of predictors/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure, something that will be discussed more later^{13}. In this situation, we are still trying to minimize the residual sums of squares, but we also have a built-in penalty for ‘wiggliness’ of the fit, where in general we try to strike a balance between an undersmoothed fit and an oversmoothed fit. The default p-value for the test is based on the effective degrees of freedom and the rank \(r\) of the covariance matrix for the coefficients for a particular smooth, so here, conceptually, it is the p-value associated with the \(F(r, n-edf)\). However, there are still other issues to be concerned about, and ?summary.gam
will provide your first step down that particular rabbit hole. For hypothesis testing an alternate edf is actually used, which is the other one provided there in the summary result^{14}. At this point you might be thinking these p-values are a bit fuzzy, and you’d be right. The gist is, they aren’t to be used for harsh cutoffs, say, at an arbitrary .05 level^{15}, but if they are pretty low you can feel comfortable claiming statistical significance, which of course is the end all, be all, of the scientific endeavor- right?
The GCV, or generalized cross validation score can be taken as an estimate of the mean square prediction error based on a leave-one-out cross validation estimation process. We estimate the model for all observations except \(i\), then note the squared residual predicting observation \(i\) from the model. Then we do this for all observations. However, the GCV score is an efficient measure of this concept that doesn’t actually require fitting all those models and overcomes other issuesIn this initial model the GCV can be found as:
\(GCV = \frac{n*scaled\, est.}{(n-edf-{[n\,of\, parametric\, terms]})^{2}}\). It is this score that is minimized by default when determining the specific nature of the smooth. On its own it doesn’t tell us much, but we can use it similar to AIC as a comparative measure to choose among different models, with lower being better.
Graphical Display
One can get sense of the form of the fit by simply plotting the model object as follows:
Note that this will only plot the smooth terms in the model. A more visually appealing alternative to the default mgcv plot can be found with the visreg package, which will plot all effects, and by default be nicer looking and easier to customize. You can simply feed the model to the visreg function. It also has 2 and 3d capabilities like the corresponding mgcv functions. In most places, I’m using my own functions from visibly, which you can get from GitHub. The intervals are Bayesian credible intervals based on the posterior predictive distribution. In this single predictor case, one can also revisit the previous graph.
Model Comparison
Let us now compare our regular regression fit to the GAM model fit. The following shows how one can extract various measures of performance, and the subsequent table shows them gathered together.
[1] 550.2449
GCV.Cp
1504.496
[1] 0.5175346
Do the same to extract those same elements from the GAM. The following display makes for easy comparison.
AIC | GCV | R^{2} | |
---|---|---|---|
LM | 550.24 | 1504.50 | 0.52 |
GAM | 529.81 | 1053.73 | 0.70 |
Comparing these various measures, it’s safe to conclude that the GAM fits better. We can also perform the familiar statistical test via the anova function we apply to other R model objects. As with the previous p-value issue, we can’t get too carried away, and technically one could have a model with even more terms but lower edf, but would be difficult to interpret?anova.gam
for more information.. As it would be best to be conservative, we’ll proceed cautiously.
Analysis of Deviance Table
Model 1: Overall ~ Income
Model 2: Overall ~ s(Income, bs = "cr")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 52.000 75336
2 45.259 41479 6.7411 33857 2.778e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It would appear the anova results tell us what we have probably come to believe already, that incorporating nonlinear effects has improved the model considerably.
Multiple Predictors
Let’s now see what we can do with a more realistic case where we have added model complexity.
Linear Fit
We’ll start with the SLiM approach again, this time adding the Health and Education indices.
Family: gaussian
Link function: identity
Formula:
Overall ~ Income + Edu + Health
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.18 78.97 1.535 0.1314
Income 182.32 85.27 2.138 0.0376 *
Edu 234.11 54.78 4.274 9.06e-05 ***
Health 27.01 134.90 0.200 0.8421
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.616 Deviance explained = 63.9%
GCV = 1212.3 Scale est. = 1119 n = 52
It appears we have statistical effects for Income and Education, but not for Health, and the adjusted R-squared suggests a notable amount of the variance is accounted forNote that a difference in sample sizes do not make this directly comparable to the first model.. Let’s see about nonlinear effects.
GAM
As far as the generalized additive model goes, we can approach things in a similar manner as before. We will ignore the results of the linear model for now and look for nonlinear effects for each covariate.
The default smoother for s()
is the argument bs='tp'
, a thin plate regression spline.
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income) + s(Edu) + s(Health)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 471.154 2.772 170 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 7.593 8.415 8.826 1.33e-07 ***
s(Edu) 6.204 7.178 3.308 0.00733 **
s(Health) 1.000 1.000 2.736 0.10661
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.863 Deviance explained = 90.3%
GCV = 573.83 Scale est. = 399.5 n = 52
There are again a couple things to take note of. First, statistically speaking, we come to the same conclusion as the linear model regarding the individual effects. One should take particular note of the effect of Health index. The effective degrees of freedom with value 1 suggests that it has essentially been reduced to a simple linear effect. The following will update the model to explicitly model the effect as such, but as one can see based on GCV, the results are identical.
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income) + s(Edu) + Health
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 640.3 102.3 6.260 3.06e-07 ***
Health -189.5 114.6 -1.654 0.107
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 7.593 8.415 8.826 1.33e-07 ***
s(Edu) 6.204 7.178 3.308 0.00733 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.863 Deviance explained = 90.3%
GCV = 573.83 Scale est. = 399.5 n = 52
We can also note that this model accounts for much of the variance in Overall science scores, with an adjusted R-squared of .86. In short, it looks like the living standards and educational resources of a country are associated with overall science scores, even if we don’t really need the Health index in the model.
Graphical Display
Now we examine the effects of interest visually via component plots^{16}. The following uses my own function from visibly but you can also use the visreg function.
Here we can see the effects of interestOne will notice that for the default gam plot in mgcv, the y-axis scale is on that of the linear predictor, but due to identifiability constraints, the smooths must sum to zero, and thus are presented in a mean-centered fashion., and again one might again note the penalized-to-linear effect of Health. As before, we see the tapering off of Income’s effect at its highest levels, and in addition, a kind of sweet spot for a positive effect of Education in the mid-range values, with a slight positive effect overall. Health, as noted, has been reduced to a linear, surprisingly negative effect, but again this is not statistically significant.
The following code demonstrates how to create data necessary for the plot, specifically for incomes, with the other predictors held at their mean. See the technical section for more details on these visualizations.
# Note that mod_gam2$model is the data that was used in the modeling process,
# so it will have NAs removed.
testdata = data.frame(Income = seq(.4, 1, length = 100),
Edu = mean(mod_gam2$model$Edu),
Health = mean(mod_gam2$model$Health))
fits = predict(mod_gam2, newdata=testdata, type='response', se=T)
predicts = data.frame(testdata, fits) %>%
mutate(lower = fit - 1.96*se.fit,
upper = fit + 1.96*se.fit)
plot_mod_gam2_response = ggplot(aes(x=Income,y=fit), data=predicts) +
geom_ribbon(aes(ymin = lower, ymax=upper), fill='gray90') +
geom_line(color='#00aaff') +
theme_trueMinimal()
2d Smooths
The GAM gives us a sense for one predictor, but let’s now take a gander at Income and Education at the same time. Previously, we saw how to use the plot method for a GAM class object. There is another plotting function, vis.gam, that will give us a bit more to play with, and specifically to display 2d smooths. It has a corresponding version from visreg which will look a bit better by default, visreg2d. The following code will produce a contour plot with Income on the x axis, Education on the y axis, with values on the response scale given by the contours, with lighter color indicating higher values. The actual plot provided instead depicts a heatmap.
vis.gam(mod_gam2, type='response', plot.type='contour')
visreg2d(mod_gam2, xvar='Income', yvar='Edu', scale='response')
First and foremost, the figure reflects the individual plots, and we can see that middling to high on Education and high on Income generally produces the highest scores. Conversely, being low on both the Education and Income indices are associated with poor Overall science scores. However, while interesting, these respective smooths were created separately of one another, and there is another way we might examine how these two effects work together in predicting the response.
So let’s take a look at another approach, continuing the focus on visual display. It may not be obvious at all, but one can utilize smooths of more than one variable, in effect, a smooth of the smooths of the variables that go into it. This is akin to an interaction in typical model settings^{17}. Let’s create a new model to play around with this feature. After fitting the model, I provide a visualization for comparison to the previous, as well as a 3D view one can rotate to their liking.
Family: gaussian
Link function: identity
Formula:
Overall ~ te(Income, Edu)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 471.154 3.349 140.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
te(Income,Edu) 10.1 12.19 16.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.8 Deviance explained = 84%
GCV = 741.42 Scale est. = 583.17 n = 52