Concluding remarks
Generalized additive models are a conceptually straightforward tool that allows one to incorporate nonlinear and other relationships into their otherwise linear models. In addition, they allow one to keep within the linear and generalized linear modeling frameworks with which one is already familiar, while providing new avenues of model exploration and improved results. As was demonstrated, it is easy enough with just a modicum of familiarity to pull them off, and as such, it is hoped that this document provided one the means to do so.
In closing, I offer the following lifted directly from Shalizi (2016), Advanced Data Analysis from an Elementary Point of View, as I don’t think I could have put it more clearly:
With modern computing power, there are very few situations in which it is actually better to do linear regression than to fit an additive model. In fact, there seem to be only two good reasons to prefer linear models.
- Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure (not others, for which our observables serve as imperfect proxies).
- Our data set is so massive that either the extra processing time, or the extra computer memory, needed to fit and store an additive rather than a linear model is prohibitive.
Even when the first reason applies, and we have good reasons to believe a linear theory, the truly scientific thing to do would be to check linearity, by fitting a flexible non-linear model and seeing if it looks close to linear. Even when the second reason applies, we would like to know how much bias we’re introducing by using linear predictors, which we could do by randomly selecting a subset of the data which is small enough for us to manage, and fitting an additive model.
In the vast majority of cases when users of statistical software fit linear models, neither of these justifications applies: theory doesn’t tell us to expect linearity, and our machines don’t compel us to use it. Linear regression is then employed for no better reason than that users know how to type lm but not gam. You now know better, and can spread the word.
Nowadays, GAMs serve as a potentially great starting point for modeling, especially for tabular data. A well-specified GAM should perform very well even compared to boosting or deep learning methods, and so at the very least, serves as a good baseline with enhanced interpretability and easier uncertainty metrics. For a lot of data situations, GAMs may be all you need.
Best of luck with your data!