Polynomial Regression

A common application in regression to deal with nonlinear relationships involves polynomial regression. For the predictor in question, $$x$$, we add terms e.g. quadratic ($$x^2$$), cubic ($$x^3$$) etc. to get a better fit. Consider the following data situation.

Let’s fit a quadratic term and note the result.

mod_poly = lm(y ~ poly(x, 2))

The R2 for this is 0.78, that’s great! But look closely and you might notice that we aren’t capturing the lower tail of this target at all, and not doing so great for observations that are high on both variables. Here’s what the data and fit would look like if we extend the data based on the underlying true function, and things only get worse.

Part of the reason is that, outside of deterministic relationships due known physical or other causes, you are unlikely to discover a quadratic relationship between variables among found data. Even when it appears to fit, without a lot of data it is almost certainly overfit due to this reason. Fitting a polynomial is more akin to enforcing our vision of how the data should be, rather than letting the data speak for itself. Sure, it might be a good approximation some of the time, just as assuming a linear relationship is, but often it’s just wishful thinking.

Compare the previous result to the following fit from a generalized additive model. GAMs are susceptible to extrapolation, as is every statistical model ever created. However, the original fit (in red) is much better. Notice how it was better able to follow the straightened out data points at the high end, rather than continuing the bend the quadratic enforced.

A more complex relationship

Perhaps you would have been satisfied with the initial quadratic fit above or perhaps a cubic fit2. We may come across a situation where the target of interest $$y$$ is a function of some covariate $$x$$, whose effect is not straightforward at all. In a standard regression we can generally write the model as follows:

$y = f(x) + e$

If we are talking a linear relationship between y and x, f(x) might be a simple sum of the covariates.

$y = b_0 + b_1*x + e$

If that were the case, we could do our standard regression model. Now consider the following functional form for x:

$f(x) = sin(2(4x-2)) + 2e^{-(16^2)(x-.5)^2} + \epsilon$ $\epsilon \sim N(0,.3^2)$

Let’s generate some data and take a look at it visually.

set.seed(123)
x = runif(500)
mu = sin(2*(4*x-2)) + 2*exp(-(16^2)*((x-.5)^2))
y = rnorm(500, mu, .3)
d = data.frame(x,y) 

Polynomial regression is problematic

A standard linear regression is definitely not going to capture this relationship. As above, we could try and use polynomial regression here, e.g. fitting a quadratic or cubic function within the standard regression framework. However, this is unrealistic at best and at worst isn’t useful for complex relationships. In the following even with a polynomial of degree 15 the fit is fairly poor in many areas, and ‘wiggles’ in some places where there doesn’t appear to be a need to.

The same would hold true for other approaches that require the functional form to be specified (e.g. so-called logistic growth curve models). It’s maybe also obvious that a target transformation (e.g. log) isn’t going to help us in this case either. In general, we’ll need tools better suited to more complex relationships, or simply ones that don’t require us to overfit/simplify the relationship we see, or guess about the form randomly until finding a decent fit.

1. This example was actually generated with a cubic polynomial.