Introduction

Explanation & Prediction

For any particular analysis conducted, emphasis can be placed on understanding the underlying mechanisms which have specific theoretical underpinnings, versus a focus that dwells more on performance and, more to the point, future performance. These are not mutually exclusive goals in the least, and probably most studies contain a little of both in some form or fashion. I will refer to the former emphasis as that of explanation, and the latter that of prediction³.

In studies with a more explanatory focus, traditional analysis concerns a single data set. For example, one assumes a data generating distribution for the response, and one evaluates the overall fit of a single model to the data at hand, e.g. in terms of R-squared, and statistical significance for the various predictors in the model. One assesses how well the model lines up with the theory that led to the analysis, and modifies it accordingly, if need be, for future studies to consider. Some studies may look at predictions for specific, possibly hypothetical values of the predictors, or examine the particular nature of covariate effects. In many cases, only a single model is considered. In general though, little attempt is made to explicitly understand how well the model will do with future data, but we hope to have gained greater insight as to the underlying mechanisms guiding the response of interest. Following Breiman (2001), this would be more akin to the data modeling culture.

For studies focused on prediction, other techniques are available that are far more focused on performance, not only for the current data under examination, but for future data the selected model might be applied to. While still possible, relative predictor importance is less of an issue, and oftentimes there may be no particular theory to drive the analysis. There may be thousands of input variables and thousands of parameters to estimate, such that no simple summary would likely be possible anyway. However, many of the techniques applied in such analyses are quite powerful, and steps are taken to ensure better results for new data. Referencing Breiman again, this perspective is more of the algorithmic modeling culture.

While the two approaches are not exclusive, I present two extreme (though thankfully dated) views of the situation⁴:

To paraphrase provocatively, ‘machine learning is statistics minus any checking of models and assumptions’.
\(\sim\) Brian Ripley, 2004

… the focus in the statistical community on data models has:

Led to irrelevant theory and questionable scientific conclusions.
Kept statisticians from using more suitable algorithmic models.
Prevented statisticians from working on exciting new problems.
\(\sim\) Leo Breiman, 2001

Respective departments of computer science, statistics, and related fields, now overlap more than ever, as more relaxed views prevail today. However, there are potential drawbacks to placing too much emphasis on explanation or prediction. Performant models that ‘just work’ have the potential to be dangerous if they are little understood. Conversely, situations for which much time is spent sorting out details for an ill-fitting model suffers the converse problem- a variable amount of understanding coupled with little pragmatism. While this document will focus on more prediction-focused approaches, guidance will be provided with an eye toward their use in situations where the typical explanatory approach would be applied, thereby hopefully shedding some light on a path toward obtaining the best of both worlds.

Terminology

For those used to statistical concepts such as dependent variables, clustering, and predictors, etc. you will have to get used to some differences in terminology⁵ such as targets, unsupervised learning, and features etc. It doesn’t take too much to get acclimated, even if it is somewhat annoying when one is first starting out. I won’t be too beholden to either in this paper, and it should be clear from the context what’s being referred to. Initially I will start off mostly with non-ML terms and note in brackets an ML version to help the orientation along.

Supervised vs. Unsupervised

At least one distinction should be made from the outset, supervised vs. unsupervised learning. Supervised learning is the typical regression or classification situation where we are trying to predict some target variable of interest, possibly several. Unsupervised learning seeks patterns within the data without regard to any specific target variable, and would include things like cluster analysis, principal components analysis, etc. They can also be used in conjunction. The focus of this document will be almost entirely on supervised learning, though we will discuss some forms of unsupervised learning by the end.

Tools you already have

One thing that is important to keep in mind as you begin is that standard techniques from traditional statistics are still available in ML, although even then we might tweak them or do more with them. So, having a basic background in statistics is all that is required to get started with machine learning. Again, the difference between ML and traditional statistical analysis is more of focus than method.

The Standard Linear Model

All introductory statistics courses will cover linear regression in great detail, and it certainly can serve as a starting point here. We can describe it as follows in matrix notation in terms of the underlying data generating process:

\[\mu = X\beta\] \[y \sim N(\mu,\sigma^{2}\textrm{I})\]

Where \(y\) is conditionally a normally distributed response [target] with mean \(\mu\) and constant variance \(\sigma^{2}\). X is a typical model matrix, i.e. a matrix of predictor variables and in which the first column is a vector of 1s for the intercept [bias]⁶, and \(\beta\) is the vector of coefficients [weights] corresponding to the intercept and predictors in the model matrix.

What might be given less focus in applied courses is how often standard regression won’t be the best tool for the job, or even applicable in the form it is presented. Because of this, many applied researchers are still hammering screws with it, even as the explosion of statistical techniques of the past quarter century or more has rendered obsolete many current introductory statistical texts that are written for specific disciplines. Even so, the concepts one gains in learning the standard linear model are generalizable, and even a few modifications of it, while still maintaining the basic design, can render it still very effective in situations where it is appropriate.

Typically, in fitting [training] a model we tend to talk about R-squared and statistical significance of the coefficients for a small number of predictors. For our purposes, let the focus instead be on the residual sum of squares⁷, or error, with an eye towards its reduction and model comparison. In ML, we will not have a situation in which we are only considering one model fit, and so must find one that reduces the sum of the squared errors but without unnecessary complexity and overfitting, concepts we’ll return to later. Furthermore, we will be much more concerned with the model fit on new data that we haven’t seen yet [generalization].

Logistic Regression

Logistic regression is often used where the response is categorical in nature, usually with binary outcome in which some event occurs or does not occur [label]. One could still use the standard linear model here, but you could end up with nonsensical predictions that fall outside the 0-1 range regarding the probability of the event occurring, to go along with other shortcomings. Furthermore, it is no more effort nor is any understanding lost in using a logistic regression over the linear probability model. It is also good to keep logistic regression in mind as we discuss other classification approaches later on.

Logistic regression is also typically covered in an introduction to statistics for applied disciplines because of the pervasiveness of binary responses, or responses that have been made as such⁸. Like the standard linear model, just a few modifications can enable one to use it to provide better performance, particularly with new data. The gist is, it is not the case that we have to abandon familiar tools in the move toward a machine learning perspective.

Expansions of Those Tools

Generalized Linear Models

To begin, logistic regression is a generalized linear model assuming a binomial distribution for the response and with a logit link function as follows:

\[\eta = X\beta\] \[\eta = g(\mu)\] \[\mu = g(\eta)^{-1}\] \[y \sim \mathrm{Bin}(\mu, \mathrm{size}=1)\]

This is the same presentation format as seen with the standard linear model presented before, except now we have a link function \(g(.)\) and so are dealing with a transformed response. In the case of the standard linear model, the distribution assumed is the Gaussian and the link function is the identity link, i.e. no transformation is made. The link function used will depend on the analysis performed, and while there is choice in the matter, the distributions used have a typical, or canonical link function⁹.

Generalized linear models expand the standard linear model, which is a special case of generalized linear model, beyond the Gaussian distribution for the response, and allow for better fitting models of categorical, count, and skewed response variables. We also have a counterpart to the residual sum of squares, though we’ll now refer to it as the deviance.

Generalized Additive Models

Additive models extend the generalized linear model to incorporate nonlinear relationships of predictors to the response. We might note it as follows:

\[\eta = X\beta + f(Z)\] \[\mu = g(\eta)^{-1}\] \[y \sim \mathrm{family}(\mu, ...)\]

So we have the generalized linear model, but also smooth functions \(f(Z)\) of one or more predictors. More detail can be found in Wood (2017) and I provide an introduction here.

Things do start to get fuzzy with GAMs. It becomes more difficult to obtain statistical inference for the smooth terms in the model, and the nonlinearity does not always lend itself to easy interpretation. However, really this just means that we have a little more work to get the desired level of understanding. GAMs can be seen as a segue toward more black box/algorithmic techniques. Compared to some of those techniques in machine learning, GAMs are notably more interpretable, though GAMs are perhaps less so than GLMs. Also, part of the estimation process includes regularization and validation in determining the nature of the smooth function, topics of which we will return later.

References

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231. doi:10.1214/ss/1009213726.

Wood, S. N. 2017. Generalized Additive Models: An Introduction with R. Vol. 66. CRC Press.

In other words, there are cases where we are more interested in statistical inference and hypothesis testing versus times where we are more in performance.↩
A favorite around our office is ‘Machine learning is magic!’ attributed to our director Kerby Shedden, in response to a client’s over enthusiasm to something they did not understand well.↩
See this for a comparison.↩
Yes, you will see ‘bias’ refer to an intercept, and also mean something entirely different in our discussion of bias vs. variance. Often, techniques inherently induce bias to get better prediction (see the regularization section).↩
\(\sum(y-f(x))^{2}\) where \(f(x)\) is a function of the model predictors, and in this context a linear combination of them (\(X\beta\)).↩
It is generally a bad idea to discretize continuous variables, especially the dependent variable. However contextual issues, e.g. disease diagnosis, might warrant it.↩
As another example, for the Poisson distribution, the typical link function would be the \(log(\mu)\).↩