Mixed Models

Mixed models have been around a long time in the statistical realm. For example, standard ANOVA methods can be seen as special cases of a mixed model. More recently, mixed models have a variety of applications and extensions, allowing them to encompass a diverse range of data situations. They can be seen as a first step in expanding one’s tool set beyond the generalized linear model.

Terminology

For the uninitiated, the terminology surrounding mixed models, especially across disciplines, can be a bit confusing. Some terms you might come across regarding these types of models include:

Variance components
Random intercepts and slopes
Random effects
Random coefficients
Varying coefficients
Intercepts- and/or slopes-as-outcomes
Hierarchical linear models
Multilevel models (implies multiple levels of hierarchically clustered data)
Growth curve models (possibly Latent GCM)
Mixed effects models

All describe types of mixed models. Some terms might be more historical, others are more often seen in a specific discipline, others might refer to a certain data structure, and still others are special cases. Mixed effects, or simply mixed, models generally refer to a mixture of fixed and random effects. For the models in general, I prefer the terms ‘mixed models’ or ‘random effects models’ because they are simpler terms, no specific structure is implied, and the latter can also apply to extensions that many would not think of when other terms are used¹. Regarding the mixed effects, fixed effects is perhaps a poor but nonetheless stubborn term for the typical main effects one would see in a linear regression model, i.e. the non-random part of a mixed model. In some contexts, they are referred to as the population average effect. Though you will hear many definitions, random effects are simply those specific to an observational unit, however defined. The approach outlined in this document largely pertains to the case where the observational unit is the level of some grouping factor, but this is only one of several possibilities.

Kinds of Structure

In terms of structure, data might have one or multiple sources of clustering, and that clustering may be hierarchical, such that clusters are nested within other clusters. An example would be scholastic aptitude tests given multiple times to students (repeated observations nested within students, students nested within schools, schools nested within districts). In other cases, there is no nesting structure. An example would be a reaction time experiment where participants perform the same set of tasks. While observations are nested within individual, observations are also clustered according to task type. Some use the terms nested and crossed to distinguish between these scenarios. In addition, clustering may be balanced or not. We might expect more balance in studies of an experimental nature, but definitely not in other cases, e.g. where the cluster is something like geographical unit and the observations are people.

In what follows we’ll see mixed effect models in all these data situations. In general, our approach will be the same, as such clustering is really more a property of the data than the model. However, it’s important to get a sense of the flexibility of mixed models to handle a variety of data situations. It’s also worth noting that other structure may arise, such as temporal or spatial. We’ll note those in some places in this document, but they will not be of focus.

Random Intercepts Model

For the following we’ll demonstrate the simplest² and most common case of a mixed model, that in which we have a single grouping/cluster structure for the random effect. For reasons that will hopefully become clear soon, this is commonly called a random intercepts model.

Example: Student GPA

For our first example, we’ll assess factors predicting college grade point average (GPA). Each of the 200 students is assessed for six occasions (each semester for the first three years), so we have observations clustered within students. We have other variables such as job status, sex, and high school GPA. Some will be in both labeled and numeric form. See the appendix for more detail.

The Standard Regression Model

Now for the underlying model. We can show it in a couple different ways. First we start with just a standard regression to get our bearings.

\[\mathrm{gpa} = b_{\mathrm{intercept}} + b_{\mathrm{occ}}\cdot \mathrm{occasion} + \epsilon\]

We have coefficients (\(b_*\)) for the intercept and the effect of time. The error (\(\epsilon\)) is assumed to be normally distributed with mean 0 and some standard deviation \(\sigma\).

\[\epsilon \sim \mathcal{N}(0, \sigma)\]

An alternate way to write the model which puts emphasis on the underlying data generating process for \(\mathrm{gpa}\) can be shown as follows.

\[\mathrm{gpa} \sim \mathcal{N}(\mu, \sigma)\] \[\mu = b_{\mathrm{intercept}} + b_{\mathrm{occ}}\cdot \mathrm{occasion}\]

More technically, the GPA and \(\mu\) variables have an implicit subscript to denote each observation, but you can also think of it as a model for a single individual at a single time point.

The Mixed Model

Initial depiction

Now we show one way of depicting a mixed model that includes a unique effect for each student. Consider the following model for a single student³. This shows that the student-specific effect, i.e. the deviation in GPA just for that student being who they are, can be seen as an additional source of variance.

\[\mathrm{gpa} = b_{\mathrm{intercept}} + b_{\mathrm{occ}}\cdot \mathrm{occasion} + (\mathrm{effect}_{\mathrm{student}} + \epsilon)\]

We would (usually) assume the following for the student effects.

\[\mathrm{effect}_{\mathrm{student}} \sim \mathcal{N}(0, \tau)\]

So the student effects are random, and specifically they are normally distributed with mean of zero and some estimated standard deviation (\(\tau\)). In other words, conceptually, the only difference between this mixed model and a standard regression is the student effect, which on average is no effect, but typically varies from student to student by some amount that is on average \(\tau\).

If we rearrange it, we can instead focus on model coefficients, rather than as an additional source of error.

\[\mathrm{gpa} = (b_{\mathrm{intercept}} + \mathrm{effect}_{\mathrm{student}}) + b_{\mathrm{occ}}\cdot \mathrm{occasion} + \epsilon\] Or more succinctly:

\[\mathrm{gpa} = b_{\mathrm{int\_student}} + b_{\mathrm{occ}}\cdot \mathrm{occasion} + \epsilon\]

In this way, we’ll have student-specific intercepts, as each person will have their own unique effect added to the overall intercept, resulting in a different intercept for each person.

\[b_{\mathrm{int\_student}} \sim \mathcal{N}(b_{\mathrm{intercept}}, \tau)\]

Now we see the intercepts as normally distributed with a mean of the overall intercept and some standard deviation. As such, this is often called a random intercepts model.

As a multi-level model

Another way of showing the mixed model is commonly seen in the multilevel modeling literature. It is shown more explicitly as a two part regression model, one at the observation level and one at the student level.

\[\mathrm{gpa} = b_{\mathrm{int\_student}} + b_{\mathrm{occ}}\cdot \mathrm{occasion} + \epsilon\]

\[b_{\mathrm{int\_student}} = b_{\mathrm{intercept}} + \mathrm{effect}_{\mathrm{student}}\]

However, after ‘plugging in’ the second level part to the first, it is identical to the previous.

Note how we don’t have a student-specific effect for occasion. In this context, occasion is said to be a fixed effect only, and there is no random component. This definitely does not have to be the case though, as we’ll see later.

Application

Initial visualization

It always helps to look before we leap, so let’s do so. Here we plot GPA vs. occasion (i.e. semester) to get a sense of the variability in starting points and trends.

All student paths are shown as faded paths, with a sample of 10 shown in bold. The overall trend, as estimated by the regression we’ll do later, is shown in red. Two things stand out. One is that students have a lot of variability when starting out. Secondly, while the general trend in GPA is upward over time as we’d expect, individual students may vary in that trajectory.

Standard regression

So let’s get started. First, we’ll look at the regression and only the time indicator as a covariate, which we’ll treat as numeric for simplicity. Note that I present a cleaner version of the summarized objects for the purposes of this document.

load('data/gpa.RData')
gpa_lm = lm(gpa ~ occasion, data = gpa)
## summary(gpa_lm)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.599	0.018	145.7	0
occasion	0.106	0.006	18.04	0

Fitting linear model: gpa ~ occasion
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
1200	0.3487	0.2136	0.2129

The above tells us that starting out, i.e. when occasion is zero, the average GPA, denoted by the intercept, is 2.6. In addition, as we move from semester to semester, we can expect GPA to increase by about 0.11 points. This would be fine except that we are ignoring the clustering. A side effect of doing so is that our standard errors are biased, and thus claims about statistical significance based on them would be off. More importantly however, is that we simply don’t get to explore the student effect, which would be of interest by itself.

Regression by cluster

An alternative approach we could take would be to run separate regressions for every student. However, there are many drawbacks to this- it’s not easily summarized when there are many groups, typically there would be very little data within each cluster to do so (as in this case), and the models are over-contextualized, meaning they ignore what students have in common. We’ll compare such an approach to the mixed model later.

Running a mixed model

Next we run a mixed model that will allow for a student specific effect. Such a model is easily conducted in R, specifically with the package lme4. In the following, the code will look just like what you used for regression with lm, but with an additional component specifying the group, i.e. student, effect. The (1|student) means that we are allowing the intercept, represented by 1, to vary by student. With the mixed model, we get the same results as the regression, but as we’ll see we’ll have more to talk about.

library(lme4)
gpa_mixed = lmer(gpa ~ occasion + (1 | student), data = gpa)
## summary(gpa_mixed)

term	value	se	t	p_value	lower_2.5	upper_97.5
Intercept	2.599	0.022	119.800	0	2.557	2.642
occasion	0.106	0.004	26.096	0	0.098	0.114

group	effect	variance	sd
student	Intercept	0.064	0.252
Residual		0.058	0.241

First we see that the coefficients, i.e. or in this context they can be called the fixed effects, for the intercept and time are the same as we saw with the standard regression⁴, as would be their interpretation. The standard errors, on the other hand are different here, though in the end our conclusion would be the same as far as statistical significance goes. Note specifically that the standard error for the intercept has increased. Conceptually you can think about allowing random intercepts per person allows us to gain information about the individual, while recognizing the uncertainty with regard to the overall average that we were underestimating before⁵.

While we have coefficients and standard errors, you might have noticed that lme4 does not provide p-values! There are several reasons for this, namely that with mixed models we are essentially dealing with different sample sizes, the \(N_c\) within clusters, which may vary from cluster to cluster (and even be a single observation!), and N total observations, which puts us in kind of a fuzzy situation with regard to reference distributions, denominator degrees of freedom, and how to approximate a ‘best’ solution. Other programs provide p-values automatically as if there is no issue, and without telling you which approach they use to calculate them (there are several). Furthermore, those approximations may be very poor in some scenarios, or make assumptions that may not be appropriate for the situation⁶.

However, it’s more straightforward to get confidence intervals, and we can do so with lme4 as follows⁷.

confint(gpa_mixed)

	group	effect	variance	sd	sd_2.5	sd_97.5	var_prop
sd_(Intercept)\|student	student	Intercept	0.064	0.252	0.225	0.282	0.523
sigma	Residual		0.058	0.241	0.231	0.252	0.477

Variance components

One thing that’s new compared to the standard regression output is the estimated standard deviation/variance of the student effect (\(\tau\)/\(\tau^2\) in our formula depiction from before). This tells us how much, on average, GPA bounces around as we move from student to student. In other words, even after making a prediction based on time point, each student has their own unique deviation, and that value (in terms of the standard deviation) is the estimated average deviation across students. Note that scores move due to the student more than double what they move based on a semester change. This is an important interpretive aspect not available to us with a standard regression model.

Another way to interpret the variance output is to note percentage of the student variance out of the total, or 0.064 / 0.122 = 52%. In this setting, this value is also called the intraclass correlation, because it is also an estimate of the within cluster correlation, as we’ll see later.

Estimates of the random effects

After running the model, we can actually get estimates of the student effects⁸. I show two ways for the first five students, both as random effect and as random intercept (i.e. intercept + random effect).

ranef(gpa_mixed)$student %>% head(5)

# showing mixedup::extract_random_effects(gpa_mixed)

group_var	effect	group	value	se	lower_2.5	upper_97.5
student	Intercept	1	-0.071	0.092	-0.251	0.109
student	Intercept	2	-0.216	0.092	-0.395	-0.036
student	Intercept	3	0.088	0.092	-0.091	0.268
student	Intercept	4	-0.187	0.092	-0.366	-0.007
student	Intercept	5	0.030	0.092	-0.149	0.210

coef(gpa_mixed)$student %>% head(5)

group_var	effect	group	value	se	lower_2.5	upper_97.5
student	Intercept	1	2.528	0.095	2.343	2.713
student	Intercept	2	2.383	0.095	2.198	2.568
student	Intercept	3	2.687	0.095	2.502	2.872
student	Intercept	4	2.412	0.095	2.227	2.597
student	Intercept	5	2.629	0.095	2.444	2.814

Note that we did not allow occasion to vary, so it is a constant, i.e. fixed, effect for all students.

Often, we are keenly interested in these effects, and want some sense of uncertainty regarding them. With lme4 this typically would be done via bootstrapping, specifically with the bootMer function within lme4. However, for some users this may be a bit of a more complex undertaking. The merTools package provides an easier way to get this with the predictInterval function⁹. Or you can go straight to the plot of them.

library(merTools)

predictInterval(gpa_mixed)   # for various model predictions, possibly with new data

REsim(gpa_mixed)             # mean, median and sd of the random effect estimates

plotREsim(REsim(gpa_mixed))  # plot the interval estimates

The following plot is of the estimated random effects for each student and their interval estimate (a modified version of the plot produced by that last line of code¹⁰). Recall that the random effects are normally distributed with a mean of zero, shown by the horizontal line. Intervals that do not include zero are in bold. In this case, such students are relatively higher or lower starting ouit compared to a typical student.

Prediction

Let’s now examine standard predictions vs. cluster-specific predictions. As with most R models, we can use the predict function on the model object.

predict(gpa_mixed, re.form=NA) %>% head()

       1        2        3        4        5        6 
2.599214 2.705529 2.811843 2.918157 3.024471 3.130786

In the above code we specified not to use the random effects re.form=NA, and as such, our predictions for the observations are pretty much what we’d get from the standard linear model.

predict_no_re = predict(gpa_mixed, re.form=NA)
predict_lm    = predict(gpa_lm)

But each person has their unique intercept, so let’s see how the predictions differ when we incorporate that information.

predict_with_re = predict(gpa_mixed)

Depending on the estimated student effect, students will start above or below the estimated intercept for all students. The following visualizes the unconditional prediction vs. the conditional prediction that incorporates the random intercept for the first two students.

We can see that the predictions from the mixed model are shifted because of having a different intercept. For these students, the shift reflects their relatively poorer start.

Cluster Level Covariates

Note our depiction of a mixed model as a multilevel model.

\[\mathrm{gpa} = b_{\mathrm{int\_student}} + b_{\mathrm{occ}}\cdot \mathrm{occasion} + \epsilon\]

\[b_{\mathrm{int\_student}} = b_{\mathrm{intercept}} + \mathrm{effect}_{\mathrm{student}}\] If we add student a student level covariate, e.g sex, to the model, we then have the following.

\[b_{\mathrm{int\_student}} = b_{\mathrm{intercept}} + b_{\mathrm{sex}}\cdot \mathrm{sex} + \mathrm{effect}_{\mathrm{student}}\]

Which, after plugging in, we still have the same model as before, just with an additional predictor.

\[\mathrm{gpa} = b_{\mathrm{intercept}} + b_{\mathrm{occ}}\cdot \mathrm{occasion}+ b_{\mathrm{sex}}\cdot \mathrm{sex} + (\mathrm{effect}_{\mathrm{student}} + \epsilon)\]

So in the end, adding cluster level covariates doesn’t have any unusual effect on how we think about the model¹¹. We simply add them to our set of predictor variables. Note also, that we can create cluster level covariates as group means or some other summary of the observation level variables. This is especially common when the clusters represent geographical units and observations are people. For example, we might have income as a person level covariate, and use the median to represent the overall wealth of the geographical region.

Summary of Mixed Model Basics

Mixed models allow for us to take into account observed structure in the data. If this were all it was used for, we would have more accurate inference relative to what would be had if we ignored that structure. However, we get much more! We better understand the sources of variability in the target variable. We also get group specific estimates of the parameters in the model, allowing us to understand exactly how the groups differ from one another. Furthermore, this in turn allows for group specific prediction, and thus much more accurate prediction, assuming there is appreciable variance due to the clustering. In short, there is much to be gained by mixed models, even in the simplest of settings.

Exercises for Starting Out

Sleep

For this exercise, we’ll use the sleep study data from the lme4 package. The following describes it.

The average reaction time per day for subjects in a sleep deprivation study. On day 0 the subjects had their normal amount of sleep. Starting that night they were restricted to 3 hours of sleep per night. The observations represent the average reaction time (in milliseconds) on a series of tests given each day to each subject.

After loading the package, the data can be loaded as follows. I show the first few observations.

library(lme4)
data("sleepstudy")

Reaction	Days	Subject
249.560	0	308
258.705	1	308
250.801	2	308
321.440	3	308
356.852	4	308
414.690	5	308

Run a regression with Reaction as the target variable and Days as the predictor.
Run a mixed model with a random intercept for Subject.
Interpret the variance components and fixed effects.

Adding the cluster-level covariate

Rerun the mixed model with the GPA data adding the cluster level covariate of sex, or high school GPA (highgpa), or both. Interpret all aspects of the results.

What happened to the student variance after adding cluster level covariates to the model?

Simulating a mixed model

The following represents a simple way to simulate a random intercepts model. Note each object what each object is, and make sure the code make sense to you. Then run it.

set.seed(1234)  # this will allow you to exactly duplicate your result
Ngroups = 100
NperGroup = 3
N = Ngroups * NperGroup
groups = factor(rep(1:Ngroups, each = NperGroup))
u = rnorm(Ngroups, sd = .5)
e = rnorm(N, sd = .25)
x = rnorm(N)
y = 2 + .5 * x + u[groups] + e

d = data.frame(x, y, groups)

Which of the above represent the fixed and random effects? Now run the following.

model = lmer(y ~ x + (1|groups), data=d)

summary(model)

confint(model)


library(ggplot2)

ggplot(aes(x, y), data=d) +
  geom_point()

Do the results seem in keeping with what you expect?

In what follows we’ll change various aspects of the data, then rerun the model after each change, then summarize and get confidence intervals as before. For each note specifically at least one thing that changed in the results.

First calculate or simply eyeball the intraclass correlation coefficient \(\frac{\textrm{random effect variance}}{\textrm{residual + random effect variance}}\). In addition, create a density plot of the random effects as follows.

re = ranef(model)$groups
qplot(x = re, geom = 'density', xlim = c(-3, 3))

Change the random effect variance/sd and/or the residual variance/sd and note your new estimate of the ICC, and plot the random effect as before.
Reset the values to the original. Change Ngroups to 50. What differences do you see in the confidence interval estimates?
Set the Ngroups back to 100. Now change NperGroup to 10, and note again the how the CI is different from the base condition.

I actually like Richly Parameterized Linear Models, or Structured Additive Regression Models. Both are a mouthful, but at least the latter reduces to STARs.↩︎
Actually, the simplest model would have no covariates at all, just variance components, with no correlations among the random effects. Such a model can be interesting to look at while exploring your data, but would rarely suffice on its own to tell the story you desire to.↩︎
Note that I leave out the observation level subscript to keep things clean. I find that multilevel style notation quickly becomes unwieldy, and don’t wish to reproduce the problem. It also tends to add confusion to a lot of applied researchers starting out with mixed models.↩︎
This will not always be the case, e.g. with unbalanced data, but they should be fairly close.↩︎
The standard error for our time covariate went down due to our estimate of \(\sigma\) being lower for this model, and there being no additional variance due to cluster membership.↩︎
Note that many common modeling situations involve a fuzzy p setting, but especially penalized regression approaches such as mixed, additive, ridge regression models etc. Rather than be a bad thing, this usually is a sign you’re doing something interesting, or handling complexity in an appropriate way.↩︎
See ?confint.merMod for details and options. The output you see is based on my wrapper mixedup::extract_vc.↩︎
These are sometimes referred to as BLUPs or EBLUPs, which stands for (empirical) best linear unbiased prediction. However, they are only BLUP for linear mixed effects models. As such you will also see them referred to as conditional mode. Furthermore, in the Bayesian context, the effects are actually estimated as additional model parameters, rather than estimated/predicted after the fact.↩︎
Note that while predictionInterval does not quite incorporate all sources of uncertainty as does bootMer, it’s actually feasible for larger data sets, and on par with the Bayesian results (e.g. with rstanarm).↩︎
Note that the default plot from merTools is confusingly labeled for single random effect, because it unnecessarily adds a facet. You’ll understand it better by looking the plot in the discussion of crossed random effects later. However, the one displayed is from my own package, visibly.↩︎
This is why the multilevel depiction is sub-par, and leads many to confusion at times. You have a target variable and selected feature covariates to predict that target based on theory. Whether they are cluster level variables or if there are interactions doesn’t have anything to do with the data structure as much as it does the theoretical motivations. However, if you choose to depict the model in multilevel fashion, the final model must adhere to the ‘plugged in’ result. So if, for example, you posit a cluster level variable for a random slope, you must include the implied interaction of the cluster level and observation level covariates.↩︎