More Random Effects
Previously we’ve looked at random intercepts, but any observation or lower level covariate effect could be allowed to vary by cluster as well. This is identical in concept to an interaction, where the effect of some feature is allowed to vary by the levels/values of another.
Returning to the GPA data, recall the visualization from before. I’ve animated it to highlight the differences among students. We can see that while there is general increase over time, some are relatively flat or otherwise take a not so straightforward path. How might we begin to capture this sort of phenomenon?
To do so, let’s now assume that the trend over time is allowed to vary by student, in addition to the intercepts. Using lme4, this is quite straightforward.
= lmer(gpa ~ occasion + (1 + occasion | student), data = gpa) gpa_mixed summary(gpa_mixed)
Pretty easy huh? Within the parenthesis, to the left of that bar
| we are just positing a model formula as we would do with most modeling functions12. Let’s look at the results.
As before, since we have 0 as our starting semester, the intercept tells us what the average GPA is in the first semester. The coefficient for occasion still reflects a one semester change in GPA. As we have not changed the fixed effect portion of the model, the values are the same as before.
The associated intercept variance tells us how much that starting GPA bounces around from student to student. The variance for the occasion effect might not look like much in comparison, but slopes are on a notably different scale than the intercept. Note that the mean slope for the semester to semester effect, our fixed effect, is 0.11, but from student to student it bounces around half that. So we could expect most students to fall somewhere between a flat effect of zero to more than double the population average13.
Yet another point of interest is the correlation of the intercepts and slopes. In this case it’s -0.1. That’s pretty small, but the interpretation is the same as with any correlation. In this case specifically, it tells us that those with lower intercepts would be associated with increased time trajectories. This makes intuitive sense in that people are improving in general, and those at the bottom would have more room to improve. However, this is very slight, and practically speaking we might not put too much weight on it.
Comparison to many regressions
Let’s compare these results to the ones we would have gotten had we run a separate regression for each student. In what follows we see the distribution of of the estimated intercept and slope coefficients for all the students. Note that these are the same estimates one would have gotten with a fixed effects model with an occasion by student interaction14.
Here we can see that the mixed model intercepts are generally not as extreme, i.e. the tails of the distribution have been pulled toward the overall effect. Same goes for the slopes. In both cases the mixed model shrinks what would have been the by-group estimate, which would otherwise overfit in this scenario. This regularizing effect is yet another bonus when using mixed models15. It comes into play largely when we have few observations per group and less estimated variance for the random effects. In other words, when there is little information in a group, or less group-level variance relative to the observation variance, then the mixed model will produce a group-specific effect that is closer to the overall population effect. In that sense, the mixed model group coefficients better reflect our ignorance. Conversely, with more pronounced group effects, our uncertainty about the overall effect increases.
The following shows what would happen to similar data under a variety of settings with simulated data that is based on the results of the GPA model we had above. On the far left is the plot just shown, our current data setting. Then there are four settings to go along with the original results. The first shows what would happen had we taken many more measurements per student. In the next, we add to the intercept and slope variance, and decrease the residual variance, but keep the sample size the same as the original data. In both cases we have a less regularizing effect of the mixed model. The random coefficients are very similar to the separate regressions results. Then, we keep the data the same but where we only have 4 observations per student, which shows more variability in the per-student results, and with it relatively more shrinkage with the mixed model. Finally, we add to the number of occasions per student (10), but have dropout over time, and so have roughly the same amount of data, but which is imbalanced. For more on this see topic, see my post here.
Visualization of effects
Now let’s compare our predictions visually. First there is the linear regression fit. We assume the same starting point and trend for everyone. If we add the conditional predictions that include the subject specific effects from the mixed model, we now can also make subject specific predictions, greatly enhancing the practical use of the model.
In contrast, the by-group approach is more noisy due to treating everyone independently. Many more students are expected to have downward or flat trends relative to the mixed model. The mixed model meanwhile only had 3 trends estimated to be negative.
Sometimes we may want to assess the effects of a categorical feature across the levels of our grouping variable. For our GPA example, this might include something like the binary indicator of whether they took any higher level/AP/Honor courses that semester. Another would be their working status, such as part-time, full-time, volunteer or not working. For many students, this sort of thing could vary from one semester to the next.
For binary features, nothing changes from what we’ve demonstrated so far. We allow that coefficient to vary across groups, just as we would the single coefficient from a numeric predictor. But what about the case where we have additional categories?
As a fixed effect, a categorical feature would have k-1 coefficients with default dummy coding, which represent the effect of changing from a reference group, whose mean is represented via the intercept, to the given category. If we allow that effect to be random, then we would have separate k-1 slopes to vary by our structured levels (along with our random intercept), resulting in multiple random coefficients, along with their correlation.
Adding these additional random slopes would obviously complicate our model, but perhaps not too much. On the practical side though, it can often lead to convergence problems. But there is a silver lining in that case. Since it is a categorical feature, we can just treat it like we would a typical random effect. The one difference is that, in the general case of random coefficients, just as we are interested in the interaction of the effect with structure already present, we will want to specify our model to capture this.
For our example we will run three models:
- Model 1: The way our intuition would suggest based on what we’ve seen so far
- Model 2: An alternate way to code model 1, but which puts our random effects on similar scales
- Model 3: A different, but identically conceptual way to get at our effect via an interaction random effect.
We demonstrate with the pupils data set16. We predict student achievement with a focus on an ordered socioeconomic indicator. For simplification, I’ve changed it to have only three categories of low, medium, and high (not shown).
= lmer( ach_cat_re_1 ~ ses + (1 + ses|primary_school_id), achievement data = pupils_demo ) = lmer( ach_cat_re_2 ~ ses + (0 + ses|primary_school_id), achievement data = pupils_demo ) = lmer( ach_cat_re_3 ~ ses + (1|primary_school_id) + (1|primary_school_id:ses), achievement data = pupils_demo )
If we look at the model summaries, we’ll see some differences. The first estimates the random intercept, the random coefficients for the k-1 effects for
ses, and their correlation.
summarize_model(ach_cat_re_1, ci = FALSE, cor_re = TRUE)
Group Effect Variance SD Var_prop primary_school_id Intercept 0.17 0.41 0.18 primary_school_id sesmid 0.05 0.22 0.05 primary_school_id seshigh 0.20 0.45 0.21 Residual 0.53 0.73 0.56
Correlation of Random Effects:
Intercept sesmid seshigh Intercept 1.00 -0.35 0.08 sesmid -0.35 1.00 0.91 seshigh 0.08 0.91 1.00
Term Value SE t P_value Lower_2.5 Upper_97.5 Intercept 6.15 0.07 84.83 0.00 6.01 6.29 sesmid 0.25 0.06 4.09 0.00 0.13 0.37 seshigh 0.53 0.10 5.28 0.00 0.33 0.73
The second has the the
ses groups in ‘intercept form’ but is an identical model, just differently parameterized, so that, while the
seslow variance is equivalent to the
Intercept variance of model 1, the other two are on different scales.
summarize_model(ach_cat_re_2, ci = FALSE, cor_re = TRUE)
Group Effect Variance SD Var_prop primary_school_id seslow 0.17 0.41 0.14 primary_school_id sesmid 0.15 0.39 0.12 primary_school_id seshigh 0.40 0.63 0.32 Residual 0.53 0.73 0.42
Correlation of Random Effects:
seslow sesmid seshigh seslow 1.00 0.85 0.71 sesmid 0.85 1.00 0.97 seshigh 0.71 0.97 1.00
Term Value SE t P_value Lower_2.5 Upper_97.5 Intercept 6.15 0.07 84.82 0.00 6.01 6.29 sesmid 0.25 0.06 4.09 0.00 0.13 0.37 seshigh 0.53 0.10 5.28 0.00 0.33 0.73
The third is conceptually identical to the others, and in spirit is the same as model 2. The primary difference is that it removes the correlations among the
ses effects. Note though that they all have a similar ‘intercept’ variance estimate.
summarize_model(ach_cat_re_3, ci = FALSE, cor_re = TRUE) # no cor_re to show
Group Effect Variance SD Var_prop primary_school_id:ses Intercept 0.01 0.12 0.02 primary_school_id Intercept 0.17 0.42 0.24 Residual 0.54 0.74 0.74
Correlation of Random Effects:
Intercept Intercept 1.00
Intercept Intercept 1.00
Term Value SE t P_value Lower_2.5 Upper_97.5 Intercept 6.15 0.08 81.63 0.00 6.00 6.30 sesmid 0.25 0.06 4.26 0.00 0.14 0.37 seshigh 0.51 0.08 6.33 0.00 0.35 0.67
So which model should we choose? We can go about this in a typical fashion, e.g. via theory, AIC, or other considerations. The following table shows AIC differences among other things. We can definitely see the first two are identical, but they estimate different numbers of parameters and random effects relative to the third.
The difference in the number of parameters shown (
K) regards the number of estimated random effect variances and covariances. We can see the initial model ‘wins’ with the lowest AIC, and we can see that the number of random effects is different. There 50 primary schools. In model 1 we have the intercept and two other fixed effects (
seshigh) for each group for 50 * 3 total random effects. In the second our effect is the intercept form, each ses random intercept estimated for every school, i.e. three intercepts for 50 * 3 random effects. For model 3, we estimate both the primary school and the interaction of it with ses, but this is not exactly a balanced data set, which would have 50 * 4 random effects, so only effects for each school-ses combination that is present in the data are estimated (two schools did not have all ses groups).
For the actual random effect differences, let’s inspect them17. Each of the following looks at the
ses effect for primary school 10. The first two models show identical random effects, though the first shows the difference of the two groups relative to the first, as that is the coefficient/slope that’s being estimated.
For our final model we do get different estimates. The variance estimates we saw previously for this model suggested very little variance for the ses within school effect, so these estimated effects are not very large.
Practically speaking, would there be differences in our predictions? Results suggest not much, at least for this school. However, the slight differences do add up across all the data, hence our AIC favoring the model(s) with estimated random effect correlations.
This would definitely not always be the case. And in other situations, it may be those models may not converge, where the form of our model 3 should have a much easier time. At least now you know how to go about dealing with these situations, but for more examples and detail, see my post here.
Summary of Random Slopes
At this point it might be clearer why some would call these richly parameterized linear models. Relative to a standard regression we get extra variance parameters that add to our understanding of the sources of uncertainty in the model, we can get the subjects specific effects, and their correlation, and use that information to obtain far better predictions. What’s not to like?
Exercises for Random Slopes
Run the sleep study model with random coefficient for the Days effect, and interpret the results. What is the correlation between the intercept and Days random effects? Use the ranef and coef functions on the model you’ve created to inspect the individual-specific effects. What do you see?
In the following, replace model with the name of your model object. Run each line, inspecting the result of each as you go along.
= ranef(model)$Subject re = fixef(model) fe apply(re, 1, function(x) x + fe) %>% t()
The above code adds the fixed effects to each row of the random effects (the t just transposes the result). What is the result compared to what you saw before?
The following shows a simplified way to simulate some random slopes, but otherwise is the same as the simulation before. Go ahead and run the code.
set.seed(1234) # this will allow you to exactly duplicate your result = 50 Ngroups = 3 NperGroup = Ngroups * NperGroup N = factor(rep(1:Ngroups, each = NperGroup)) groups = rnorm(Ngroups, sd = .75) re_int = rnorm(Ngroups, sd = .25) re_slope = rnorm(N, sd = .25) e = rnorm(N) x = (2 + re_int[groups]) + (.5 + re_slope[groups]) * x + e y = data.frame(x, y, groups)d
This next bit of code shows a way to run a mixed model while specifying that there is no correlation between intercepts and slopes. There is generally no reason to do this unless the study design warrants it18, but you could do it as a step in the model-building process, such that you fit a model with no correlation, then one with it.
= lmer(y ~ x + (1|groups), data = d) model_ints_only = lmer(y ~ x + (1|groups) + (0 + x|groups), data = d) model_with_slopes summary(model_with_slopes) confint(model_with_slopes) library(ggplot2) ggplot(aes(x, y), data = d) + geom_point()
Compare model fit using the AIC function, e.g.
AIC(model). The model with the lower AIC is the better model, so which would you choose?
Technically the intercept is assumed but you should keep it for clarity.↩︎
In case it’s not clear, I’m using the fact that we assume a normal distribution for the random effect of occasion. A quick rule of thumb for a normal distribution is that 95% falls between \(\pm\) 2 standard deviations of the mean.↩︎
This is just one issue with a fixed effects approach. You would have to estimate 400 parameters, but without anything (inherently) to guard against overfitting. The so-called fixed effects model from the econometrics perspective gets around this by demeaning variables that vary within groups, i.e. subtracting the per group mean. This is also equivalent to a model adding a dummy variable for the groups, though it’s a computationally more viable model to fit, as one no longer includes the grouping variable in the model (not really a concern with data FE models are actually applied to and it being what year it is). But while it controls for group level effects, we still cannot estimate them. Traditional approaches to fixed effects models also do not have any way to incorporate group-specific slopes, except perhaps by way of an interaction of a covariate with the cluster, which brings you back to square one of having to estimate a lot of parameters. For more about FE models and their issues, see my document on clustered data approaches, and Bell et al. (2016). Fixed and Random effects: making an informed choice.↩︎
This phenomenon is also sometimes referred to as partial pooling. This idea of pooling is as in ‘pooling resources’ or borrowing strength. You have complete pooling, which would be the standard regression model case of ignoring the clusters, i.e. all cluster effects are assumed to be the same. With no pooling, we assumes the clusters have nothing in common, i.e. the separate regressions approach. Partial pooling is seen in the mixed model scenario, where the similarity among the clusters is estimated in some fashion, and data for all observations informs the estimates for each cluster. I’ve never really liked the ‘pooling’ terminology, as regularization/shrinkage is a more broad concept that applies beyond mixed models, and I’d prefer to stick to that. In any case, if interested in more detail, see practically anything Andrew Gelman has written on it, and the pool-no-pool document here.↩︎
These tables are produced by
I personally have not come across a situation where I’d do this in practice. Even if the simpler model with no correlation was a slightly better fit, there isn’t much to be gained by it.↩︎