Issues

This section discusses common issues, conundrums, and other things that might come up when implementing mixed models.

Variance Accounted For

People really love the R-squared value that comes from standard regression. Never mind that it is inherently biased, nor does it matter that there is no way to state what would be a ‘good’ result for a given data situation, nor that many actually don’t know how to interpret it, nor does it even matter that many have no qualms about dropping it from a report at the first sign of trouble.

Suffice it to say that when there are multiple sources of ‘variance,’ talking about variance accounted for is not straightforward. Still, many have tried. You might look at the r2glmm package and the references noted in the description, or the r2 function in performance. See also the GLMM FAQ. I would suggest not even bothering with it beyond the standard linear mixed model.

Similar issues would apply to the estimate of intraclass correlation, which has little meaning outside of a linear mixed model with no random slopes. Again, just because you can calculate something that looks like it, doesn’t mean that it actually means what you want it to mean.

Common Alternatives to Mixed Models

I have a document that goes into more detail about many approaches to dealing with clustered data, but we can briefly talk about some here. Common alternatives used in clustered data situations include:

Fixed effects models (also panel linear models with fixed, as opposed to random, effects)
Using cluster-robust standard errors
Generalized estimating equations (GEE)

The first two are commonly used by those trained with an econometrics perspective, while you might see GEE more with those of a biostatistics or other perspective. GEE are in fact a generalization of the cluster-robust approach, and extend generalized least squares (GLS) to nonlinear/GLM settings. GEE approaches allow one to take into account the dependency in the data, but not actually estimate what might be very interesting, i.e. the random effects and associated variance. There are also fewer tools for GEE in more complicated covariance structures beyond a single clustering variable.

The nature of fixed effects models allow you to control for, but not actually explore, cluster level effects. This makes them a non-starter for many investigations, as those are typically of prime theoretical interest. In addition, the main concern econometricians have that leads them to prefer such models is easily accommodated in standard mixed models, so there is very little reason to employ them, as they are special cases of the more flexible mixed model.

Growth curve models

With longitudinal data, growth curve models are a latent variable approach that is commonly used in these situations. With appropriate setup, they will duplicate the results of a mixed model. In my opinion, there are few reasons to use a growth curve approach over a mixed model, and many reasons not to, not least of which is that effects, which would be simple to interpret in the mixed model approach, are now a source of confusion to applied modelers in the growth curve model, even though it’s the same thing. Furthermore, indirect effects, growth mixture models and other extensions common in the latent variable approach are more easily implemented in the mixed model approach. In short, only the most complicated models would perhaps require using structural equation modeling, but as such, would also bring with them many other issues. See more here and here. The supplemental section has a full demonstration of how to equate these models.

Sample Sizes

Small number of clusters

Think about how many values of some variable you’d need before you felt comfortable with statistics based on it, especially standard deviation/variance. That’s at play with mixed models, in the sense you’d like to have enough groups to adequately assess the variance components. Mixed models will run with very small numbers, though the estimates will generally be biased. I have a demo here if interested. One way to deal with this is to move to the Bayesian context, which can be used to induce some regularization in parameter estimates, and better deal with possible variance estimates that are near zero.

This also speaks to the issue some will have regarding whether they should treat something as a fixed vs. random effect. Historical definitions would unnecessarily restrict usage of random effects approaches. For example, random effects were defined to be a (random) sample from some population. If this were the case, some might take issue when your levels do not deal with a sample, but the whole population, as in the case where your cluster is state and you have all 50 states. This doesn’t matter. If you have enough levels to consider a mixed model approach, feel free to do so. In the end these are all just different forms of regularized/penalized models.

Small number of observations within clusters

Mixed models work even with no more than two in each cluster and some singletons. Even in the simple case of pre-post design, mixed models are entirely applicable, though limited (e.g. you can’t have random slopes with just pre-post). So whenever you have clustering of some kind, you should consider mixed models.

Balanced/Missing values

We’ve primarily been looking at balanced data, where each cluster has the same number of observations within them. There is no requirement for this, and in many cases we wouldn’t even expect it, e.g. people within geographical units.

However, if data is only missing on the outcome, or a mix of variables, we essentially have the same issue as with typical data situations, and will have the same considerations for dealing with missingness. If you don’t lose much data, the practical gain by ignoring missingness generally outweighs the complexities that can come with, for example, multiple imputation²⁵, even in the best of settings. By default, mixed models assume missing at random (MAR). On the other hand, longitudinal data has special considerations, as there is typically increasing dropout over time.

Having dealt with missingness in a variety of contexts with different approaches (FIML, MI, Bayesian), the end result is usually that you spend vast amounts more time dealing with the missing data than you do understanding your primary models of interest, and yet don’t feel any better about the results. Unless the missingness would make you lose a large chunk of the data, and/or you actually know something about the underlying mechanism attributing to missing data, it’s probably best just to leave that to the limitations section of your report²⁶. If you do deal with it, under less than ideal circumstances, I’d perhaps suggest an approach that is essentially a one-off imputation (e.g. with missForest) to be compared to the data that ignores the missingness, but still allows you to do everything you want. While you may not incorporate all sources of uncertainty in doing so, it seems to me a viable compromise.

Big data

Mixed model packages are often not so great with largish data, e.g. thousands, coupled with anything beyond random intercepts. However, I’ve used lme4 with millions and simple structure, and 100s of thousands with complicated structure, and it does very well (at least for the gaussian case). For truly big data you’re not going to have a lot of options though, but you’d need a lot before you’d need to worry. I did some testing of lme4, mgcv, and glmmTMB, for up to a million cases with two random effects, and generally even then you may only have to wait seconds to a couple minutes. In applied work where I was using Medicare data with of hundreds of thousands of observations, and at least two random effects with thousands of levels each, those packages were still viable.

Common techniques in machine learning have no special implementation for the inclusion of something like random effects. There has been some effort with trees/forests here (mebt specifically), here, and also vcrpart). However, most approaches in the ML world will simply throw the clustering variable in along with everything else, possibly even as a lower dimensional word embedding, or have enough data to do by-cluster approaches. While this may be fine predictively, it may not be theoretically interesting to those doing mixed models.

On the plus side, if you’re willing to wait, tools like the Stan family will likely do just fine with bigger data as well, eventually. So while massive data may still be problematic, you may be fine with very large data.

Model Comparison

Model comparison takes place in the usual way in the sense of potentially having statistical tests and information criteria. Unfortunately, the typical likelihood ratio tests one might use in standard settings are not so straightforward here. For example, in order to compare models with different fixed effects, at a minimum you’d have to change the default estimation from REML to ML, and the models must have the same random effects structure, for the resulting test p-value to be correct. It works the other way to compare models with different random effects structure (with fixed effects the same across models), and you’d have to use REML.

# to compare fixed effects refit with ML
gpa_1 =  lmer(gpa ~ occasion + (1 + occasion | student), data = gpa)
gpa_2 =  lmer(gpa ~ occasion + sex + (1 + occasion | student), data = gpa)
gpa_3 =  lmer(gpa ~ occasion + (1 | student), data = gpa)
gpa_4 =  lmer(gpa ~ occasion + (1 + occasion | student), data = gpa)

anova(gpa_1, gpa_2, refit = TRUE)
anova(gpa_3, gpa_4, refit = FALSE)

	npar	AIC	BIC	logLik	deviance	Chisq	Df	p-value
gpa_1	6	258.234	288.775	-123.117	246.234
gpa_2	7	250.068	285.699	-118.034	236.068	10.166	1	0.001

	npar	AIC	BIC	logLik	deviance	Chisq	Df	p-value
gpa_3	4	416.893	437.253	-204.446	408.893
gpa_4	6	272.957	303.497	-130.478	260.957	147.936	2	0

One can see that the estimated log likelihood is not the same for gpa_1 (ML) and gpa_4 (REML), even though they are otherwise the same model.

In my opinion, model selection involves considerations of theory, parsimony, and prediction, and those tests do not. I’m not a fan of such tests even in the standard setting, and would use AIC here to potentially aid (not make) a model choice if I thought it was necessary, as I would there²⁷. We can just use the AIC function, but see also the cAIC4 package for a conditional AIC that works specifically for merMod objects. In the following, gpa_2 has the lowest (best) AIC value, and given that both it and gpa_1 are notably superior to gpa_3, one could conclude that having the random coefficient for occasion is useful.

model	AIC	Δ AIC
gpa_2	269.785	0.000
gpa_1	272.957	3.171
gpa_3	416.893	147.108

In general though, trying to determine a ‘best’ model with one set of data is a problematic endeavor at best, and at worst, completely misguided. I think it’s very useful to build models of increasing complexity, and select one to focus on based on the available evidence to simplify exposition. Just don’t get hung up on choosing one based solely on the outcome of a single statistic. If you have a lot of data, you should consider some sort of explicit validation approach if you really want to compare competing models, but that is not without complication given the dependency in the data. If your goal is more on the predictive side, consider model averaging rather than selecting a single model.

Convergence

Data is as data does. It is likely that you will eventually have issues in conducting mixed models, such as lack of convergence, estimates of zero for the random effects variance, warnings about scaling variables etc. These are not easy models to estimate (at least outside of the Bayesian context), so don’t be surprised if all doesn’t go smoothly.

I have a more detailed overview, but we can talk briefly about typical problems and solutions. A few common issues²⁸ I see are:

Lack of centering/scaling will often result in scaling warnings for lme4. It makes sense to do it anyway, so the same goes for mixed models as any other. A very common convergence problem in my experience results from time-based variables, such as incorporating a yearly trend. At the very least, it’s useful to have a meaningful zero value, for example, by starting the time at zero, rather than say, year 2008. You may need to scale the trend variable also to get the model to converge, but can revert back to raw form for interpretation.
The default optimizer options may need to be tweaked, for example, to allow for more iterations. If possible, you may need to change to a different optimizer. I have a function (mixedup::converge_it) that will simply rerun the model for more iterations until convergence, and this is often all that needs to be done.
Zero estimates for the random effect variance, or \(\pm 1\) estimates for correlation of intercepts and slopes, often can be attributed to not having enough data, not having enough clusters, or an overlooked data issue, along with possible scaling. You won’t have this issue in the Bayesian context, but in others, you may have to deal with the dependency in some other fashion (e.g. cluster-robust standard errors/GEE).
Any complicated GLMM or similar model is likely to have problems, so be prepared. If you want to go beyond GLM, you’ll have fewer tools and likely more issues. There are packages like ordinal, mgcv, glmmTMB, and others that can potentially handle alternate distributions and other complexities, however I think one might be better off with a Bayesian approach (e.g. brms/rstan). In practice, I’ve found non-Bayesian approaches to be prohibitively slow, unable to converge, or too limited in post-estimation options.

If you’re using lme4 specifically you have a couple resources and tools at your disposal.

Start with ?convergence. This comes up so often that there is a help file just for convergence issues.
Use the allFit function. If the results are very similar across fits you can feel better about them.
Consult the troubleshooting section of the FAQ.

The main point is that you’ll need to acknowledge the warnings and messages that the packages provide, not ignore them, and be prepared to take necessary steps to deal with these issues when they arise.

Multiple imputation is straightforward only in theory. In practice it becomes a major pain to go very far beyond getting the parameter estimates for simple models. Full information maximum likelihood (FIML) is little implemented outside of SEM software/packages, and more problematic in its assumptions.↩︎
Just note that in some academic disciplines, reviewers, who will rarely do this themselves for the same reasons, nevertheless will make a big deal about the missing data because it’s an easy point for them to make. This is similar to econometrically trained reviewers who shout ‘endogeneity!’ at every turn, but won’t bother to tell you where to get the instrument in the first place, admit that IV analysis is problematic in its own right, or what specifically the approach should be in complex model settings such as mixed models with nonlinear, spatial effects, multiple levels of clustering etc. I’ve even seen a reviewer say that one ‘should conduct a test for whether data is missing at random vs. not, then proceed accordingly.’ That was the entirety of their suggestion. Aside from the fact that there technically is no such test, because it would require the very data one doesn’t have, declaring a type of missingness doesn’t tell you which of dozens of approaches one should take.↩︎
If you really want to go the statistical test route, see the lmertest package for additional functionality. Note also that AIC does not come with a free lunch, and as mentioned, see the cAIC4 package and references therein.↩︎
The Richly Parameterized Linear Models text discusses convergence and other issues probably more than any other text I’ve come across. Most don’t treat the subject at all.↩︎