Note that I have converted these from their original SPSS format to R data.frames saved within RData files. I also cleaned them up with better names/labels etc. For data sets used in that text, most of the description is taken from Joop Hox’s text appendix (‘Data Stories’).
GPA: The GPA data are a longitudinal data set, where 200 college students have been followed 6 consecutive semesters. In this data set, there are GPA measures on 6 consecutive occasions, with a JOB status variable (how many hours worked) for the same 6 occasions. There are two student-level explanatory variables: the sex (1= male, 2= female) and the high school GPA. There is also a dichotomous student-level outcome variable, which indicates whether a student has been admitted to the university of their choice. Since not every student applies to a university, this variable has many missing values.
pupils: Assume that we have data from 1000 pupils who have attended 100 different primary schools, and subsequently went on to 30 secondary schools. Similar to the situation where we have pupils within schools and neighborhoods, we have a cross-classified structure. Pupils are nested within primary and within secondary schools, with primary and secondary schools crossed. In other words: pupils are nested within the cross-classification of primary and secondary schools. In our example, we have a response variable achievement which is measured in secondary school. We have two explanatory variables at the pupil level: pupil sex (0 = male, 1 = female) and a six-point scale for pupil socioeconomic status, pupil SES. We have at the school level a dichotomous variable that indicates if the school is public (denom = 0) or denominational (denom = 1). Since we have both primary and secondary schools, we have two such variables (named pdenom for the primary school and sdenom for the secondary school).
nurses: The data in this example are from a hypothetical study on stress in hospitals. The data are from nurses working in wards nested within hospitals. In each of 25 hospitals, four wards are selected and randomly assigned to an experimental and control condition. In the experimental condition, a training program is offered to all nurses to cope with job- related stress. After the program is completed, a sample of about 10 nurses from each ward is given a test that measures job-related stress. Additional variables are: nurse age (years), nurse experience (years), nurse sex (0 = male, 1 = female), type of ward (0 = general care, 1 = special care), and hospital size (0 = small, 1 = medium, 2 = large). This is an example of an experiment where the experimental intervention is carried out at the group level. In biomedical research this design is known as a cluster randomized trial. They are quite common also in educational and organizational research, where entire classes or schools are assigned to experimental and control conditions. Since the design variable Experimental versus Control group (ExpCon) is manipulated at the second (ward) level, we can study whether the experimental effect is different in different hospitals, by defining the regression coefficient for the ExpCon variable as random at the hospital level. In this example, the variable ExpCon is of main interest, and the other variables are covariates. Their function is to control for differences between the groups, which should be small given that randomization is used, and to explain variance in the outcome variable stress. To the extent that they are successful in explaining variance, the power of the test for the effect of ExpCon will be increased. Therefore, although logically we can test if explanatory variables at the first level have random coefficients at the second or third level, and if explanatory variables at the second level have random coefficients at the third level, these possibilities are not pursued. We do test a model with a random coefficient for ExpCon at the third level, where there turns out to be significant slope variation. This varying slope can be predicted by adding a cross-level interaction between the variables ExpCon and HospSize. In view of this interaction, the variables ExpCon and HospSize have been centered on their overall mean.
sociometric: The sociometric data are intended to demonstrate a data structure where the cross-classification is at the lowest level, with an added group structure because there are several groups. The story is that in small groups all members are asked to rate each other on a scale of 1-9, where higher numbers indicate a more positive view of the individual (i.e. how much they would like to share some activity with the rated person). Each record is defined by the sender–receiver pairs, with explanatory variables age and sex defined separately for the sender and the receiver. The group variable ‘group size’ is added to this file. There are 20 groups, with sizes ranging from 4 to 11.
speed dating: Data involve a speed dating experiment on a sample of a few hundred students in graduate and professional schools at Columbia University. In the speed dating events, the experiment randomly assigned each participant to ten short dates (four minutes) with participants of the opposite sex. For each date, each person rated six attributes (attractive, sincere, intelligent, fun, ambitious, shared interests) of the other person on a 10-point scale and wrote down whether he or she would like to see the other person again. The data have been filtered to remove constant responders, i.e. those who always or never wanted to see their partner again, those six attributes have scaled versions with mean 0 and standard deviation 1 (
_sc), column names have been made useful, and only some of the variables have been kept for the demo. If you run the same model, you can get very similar estimates to that in the Fahrmeier et al. text, table 7.4 (though there is a typo for the Male effect).
- The data come from Gelman’s website but are based on this article.
patents: The European Patent Office is able to protect a patent from competition for a certain period of time. The Patent Office has the task to examine inventions and to declare patent if certain prerequisites are fulfilled. The most important requirement is that the invention is something truly new. Even though the office examines each patent carefully, in about 80% of cases competitors raise an objection against already assigned patents. In the economic literature the analysis of patent opposition plays an important role as it allows one to (indirectly) investigate a number of economic questions. For instance, the frequency of patent opposition can be used as an indicator for the intensity of the competition in different market segments.
The primary variables are:
opp: Patent opposition (1=yes 0=no),
biopharm: Patent from biotech/pharma sector,
ustwin: US twin patent exists,
patus: Patent holder from the USA,
patgsgr: Patent holder from Germany, Switzerland, or Great Britain,
ncit: Number of citations for the patent,
ncountry: Number of designated states for the patent, and
nclaims: Number of claims. Centered and other transformed variables are also present.
In order to analyze objections against patents, a data set with 4,866 patents from the sectors biotechnology/pharmaceutics and semiconductor/computer was collected. The goal of one analysis is to model the probability of patent opposition, while using a variety of explanatory variables for the binary response variable patent opposition (yes/no). This corresponds to a regression problem with a binary response. In other cases you might wish to analyze the number of citations.
R has more mixed modeling capabilities than anything else out there. Here are just a few options within R.
- lme4: generalized linear mixed models; extremely efficient
- nlme: (non-)linear mixed models but only for the gaussian case.
- mgcv: provides means to use both lme4 or nlme, extends distributional families, correlated residuals, and all that additive model stuff too
- glmmTMB: newer package that similarly extends beyond lme4
- ordinal: for various types of ordinal models
- rstanarm: Bayesian with lme4 level options
- brms: Bayesian with possibly the most extensive mixed model capabilities out there
In Python, the statsmodels module has basic capabilities for mixed models, but not too many frills relatively speaking. CSCAR director Kerby Shedden has been part of this specific development. You can see his notes from the Data Science Skills Series here. The following provides an example.
= ifelse(grepl('Windows', sessionInfo()$running), pycheck 'C:/ProgramData/Anaconda3', '/Users/micl/opt/anaconda3/bin/python')
import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf = pd.read_csv('data/gpa.csv') gpa = smf.mixedlm("gpa ~ occasion", gpa, groups=gpa["student"]).fit() gpa_py_mixed print(gpa_py_mixed.summary())
Mixed Linear Model Regression Results ======================================================= Model: MixedLM Dependent Variable: gpa No. Observations: 1200 Method: REML No. Groups: 200 Scale: 0.0581 Min. group size: 6 Log-Likelihood: -204.4464 Max. group size: 6 Converged: Yes Mean group size: 6.0 ------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ------------------------------------------------------- Intercept 2.599 0.022 119.801 0.000 2.557 2.642 occasion 0.106 0.004 26.096 0.000 0.098 0.114 Group Var 0.064 0.033 =======================================================
One of the lme4 authors develops a MixedModels package for Julia. Doug Bates has some notebooks on implementation of mixed models in Julia, though they are a bit dated at this point, so consult the documentation on the most recent version of the package.
Among the standard stats packages, I can only recommend Stata for both ease of implementation and flexibility in modeling. SAS and SPSS both have non-intuitive and needlessly verbose syntax, and, while SAS is in fact a fairly good tool for mixed models (from my understanding), I’ve actually seen SPSS consistently struggle with even simple models, and its GUI interface, the only reason to use SPSS in the first place, is not even remotely intuitive.
Mplus has a lot of functionality both for standard mixed models and growth curve models, though between brms, lavaan, mediation, and the rest of R, there is little need to use Mplus except for very complicated multilevel SEM, which requires very large samples and a lot of theoretical justification, which most of the people doing SEM are typically lacking in one or the other.
I will also say that there is no reason to use mixed model specific software like HLM. The days for such hijinks have long since passed.
Reference texts and other stuff
An excellent modeling book can be found in Data Analysis Using Regression and Multilevel/Hierarchical Models. You will learn a great deal about statistical modeling in general, as the mixed model stuff only comprises the second part. Gelman has since updated this, or at least the first part, but I can’t speak to the content yet.
On the applied side, former CSCAR consultant Brady West has a book- Linear Mixed Models: A Practical Guide Using Statistical Software. It shows examples in a variety of programming formats while not glossing over the details you need to know.
Two texts come at mixed models from a very broad perspective, which I like. I find Fahrmeier et al. Regression - Models, Methods and Applications to be very good. It will also tie mixed models to spatial and nonparametric approaches. Richly Parameterized Linear Models - Additive, Time Series, and Spatial Models Using Random Effects, does as well, and provides some nice historical context and hits on practical issues you will encounter if you play with these models long enough.
There is literally a book on mixed models with lme4 by Doug Bates, one of the developers. While dated with regard to lme4, it’s definitely not with regard to mixed models in general, and worth having around.
Check out the Mixed Model FAQ maintained by Ben Bolker, lme4 developer
I have several other docs on various aspects of mixed models, and post about different aspects about them regularly.
There is a lot of good information on stackoverflow and cross validated Q & A sites, where lme4 developer Ben Bolker has been quite active in answering people’s questions.