Introduction

The following compares a structural equation modeling (SEM) approach with wide format data vs. a mixed model, long-form data approach. In a traditional mixed model we have observations clustered within some group structure, for example, test scores collected over time within students, or survey samples of individuals from each state.

In what follows we can see a traditional mixed model as a form of a constrained latent variable model, where the random effects in the mixed model are represented as a latent variable in the SEM context. This is more explicitly laid out in the area of growth curve models, which take a standard setting of a mixed model with longitudinal data in which observations occur over time. However, in the SEM approach we instead take a wide form approach to the data and explicitly model latent variables that reflect the random intercepts and slopes. Growth curve models are highly constrained compared to the usual SEM setting, as most loadings are fixed rather than estimated.

We’ll start by generating some data with the SEM context in mind, then melt it to long form and run a standard mixed model. Following that we’ll compare the two, and eventually add in random slopes.

Random Intercepts

For the following we start with no additional covariates. In the mixed model framework, a simple random effects model is often depicted as follows:

\[y_{i} = µ + u_{j} + e_{i}\]

In this case, each observation \(i\) within a cluster is the result of some overall mean plus a random effect due to belonging to group \(j\), plus residual noise.

Data

In the SEM context, we generate ‘observed’ data \(y\) as if there were a single underlying latent variable \(f\). Unlike traditional SEM models, here we fix the loading of the latent variable to be 1 for each observed variable. We give the observed \(y\) variances of 1, 2 and 3 respectively, and set the ‘fixed effect’ intercept to µ.

set.seed(8675309)
n = 1000
lvI = rnorm(n, sd=2)  # Our latent variable representing random intercepts
mu = .3               # Intercept

### make the data to conform to a mixed model
y1 = mu + 1*lvI + rnorm(n, sd=1)
y2 = mu + 1*lvI + rnorm(n, sd=sqrt(2))
y3 = mu + 1*lvI + rnorm(n, sd=sqrt(3))

y = data.frame(y1, y2, y3)
head(y)
          y1        y2         y3
1 -1.9605780 0.1515718 -0.1737814
2  0.4804558 3.8265794  2.8545217
3 -2.1044920 0.7679583 -2.6376176
4  5.0600987 1.5849592  2.4593212
5  4.0626543 3.5548814  5.7934398
6  2.0400965 0.8187298  5.8198780
# reshape to long for later use with nlme
library(magrittr); library(dplyr); library(tidyr)
ylong = y
ylong %<>% 
  gather(y, key='variable') %>% 
  mutate(group = factor(rep(1:n, 3)))
## head(ylong)
## 
## # alternative, generate long form first
## group = factor(rep(1:n, 3))
## # y2 = mu + lvI[group] + rnorm(n*3, sd=rep(sqrt(c(1,2,3)), e=n))
## # ylong = data.frame(variable=rep(paste0('y',1:3), e=n), value=y2, group)
## # head(ylong)

As we can see and would expect, the variances of the observed variables are equal to the variance of the latent variable (the \(u\) random effect in the mixed model) plus the residual variance. In this case 22 + c(1, 2, 3).

sapply(y, var)
      y1       y2       y3 
5.349871 6.221397 7.282551 

SEM

In the following we’ll set up the SEM in lavaan with appropriate constraints. We will hold off on results for now, but go ahead and display the graph pertaining to the conceptual model1. At this point we’re simply estimating the variances of the latent variable and the residual variances of the observed data.

To make results comparable to later, we’ll use the settings pertaining to lavaan’s growth function.

library(lavaan)
LVmodel1 = '
  I =~ 1*y1 + 1*y2 + 1*y3
'  

semres1 = growth(LVmodel1, y)

Mixed Model

For the mixed model we use the melted data with a random effect for group. The nlme package is used because it will allow for heterogeneous variances for the residuals, which is what we need here.

library(nlme)
nlmemod1 = lme(y ~ 1, random= ~ 1|group, data=ylong, weights=varIdent(form = ~1|variable), method='ML')

Model Results and Comparison

SEM

For the SEM approach we get the results we’d expect, and given the data set size the estimates are right near the true values.

summary(semres1)
lavaan (0.5-20) converged normally after  22 iterations

  Number of observations                          1000

  Estimator                                         ML
  Minimum Function Test Statistic                0.990
  Degrees of freedom                                 4
  P-value (Chi-square)                           0.911

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  I =~                                                
    y1                1.000                           
    y2                1.000                           
    y3                1.000                           

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y1                0.000                           
    y2                0.000                           
    y3                0.000                           
    I                 0.247    0.069    3.560    0.000

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y1                1.116    0.098   11.405    0.000
    y2                2.004    0.126   15.947    0.000
    y3                3.072    0.167   18.371    0.000
    I                 4.219    0.216   19.517    0.000

Mixed Model

With the mixed model we see the random intercept sd/variance is akin to the latent variable variance, and residual variance is what we’d expect also.

summary(nlmemod1)
Linear mixed-effects model fit by maximum likelihood
 Data: ylong 
       AIC      BIC    logLik
  12562.33 12592.36 -6276.164

Random effects:
 Formula: ~1 | group
        (Intercept) Residual
StdDev:    2.054055 1.056471

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | variable 
 Parameter estimates:
      y1       y2       y3 
1.000000 1.340076 1.658953 
Fixed effects: y ~ 1 
                Value  Std.Error   DF  t-value p-value
(Intercept) 0.2466467 0.06929644 2000 3.559299   4e-04

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-2.582310510 -0.575948827 -0.001671765  0.577319764  2.865939066 

Number of Observations: 3000
Number of Groups: 1000 

Comparison of latent variable and random effects

It’s a little messy to extract the specific individual variance estimates due to the way nlme estimates them2, but the estimates of the mixed model and the SEM are the same.

coef(nlmemod1$modelStruct$varStruct, unconstrained =FALSE,allCoef=T)*nlmemod1$sigma
      y1       y2       y3 
1.056471 1.415752 1.752636 
sqrt(diag(inspect(semres1, 'est')$theta))
      y1       y2       y3 
1.056469 1.415752 1.752635 

Comparing the latent variable scores to the random effects, we are dealing with almost identical estimates.

comparisonDat = data.frame(LV=lavPredict(semres1)[,1], RE=ranef(nlmemod1)[,1] + fixef(nlmemod1))

head(round(comparisonDat,2), 10)   
psych::describe(comparisonDat)
cor(comparisonDat)
      LV    RE
1  -0.86 -0.86
2   1.70  1.70
3  -1.18 -1.18
4   3.16  3.16
5   3.76  3.76
6   2.14  2.14
7   0.57  0.57
8   1.83  1.83
9   1.71  1.71
10  1.25  1.25
   vars    n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
LV    1 1000 0.25 1.93   0.31    0.26 1.91 -5.97 7.19 13.17 -0.08     0.05 0.06
RE    2 1000 0.25 1.93   0.31    0.26 1.91 -5.97 7.19 13.17 -0.08     0.05 0.06
   LV RE
LV  1  1
RE  1  1

Random intercepts and slopes

Now we will investigate random intercepts and slopes as in a standard ‘growth curve model’. For this data the y is a repeated measurement over time, otherwise the data is much the same as before. However, we will add a slight positive correlation between intercepts and slopes, and scale time to start at zero so the intercept represents the average baseline value. We’ll also add an additional time point.

Data

The main difference here is adding a covariate for time and a second latent variable. For this demo, the ‘fixed’ effects in the mixed model sense will be set to .5 and .2 for the intercept and slope for time respectively. The variances of the latent variables are set to one. We add increasing residual variance over time.

set.seed(8675309)
n = 1000
i = .5
s = .2
f = MASS::mvrnorm(n, mu=c(i,s), Sigma=matrix(c(1,.3,.3,1), nrow=2, byrow=T), empirical=T)
f1 = f[,1]
f2 = f[,2]


### make the data to conform to a mixed model
y1 = 1*f1 + 0*(f2) + rnorm(n, sd=1)          
y2 = 1*f1 + 1*(f2) + rnorm(n, sd=sqrt(2))
y3 = 1*f1 + 2*(f2) + rnorm(n, sd=sqrt(3))
y4 = 1*f1 + 3*(f2) + rnorm(n, sd=sqrt(4))

y = data.frame(y1, y2, y3, y4)
head(y)
         y1         y2         y3         y4
1 0.8233074  0.6094268 -2.9239599 -1.9196902
2 2.4320979  0.8722732  0.4130522 -0.6960228
3 0.9059002 -2.6873153 -4.6821983 -4.7305664
4 0.6309905  2.0462769  4.5094473  6.4255891
5 2.6525856  6.4423691  6.0122283  6.2786173
6 0.3857095  4.3329025  3.8898125  2.1616358
# reshape to long for later use with nlme
ylong = y
ylong %<>% 
  gather(y, key='variable') %>% 
  mutate(subject = factor(rep(1:n, 4)),
         time = rep(0:3, e=n))

Let’s take a look at what we have.

Models

For the SEM we now have two latent structures, one representing the random intercepts, another the random slopes for time. For the mixed model we specify both random intercepts and slopes. The graphical model for the SEM is shown.

LVmodel2 = '
  I =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
  S =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
'
semres2 = growth(LVmodel2, y)

nlmemod2 = lme(y ~ time, data=ylong, random =  ~time|subject, 
              weights=varIdent(form = ~1|variable), method='ML')

semPlot::semPaths(semres2, what='path', whatLabels='est', whatstyle='lisrel')

Model Results and Comparison

SEM

For the SEM approach we get the results we’d expect, with estimates right near the true values.

summary(semres2)
lavaan (0.5-20) converged normally after  28 iterations

  Number of observations                          1000

  Estimator                                         ML
  Minimum Function Test Statistic                5.918
  Degrees of freedom                                 5
  P-value (Chi-square)                           0.314

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  Z-value  P(>|z|)
  I =~                                                
    y1                1.000                           
    y2                1.000                           
    y3                1.000                           
    y4                1.000                           
  S =~                                                
    y1                0.000                           
    y2                1.000                           
    y3                2.000                           
    y4                3.000                           

Covariances:
                   Estimate  Std.Err  Z-value  P(>|z|)
  I ~~                                                
    S                 0.317    0.064    4.931    0.000

Intercepts:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y1                0.000                           
    y2                0.000                           
    y3                0.000                           
    y4                0.000                           
    I                 0.480    0.043   11.096    0.000
    S                 0.224    0.038    5.929    0.000

Variances:
                   Estimate  Std.Err  Z-value  P(>|z|)
    y1                1.033    0.110    9.411    0.000
    y2                1.922    0.108   17.809    0.000
    y3                3.159    0.199   15.840    0.000
    y4                3.932    0.347   11.323    0.000
    I                 1.006    0.114    8.849    0.000
    S                 0.995    0.069   14.364    0.000

Mixed Model

With the mixed model we see the between group sd/variance is akin to the latent variable variance, and residual variance is what we’d expect also.

summary(nlmemod2)
Linear mixed-effects model fit by maximum likelihood
 Data: ylong 
       AIC     BIC    logLik
  17107.75 17164.4 -8544.875

Random effects:
 Formula: ~time | subject
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 1.0031982 (Intr)
time        0.9973894 0.317 
Residual    1.0161174       

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | variable 
 Parameter estimates:
      y1       y2       y3       y4 
1.000000 1.364544 1.749243 1.951385 
Fixed effects: y ~ time 
                Value  Std.Error   DF   t-value p-value
(Intercept) 0.4797851 0.04324928 2999 11.093483       0
time        0.2242885 0.03783797 2999  5.927604       0
 Correlation: 
     (Intr)
time -0.054

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-2.668449980 -0.582027043 -0.006143264  0.582212366  3.137291189 

Number of Observations: 4000
Number of Groups: 1000 

Let’s compare the estimated residual variances again.

                 y1       y2       y3       y4
varsMixed  1.016117 1.386537 1.777437 1.982836
varsGrowth 1.016122 1.386542 1.777426 1.982842

Comparing the latent variable scores to the random effects, once again we’re getting similar results. For the latent variable regarding slopes, we’ll subtract out the fixed effect.

       I     S X.Intercept.  time
1   0.38 -0.68         0.38 -0.68
2   1.16 -0.30         1.16 -0.30
3  -0.14 -1.48        -0.14 -1.48
4   0.93  1.43         0.93  1.43
5   2.32  1.62         2.32  1.62
6   1.08  0.92         1.08  0.92
7   0.63  1.01         0.63  1.01
8   1.40  1.76         1.40  1.76
9   1.27  0.43         1.27  0.43
10  0.33  0.82         0.33  0.82
             vars    n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
I               1 1000 0.48 0.80   0.49    0.48 0.79 -2.21 2.96  5.18 -0.08    -0.06 0.03
S               2 1000 0.22 0.88   0.25    0.21 0.89 -2.83 3.04  5.87  0.08     0.06 0.03
X.Intercept.    3 1000 0.48 0.80   0.49    0.48 0.79 -2.21 2.96  5.18 -0.08    -0.06 0.03
time            4 1000 0.22 0.88   0.25    0.21 0.89 -2.83 3.04  5.87  0.08     0.06 0.03
             I X.Intercept.
I            1            1
X.Intercept. 1            1
     S time
S    1    1
time 1    1

Which to use?

It really depends on the model specifics as to which might be best for your situation, but I would suggest defaulting to mixed models for a variety of reasons.

In short, my opinion is that growth curve models would probably only be preferred in settings where the model is notably complicated, but the level of complexity would start at the point where the interpretation of such a model would already be very difficult, and theoretical justification hard to come by. SEM has its place, but the standard mixed model approach is very flexible, even in fairly complicated settings, both for ease of implementation and interpretation.

Summary

As has been demonstrated, we can think of random effects in mixed models as latent variables, and conversely, we can specify most growth models as standard mixed models. Noting the connection may provide additional insight into how to think about random effects ways in which to incorporate their use in modeling.


  1. Graph made with the semPlots package.

  2. lme actually works by setting the reference group variance to 1, and the coefficients represent the ratios of the other variances to that group. See ?varIdent.