# Introduction

The following compares a structural equation modeling (SEM) approach with wide format data vs. a mixed model, long-form data approach. In a traditional mixed model we have observations clustered within some group structure, for example, test scores collected over time within students, or survey samples of individuals from each state.

In what follows we can see a traditional mixed model as a form of a constrained latent variable model, where the random effects in the mixed model are represented as a latent variable in the SEM context. This is more explicitly laid out in the area of growth curve models, which take a standard setting of a mixed model with longitudinal data in which observations occur over time. However, in the SEM approach we instead take a wide form approach to the data and explicitly model latent variables that reflect the random intercepts and slopes. Growth curve models are highly constrained compared to the usual SEM setting, as most loadings are fixed rather than estimated.

We’ll start by generating some data with the SEM context in mind, then melt it to long form and run a standard mixed model. Following that we’ll compare the two, and eventually add in random slopes.

# Random Intercepts

For the following we start with no additional covariates. In the mixed model framework, a simple random effects model is often depicted as follows:

$y_{i} = µ + u_{j} + e_{i}$

In this case, each observation $$i$$ within a cluster is the result of some overall mean plus a random effect due to belonging to group $$j$$, plus residual noise.

## Data

In the SEM context, we generate ‘observed’ data $$y$$ as if there were a single underlying latent variable $$f$$. Unlike traditional SEM models, here we fix the loading of the latent variable to be 1 for each observed variable. We give the observed $$y$$ variances of 1, 2 and 3 respectively, and set the ‘fixed effect’ intercept to µ.

set.seed(8675309)
n = 1000
lvI = rnorm(n, sd=2)  # Our latent variable representing random intercepts
mu = .3               # Intercept

### make the data to conform to a mixed model
y1 = mu + 1*lvI + rnorm(n, sd=1)
y2 = mu + 1*lvI + rnorm(n, sd=sqrt(2))
y3 = mu + 1*lvI + rnorm(n, sd=sqrt(3))

y = data.frame(y1, y2, y3)
head(y)
          y1        y2         y3
1 -1.9605780 0.1515718 -0.1737814
2  0.4804558 3.8265794  2.8545217
3 -2.1044920 0.7679583 -2.6376176
4  5.0600987 1.5849592  2.4593212
5  4.0626543 3.5548814  5.7934398
6  2.0400965 0.8187298  5.8198780
# reshape to long for later use with nlme
library(magrittr); library(dplyr); library(tidyr)
ylong = y
ylong %<>%
gather(y, key='variable') %>%
mutate(group = factor(rep(1:n, 3)))
##
## # alternative, generate long form first
## group = factor(rep(1:n, 3))
## # y2 = mu + lvI[group] + rnorm(n*3, sd=rep(sqrt(c(1,2,3)), e=n))
## # ylong = data.frame(variable=rep(paste0('y',1:3), e=n), value=y2, group)
## # head(ylong)

As we can see and would expect, the variances of the observed variables are equal to the variance of the latent variable (the $$u$$ random effect in the mixed model) plus the residual variance. In this case 22 + c(1, 2, 3).

sapply(y, var)
      y1       y2       y3
5.349871 6.221397 7.282551 

## SEM

In the following we’ll set up the SEM in lavaan with appropriate constraints. We will hold off on results for now, but go ahead and display the graph pertaining to the conceptual model1. At this point we’re simply estimating the variances of the latent variable and the residual variances of the observed data.

To make results comparable to later, we’ll use the settings pertaining to lavaan’s growth function.

library(lavaan)
LVmodel1 = '
I =~ 1*y1 + 1*y2 + 1*y3
'

semres1 = growth(LVmodel1, y)

## Mixed Model

For the mixed model we use the melted data with a random effect for group. The nlme package is used because it will allow for heterogeneous variances for the residuals, which is what we need here.

library(nlme)
nlmemod1 = lme(y ~ 1, random= ~ 1|group, data=ylong, weights=varIdent(form = ~1|variable), method='ML')

## Model Results and Comparison

### SEM

For the SEM approach we get the results we’d expect, and given the data set size the estimates are right near the true values.

summary(semres1)
lavaan (0.5-20) converged normally after  22 iterations

Number of observations                          1000

Estimator                                         ML
Minimum Function Test Statistic                0.990
Degrees of freedom                                 4
P-value (Chi-square)                           0.911

Parameter Estimates:

Information                                 Expected
Standard Errors                             Standard

Latent Variables:
Estimate  Std.Err  Z-value  P(>|z|)
I =~
y1                1.000
y2                1.000
y3                1.000

Intercepts:
Estimate  Std.Err  Z-value  P(>|z|)
y1                0.000
y2                0.000
y3                0.000
I                 0.247    0.069    3.560    0.000

Variances:
Estimate  Std.Err  Z-value  P(>|z|)
y1                1.116    0.098   11.405    0.000
y2                2.004    0.126   15.947    0.000
y3                3.072    0.167   18.371    0.000
I                 4.219    0.216   19.517    0.000

### Mixed Model

With the mixed model we see the random intercept sd/variance is akin to the latent variable variance, and residual variance is what we’d expect also.

summary(nlmemod1)
Linear mixed-effects model fit by maximum likelihood
Data: ylong
AIC      BIC    logLik
12562.33 12592.36 -6276.164

Random effects:
Formula: ~1 | group
(Intercept) Residual
StdDev:    2.054055 1.056471

Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | variable
Parameter estimates:
y1       y2       y3
1.000000 1.340076 1.658953
Fixed effects: y ~ 1
Value  Std.Error   DF  t-value p-value
(Intercept) 0.2466467 0.06929644 2000 3.559299   4e-04

Standardized Within-Group Residuals:
Min           Q1          Med           Q3          Max
-2.582310510 -0.575948827 -0.001671765  0.577319764  2.865939066

Number of Observations: 3000
Number of Groups: 1000 

#### Comparison of latent variable and random effects

It’s a little messy to extract the specific individual variance estimates due to the way nlme estimates them2, but the estimates of the mixed model and the SEM are the same.

coef(nlmemod1$modelStruct$varStruct, unconstrained =FALSE,allCoef=T)*nlmemod1$sigma  y1 y2 y3 1.056471 1.415752 1.752636  sqrt(diag(inspect(semres1, 'est')$theta))
      y1       y2       y3
1.056469 1.415752 1.752635 

Comparing the latent variable scores to the random effects, we are dealing with almost identical estimates.

comparisonDat = data.frame(LV=lavPredict(semres1)[,1], RE=ranef(nlmemod1)[,1] + fixef(nlmemod1))

psych::describe(comparisonDat)
cor(comparisonDat)
      LV    RE
1  -0.86 -0.86
2   1.70  1.70
3  -1.18 -1.18
4   3.16  3.16
5   3.76  3.76
6   2.14  2.14
7   0.57  0.57
8   1.83  1.83
9   1.71  1.71
10  1.25  1.25
vars    n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
LV    1 1000 0.25 1.93   0.31    0.26 1.91 -5.97 7.19 13.17 -0.08     0.05 0.06
RE    2 1000 0.25 1.93   0.31    0.26 1.91 -5.97 7.19 13.17 -0.08     0.05 0.06
LV RE
LV  1  1
RE  1  1

# Random intercepts and slopes

Now we will investigate random intercepts and slopes as in a standard ‘growth curve model’. For this data the y is a repeated measurement over time, otherwise the data is much the same as before. However, we will add a slight positive correlation between intercepts and slopes, and scale time to start at zero so the intercept represents the average baseline value. We’ll also add an additional time point.

## Data

The main difference here is adding a covariate for time and a second latent variable. For this demo, the ‘fixed’ effects in the mixed model sense will be set to .5 and .2 for the intercept and slope for time respectively. The variances of the latent variables are set to one. We add increasing residual variance over time.

set.seed(8675309)
n = 1000
i = .5
s = .2
f = MASS::mvrnorm(n, mu=c(i,s), Sigma=matrix(c(1,.3,.3,1), nrow=2, byrow=T), empirical=T)
f1 = f[,1]
f2 = f[,2]

### make the data to conform to a mixed model
y1 = 1*f1 + 0*(f2) + rnorm(n, sd=1)
y2 = 1*f1 + 1*(f2) + rnorm(n, sd=sqrt(2))
y3 = 1*f1 + 2*(f2) + rnorm(n, sd=sqrt(3))
y4 = 1*f1 + 3*(f2) + rnorm(n, sd=sqrt(4))

y = data.frame(y1, y2, y3, y4)
head(y)
         y1         y2         y3         y4
1 0.8233074  0.6094268 -2.9239599 -1.9196902
2 2.4320979  0.8722732  0.4130522 -0.6960228
3 0.9059002 -2.6873153 -4.6821983 -4.7305664
4 0.6309905  2.0462769  4.5094473  6.4255891
5 2.6525856  6.4423691  6.0122283  6.2786173
6 0.3857095  4.3329025  3.8898125  2.1616358
# reshape to long for later use with nlme
ylong = y
ylong %<>%
gather(y, key='variable') %>%
mutate(subject = factor(rep(1:n, 4)),
time = rep(0:3, e=n))

Let’s take a look at what we have.

## Models

For the SEM we now have two latent structures, one representing the random intercepts, another the random slopes for time. For the mixed model we specify both random intercepts and slopes. The graphical model for the SEM is shown.

LVmodel2 = '
I =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
S =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
'
semres2 = growth(LVmodel2, y)

nlmemod2 = lme(y ~ time, data=ylong, random =  ~time|subject,
weights=varIdent(form = ~1|variable), method='ML')

semPlot::semPaths(semres2, what='path', whatLabels='est', whatstyle='lisrel')

## Model Results and Comparison

### SEM

For the SEM approach we get the results we’d expect, with estimates right near the true values.

summary(semres2)
lavaan (0.5-20) converged normally after  28 iterations

Number of observations                          1000

Estimator                                         ML
Minimum Function Test Statistic                5.918
Degrees of freedom                                 5
P-value (Chi-square)                           0.314

Parameter Estimates:

Information                                 Expected
Standard Errors                             Standard

Latent Variables:
Estimate  Std.Err  Z-value  P(>|z|)
I =~
y1                1.000
y2                1.000
y3                1.000
y4                1.000
S =~
y1                0.000
y2                1.000
y3                2.000
y4                3.000

Covariances:
Estimate  Std.Err  Z-value  P(>|z|)
I ~~
S                 0.317    0.064    4.931    0.000

Intercepts:
Estimate  Std.Err  Z-value  P(>|z|)
y1                0.000
y2                0.000
y3                0.000
y4                0.000
I                 0.480    0.043   11.096    0.000
S                 0.224    0.038    5.929    0.000

Variances:
Estimate  Std.Err  Z-value  P(>|z|)
y1                1.033    0.110    9.411    0.000
y2                1.922    0.108   17.809    0.000
y3                3.159    0.199   15.840    0.000
y4                3.932    0.347   11.323    0.000
I                 1.006    0.114    8.849    0.000
S                 0.995    0.069   14.364    0.000

### Mixed Model

With the mixed model we see the between group sd/variance is akin to the latent variable variance, and residual variance is what we’d expect also.

summary(nlmemod2)
Linear mixed-effects model fit by maximum likelihood
Data: ylong
AIC     BIC    logLik
17107.75 17164.4 -8544.875

Random effects:
Formula: ~time | subject
Structure: General positive-definite, Log-Cholesky parametrization
StdDev    Corr
(Intercept) 1.0031982 (Intr)
time        0.9973894 0.317
Residual    1.0161174

Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | variable
Parameter estimates:
y1       y2       y3       y4
1.000000 1.364544 1.749243 1.951385
Fixed effects: y ~ time
Value  Std.Error   DF   t-value p-value
(Intercept) 0.4797851 0.04324928 2999 11.093483       0
time        0.2242885 0.03783797 2999  5.927604       0
Correlation:
(Intr)
time -0.054

Standardized Within-Group Residuals:
Min           Q1          Med           Q3          Max
-2.668449980 -0.582027043 -0.006143264  0.582212366  3.137291189

Number of Observations: 4000
Number of Groups: 1000 

Let’s compare the estimated residual variances again.

                 y1       y2       y3       y4
varsMixed  1.016117 1.386537 1.777437 1.982836
varsGrowth 1.016122 1.386542 1.777426 1.982842

Comparing the latent variable scores to the random effects, once again we’re getting similar results. For the latent variable regarding slopes, we’ll subtract out the fixed effect.

       I     S X.Intercept.  time
1   0.38 -0.68         0.38 -0.68
2   1.16 -0.30         1.16 -0.30
3  -0.14 -1.48        -0.14 -1.48
4   0.93  1.43         0.93  1.43
5   2.32  1.62         2.32  1.62
6   1.08  0.92         1.08  0.92
7   0.63  1.01         0.63  1.01
8   1.40  1.76         1.40  1.76
9   1.27  0.43         1.27  0.43
10  0.33  0.82         0.33  0.82
vars    n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
I               1 1000 0.48 0.80   0.49    0.48 0.79 -2.21 2.96  5.18 -0.08    -0.06 0.03
S               2 1000 0.22 0.88   0.25    0.21 0.89 -2.83 3.04  5.87  0.08     0.06 0.03
X.Intercept.    3 1000 0.48 0.80   0.49    0.48 0.79 -2.21 2.96  5.18 -0.08    -0.06 0.03
time            4 1000 0.22 0.88   0.25    0.21 0.89 -2.83 3.04  5.87  0.08     0.06 0.03
I X.Intercept.
I            1            1
X.Intercept. 1            1
S time
S    1    1
time 1    1

# Which to use?

It really depends on the model specifics as to which might be best for your situation, but I would suggest defaulting to mixed models for a variety of reasons.

• Ease of implementation: Only very little of special syntax is needed for a mixed model approach relative to standard linear model code. Whereas all SEM programs or packages require special syntax.
• Ease of interpretation/communication: Mixed models are far more commonly used across disciplines, and allow the less familiar to use their standard regression knowledge to interpret them in a straightforward fashion. I’ve often seen people that would have no trouble interpreting the mixed model (at least the fixed effects portion), but get hung up on growth models with additional covariates.
• Time-varying covariates: With time-varying covariates, the multivariate approach has each time point of the dependent variable predicted by each time point of the covariate (akin to nonlinear relationship or interaction with time), and the model gets unwieldy very quickly with only handful of time-varying covariates even if there are few time points, or few covariates with many time points.
• Nonlinear relationships: While one can incorporate nonlinear relationships very easily in the standard setting due to the close ties between mixed and additive models, in the SEM setting it becomes cumbersome (e.g. adding a quadratic slope factor as if one knew the functional form) or unusual to interpret (allowing the slope loadings to be estimated, rather than fixed).
• Correlated residuals: In growth model settings it’s common to specify autocorrelated residuals based on time. The syntax to do so in the SEM framework is very tedious.
• Parallel Processes: Within the Bayesian framework one can incorporate [parallel processes] (multivariate mixed models) via a multivariate outcome fairly easily with the brms package (example with Stan).
• Indirect Effects: Indirect effects can also be incorporated in the standard mixed model framework (example with Stan, see also the mediation package for using lme4 for multilevel mediation).
• Sample sizes: SEM is an inherently large sample technique, and growth curve models can become quite complicated in terms of the number of parameters to be estimated. Obviously large samples are always desirable for either approach, but e.g. where mixed models can be run on clustered data with 30 clusters, it would be a bit odd to use SEM for 30 observations. I have some simulation results here. It may be that one would need at least 50 clusters with many data points within each for the growth curve model to approach the performance of the mixed model, where setting even a few clusters with few time points is okay.
• Balanced data: Growth curve modeling requires balanced data across all time points, and so missing data necessarily has to be estimated or one will potentially lose too much of it. Mixed models do not, but whether one should still estimate the missing values is a matter of debate.
• Other: Mixed models have natural ties to spatial and additive models, as well as a straightforward Bayesian interpretation regarding a prior distribution for the random effects.

In short, my opinion is that growth curve models would probably only be preferred in settings where the model is notably complicated, but the level of complexity would start at the point where the interpretation of such a model would already be very difficult, and theoretical justification hard to come by. SEM has its place, but the standard mixed model approach is very flexible, even in fairly complicated settings, both for ease of implementation and interpretation.

# Summary

As has been demonstrated, we can think of random effects in mixed models as latent variables, and conversely, we can specify most growth models as standard mixed models. Noting the connection may provide additional insight into how to think about random effects ways in which to incorporate their use in modeling.

1. Graph made with the semPlots package.

2. lme actually works by setting the reference group variance to 1, and the coefficients represent the ratios of the other variances to that group. See ?varIdent.