Michael Clark

Statistician Lead

CSCAR, ARC, U of Michigan

The following compares a structural equation modeling (SEM) approach with wide format data vs. a mixed model, long-form data approach. In a traditional mixed model we have observations clustered within some group structure, for example, test scores collected over time within students, or survey samples of individuals from each state.

In what follows we can see a traditional mixed model as a form of a constrained latent variable model, where the random effects in the mixed model are represented as a latent variable in the SEM context. This is more explicitly laid out in the area of *growth curve* models, which take a standard setting of a mixed model with longitudinal data in which observations occur over time. However, in the SEM approach we instead take a wide form approach to the data and explicitly model latent variables that reflect the random intercepts and slopes. Growth curve models are highly constrained compared to the usual SEM setting, as most loadings are fixed rather than estimated.

We’ll start by generating some data with the SEM context in mind, then melt it to long form and run a standard mixed model. Following that we’ll compare the two, and eventually add in random slopes.

For the following we start with no additional covariates. In the mixed model framework, a simple random effects model is often depicted as follows:

\[y_{i} = µ + u_{j} + e_{i}\]

In this case, each observation \(i\) within a cluster is the result of some overall mean plus a random effect due to belonging to group \(j\), plus residual noise.

In the SEM context, we generate ‘observed’ data \(y\) as if there were a single underlying latent variable \(f\). Unlike traditional SEM models, here we fix the loading of the latent variable to be 1 for each observed variable. We give the observed \(y\) variances of 1, 2 and 3 respectively, and set the ‘fixed effect’ intercept to µ.

```
set.seed(8675309)
n = 1000
lvI = rnorm(n, sd=2) # Our latent variable representing random intercepts
mu = .3 # Intercept
### make the data to conform to a mixed model
y1 = mu + 1*lvI + rnorm(n, sd=1)
y2 = mu + 1*lvI + rnorm(n, sd=sqrt(2))
y3 = mu + 1*lvI + rnorm(n, sd=sqrt(3))
y = data.frame(y1, y2, y3)
head(y)
```

```
y1 y2 y3
1 -1.9605780 0.1515718 -0.1737814
2 0.4804558 3.8265794 2.8545217
3 -2.1044920 0.7679583 -2.6376176
4 5.0600987 1.5849592 2.4593212
5 4.0626543 3.5548814 5.7934398
6 2.0400965 0.8187298 5.8198780
```

```
# reshape to long for later use with nlme
library(magrittr); library(dplyr); library(tidyr)
ylong = y
ylong %<>%
gather(y, key='variable') %>%
mutate(group = factor(rep(1:n, 3)))
## head(ylong)
##
## # alternative, generate long form first
## group = factor(rep(1:n, 3))
## # y2 = mu + lvI[group] + rnorm(n*3, sd=rep(sqrt(c(1,2,3)), e=n))
## # ylong = data.frame(variable=rep(paste0('y',1:3), e=n), value=y2, group)
## # head(ylong)
```

As we can see and would expect, the variances of the observed variables are equal to the variance of the latent variable (the \(u\) random effect in the mixed model) plus the residual variance. In this case 2^{2} + c(1, 2, 3).

`sapply(y, var)`

```
y1 y2 y3
5.349871 6.221397 7.282551
```

In the following we’ll set up the SEM in lavaan with appropriate constraints. We will hold off on results for now, but go ahead and display the graph pertaining to the conceptual model^{1}. At this point we’re simply estimating the variances of the latent variable and the residual variances of the observed data.

To make results comparable to later, we’ll use the settings pertaining to lavaan’s `growth`

function.

```
library(lavaan)
LVmodel1 = '
I =~ 1*y1 + 1*y2 + 1*y3
'
semres1 = growth(LVmodel1, y)
```

For the mixed model we use the melted data with a random effect for group. The nlme package is used because it will allow for heterogeneous variances for the residuals, which is what we need here.

```
library(nlme)
nlmemod1 = lme(y ~ 1, random= ~ 1|group, data=ylong, weights=varIdent(form = ~1|variable), method='ML')
```

For the SEM approach we get the results we’d expect, and given the data set size the estimates are right near the true values.

`summary(semres1)`

```
lavaan (0.5-20) converged normally after 22 iterations
Number of observations 1000
Estimator ML
Minimum Function Test Statistic 0.990
Degrees of freedom 4
P-value (Chi-square) 0.911
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
I =~
y1 1.000
y2 1.000
y3 1.000
Intercepts:
Estimate Std.Err Z-value P(>|z|)
y1 0.000
y2 0.000
y3 0.000
I 0.247 0.069 3.560 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
y1 1.116 0.098 11.405 0.000
y2 2.004 0.126 15.947 0.000
y3 3.072 0.167 18.371 0.000
I 4.219 0.216 19.517 0.000
```

With the mixed model we see the random intercept sd/variance is akin to the latent variable variance, and residual variance is what we’d expect also.

`summary(nlmemod1)`

```
Linear mixed-effects model fit by maximum likelihood
Data: ylong
AIC BIC logLik
12562.33 12592.36 -6276.164
Random effects:
Formula: ~1 | group
(Intercept) Residual
StdDev: 2.054055 1.056471
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | variable
Parameter estimates:
y1 y2 y3
1.000000 1.340076 1.658953
Fixed effects: y ~ 1
Value Std.Error DF t-value p-value
(Intercept) 0.2466467 0.06929644 2000 3.559299 4e-04
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.582310510 -0.575948827 -0.001671765 0.577319764 2.865939066
Number of Observations: 3000
Number of Groups: 1000
```

It’s a little messy to extract the specific individual variance estimates due to the way nlme estimates them^{2}, but the estimates of the mixed model and the SEM are the same.

`coef(nlmemod1$modelStruct$varStruct, unconstrained =FALSE,allCoef=T)*nlmemod1$sigma`

```
y1 y2 y3
1.056471 1.415752 1.752636
```

`sqrt(diag(inspect(semres1, 'est')$theta))`

```
y1 y2 y3
1.056469 1.415752 1.752635
```

Comparing the latent variable scores to the random effects, we are dealing with almost identical estimates.

```
comparisonDat = data.frame(LV=lavPredict(semres1)[,1], RE=ranef(nlmemod1)[,1] + fixef(nlmemod1))
head(round(comparisonDat,2), 10)
psych::describe(comparisonDat)
cor(comparisonDat)
```

```
LV RE
1 -0.86 -0.86
2 1.70 1.70
3 -1.18 -1.18
4 3.16 3.16
5 3.76 3.76
6 2.14 2.14
7 0.57 0.57
8 1.83 1.83
9 1.71 1.71
10 1.25 1.25
vars n mean sd median trimmed mad min max range skew kurtosis se
LV 1 1000 0.25 1.93 0.31 0.26 1.91 -5.97 7.19 13.17 -0.08 0.05 0.06
RE 2 1000 0.25 1.93 0.31 0.26 1.91 -5.97 7.19 13.17 -0.08 0.05 0.06
LV RE
LV 1 1
RE 1 1
```

Now we will investigate random intercepts and slopes as in a standard ‘growth curve model’. For this data the *y* is a repeated measurement over time, otherwise the data is much the same as before. However, we will add a slight positive correlation between intercepts and slopes, and scale time to start at zero so the intercept represents the average baseline value. We’ll also add an additional time point.

The main difference here is adding a covariate for time and a second latent variable. For this demo, the ‘fixed’ effects in the mixed model sense will be set to .5 and .2 for the intercept and slope for time respectively. The variances of the latent variables are set to one. We add increasing residual variance over time.

```
set.seed(8675309)
n = 1000
i = .5
s = .2
f = MASS::mvrnorm(n, mu=c(i,s), Sigma=matrix(c(1,.3,.3,1), nrow=2, byrow=T), empirical=T)
f1 = f[,1]
f2 = f[,2]
### make the data to conform to a mixed model
y1 = 1*f1 + 0*(f2) + rnorm(n, sd=1)
y2 = 1*f1 + 1*(f2) + rnorm(n, sd=sqrt(2))
y3 = 1*f1 + 2*(f2) + rnorm(n, sd=sqrt(3))
y4 = 1*f1 + 3*(f2) + rnorm(n, sd=sqrt(4))
y = data.frame(y1, y2, y3, y4)
head(y)
```

```
y1 y2 y3 y4
1 0.8233074 0.6094268 -2.9239599 -1.9196902
2 2.4320979 0.8722732 0.4130522 -0.6960228
3 0.9059002 -2.6873153 -4.6821983 -4.7305664
4 0.6309905 2.0462769 4.5094473 6.4255891
5 2.6525856 6.4423691 6.0122283 6.2786173
6 0.3857095 4.3329025 3.8898125 2.1616358
```

```
# reshape to long for later use with nlme
ylong = y
ylong %<>%
gather(y, key='variable') %>%
mutate(subject = factor(rep(1:n, 4)),
time = rep(0:3, e=n))
```

Let’s take a look at what we have.

For the SEM we now have two latent structures, one representing the random intercepts, another the random slopes for time. For the mixed model we specify both random intercepts and slopes. The graphical model for the SEM is shown.

```
LVmodel2 = '
I =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
S =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
'
semres2 = growth(LVmodel2, y)
nlmemod2 = lme(y ~ time, data=ylong, random = ~time|subject,
weights=varIdent(form = ~1|variable), method='ML')
semPlot::semPaths(semres2, what='path', whatLabels='est', whatstyle='lisrel')
```

For the SEM approach we get the results we’d expect, with estimates right near the true values.

`summary(semres2)`

```
lavaan (0.5-20) converged normally after 28 iterations
Number of observations 1000
Estimator ML
Minimum Function Test Statistic 5.918
Degrees of freedom 5
P-value (Chi-square) 0.314
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
I =~
y1 1.000
y2 1.000
y3 1.000
y4 1.000
S =~
y1 0.000
y2 1.000
y3 2.000
y4 3.000
Covariances:
Estimate Std.Err Z-value P(>|z|)
I ~~
S 0.317 0.064 4.931 0.000
Intercepts:
Estimate Std.Err Z-value P(>|z|)
y1 0.000
y2 0.000
y3 0.000
y4 0.000
I 0.480 0.043 11.096 0.000
S 0.224 0.038 5.929 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
y1 1.033 0.110 9.411 0.000
y2 1.922 0.108 17.809 0.000
y3 3.159 0.199 15.840 0.000
y4 3.932 0.347 11.323 0.000
I 1.006 0.114 8.849 0.000
S 0.995 0.069 14.364 0.000
```

With the mixed model we see the between group sd/variance is akin to the latent variable variance, and residual variance is what we’d expect also.

`summary(nlmemod2)`

```
Linear mixed-effects model fit by maximum likelihood
Data: ylong
AIC BIC logLik
17107.75 17164.4 -8544.875
Random effects:
Formula: ~time | subject
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.0031982 (Intr)
time 0.9973894 0.317
Residual 1.0161174
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | variable
Parameter estimates:
y1 y2 y3 y4
1.000000 1.364544 1.749243 1.951385
Fixed effects: y ~ time
Value Std.Error DF t-value p-value
(Intercept) 0.4797851 0.04324928 2999 11.093483 0
time 0.2242885 0.03783797 2999 5.927604 0
Correlation:
(Intr)
time -0.054
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.668449980 -0.582027043 -0.006143264 0.582212366 3.137291189
Number of Observations: 4000
Number of Groups: 1000
```

Let’s compare the estimated residual variances again.

```
y1 y2 y3 y4
varsMixed 1.016117 1.386537 1.777437 1.982836
varsGrowth 1.016122 1.386542 1.777426 1.982842
```

Comparing the latent variable scores to the random effects, once again we’re getting similar results. For the latent variable regarding slopes, we’ll subtract out the fixed effect.

```
I S X.Intercept. time
1 0.38 -0.68 0.38 -0.68
2 1.16 -0.30 1.16 -0.30
3 -0.14 -1.48 -0.14 -1.48
4 0.93 1.43 0.93 1.43
5 2.32 1.62 2.32 1.62
6 1.08 0.92 1.08 0.92
7 0.63 1.01 0.63 1.01
8 1.40 1.76 1.40 1.76
9 1.27 0.43 1.27 0.43
10 0.33 0.82 0.33 0.82
vars n mean sd median trimmed mad min max range skew kurtosis se
I 1 1000 0.48 0.80 0.49 0.48 0.79 -2.21 2.96 5.18 -0.08 -0.06 0.03
S 2 1000 0.22 0.88 0.25 0.21 0.89 -2.83 3.04 5.87 0.08 0.06 0.03
X.Intercept. 3 1000 0.48 0.80 0.49 0.48 0.79 -2.21 2.96 5.18 -0.08 -0.06 0.03
time 4 1000 0.22 0.88 0.25 0.21 0.89 -2.83 3.04 5.87 0.08 0.06 0.03
I X.Intercept.
I 1 1
X.Intercept. 1 1
S time
S 1 1
time 1 1
```

It really depends on the model specifics as to which might be best for your situation, but I would suggest defaulting to mixed models for a variety of reasons.

- Ease of implementation: Only very little of special syntax is needed for a mixed model approach relative to standard linear model code. Whereas all SEM programs or packages require special syntax.
- Ease of interpretation/communication: Mixed models are far more commonly used across disciplines, and allow the less familiar to use their standard regression knowledge to interpret them in a straightforward fashion. I’ve often seen people that would have no trouble interpreting the mixed model (at least the fixed effects portion), but get hung up on growth models with additional covariates.
- Time-varying covariates: With time-varying covariates, the multivariate approach has each time point of the dependent variable predicted by each time point of the covariate (akin to nonlinear relationship or interaction with time), and the model gets unwieldy very quickly with only handful of time-varying covariates even if there are few time points, or few covariates with many time points.

- Nonlinear relationships: While one can incorporate nonlinear relationships very easily in the standard setting due to the close ties between mixed and additive models, in the SEM setting it becomes cumbersome (e.g. adding a quadratic slope factor as if one knew the functional form) or unusual to interpret (allowing the slope loadings to be estimated, rather than fixed).
- Correlated residuals: In growth model settings it’s common to specify autocorrelated residuals based on time. The syntax to do so in the SEM framework is very tedious.
- Parallel Processes: Within the Bayesian framework one can incorporate [parallel processes] (multivariate mixed models) via a multivariate outcome fairly easily with the brms package (example with Stan).
- Indirect Effects: Indirect effects can also be incorporated in the standard mixed model framework (example with Stan, see also the mediation package for using lme4 for multilevel mediation).
- Sample sizes: SEM is an inherently large sample technique, and growth curve models can become quite complicated in terms of the number of parameters to be estimated. Obviously large samples are always desirable for either approach, but e.g. where mixed models can be run on clustered data with 30 clusters, it would be a bit odd to use SEM for 30 observations. I have some simulation results here. It may be that one would need at least 50 clusters with many data points within each for the growth curve model to approach the performance of the mixed model, where setting even a few clusters with few time points is okay.
- Balanced data: Growth curve modeling requires balanced data across all time points, and so missing data necessarily has to be estimated or one will potentially lose too much of it. Mixed models do not, but whether one should still estimate the missing values is a matter of debate.
- Other: Mixed models have natural ties to spatial and additive models, as well as a straightforward Bayesian interpretation regarding a prior distribution for the random effects.

In short, my opinion is that growth curve models would probably only be preferred in settings where the model is notably complicated, but the level of complexity would start at the point where the interpretation of such a model would already be very difficult, and theoretical justification hard to come by. SEM has its place, but the standard mixed model approach is very flexible, even in fairly complicated settings, both for ease of implementation and interpretation.

As has been demonstrated, we can think of random effects in mixed models as latent variables, and conversely, we can specify most growth models as standard mixed models. Noting the connection may provide additional insight into how to think about random effects ways in which to incorporate their use in modeling.