Purpose

In many cases, researchers are interested in what amounts to a latent construct and its effects in a structural/regression model with regard to some target. If we assume the latent construct is the ‘true’ score, using a sum or single item will likely not capture the relationship between the construct and the target very well, yet these are far more common the approach in practice. Unfortunately, doing so makes many assumptions about how well the indicator(s) measure the underlying construct that are likely not to hold. Even for those that would prefer an SEM approach, too often they will not have adequate sample size to conduct an SEM confidently.

The following seeks to investigate how well using estimated factor scores in a regression performs relative to single items, sum/mean scores, or full SEM estimation. Comparisons will be made across different loadings on the latent variable, sample sizes, effect sizes, and the number of items/indicators.

Outline

Single factor

In all, 54 data situations will be looked at as follows.

  • 3 average loading sets (.25, .5, .8)
  • N indicators = 3 or 6
  • small effect (.15), medium (.3), large (.5)
  • sample size small (100), moderate (500), large (1000)

Model comparisons

The primary comparisons include the following:

  • compare regression coefficient for estimated factor score to sum score
  • compare regression coefficient for estimated factor score to single (random) item
  • compare lm vs. sem

Data & Model Generation

The following code represents how the data are created. Given a mean loading score, sample size setting and number of items, both indicators x and target y are created based on a scaled factor score f.

As an example, consider the following 6 item setting where the average loading is .8, and the true effect of the latent variable is .5. The sample size is set to 1000. For all data generation the factor variance and observed indicator means are 0 and variance 1. For easier comparison across settings, I also standardize the scores used in the regression, though on average they’d be close to mean 0, sd 1 anyway.

nitems = 6
loadings = .8
effect =.5
sampsize = 1000
factorvar = 1

# lambda = rnorm(nitems-1, mean=loadings, sd=.1)  # unstandardized
# lambda = matrix(c(1, lambda), nrow=1)

# lambda = rnorm(nitems, mean=loadings, sd=.1)    # standardized (used in simulations)
lambda = rep(loadings, nitems)                    # just for this demo

# factors and some noise; this approach will create standardized indicators
f = matrix(rnorm(sampsize, mean=0, sd=factorvar), ncol=1)
e = mvtnorm::rmvnorm(sampsize, sigma=diag(1-c(lambda)^2, nitems))

# observed indicators
x = 0 + f%*%lambda + e

# observed target variable
y = 2 + effect*scale(f) + rnorm(sampsize)         # intercept is not necessary

We can see the results of a regression of y on the factor scores f is what we’d expect.

summary(lm(y ~ f))

Call:
lm(formula = y ~ f)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2196 -0.6938 -0.0095  0.6336  3.5718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.05946    0.03164   65.10   <2e-16 ***
f            0.50248    0.03173   15.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 998 degrees of freedom
Multiple R-squared:  0.2009,    Adjusted R-squared:  0.2001 
F-statistic: 250.9 on 1 and 998 DF,  p-value: < 2.2e-16

Now let’s look at a factor analysis of the items, as well as the factor scores generated from them. The column labeled ML1 contains the loadings (roughly 0.8), h2 is the squared loading. The u2 are the unique factor loadings, which by design are 1 - h2. The unique factor for each indicator x comprises all those other causes that are not the latent variable, or common factor, under specific consideration. The final column com, is the communality, which is the sum of h2 and u2. The sum of the squared loadings is 3.88, which, divided by the total number of items (6), tells us how much of the variance in the items variance is accounted for by the latent variable (65%). The rest of the output provides various measures of model fit and assessment.

library(psych)
faout = fa(x, fm='ML')
faout
Factor Analysis using method =  ml
Call: fa(r = x, fm = "ML")
Standardized loadings (pattern matrix) based upon correlation matrix
   ML1   h2   u2 com
1 0.80 0.64 0.36   1
2 0.82 0.68 0.32   1
3 0.81 0.65 0.35   1
4 0.81 0.66 0.34   1
5 0.78 0.60 0.40   1
6 0.81 0.65 0.35   1

                ML1
SS loadings    3.88
Proportion Var 0.65

Mean item complexity =  1
Test of the hypothesis that 1 factor is sufficient.

The degrees of freedom for the null model are  15  and the objective function was  3.78 with Chi Square of  3765.66
The degrees of freedom for the model are 9  and the objective function was  0 

The root mean square of the residuals (RMSR) is  0.01 
The df corrected root mean square of the residuals is  0.01 

The harmonic number of observations is  1000 with the empirical chi square  1  with prob <  1 
The total number of observations was  1000  with MLE Chi Square =  3.85  with prob <  0.92 

Tucker Lewis Index of factoring reliability =  1.002
RMSEA index =  0  and the 90 % confidence intervals are  NA 0.012
BIC =  -58.32
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                ML1
Correlation of scores with factors             0.96
Multiple R square of scores with factors       0.92
Minimum correlation of possible factor scores  0.83

Scores are constructed based on the Thurstone (a.k.a. regression) approach. They are produced as follows:

\[S = XW\] \[W = R^{\text{ -}1}F\] where \(X\) are the observed indicators, \(R\) is the correlation matrix, and \(F\) is the factor loading matrix.

s = factor.scores(x, faout, method='Thurstone')$scores
describe(data.frame(x,y,s))
    vars    n  mean   sd median trimmed  mad   min  max range  skew kurtosis   se
X1     1 1000 -0.02 0.98  -0.05   -0.03 0.95 -3.39 3.18  6.56  0.06     0.27 0.03
X2     2 1000 -0.07 1.01  -0.09   -0.07 1.01 -3.56 3.36  6.92 -0.01     0.18 0.03
X3     3 1000 -0.02 1.00  -0.01   -0.02 1.01 -2.99 3.45  6.44  0.04     0.04 0.03
X4     4 1000 -0.01 1.01  -0.03   -0.03 1.01 -3.17 3.04  6.21  0.12    -0.11 0.03
X5     5 1000  0.01 0.99   0.03    0.00 0.96 -3.00 3.27  6.27  0.02     0.01 0.03
X6     6 1000  0.01 1.01  -0.02    0.00 1.06 -3.95 3.24  7.18 -0.01     0.04 0.03
y      7 1000  2.05 1.12   2.04    2.03 1.12 -1.06 5.72  6.77  0.15     0.10 0.04
ML1    8 1000  0.00 0.96  -0.03   -0.01 0.94 -3.14 3.25  6.39  0.08     0.24 0.03
lm(y ~ s)

Call:
lm(formula = y ~ s)

Coefficients:
(Intercept)            s  
     2.0461       0.5138  
cor(s, f)
         [,1]
ML1 0.9571474

For each of the 54 cases under consideration, 1000 data sets were generated. Note that this data could also be generated via lavaan as follows:

demo.model <- '
y ~ 2*1 + .5*f

f =~ .8*x1 + 0.8*x2 + .8*x3 + .8*x4 + 0.8*x5 + .8*x6

x1 ~~ (1-.8^2)*x1
x2 ~~ (1-.8^2)*x2
x3 ~~ (1-.8^2)*x3
x4 ~~ (1-.8^2)*x4
x5 ~~ (1-.8^2)*x5
x6 ~~ (1-.8^2)*x6
'

# generate data; note, standardized lv is default
myData <- lavaan::simulateData(demo.model, sample.nobs=1000)
describe(myData)[,1:4]
   vars    n  mean   sd
x1    1 1000  0.00 0.99
x2    2 1000  0.00 1.02
x3    3 1000 -0.03 0.99
x4    4 1000 -0.01 1.00
x5    5 1000 -0.04 1.02
x6    6 1000  0.01 1.03
y     7 1000  1.99 1.17

SEM Model

The following results are based on a standard linear model with a single predictor for target \(y\) of three types, one using sum score of the items for the predictor, one using a randomly selected item, and one using a factor score generated with standard factor analysis. In addition, a formal SEM will be run using lavaan where the items are assumed to regard an underlying latent variable which has some relation to the target \(y\). The following shows the conceptual model in typical SEM style.

Results: Single factor models

As noted, results will focus on different loading sizes, differing number of items, effect size, and sample size. Given this setup, it might be obvious to some that a sum score can’t reproduce the result as well as the factor (nor could a random item). However, this is precisely the point. Unless the indicators are measured without error and contribute equally, deficits in estimation are possible.

Reliability

To begin with specific results we might consider the reliability of the indicators. The alpha displayed below is the over-used Cronbach’s \(\alpha\), but in this demonstration it’s an adequate measure. Not surprisingly, without strong loadings and/or more indicators, the indicators are not very reliable. For more on these measures, see the help file for alpha in the psych package.


Coefficients

Next we move to the coefficient from the structural model.

Single item

The best that a single item can do in estimating the effect is based on the product of its loading and the factor score regression coefficient. In other words, using a single variable for a measure assumes perfect measurement correspondence with the underlying construct, and without that, it will always underestimate the true effect1.

Sum score

The sum score performance is going to reflect the reliability of all the indicators used to create it. In this situation, you can roughly get its estimate as \(\beta_{\mathrm{true}}*\sqrt\alpha\), using the Cronbach \(\alpha\) from the above table2. Conversely, \(\frac{\beta{\mathrm{est}}}{\sqrt\alpha}\) will roughly equal \(\beta_{\mathrm{true}}\). Note that this demonstration is the best case scenario as well, as here we are in fact dealing with equal loadings on average, which is essentially how the sum score works. However, this is not usually not going hold in practice.

Latent variable

With larger data that is more conducive to SEM, it will correctly estimate the true effect even with less reliable measures but more items, while using the two-step approach will still underestimate to some effect. However with poor measures and few items, the SEM consistently ran into estimation problems. The SEM_err_per is the percentage of failed models.



Standard errors

Given that \(\sigma_y\), i.e. the residual standard error, is 1 in these models, the standard error for the coefficients is \(1/\sqrt{N}\), assuming no measurement error3, and in this case would be 0.1, 0.045, 0.032 for sample sizes of 100, 500, 1000. With measurement error, it would be a function of the reliability, and thus depend on the number of items and how well they reflect the true score. For example, with average loading .5 and 6 items, the typical \(\alpha\) was around .66. In those situations, the true standard error should be around \(1/\sqrt{\alpha N}\), or about 0.123, 0.055, 0.039. And this is what the factor score approaches hover around. As such, standard errors for the sum score and single item case are going to be optimistic (too low) in the presence of unreliable measures, and even the factor score approaches may be problematic in the worst case scenarios.



Interestingly, we can compare the average estimated standard errors from models with the raw standard deviation of the estimated coefficients across the 1000 data sets for each scenario. In general, in the least reliable settings, the estimated standard errors may be a bit low. In the better scenarios, there is no efficiency gain comparing the two-step and SEM approaches.


With ideal situations, standard errors are identical across the board (maybe a little high for the single item approach). Thus, all else being equal, you’re going to miss some effects in terms of statistical significance. However, you may come to similar conclusions using SEM, two step, or sum score generally speaking, at least in ideal scenarios.

Bias

There are a couple ways in which we could assess bias. To begin, for each scenario, we can estimate the quantiles of the coefficients generated based on the 1000 data sets. Then we can see if the interval captures the true effect. The following shows such a result. The sum score and item results are not included due to their inherent bias already discussed.


The following instead assumes a normal distribution and uses the average coefficient and estimated standard error across the data sets for each scenario. In other words, we calculate \(\bar{\text{coef}}\pm 1.96*\bar{\text{se}}\) for each of the 54 scenarios. Here we see more faltering in the poorer scenarios.



And finally, we can use the raw coefficients and standard errors from every model on every data set, and see if the 95% confidence interval would capture the true parameter. The following shows the proportions in which this is the case. Problems arise with poor measures, and SEM might have a advantage in coverage for the moderate loading situations when the sample size is small.


SEM problems

Full SEM on 100 observations with few items and low loadings consistently resulted in problems fitting the model, but with poor loadings it could still happen with larger samples. This isn’t at all surprising given that SEM is a large sample technique that requires well behaved data. The NAs in the table are the result of convergence problems, such that no estimate could be provided. If one looks at the coefficients, it’s clear the estimates that do result in those settings are often useless anyway.

Summary

If you have poor items and a small sample, it should be no surprise that your model will do poorly in recovering the true effect. This may seem obvious, but I’ve seen many clients and reported results in papers attempt models with unreliable measures on a regular basis.

Without high loadings, randomly using a single item is essentially a problematic endeavor at best. Using the sum score resulted in regularly lower estimates, but this is expected as the indicators used in its construction are not perfectly reliable, and lack of reliability in variables attenuates correlations they have with other variables. In addition, using a sum/mean assumes that all items are equally useful, which, even if they are in theory, they rarely will be in practice. I can’t think of a reason to use either approach in a statistical model relative to a latent variable score, unless you know that the indicators/items are perfectly measured4. The best one could hope for is equal performance.

Using a two stage approach is apparently quite feasible in low N settings, even with fewer and poorer items. It would underestimate in the worst settings of few items and low loadings, though not as badly as the non-SEM approaches. With moderate loadings, the two-stage approach slightly underestimated the effect relative to the SEM approach, but was at least in the ballpark. In more ideal settings with strong loadings, it was more or less indistinguishable from SEM.

In poorer settings standard maximum likelihood SEM struggled, and even when results were obtained, they were unreliable in the worst scenarios. SEM’s ability to estimate the model was a function of size of the loadings, sample size, and number of items, more or less in that order. With moderate loadings and larger samples, feel free to use SEM for a simple model such as this.

In general, if you have multiple measures of some latent construct, or can feasibly frame the data situation and model in such a way, you’re better off using the latent variable rather than a sum score or single item. It is one thing to have to make a practical decision given an individual score, such as a clinical diagnosis5. When running a statistical model where we seek to uncover relationships among the variables of interest and make predictions, this is not the situation we find ourselves in. Even if the suggested way to use an inventory for practice is to create a sum score, for analysis use the latent variable.

This is no free lunch however. There are many ways to create factor scores, and factor scores are not uniquely determined, which means an infinite number of scores could be created that would be consistent with the factor loadings. None of the standard approaches are perfect6, but see the Grice reference for more information. In the end, you’ll still have choices to make, but hopefully you’ll feel better about those that you do.

Results: Multiple factor models

Some other time.

References

Estabrook & Neale (2013). A Comparison of Factor Score Estimation Methods in the Presence of Missing Data.

Grice (2001). Computing and Evaluating Factor Scores.

Revelle. An introduction to psychometric theory with applications in R.


  1. Despite this situation, it is by far the most commonly used approach in science. One wonders what’s been missed.

  2. See Revelle’s freely available text on psychometrics, chapter 7 specifically. As noted there, Charles Spearman was hip to this problem back in 1904. Technically, \(\beta_{\text{true}} = \frac{\beta_\text{est}}{\sqrt{\alpha_x\alpha_y}}\), but we’re assuming only random error in \(y\).

  3. \(se = \sqrt{\sigma_y^2/\text{ss}_x}\) where \(\text{ss}_x\) is \(\Sigma_{i=1}^n(x-\bar{x})\). Since x is standardized with mean 0 and standard deviation of 1, and \(\sigma_y=1\), this amounts to \(se = 1/\sqrt{N}\).

  4. Hi ‘hard’ sciences!

  5. Assuming one is okay with ignoring the variability in that sum score and is willing to engage in potentially life altering behavior for said individual based on a completely arbitrary cutoff.

  6. I’ve actually estimated the scores as parameters via a Bayesian approach. The estimated values are nearly identical to those produced by the lavaan function lavPredict. Stan code available at the github folder for this doc.