Mixture Models

Thus far we have understood latent variables as possessing an underlying continuum, i.e. normally distributed with a mean of zero and some variance. This does not have to be the case, and instead we can posit a categorical variable. Some approaches you may be already familiar with, as any modeling process under the heading of cluster analysis could be said to deal with latent categorical variables. The issue is that we may feel that there is some underlying structure to the data that is described as discrete, and based on perhaps multiple variables.

We will approach this in the way that has been done from statistical and related motivations, rather than the SEM/psychometric specific approach. This will hopefully make clearer what it is we’re dealing with, as well as not get bogged down in terminology. Furthermore, mixture models are typically poorly implemented within SEM, as many of the typical issues found in such models can often be magnified. The goal here as before is clarity of thought over ‘being able to run it’.

A common question in such analysis is how many clusters? There are many, many techniques for answering this question, and not a single one of them even remotely definitive. On the plus side, the good news is that we already know the answer, because the answer is always 1. However, that won’t stop us from trying to discover more than that, so here we go.

A Motivating Example

Take a look at the following data. It regards the waiting time between eruptions and the duration of the eruption (both in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Take a good look. This is probably the cleanest separation of clustered data you will likely ever see in practice⁵⁸, at least where the categories wouldn’t be obvious without any statistical analysis. Even so there are still data points that might fall into either cluster.

Create Clustered Data

To get a sense of mixture models, let’s actually create some data that might look like the Old Faithful data above. In the following we start by creating something similar the eruptions variable in the faithful data. To do so, we draw one random sample from a normal distribution with a mean of 2, and the other with a mean of 4.5, and both get a standard deviation of .25. The first plot is based on the code below, the second on the actual data.

set.seed(1234)
erupt1 = rnorm(150, mean=2, sd=.25)
erupt2 = rnorm(150, mean=4.5, sd=.25)
erupt = sample(c(erupt1, erupt2))

What do we see here? The data is a mixture of two normals, but we can think of the observations as belonging to a latent class, and each class has its own mean and standard deviation (and is based on a normal distribution here, but doesn’t have to be). Each observation has some likelihood, however great or small, of coming from either cluster, and had we really wanted to do more appropriate simulation, we would incorporate that information.

A basic approach for categorical latent variable analysis from a model based perspective⁵⁹ could be seen as follows:

Posit the number of clusters you believe there to be
For each observation, estimate those probability of coming from either cluster
Assign observations to the most likely class (i.e. the one with the highest probability)

More advanced approaches might include:

Predicting the latent classes with other covariates in a manner akin to logistic regression
Allow your model coefficients to vary based on cluster membership
- For example, have separate regression models for each class
Use more recent techniques that will allow the number of clusters to grow with the data (see the BNP section)

Mixture modeling with Old Faithful

The following uses the flexmix package and function to estimate a regression model for each latent class. In this case, our model includes only an intercept, and so is equivalent to estimating the mean and variance of each group. We posit k=2 groups.

library(flexmix)
mod = flexmix(eruptions~1,  data=faithful, k = 2)
summary(mod)


Call:
flexmix(formula = eruptions ~ 1, data = faithful, k = 2)

       prior size post>0 ratio
Comp.1 0.652  177    190 0.932
Comp.2 0.348   95     98 0.969

'log Lik.' -276.3611 (df=5)
AIC: 562.7222   BIC: 580.7512

We can see from the summary about 2/3 are classified to one group. We also get the estimated means and standard deviations for each group, as well as note respective probabilities of each observation for each class. Note that the group labels are completely arbitrary.

parameters(mod)    # means (Intercept) and std dev (sigma) for each group

                    Comp.1    Comp.2
coef.(Intercept) 4.2735128 2.0187913
sigma            0.4376169 0.2363539

        [,1]   [,2]
 [1,] 1.0000 0.0000
 [2,] 0.0000 1.0000
 [3,] 1.0000 0.0000
 [4,] 0.0001 0.9999
 [5,] 1.0000 0.0000
 [6,] 0.8384 0.1616
 [7,] 1.0000 0.0000
 [8,] 1.0000 0.0000
 [9,] 0.0000 1.0000
[10,] 1.0000 0.0000

The first of the next two plots shows the estimated probabilities for each observation for the clusters (with some jitter). Basically it shows that if that most are very highly probable of belonging to either class. Again, you will probably never see anything like this, but clarity is useful here. The second plot shows the original data with their classification and contours of the density for each group.

SEM and Latent Categorical Variables

Dealing with categorical latent variables can be somewhat problematic. Interpreting a single SEM model might be difficult enough, but then one might be allowing parts of it to change depending on which latent class observations belong to, while having to assess the latent class measurement model as well. It can be difficult to find a clarity of understanding from this process, as one is discovering classes then immediately assuming key differences among these classes they didn’t know existed beforehand⁶⁰. In addition, one will need even more data than standard SEM to deal with all the additional parameters that are allowed to vary across the classes.

Researchers also tend to find classes that represent ‘hi-lo’ or ‘hi-med-lo’ groups, which may suggest they should have left the latent construct as a continuous variable. When given the choice to discretize continuous variables in normal settings, it is rare in which the answer is anything but a resounding no. As such, one should think hard about the ultimate goals of such a model. See the IRT section for alternative approaches to categorical data.

Another issue I’ve seen in applied practice is that the the ‘latent’ classes uncovered are in fact intact classes representing an interaction of variables available in the data, or possibly some grouping variable that wasn’t in the model. It would be much more straightforward to use the standard interaction approach in a more guided fashion, as opposed to assuming every variable effect interacts with a latent variable. In typical modeling scenarios such an option (i.e. laying waste to our model with interactions) is relatively rarely considered, and even then we’d want some sort of regularizing approach⁶¹, which is typically non-existent in this SEM setting.

All that said, let’s go ahead and look at a traditional latent class approach. We’ll use the poLCA package which is geared specifically to such analysis. The term ‘latent class analysis’ would assume binary outcomes, but as we’ve noted before, usage of a latent categorical variable does not require them (nor does this package). Our example will also not be binary data.

Recall the depression data from the section on latent variable models. We have three items regarding emotional well-being (depression)- how often the person felt down or blue, how often they’ve been a happy person, and how often they’ve been depressed in the last month. These are four point scales and range from ‘all of the time’ (1) to ‘none of the time’ (4) (i.e. reversed).

The conceptual model is still the same, except that we are thinking of the latent variable as categorical, and thus people will fall under depression types rather than have a continuous score that reflects ‘depression’.

We now run the model. As with mixture models before, and to be clear we are still conducting a finite mixture model, one of our primary goals is to assign individuals to a latent class. In addition, we can obtain probabilities of each item category according to latent class. If we had covariates, we could also look at their relationship to the latent classes

library(poLCA)
result = poLCA(cbind(FeltDown, BeenHappy, DepressedLastMonth) ~ 1, depressed)
result

Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$FeltDown
          All of the time None of the time Rarely Some of the time
class 1:            0.000           0.4997 0.5003           0.0000
class 2:            0.047           0.0085 0.6637           0.2809

$BeenHappy
          All of the time None of the time Rarely Some of the time
class 1:           0.1418           0.0045 0.1069           0.7468
class 2:           0.0178           0.0518 0.7008           0.2296

$DepressedLastMonth
          All of the time None of the time Rarely Some of the time
class 1:           0.0025           0.9290 0.0654           0.0031
class 2:           0.0235           0.2328 0.6388           0.1049

Estimated class population shares 
 0.6828 0.3172 
 
Predicted class memberships (by modal posterior prob.) 
 0.6251 0.3749 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 7183 
number of estimated parameters: 19 
residual degrees of freedom: 44 
maximum log-likelihood: -17515.44 
 
AIC(2): 35068.88
BIC(2): 35199.59
G^2(2): 1254.267 (Likelihood ratio/deviance statistic) 
X^2(2): 4237.753 (Chi-square goodness of fit)

In general, aside from poLCA reordering the variable labels alphabetically, this is a fairly clear result. The first class, which are about 2/3 of the individuals, are those that rarely or didn’t feel down nor were depressed in the past month, and felt happy some or all of the time. The other class represents our more depressed individuals.

However, this is a good example where it would be notably more appropriate to assume depression on a continuous scale, rather than categorical. We have a lot more output from the LCA, but nothing is really gained in our understanding of the latent aspect underlying the scores. Furthermore, we actually haven’t moved away from a continuous latent variable, as the probabilities of class membership are continuous, and we are merely picking an arbitrary cutpoint to create the latent class.

Latent Categories vs. Multi-group Analysis

The primary difference between the latent class analysis and a multiple group approach is that in the latter, grouping structure explicitly exists in the data, for example, sex, race etc. In that case, a multi-group analysis, a.k.a. multi-sample analysis, would allow for separate SEM models per group. In the latent categorical variable situation, one must first discover the latent groups. By contrast, in multi-group analysis, a common goal is to test measurement invariance, a concept which has several definitions itself. An example would be to see if the latent structure holds for an American vs. non-U.S. sample, with the same items for the scale provided in the respective languages. This makes a lot of sense from a measurement model perspective and has some obvious use.

If one wants to see a similar situation for a less practically driven model, e.g. to see if an SEM model is the same for males vs. females, this is equivalent to having an interaction with the sex variable for every path in the model. The same holds for ‘subgroup analysis’ in typical settings, where you won’t find any more than you would by including the interactions of interest with the whole sample, though you will certainly have less data to work with. In any case, whether the classes are latent or intact, we need a lot of data to estimate parameters that are allowed to vary by group vs. a model in which they are fixed, and many simply do not have enough data for this.

Latent Trajectories

As noted in the growth curve modeling section, these are growth curve models in which intercepts and slopes are allowed to vary across latent groups clusters. The flexmix package used previously as well as others would allow one to estimate such models from the mixed model perspective, and might be preferred.

Estimation

If you would like to see the conceptual innards of estimating mixture models using the EM Algorithm, see my GitHub page for some examples.

Terminology in SEM

Some unnecessary terminology comes into play from the SEM literature, such that it might be worth knowing about them so that one doesn’t get out of sorts. SEM literature often makes the following distinctions:

Aside from noting whether the latent variable is categorical or not, these aren’t very enlightening⁶², and go back as far as the 60s. Back then perhaps it was useful, but in the end, it’s all just ‘latent variable analysis’.

Some other terminology includes:

Mixture models Refers generally to dealing with categorical latent variables as demonstrated above.

Finite Mixture models Same thing.

Cluster analysis Same thing.

Infinite Mixture model A Bayesian version of it.

Latent Class Analysis refers to dealing with categorical latent variables in the context of multivariate data, especially within the measurement model scenario. For example one might have a series of yes/no questions on a survey, and want to discover categories of collections of responses. Some use it for the specific case of categorical indicators, but this is not necessary.

Latent Profile Analysis refers to dealing with categorical latent variables in the context of multivariate numerical/continuous data. Again, most people outside of SEM would simply call this a mixture model.

Latent Trait Analysis refers to dealing with continuous latent variables in the context of multivariate categorical data. Again, most people outside of SEM would simply call this a mixture model, but it also serves as the setting for Item-Response Theory models where we, for example, find latent ‘ability’ among binary test items that are correct vs. not.

Summary

In general, humans are predisposed to think and label things in terms of categories because it presumably simplifies things. More often than not, it’s an oversimplification, and has a long history of causing more trouble than it’s worth. Furthermore, we know from our study of reliability that if we categorize a continuous variable, it is a less reliable measure, and potentially causes numerous issues statistically. One should think long and hard about positing types instead of allowing more nuance. Even so, it might be appropriate for a specific avenue of study, in which case mixture models may serve one well.

R Packages Used

flexmix

Note that one shortcoming of lavaan relative to Mplus is the lack of estimating latent classes. However, along with flexmix where the latent classes are on the predictor side of the equation and serve to moderate the predictor effects, one might also consider PoLCA, which models latent classes for an outcome similar to traditional LCA within the SEM setting. This is essentially logistic regression where the classes to predict are latent. However, as of early 2017, while some have inquired about such models, I’ve not come across an R package that easily does both, i.e. simultaneously estimates latent classes on either side of the regression equation. Mplus can do so with Mplus simulated data, but it’s not clear how easily it’s been used in practice. The number of parameters to interpret would easily become overwhelming, not to mention ratios of odds ratios.

Outside of iris, which is also used regularly to give people unrealistic expectations about what to expect from their cluster analysis.↩
Note that k-means and other common cluster analysis techniques are not model based as in this example. In model-based approaches we assume some probabilistic data generating process (e.g. normal distribution) rather than some heuristic.↩
By contrast, continuous latent variables are typically clearly theoretically motivated in usual SEM practice. In addition, SEM data set sizes would usually limit the number of groups to three or four at most, because the smallest latent group will be the limiting factor, but still needs enough data to run an SEM model confidently. However, there is no reason to believe there would only be three distinct latent classes in any reasonable amount of data.↩
See for example, Rendle (2010). Factorization machines. link ↩
I can never keep profile vs. trait straight.↩