Michael Clark
Not everything we want to measure comes with an obvious yardstick. If one wants to measure something like a person’s happiness, what would they have at their disposal?
Any of these might be useful as an indicator of their current state of happiness, but of course none of them would tell us whether they truly are happy or not. At best, they can be considered imperfect measures. If we consider those and other indicators collectively, perhaps we can get an underlying measure of something we might call happiness, contentment, or some other arbitrary but descriptive name.
The above describes a latent variable, and gives at least one sense of at least one way in which we can think of them. However, models incorporating latent variables are seen all over, and incorporated in many ways. Broadly speaking, factor analysis can be seen as a dimension reduction technique, or as an approach to modeling measurement error and understanding underlying constructs, a means to understand large amounts of text, a way to make movie recommendations, and more.
Some key ideas:
Each section will provide some suggested packages to serve as starting points for standard implementations, most with code demonstrations. Just note that there will likely be several packages beyond those suggested that provide enhancements in a variety of fashion. Almost all the data is available from R packages (package::data
):
psych::bfi
psych::ability
lda::poliblog.documents
datasets::UScitiesD
datasets::faithful
recommenderlab::MovieLense
poLCA::election
To load them load the package and type data('datasetname')
, or otherwise data('datasetname', package='pack')
.
And while the MNIST data is available in various places, you can download the processed threes used in the demos here.
This document also makes use of the tidyverse family of packages, so you’ll want to have at least dplyr loaded.
This document is meant to be high level. For more details preuse my document on Graphical & Latent Variable Models. See also, fellow CSCAR consultant James Henderson’s recent slides regarding PCA, factor analysis, and multidimensional scaling for a bit more technical detail and additional examples regarding those techniques. For a more technical overview of many matrix factorization techniques, but with nice summary tables, see Singh & Gordon, A Unified View of Matrix Factorization Models.
We’ll start with one of the most widely used techniques for dimension reduction. With principal components analysis (PCA), we seek to find orthogonal (i.e. uncorrelated) factors to maximize the variance accounted for by each factor. For some data, \(X\) is the \(NxD\) matrix of \(N\) observations of \(D\) variables. We decompose X into two matrices - \(Z\) are the \(N x L\) component scores where \(L <= D\), while \(W\) are the \(DxL\) weights, typically referred to as the factor loading matrix.
\[X = ZW'\]
Each created component, or factor, accounts for respectively less of the variance in the original data. So, while PCA returns the same number of components as there are columns in the original data, they big idea is that we want to use far fewer. However, with all components, we could reconstruct the data exactly.
Take for example, the MNIST data, which contains 28 by 28 pixel images of the digits 0-9. If we unroll the data to 784 columns, each row represents a single digit, with each entry the grayscale value. We can see in the following how well we can reconstruct a digit via PCA. Even with only two components we can get a sense of what we’re looking at. With all components, the reconstruction is perfect.
Let’s see an example with more digestible results. The following data contain 25 personality items regarding the Big 5 scales. There are 2800 total subjects, and five categories of questions1.
We’ll just look at part of this to keep things simple, namely the agreeableness and neuroticism items.
library(psych)
bfi_trim = bfi %>%
select(matches('A[1-5]|N[1-5]'))
So let’s do a PCA. We’ll look at two components, though remember that PCA technically returns as many components as there are variables. You’re just seeing part of the results. Note that PCA almost universally requires you to standardize the data, so that each variable has the same variance. Otherwise you’ll just get components reflecting the relative variances. In what follows the PCA is done on the correlation matrix, which amounts to the same thing.
pc = principal(bfi_trim, 2)
pc
Principal Components Analysis
Call: principal(r = bfi_trim, nfactors = 2)
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 h2 u2 com
A1 0.09 -0.50 0.25 0.75 1.1
A2 0.03 0.77 0.60 0.40 1.0
A3 -0.01 0.80 0.63 0.37 1.0
A4 -0.07 0.61 0.38 0.62 1.0
A5 -0.16 0.70 0.52 0.48 1.1
N1 0.81 -0.12 0.67 0.33 1.0
N2 0.80 -0.11 0.65 0.35 1.0
N3 0.81 -0.03 0.66 0.34 1.0
N4 0.68 -0.15 0.49 0.51 1.1
N5 0.66 0.05 0.43 0.57 1.0
RC1 RC2
SS loadings 2.90 2.40
Proportion Var 0.29 0.24
Cumulative Var 0.29 0.53
Proportion Explained 0.55 0.45
Cumulative Proportion 0.55 1.00
Mean item complexity = 1
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 1923.02 with prob < 0
Fit based upon off diagonal values = 0.91
First, focus on the portion of the output where it says SS loadings
. The first line is the sum of the squared loadings2 for each component (in this case where we are using a correlation matrix as the basis, summing across all 10 possible components would equal the value of 10). The Proportion Var
tells us how much of the overall variance the component accounts for out of all the variables (e.g. 2.9 / 10 = 0.29). The Cumulative Var
tells us that 2 components make up over 53% the variance. The others are the same thing just based on the 2 retained components rather than all 10 variables (i.e. the cumulative explained variance would = 1). We can see that the second component accounts for less variance, and this would continue with additional components, where each accounts for a decreasing amount of variance.
Some explanation of the other parts of the output:
h2
: the amount of variance in the item/variable explained by the (retained) components. It is the sum of the squared loadings, a.k.a. communality. For example, the first reasoning item shares only 38% of its variance with these components.u2
: 1 - h2com
: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure).RC
: these factors are called RC to denote these are rotated components. They’ve been rotated to aid interpretability of the loadings.Loadings, also referred to as the pattern matrix, in this scenario represent the estimated correlation of an item/variable with the respective components, and provide the key way in which we interpret the factors. As an example, we can reproduce the loadings by correlating the observed variables with the estimated component scores3.
cor(bfi_trim, pc$scores, use = 'pair') %>% round(2)
RC1 RC2
A1 0.09 -0.50
A2 0.03 0.77
A3 -0.02 0.80
A4 -0.08 0.62
A5 -0.17 0.70
N1 0.81 -0.12
N2 0.80 -0.12
N3 0.81 -0.04
N4 0.69 -0.15
N5 0.66 0.05
It can be difficult to sort it out just by looking at the values, so we’ll look at it visually. In the following plot, stronger and more positive loadings are indicated by blue (negative are indicated by red), and we can see the different variables associated with different components. Note that the smaller negative loading comes from a reverse scored item (‘I am indifferent to the feelings of others’)4.
Many would use what’s called a biplot, which shows the both observations and variables on the same plot, in with components/loadings on the axes, the component scores of the data, and the loadings as vectors whose length and direction specify the strength of the loading on a particular component.
The sheering of the data here is just due to the fact we have variables that only take on 6 values, and we are looking at them in terms of continuous PC scores. Biplots often take a bit to figure out just for two components, especially when one doesn’t have a clean result, and one can’t really get a sense of more than two at time. However, using a graphical model makes clear what’s going on for all retained components. Low loadings have been suppressed for additional clarity.
Interpretation is the fun, but commonly difficult, part. In this case, the variables group like we’d expect5. So, we have one component that represents agreeableness, and one that represents neuroticism.
It’s worth mentioning the naming fallacy at this point. Just because we associate a factor with some concept, doesn’t make it so. In addition, for PCA the goal is not interpretation, but to reduce the data while retaining as much of the variance as possible.
Now let’s examine what is sometimes called common factor analysis, and also sometimes exploratory factor analysis in the social sciences, and even ‘classical’ or ‘traditional’, but typically just factor analysis (FA) everywhere else. While we can use both PCA and FA for similar reasons (dimension reduction) and even have similar interpretations (in terms of loadings), there are some underlying subtleties between the two that provide unique distinctions.
First let’s revisit PCA, where we can depict it conceptually as an approach where we attempt to reproduce the correlation matrix \(R\) in terms of the product of components, represented by our loading matrix \(W\).
\[R = WW'\] and
\[C = XV\]
In other words, with the complete component result we can exactly reproduce the correlation matrix. Furthermore, each component score \(C\), i.e. the score for a particular observation with regard to the component, is a weighted combination of the \(D\) observed variables \(X\), the weights (with weight matrix \(V\)) of which are determined by the loadings6.
Things are different with factor analysis. Now the causal/philosophical flow is in the other direction, originating with the latent variables.
\[X \approx ZW\]
Each observed variable \(x\) is a function of the latent variables that it is associated with. In addition, we also take into account the uniquenesses \(\Psi\), or that part which the factors do not explain.
And to reproduce the correlation matrix:
\[R \approx WW'\] \[R = WW' + \Psi\]
So if we just use the loadings from the FA, we cannot reproduce the correlation matrix exactly, we need to add the uniquenesses as well7.
What this amounts to conceptually are a few key ideas:
Now let’s do a factor analysis using the same data as we did for the PCA.
fa_model = fa(bfi_trim, 2)
fa_model
Factor Analysis using method = minres
Call: fa(r = bfi_trim, nfactors = 2)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
A1 0.07 -0.36 0.14 0.86 1.1
A2 0.05 0.69 0.47 0.53 1.0
A3 0.03 0.76 0.56 0.44 1.0
A4 -0.05 0.47 0.24 0.76 1.0
A5 -0.12 0.60 0.39 0.61 1.1
N1 0.78 -0.03 0.61 0.39 1.0
N2 0.76 -0.02 0.58 0.42 1.0
N3 0.77 0.05 0.58 0.42 1.0
N4 0.58 -0.08 0.36 0.64 1.0
N5 0.54 0.08 0.29 0.71 1.0
MR1 MR2
SS loadings 2.44 1.78
Proportion Var 0.24 0.18
Cumulative Var 0.24 0.42
Proportion Explained 0.58 0.42
Cumulative Proportion 0.58 1.00
With factor correlations of
MR1 MR2
MR1 1.00 -0.19
MR2 -0.19 1.00
Mean item complexity = 1
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 45 and the objective function was 2.82 with Chi Square of 7880.99
The degrees of freedom for the model are 26 and the objective function was 0.23
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.05
The harmonic number of observations is 2759 with the empirical chi square 396.78 with prob < 5.7e-68
The total number of observations was 2800 with Likelihood Chi Square = 636.27 with prob < 1.6e-117
Tucker Lewis Index of factoring reliability = 0.865
RMSEA index = 0.092 and the 90 % confidence intervals are 0.085 0.098
BIC = 429.9
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
MR1 MR2
Correlation of (regression) scores with factors 0.92 0.88
Multiple R square of scores with factors 0.84 0.77
Minimum correlation of possible factor scores 0.68 0.54
The results in terms of loadings are similar to the PCA. But we can see some differences. The PCA components accounted for 53% the variance, while here we only account for 42%. In addition, we have an estimate of the correlation of the latent factors, -0.19, where the PCA components are uncorrelated by definition. And finally, there are measures of ‘fit’ that come from the structural equation modeling context. See this for more detail on those measures. I will note however that BIC is provided, which allows for a way to compare different factor solutions.
Some distinguish between exploratory factor analysis and confirmatory factor analysis, but all the latter is doing is constraining some loadings/paths to be zero. Otherwise they are identical techniques, or can be made to be. For example, if we only had a single factor, the results from ‘confirmatory’ and ‘exploratory’ FA are identical if they use the same estimation techniques (e.g. maximum likelihood).
Why would you set loadings or factor correlations to zero? Theory would suggest it. For example, the agreeableness items shouldn’t be a measure of neuroticism after accounting for the correlation of the factors, so we can set things up to coincide with our substantive assumptions. In other cases, theory might be available, but it might just also be a simplifying assumption.
The following shows how to do this with the structural equation modeling package lavaan. One just has to specify each latent variable and the corresponding observed indicators that go with it.
modelCode = "
agree =~ A1 + A2 + A3 + A4 + A5
neurot =~ N1 + N2 + N3 + N4 + N5
# not necessary
# agree ~~ neurot
# if you don't want them correlated
# agree ~~ 0*neurot
"
cfa_model = cfa(modelCode, data=bfi_trim)
summary(cfa_model, standardized=T, rsq=T, nd=2)
Here is a visualization of the results. Note how the arrows flow from the factor to the observed variables.
In general, our results are very similar to the previous factor analysis, but we didn’t have to estimate all the cross loadings, so this approach is a simpler model. As before, the factor explains what the items have in common, and what they do not explain is partly due to measurement error, or in other words, a lack of reliability of measurement. For more on confirmatory factor analysis, take a look here.
In the following we compare PCA and FA from a probabilistic modeling perspective. This further highlights their differences and similarities.
\[ X \sim \mathcal{N}(ZW' + \mu, \Psi) \] \(\mu\) are the intercepts, \(\Psi\) is a \(DxD\) covariance matrix.
\[\Psi = \sigma^2I\]
The following plot shows the projection of original data onto the principal component, where we have two variables and one component. By default, PCA results in an orthogonal projection, where PPCA does not require this, and shrinks estimates to the overall mean.
With PCA we are assuming a noiseless process, and constraining \(W\) to be orthogonal.
\[\sigma^2 \rightarrow 0\]
Commonly our data regards counts or otherwise compositional data. For example, we might have counts of species seen at different sites, data regarding genetic expression, or the number of times words are seen in various documents. In this case we can use techniques that are better suited to such data, and provide a more interpretable solution.
Non-negative matrix factorization is similar to PCA, just with the constraint that scores and weights be non-negative.
\[X \approx ZW\]
It is (in the usual implementation) identical to what’s called probabilistic latent semantic analysis. The following shows how to conduct a NMF using the package of the same name.
load('data/threes_mnist.RData')
# many of the columns are useless
zero_idx = colSums(threes) == 0
library(NMF)
# nmf_50 = nmf(threes[,!zero_idx], rank = 50) # if you don't mind waiting
nmf_10 = nmf(threes[,!zero_idx], rank = 10)
# a current bug may cause these from displaying in RStudio
coefmap(nmf_10)
basismap(nmf_10)
Here are the reconstruction results of the MNIST digits using a 10 component solution.
Latent Dirichlet Allocation takes a different approach to deal with count data. The classic example regards text analysis, where LDA is applied to word frequencies across documents. Consider a document-term matrix where each row regards a document, and each column a word or term. LDA looks for latent topics that are probability distributions over the terms. There is a probability distribution for the topics as well as the terms, and given both, we will see some occurrence of the term in each document. Each document can be seen a mixture of topics, though some will be more probable than others.
In the factor analysis sense, each variable (the term) will have some probability of occurring for a given topic, i.e. latent factor. One can think of the term/variable probabilities for a specific topic similar to how we did loadings in the factor analysis sense. Every variable will have some non-zero probability of occurrence for a given factor, but often they are essentially zero. Factor scores are the document topic probabilities.
Given the probabilities of topics and terms within topics, there is a multinomial draw of counts for an observation. With this probabilistic result, there isn’t a ‘reconstruction’ connotation with LDA. However, we can simulate a multinomial draw based on the total counts we see. For example, let’s assume 5 terms seen a total of 20 times for a given document, given some probability of each term.
t(rmultinom(1, size = 20, prob=c(.5, .2, .15, .1, .05))) # first term is most likely, 5th term is least likely
[,1] [,2] [,3] [,4] [,5]
[1,] 9 6 2 1 2
Those are the counts we see for a given observation. This treats documents as a bag of words, where we don’t care about ordering. Surprisingly though, simply using the document term matrix and this bag of words approach does very well to uncover latent topics that make sense.
The following shows a text analysis of political blogs using the lda package. As in other factor analysis we’ve discussed, one must choose the number of latent topics.
library(lda)
data(poliblog.documents)
data(poliblog.vocab)
set.seed(8675309)
K <- 10 ## Num topics
# this will take at least a few seconds to run
result <- lda.collapsed.gibbs.sampler(poliblog.documents,
K, ## Num topics
poliblog.vocab,
2000, ## Num iterations
0.1,
0.1)
## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5)
This shows the topics (identified by their top 5 words) topics and their probability of occurrence in a sample of 10 blogs. For example, blog 1 is more about McCain while blog 6 is more democratic leaning.
See my text analysis document for more details and examples. In my experience, this kind of data comes up quite often, and one should have LDA in their Factor Analysis Toolkit (FAT)™, and as readily available as PCA and FA. LDA wouldn’t be the best tool if your goal is prediction and you don’t care as much about interpretation. But when you do, it can be rewarding to use. Furthermore, there is supervised LDA where one can use the topics to simultaneously make predictions on outcomes of interest, and structured topic modeling where one can use covariates to model the topic probabilities. In short, you have plenty to play with here.
Here is the number 3 reconstructed with LDA.
Structural Equation Modeling is a very general technique that allows one to incorporate latent variables and observed variables within a single modeling framework, allowing for indirect effects and multiple outcomes. Practically any standard model you’ve conducted can be seen as a special case. We actually already did an SEM when we did the confirmatory factor analysis. Consider the following depiction of a mediation model with three latent variables, each of which would have their respective observed variables (not shown).
We can use factor analysis in the SEM context as a way to assess measurement error. Whatever variance a factor doesn’t explain is a combination of measurement error and random variation. Thus, we can use factor analysis as a means to assess a group of variables’ ability to measure the underlying construct they are supposed to measure.
This document is actually a subset of the SEM document. Check it out for lots more detail and examples.
Item response theory can be seen as latent variable modeling for categorical data, and with specific parameters with theoretical underpinnings. It is especially applied to testing environments where the items are binary correct vs. not outcomes. An underlying logistic model relates the latent variable, which can be seen as ability in the testing context, to the observed items which have unique difficulty. The conceptual latent variable model is identical to confirmatory factor analysis.
As an example, the one-parameter, a.k.a. Rasch, model (1PM) can be expressed as follows:
\[P(y=1|\theta, \delta) = \mathrm{logis}(\theta_i-\delta_j)\]
In this setting, the probability of endorsement (or getting an item correct), \(\pi_{ij}\), is a function of the difficulty of item \(j\), \(\delta_j\) above, and the latent trait (ability) of person \(i\), \(\theta_i\). In other words, it’s a specific type of logistic regression model. In the testing context, a person with more ‘ability’ relative to the item difficulty will answer correctly. In terms of the logit:
\[\mathrm{logit_{ij}} = \mathrm{log}(\frac{\pi_{ij}}{1-\pi_{ij}}) = \theta_i-\delta_j\]
The following shows an example. With the psych package a factor analysis is done on the tetrachoric correlation matrix of the binary verbal reasoning items. Compare with the standard IRT ‘two parameter’ model from ltm. While the parameterizations are different, they come to the same conclusions about ‘ability’, as can be seen in the plot. See my IRT chapter for more details.
data("ability", package = 'psych')
verbal = ability %>%
data.frame() %>%
dplyr::select(contains('reason')) %>%
drop_na()
library(psych); library(ltm)
irt_fa = fa2irt(fa(verbal, fm='ml'), rho=tetrachoric(verbal), plot = F)
irt_model = ltm(verbal ~ z1)
# irt_fa$fa # to compare to the standard fa output
irt_fa$irt
$difficulty
$difficulty[[1]]
reason.4 reason.16 reason.17 reason.19
-0.5649361 -0.7011017 -0.8416509 -0.4474843
$discrimination
Loadings:
ML1
reason.4 0.746
reason.16 0.559
reason.17 0.894
reason.19 0.590
ML1
SS loadings 2.018
Proportion Var 0.505
summary(irt_model)
Call:
ltm(formula = verbal ~ z1)
Model Summary:
log.Lik AIC BIC
-3058.632 6133.263 6175.102
Coefficients:
value std.err z.vals
Dffclt.reason.4 -0.6156 0.0549 -11.2151
Dffclt.reason.16 -0.9684 0.0798 -12.1367
Dffclt.reason.17 -0.7647 0.0533 -14.3422
Dffclt.reason.19 -0.6010 0.0631 -9.5229
Dscrmn.reason.4 1.8781 0.1897 9.9001
Dscrmn.reason.16 1.3847 0.1356 10.2099
Dscrmn.reason.17 2.5396 0.3141 8.0852
Dscrmn.reason.19 1.4163 0.1357 10.4370
Integration:
method: Gauss-Hermite
quadrature points: 21
Optimization:
Convergence: 0
max(|grad|): 0.002
quasi-Newton: BFGS
The latent linear model versions of PCA and factor analysis assume the observed variables are normally distributed (even standard PCA won’t work nearly as well if the data aren’t. This is not required, and ICA, independent components analysis, does not. This visualization duplicates that seen in Murphy (2012) where we have two (uniform) independent sources. We can see that the ICA correctly recovers those components.
When would you use ICA? If you believe that truly independent sources of signal underlie your data, and that those sources are not normally distributed. It is commonly applied to deal with images or sound8.
Multidimensional scaling9 is a technique for converting a matrix of observation dissimilarities into a low-dimensional map that preserves those distances as well as possible.
The first step in any MDS is choosing a similarity measure. Often this will be a metric or distance, i.e.:
The basic idea is to come up with what we might call components or factors that minimize the difference in factor scores for two observations from the actual distance.
data(UScitiesD) # base R data set
UScitiesMDS <- cmdscale(UScitiesD)
Atl | Chi | Den | Hou | LA | Mia | NY | SF | Sea | DC | |
---|---|---|---|---|---|---|---|---|---|---|
Atlanta | 0 | 587 | 1212 | 701 | 1936 | 604 | 748 | 2139 | 2182 | 543 |
Chicago | 587 | 0 | 920 | 940 | 1745 | 1188 | 713 | 1858 | 1737 | 597 |
Denver | 1212 | 920 | 0 | 879 | 831 | 1726 | 1631 | 949 | 1021 | 1494 |
Houston | 701 | 940 | 879 | 0 | 1374 | 968 | 1420 | 1645 | 1891 | 1220 |
LosAngeles | 1936 | 1745 | 831 | 1374 | 0 | 2339 | 2451 | 347 | 959 | 2300 |
Miami | 604 | 1188 | 1726 | 968 | 2339 | 0 | 1092 | 2594 | 2734 | 923 |
NewYork | 748 | 713 | 1631 | 1420 | 2451 | 1092 | 0 | 2571 | 2408 | 205 |
SanFrancisco | 2139 | 1858 | 949 | 1645 | 347 | 2594 | 2571 | 0 | 678 | 2442 |
Seattle | 2182 | 1737 | 1021 | 1891 | 959 | 2734 | 2408 | 678 | 0 | 2329 |
Washington.DC | 543 | 597 | 1494 | 1220 | 2300 | 923 | 205 | 2442 | 2329 | 0 |
We can use MDS to obtain a 2-dimensional map preserving distances as well as possible.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a newer technique in dimension reduction, one that can be seen as a modern version of multidimensional scaling, that tries to put similar rows of data together. Similar in this case can be thought of as follows. If you had a single variable, similarity to data point \(x_i\) would be defined as points with high probability under a normal distribution with mean of the value \(x_i\). However, with several variables we’d be thinking of a multivariate normal distribution with a mean vector. t-SNE will then try to create a new coordinate system of points (call them components or factors if you like) that tries to maintain those similarities. It is particularly well suited to high-dimensional data, or at least some implementations of it are10. People have used it well for visualization purposes, but it’s no different than what people have been doing to visualize the results of MDS and PCA for decades, just that the underlying technique is different.
The t-SNE approach is more difficult to implement, as there are particular parameters to set beyond just the number of components, and they will definitely affect the visual result. For example, I could only duplicate the MDS result for the US cities data with some fiddling and luck. In addition, there is some randomness in the process, so you can get slightly different results each time. None of this really matters, as the only real stated goal with this technique is good visualization. You might also could use it as a prelude to determining the number of clusters before cluster analysis.
library(tsne)
tsne_result_us = tsne(UScitiesD, k = 2, initial_dims=6, max_iter = 2000, perplexity = 4, epoch = 250)
Here’s what it came up with for the MNIST data set.
Practically everyone has been exposed to recommender systems such as collaborative filtering and related models. That’s how Netflix, Amazon and others make their recommendations to you given the information you’ve provided about likes and dislikes, what other similar people have provided, and how similar the object of interest is to others.
The following image, taken from Wikipedia (click the image to go there), conceptually shows how a user-based collaborative filtering method would work, where a recommendation is given based on what other similar users have given.
Let’s go with movies as an example. You might only rate a handful, and indeed most people will not rate most movies. But at some point most movies will have been rated. So how can one provide a recommendation for some movie you haven’t seen? If we group similar movies into genres, and similar people into demographic categories and based on taste, one can recommend something from a similar genre of movies that you like, that people in the same demographic category seem to like as well.
If you think of genres as latent categories of movies, now you can employ the techniques we’ve talked about11. Similarly, we can find clusters of people using cluster analytic techniques. In short, collaborative filtering/recommender systems are using latent variable techniques to a specific type of data, e.g. ratings. More modern approaches will incorporate user and item characteristics, recommendations from other systems, and additional information. The following provides some code for you to play with, using a straightforward singular value decomposition on movie ratings, which is the same technique used in Base R’s default prcomp function for PCA. You might compare it with method = 'POPULAR'
.
library(recommenderlab)
data("MovieLense")
MovieLense
barplot(table(getRatings(MovieLense)), col = '#ff5500', border='#00aaff') # rating frequencies
recommenderRegistry$get_entries(dataType = "realRatingMatrix") # the methods available
recom_svd <- Recommender(MovieLense, method = "SVD")
recom_svd
After running it try some predictions.
# predicted ratings for two users
recom <- predict(recom_svd, MovieLense[2:3], type="ratings")
recom
as(recom, "matrix")[,1:10]
# comparison model
recom_popular <- Recommender(MovieLense, method = "POPULAR")
getModel(recom_popular)$topN
recom <- predict(recom_popular, MovieLense[2:3], type="topNList")
recom
as(recom, "list")
All in all, thinking about your data in terms of a recommendation system might not be too far-fetched, especially if you’re already considering factor analysis of some sort.
Hidden Markov models can be used to model latent discrete states that result in a sequence of observations over time. In terms of a graphical model, we can depict it as follows:
In this sense, we have latent variable \(z\) that represents the hidden state of a system, while the outcome is what we actually observe. There are three latent states above, and the relative width of the edge reflects the transition probability of moving from one state to the other. The icons are the categories of observations we can potentially see, and there is some probability, given a latent state of seeing a particular category. Such a situation might lead to the following observed sequence of observations:
We start in state 1 where the heart and yellow smiley are most probable. Let’s say we observe ♥. The second state is most likely so we get there, and because it has some probability of staying in that state, we observe § and §. We finally get to state 3 and see ☻, where we go back to state 1, where we see ☺, jump to latent state 3 etc. And such a process continues for the length of sequence we see.
This can be seen as a mixture model/latent class situation that we’ll talk about more later. The outcomes could also be continuous, such that the latent state determines the likelihood of the observation in a manner more directly akin to the latent linear model for standard factor analysis.
For those familiar with mixed effects models that incorporate random effects, the random effects can be seen as a latent variable in the same manner as we have been discussing here. In fact, with latent growth curve models widely use in the SEM context, they are explicitly modeled as such, but it turns out these are just a different way to run a mixed model. For typical mixed models, they are essentially a latent construct that represents unknown sources of variance that aren’t captured by the model. In addition, the residual covariance structures in mixed models, and covariance functions in Gaussian process models more generally, can actually be modeled with a factor analytic approach where loadings are estimated to produce the covariances. And finally, certain IRT models can be reformulated as mixed models. Thinking about random effects as latent variables possibly opens the door for other interpretive opportunities when dealing with them in other contexts.
There are Bayesian versions for any of the model-based techniques we’ve discussed, essentially just adding priors for the parameters of interest. As mentioned, LDA can be seen as latent semantic analysis with priors for the topics. With standard factor analysis, one can actually estimate the factor scores directly as additional parameters in the model. In other approaches, it can require a bit of finagling to estimate the factor analytic result, as it is inherently unidentified and requires constraints.
Beyond that, Bayesian non-parametric techniques, such as Dirichlet process approaches, allow for the possibility of not having to specify the number of clusters or latent variables in advance. Instead number is allowed to grow as the data grows according to the prior, and so can be applied to online learning situations as well. See my chapter on this for some examples.
Take a look at the following data. It regards the waiting time between eruptions and the duration of the eruption (both in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
The visualization suggests probably two clusters- less waiting time results in shorter eruptions, and longer waits mean longer eruptions. However, it seems the observations more or less come from only those two possibilities.
In some contexts, possibly way too many, some assume the latent variables are discrete in nature rather than continuous. In such cases, we can think of the dimension reduction occurring along the rows rather than the columns. We’ll talk about two approaches, what we might call model-based vs. algorithmic or distance-based.
A basic approach for categorical latent variable analysis from a model-based perspective could be seen as follows:
Typical methods are assuming an underlying normal distribution, such that each observation is the weighted combination of its likelihood under the parameters for each proposed cluster with their respective estimated parameters. For example, with two clusters, each with an estimated mean and variance, we get the likelihood for a particular observation for each cluster. We weight them by its estimated probability of membership in that cluster, which are also estimated. The parameters of interest are those of the assumed distribution (e.g. means and (co-)variances with a normal distribution).
As an example, we’ll use mclust on the old faithful data, positing 2 clusters. As such we’ll get the proportion of observations classified to each cluster, and the means and covariance matrix (not shown) for each cluster.
library(mclust)
mod = Mclust(faithful, G = 2)
## summary(mod, parameters=TRUE) # more detail
component | size | proportion | mean.eruptions | mean.waiting |
---|---|---|---|---|
1 | 175 | 0.64 | 4.29 | 79.99 |
2 | 97 | 0.36 | 2.04 | 54.51 |
We can see that things come out more or less as we might expect, though there is a bit more uncertainty with observations in the middle.
The nice thing about model based approaches is that if we want to compare different solutions we can use the same tools we do in regression settings. Here, mclust does a search over many models (not just differences in cluster sizes, but also things like equal variances, shape, etc.) and selects the ‘best’ model according to BIC12. Interestingly, the result here suggests three clusters for the Old Faithful data.
mod = Mclust(faithful, verbose=F)
Mixture models are often used in the regression setting, similar to the latent class setting that we’ve been describing. One approach regards the outcome of interest, whose observations you believe to be the result of hidden category membership, and you model it as such. In a standard regression setting you would get two sets of output, one for each class, along with estimated probabilities of class membership for each observation.
As a starting point, we’ll rerun mclust models for just eruption times. With flexmix, we’ll run a model without any covariates as an initial demonstration.
library(flexmix)
flexmix(eruptions ~ 1, data=faithful, k = 2) %>% parameters()
Mclust(faithful$eruptions, G=2, verbose=F) %>% summary(parameters=T)
coef.(Intercept) | sigma | |
---|---|---|
Comp.1 | 4.27 | 0.44 |
Comp.2 | 2.02 | 0.24 |
component | mean | variance |
---|---|---|
1 | 2.02 | 0.06 |
2 | 4.27 | 0.19 |
Note that the output shows the standard deviation, and so are the square root of the mclust values. The cluster names are arbitrary. The point is it’s the same result.
What flexmix will allow is to model the mean with covariates of interest. To keep things simple we’ll stick to this data and model eruption time by waiting time.
mod = flexmix(eruptions ~ waiting, data=faithful, k = 2)
summary(mod)
Call:
flexmix(formula = eruptions ~ waiting, data = faithful, k = 2)
prior size post>0 ratio
Comp.1 0.498 148 272 0.544
Comp.2 0.502 124 272 0.456
'log Lik.' -191.5743 (df=7)
AIC: 397.1485 BIC: 422.3891
We can see from the summary it’s a fairly even split as far as group membership (54% vs. 46%). We also get the estimated coefficients for each group, as well as note respective probabilities of each observation for each class. Note again that the group labels are completely arbitrary, even rerunning it might flip-flop the labels. In this case, we’d probably conclude that maybe they only differ in intercepts, as the relationship of waiting time with eruptions appears to be the same for both groups.
parameters(mod) # means (Intercept) and std dev (sigma) for each group
Comp.1 Comp.2
coef.(Intercept) -1.73204354 -2.06336751
coef.waiting 0.07822496 0.07374856
sigma 0.34866389 0.39692190
X1 | X2 |
---|---|
0.06 | 0.94 |
0.14 | 0.86 |
0.12 | 0.88 |
0.07 | 0.93 |
0.46 | 0.54 |
0.90 | 0.10 |
0.38 | 0.62 |
0.00 | 1.00 |
0.48 | 0.52 |
0.24 | 0.76 |
The key idea is as before, that the response is a mixture of \(k\) distributions, e.g. two normal distributions, each with their own mean, modeled by covariates, and some variance. Each observation has some probability of belonging to the clusters, and their likelihood is weighted by that. Thus each observation is a weighted combination of the clusters.
Such a technique is very commonly applied in cases where there are many zeros in the data, leading to zero-inflated, hurdle and similar models, where the zeros are modeled by a logistic regression and the counts by Poisson or negative binomial. However, the approach is much more general, and many settings/distributions may arise.
Aside from mixture models, when people use the term ‘cluster analysis’ they are typically referring to distance-based methods. Given a distance matrix that informs how dissimilar observations are from one another, the methods try to create clusters of observations that are similar to one another.
K-means cluster analysis is hands-down the most commonly used clustering method. Conceptually it’s fairly straightforward- find \(k\) clusters that minimize the variance of its members from the mean of its members. As such it’s easy to implement in standard data settings.
K-means can actually be seen as a special case of the Gaussian mixture model above (similar to EII in the BIC plot, which is an attempt to find spherical clusters of equal size and shape). It also has connections to PCA and ICA. The general issue is trying to determine just how many clusters one should retain. The following plot shows both a two and three cluster solution using the kmeans function in base R, e.g. kmeans(faithful, 2)
.
Other methods can be thought of as a clustering the data in a hierarchical fashion. These can start at the bottom of the hierarchy (agglomerative), allowing every observation into its own cluster, and successively combining them. For example, first choose a measure of dissimilarity, and combine the two observations that are most alike, then add one to those or if another pair are closer, make a new cluster. Conversely, one can start with every observation in one cluster (divisive), and take the most dissimilar and split it off, continuing on until every observation is in its own cluster.
Practically at every turn you’re faced with multiple options for settings to choose (distance, linkage method, cluster determination, general approach), and most decisions will be arbitrary . While these are actually still commonly used, you always have better alternatives. They are fine to use in a quick visualization to sort the data more meaningfully though, as above.
The term latent class analysis (LCA) refers to the case where we assume a latent categorical variable, but is commonly applied to the observed data being categorical as well (usually binary). In fact, some disciplines have a name for every categorical continuous combination you can have.
We could simply call them by the names we have already, mixture models in the case of categorical latent variables, and factor analysis otherwise. The IRT demonstrated above can be seen as an example of latent trait analysis.
The following example comes from survey data from the 2000 American National Election Study. Two sets of six questions with four responses each, asking respondents’ opinions of how well various traits (moral, caring, knowledgeable, good leader, dishonest, intelligent) describe presidential candidates Al Gore and George W. Bush. The responses are measured on a four point scale from ‘Extremely well’ to ‘Not well at all’. In what follows, we are assuming three latent classes can explain these responses. We can also model the latent class as an outcome. Which we do here by likely vote cast (the respondent’s 2000 vote choice when asked).
library(poLCA)
data("election")
# ?election # for details
f <- cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG,
MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB) ~ VOTE3
lca_mod <- poLCA(f, election, nclass=3, verbose = F)
# lca_mod
The problem with LCA, and this is regardless of tool used, is that the output is ridiculously messy, because you get estimated proportions of response for every latent class (3) for every observed class (4), for every variable (8). So we’ll just plot the result instead.
Darker colors represent higher probability of response. In general, we have classes representing naysayers, more positive responders, and those with a more mixed response (these would also be known as low, high and moderate scores if we’d just left the latent variable continuous). The following shows the proportions for each latent class by vote type.
Probable Vote | 1 | 2 | 3 |
---|---|---|---|
Bush | 0.40 | 0.01 | 0.08 |
Gore | 0.02 | 0.23 | 0.23 |
Other | 0.01 | 0.00 | 0.02 |
So the more negative responding types were also those likely to vote for Bush, while the other two are more likely to vote for Gore.
- Determining the most interpretable number of factors from a factor analysis is perhaps one of the greatest challenges in factor analysis. There are many solutions to this problem, none of which is uniformly the best.
- W. Revelle
- “Solving the number of factors problem is easy, I do it everyday before breakfast. But knowing the right solution is harder.”
- H. F. Kaiser
- The only rule you have to follow is to never use a scree plot to make your decision. Elbows indeed.
- Anonymous
Many tools have been offered to assist with this, and if one were able to definitively say how many factors were ‘best’, you’d have heard about it by now. For FA and PCA, see the help file for the nfactors in psych (also unidim), which will provide eight or so metrics, and probably cause confusion more than clarity.
In general, if you’re using a latent variable or factor analysis approach for interpretative purposes, theory should be a strong, but not overwhelming guide. If your data suggest three are clearly better than two (e.g. via AIC comparison), then go with that, because that may just be how it is for your sample. For measurement models, some get hung up on ‘validated’ scales, and are reluctant to even explore other possibilities. But if the scale was validated on different types of people than your sample, you should probably be more exploratory. I can also mention that scales have been used for decades that were never adequately shown to be very reliable in the first place.
Interpretation comes largely from the weights/loadings or related coefficients. With PCA and common FA, rotation goes a long way to improving interpretability. I will say that if you interpret latent clusters in an ordinal fashion, you should strongly consider simply leaving the latent variable continuous. Even a mixture of two continuous latent variables (and/or their interaction) may uncover the latent classes you would have found, but leave the data in a more expressive context.
In many situations, you may be able to treat the observed data as discrete or continuous. Unless there is a strong need otherwise, I would suggest leaving continuous, even for Likert type scales. The issue is that categorical outcome models are generally harder to estimate, and often without providing much interpretability benefit. PCA wouldn’t be your best option, but it wouldn’t necessarily be wrong to use it on binary data. Remember also, that even if your data is binary that doesn’t mean the latent variables have to be discrete (think IRT).
Many often use the clusters or factor scores obtained from an initial analysis for prediction. As such, they perform a two-step process, one where they obtain the cluster membership or scores, and then perform some other model using it, say standard regression. Unfortunately, this ignores the uncertainty in those estimates of the class membership or factor score. Essentially your standard errors will be a bit optimistic.
While personally I’m less worried about this, using SEM or other approaches will allow for simultaneous estimation, and should be employed if not too much trouble. Even then it might not be straightforward. For example, several suggestions have been made for doing so with latent classes in the SEM context to keep the latent class determination free of contamination by the predictive model. They generally just find a different issue arising with each, and using the ‘appropriate’ technique may mean other standard post-processing is not available. When the latent variables are continuous, there are concerns about smaller sample sizes with SEM, but you can use fsr (factor score regression) in lavaan to correct the standard errors. I would also recommend looking at the help file for the factor.scores function in psych.
There are many disciplines that will just do the two-step and not blink an eye, so you shouldn’t feel bad just using factor analysis and related as an initial pre-processing dimension reduction step.
If prediction is the primary goal of the analysis however, you should actually be wary of dimension reduction techniques. Having more columns than rows or mountains of data is commonly dealt with by using other techniques that can handle it, not by reducing the data so you can do a regression. Dimension reduction would still perhaps be useful in small data settings though, allowing you to retain information from multiple sources without making the ultimate model overly complex.
Non-model based methods provide no clear-cut way of selecting the number of clusters within a particular method, or comparing different approaches to the same solution. Many have been offered, but even in typical settings they often not agree. For example, let’s say you want to choose among k-means, k-medoids, and two hierarchical techniques. You’ll need to look at 2:4 clusters, across 4 techniques, based on 3 methods of ‘best’ (e.g. silhouette score), 2 distance measures, 3 linkage methods (where appropriate), etc. So just starting out you get to run over a hundred models and probably still won’t know which to choose. At least clusterSim will automate it for you.
In standard statistical settings, you are never going to have it recommended that you take continuous outcomes or predictors and categorize them. It is sometimes done as a computational convenience, but if you’re interested in interpretability and inference, it’s almost always the wrong choice.
Now ask, why would you do it with latent variables? Unless there is very strong theory, an actual physical aspect involved (e.g. seizure occurrence), or the data demand it (extreme zero-inflation in an outcome) you should think hard before assuming a hard classification of observations.
When you see cluster analysis conducted as part of some attempt to answer a research or other question, you almost never see someone test whether there is only one cluster. Despite the obvious possibility otherwise, it’s just assumed there are latent classes. The techniques that aren’t model-based don’t even have a way to test whether a single class is best. Model-based ones do however, just like they have a straightforward way to compare other cluster numbers. That’s reason enough alone to use them when possible.
Factor Analytic techniques, and latent variable methods more generally, have a history that’s now over a century old. As such, much is available to choose from, and though some approaches may be more common in some disciplines relative to others, all those discussed above are in fact common techniques. You should feel comfortable trying them out, and using them to help you find the hidden, unobserved meaning within your data!
Data: National Longitudinal Survey of Youth (1997, NLSY97), which investigates the transition from youth to adulthood. For this example, a series of questions asked to the participants in 2006 pertaining to the government’s role in promoting well-being will be investigated. Questions regarded the government’s responsibility for following issues: provide jobs for everyone, keep prices under control, provide health care, provide for elderly, help industry, provide for unemployed, reduce income differences, provide college financial aid, provide decent housing, protect environment. They have four values 1:4, which range from ‘definitely should be’ to ‘definitely should not be’.
There are several things you could run here. For example, use the psych package to run PCA, FA, or even IRT. You could also use poLCA to run a latent class analysis, or mclust if you just want to treat the data as numeric to run a standard mixture model. Given the nature of the variables, you could even see what a recommender system would come up with. Note that there is an ID variable that you don’t want to include in your analyses.
library(dplyr)
nlsy = read.csv('https://raw.githubusercontent.com/m-clark/sem/master/data/nlsy97_governmentNumeric.csv')
nlsy = nlsy %>%
select(-ID) %>%
drop_na()
nlsy_mat = as.matrix(nlsy) # in case you need it
psych::describe(nlsy)
These descriptions come from the Wikipedia page.↩
They are the eigenvalues of the correlation matrix. In addition, they are the diagonal of the cross-product of the loading matrix, i.e. \(W'W\).↩
These results have been rotated, something not often practiced with PCA, yet almost always done with FA.↩
In almost every case I’ve come across, reverse-scored items are simply poorer items.↩
Muddier results are more the rule than the exception in my experience. Not enough initial development is typically done with scales, and when other people use them in other settings, it often ends with disappointment. Just think, many psychological scales are developed on specific clinical samples or freshman psychology students. How well do you think that will generalize?↩
My description here is a bit loose, but mostly for reasons that allow me to say similar things across techniques. For the technically inclined, \(V\) are the eigenvectors of \(X'X\), or if the data is standardized, this is the same as the eigenvector of the covariance/correlation matrix of \(X\). Technically, psych uses weights based on the loadings and covariance matrix via solve(cov(bfi_trim), pc$loadings)
, which amounts to an equivalent result (even if different in sign/value). For factor analysis this becomes more nuanced. There is no unique way to create a factor score, but the common ones use weights that are a function of the loadings and correlation matrix. For example, the most commonly used (‘Thurstone’ or ‘regression’) are constructed via \(V = R^{-1}W\), where \(R\) is the correlation matrix and \(W\) are the loadings as before.↩
Note that your actual factor solution will still only approximate the observed correlation matrix, except under certain settings that would make it similar to PCA.↩
These days one would probably be more likely to use deep learning techniques for audio and images, at least if prediction was the primary concern.↩
This section has been gracefully stolen from James Henderson. As with ICA, I personally have never had a need to use this technique, nor was much interested beyond what I knew about it already, so I decided to lift it from him. I can say that ldaVis shows topic relations from LDA using it, so there’s that. James’ workshop on dimension reduction provides a more technical detail and additional examples for PCA, FA and MDS, for those inclined.↩
The R package for t-SNE is written entirely in R code, and not optimized, and so it’s notably slow with large data. And, though it suggests otherwise in the helpfile, it does not return the probabilities.↩
Indeed, straight up singular value decomposition, as that used for PCA, was the hot machine learning tool used by Netflix back in the day.↩
For whatever reason, mclust displays BIC in higher is better format, in contrast to practically every other modeling context I’ve come across.↩