This is a brief demonstration of a standard latent dirichlet allocation (LDA) for topic modeling, geared toward those who want to dig a little deeper than merely getting a result from a package command. The basic idea is to first generate some documents based on the underlying model, and then we’ll use the topicmodels package to recover the topics via LDA. You can find a basic overview of topic models in a variety of places, and I’ll eventually have a chapter in my graphical and latent variable models document.
Suffice it to say, one can approach this in (at least) one of two ways. In one sense, LDA is a dimension reduction technique, much like the family of techniques that includes PCA, factor analysis, non-negative matrix factorization etc. We’ll take a whole lot of terms, loosely defined, and boil them down to a few topics. In this sense LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics. The following is the plate diagram and description for standard LDA from Blei, Jordan, and Ng (2003).
Both \(z\) and \(w\) are from a multinomial draw based on the \(\theta\) and \(\beta\) distributions respectively. The key idea is that, to produce a given document, one draws a topic, and given the topic, words are drawn from it1.
In the standard setting, to be able to conduct such an analysis from text one needs a document-term matrix, where rows represent documents, and columns terms. Terms are typically words but could be any n-gram of interest. In practice, this is where you’ll spend most of your time, as text is never ready for analysis, and must be scraped, converted, stemmed, cleaned etc. In what follows we’ll be generating the text based on the underlying topic model. We’ll initially create \(\theta\) and \(\beta\) noted above, then given those, draw topics and words given topics based on the multinomial distribution. I’ve tried to keep the code general enough to make it easy to play with the parameter settings.
We begin the simulation by creating topic probabilities. There will be \(k=3\) topics2. Half of our documents will have probabilities of topics for them (\(\theta_1\)) which will be notably different from the other half (\(\theta_2\)). Specifically, the first half will show higher probability of topic ‘A’ and ‘B’, while the second half of documents show higher probability of topic ‘C’. What we’ll end up with here is an \(m\) x \(k\) matrix of probabilities \(\theta\) where each \(m\) document has a non-zero probability for each \(k\) topic. Note that in the plate diagram, these would come from a \(\mathcal{Dirichlet}(\mathbf{\alpha})\) draw rather than be fixed like this, but hopefully this will make things clear starting out. Later we will use the Dirichlet for the terms.
# Note that in what follows, I strive for conceptual clarity, not code efficiency.
library(tidyverse)
Ndocs = 500 # Number of documents
WordsPerDoc = rpois(Ndocs, 100) # Total words/terms in a document
thetaList = list(c(A=.60, B=.25, C=.15), # Topic proportions for first and second half of data
c(A=.10, B=.10, C=.80)) # These values represent a Dir(alpha) draw
theta_1 = t(replicate(Ndocs/2, thetaList[[1]]))
theta_2 = t(replicate(Ndocs/2, thetaList[[2]]))
theta = rbind(theta_1, theta_2) # Topic probabilities for all 500 docs
If you like, the following will use the Dirichlet explicitly, and amounts to the same thing but with randomness thrown in.
theta_1 = t(replicate(Ndocs/2, rdirichlet(1, c(6, 2.5, 1.5)), simplify=T))
theta_2 = t(replicate(Ndocs/2, rdirichlet(1, c(1, 1, 8)), simplify=T))
With topic probabilities in hand, we’ll draw topic assignments from a categorical distribution, which, for those not in computer science, is the multinomial with size=1
(see the commented line at the end).
firsthalf = 1:(Ndocs/2)
secondhalf = (Ndocs/2+1):Ndocs
Z = apply(theta, 1, rmultinom, n=1, size=1) # draw topic assignment
rowMeans(Z[,firsthalf]) # roughly equal to theta_1
[1] 0.612 0.252 0.136
rowMeans(Z[,secondhalf]) # roughly equal to theta_2
[1] 0.116 0.116 0.768
z = apply(Z, 2, which.max) # topic assignment as arbitrary label 1:3
# z = apply(theta, 1, function(topprob) extraDistr::rcat(1, topprob)) # topic assignment via categorical dist
Next we need the topics themselves. Topics are probability distributions of terms, and in what follows we’ll use the Dirichlet distribution to provide the prior probabilities for the terms3. With topic A, we’ll make the first ~40% of terms have a higher probability of occurring, the last ~40% go with topic C, and the middle more associated with topic B. To give a sense of the alpha settings, alpha=c(8,1,1)
would result in topic probabilities of .8, .1, .1, as would alpha=c(80,10,10)
, though the latter would serve as a much stronger prior. We’ll use the gtools package for the rdirichlet function. I also provide a visualization of the term-topic matrix, where the dark represents terms that are notably less likely to be associated with a particular topic.
Nterms = max(WordsPerDoc)
breaks = quantile(1:Nterms, c(.4,.6,1)) %>% round()
cuts = list(1:breaks[1], (breaks[1]+1):breaks[2], (breaks[2]+1):Nterms)
library(gtools)
B_k = matrix(0, ncol=3, nrow=Nterms)
B_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])), # topics for 1st 40% of terms
rep(1, length(cuts[[2]])),
rep(1, length(cuts[[3]]))))
B_k[,2] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), # topics for middle 20%
rep(10, length(cuts[[2]])),
rep(1, length(cuts[[3]]))))
B_k[,3] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), # topics for last 40%
rep(1, length(cuts[[2]])),
rep(10, length(cuts[[3]]))))
Now, given the topic assignment, we draw words for each document according to its size via a multinomial draw, and with that, we have our document-term matrix. However, we can also think of each document a merely a bag of words, where order, grammar etc. is ignored, but the frequency of term usage is retained4.
wordlist_1 = sapply(1:Ndocs,
function(i) t(rmultinom(1, size=WordsPerDoc[i], prob=B_k[,z[i]]))
, simplify = F)
# smash to doc-term matrix
dtm_1 = do.call(rbind, wordlist_1)
colnames(dtm_1) = paste0('word', 1:Nterms)
# bag of words representation
wordlist_1 = lapply(wordlist_1, function(wds) rep(paste0('word', 1:length(wds)), wds))
Here’s a glimpse of the document-term matrix.
And here’s what the first document looks like.
Term | Freq |
---|---|
word34 | 8 |
word25 | 7 |
word7 | 6 |
word1 | 5 |
word30 | 5 |
word129 | 4 |
word20 | 4 |
word43 | 4 |
Note that with a matrix approach, we don’t need an explicit topic label, just the topic indicator matrix and topic probabilities. I will let you see this for yourself that the results are essentially the same, but we’ll just stick to the previous results for this demonstration.
ZB = crossprod(Z, t(B_k))
dtm_2 = t(sapply(1:Ndocs, function(i) rmultinom(1, WordsPerDoc[i], ZB[i,])))
wordlist_2 = apply(dtm_2, 1, function(row) rep(paste0('word', which(row!=0)), row[row!=0]) )
colnames(dtm_2) = paste0('word', 1:Nterms)
We’re now ready to run the models. Depending on the number of documents, terms, and other settings you might fiddle with, it is possible to get an unsatisfactory result. Usually a redo is enough to recover topic probabilities, but the commented settings are an attempt to get better result from the outset. However, they will notably slow things down. The only thing we’ll do here is incorporate multiple random starts.
library(topicmodels)
controlSettings = list(nstart=10, verbose=100)
# others var=list(iter.max=5000, tol=10e-8), em=list(iter.max=5000, tol=10e-8)
LDA_1 = LDA(dtm_1, k=3, control=controlSettings)
Let’s look at the results. To be clear, the topic labels are completely arbitrary, so what is topic “1” for one analysis might be topic “3” for another. In this case, they align such that A-1, B-2, and C-3. The main thing for us is the recovery of the distribution of the topics, which will be roughly 0.6, 0.25, 0.15 for the first half (arbitrary order), and 0.1, 0.1, 0.8 for the other. Visually we can see the expected result, the first half of the documents are mostly associated with the first topic, and a little more so for the second. The last half are mostly associated with the third topic and little else.
LDA_1Post = posterior(LDA_1)
We can compare our results with the true topic probabilities and see we’ve recovered things pretty well.
A | B | C | |
---|---|---|---|
true_topic_probs_first_half | 0.60 | 0.25 | 0.15 |
true_topic_probs_second_half | 0.10 | 0.10 | 0.80 |
LDA_1_first_half | 0.61 | 0.26 | 0.13 |
LDA_1_second_half | 0.10 | 0.09 | 0.81 |
As another approach to looking at how the documents relate to one another, I calculated their KL divergence based on the estimated topic probabilities, and show the (lower triangle) of the resulting distance matrix of documents. Darker indicates documents that are less alike, and we see the later documents are less like the earliest ones, as we would expect.
The tidytext package can facilitate working with LDA results, as well as doing all the text pre-processing beforehand. As it is relatively new to the scene, I thought it might be useful to demonstrate some of how it works. The first step would assume your initial data is in ‘tidy’ format (or what is more typically called ‘long’ form). In this case, each row represents a word within a document. Additionally we have a column for how many times the word occurs and the document id. Note that I drop 0 counts. This is because tidytext will not create a sparse dtm object otherwise.
library(tidytext)
lda_dat_df = dtm_1 %>%
as_data_frame() %>%
mutate(doc=1:n()) %>%
gather(key=term, value=count, -doc) %>%
filter(count>0) %>%
arrange(doc)
lda_dat_df
doc | term | count |
---|---|---|
1 | word1 | 5 |
1 | word2 | 2 |
1 | word3 | 3 |
1 | word5 | 3 |
1 | word7 | 6 |
1 | word8 | 2 |
1 | word10 | 1 |
1 | word11 | 1 |
1 | word12 | 1 |
1 | word16 | 2 |
Again, tidytext will help you get to that state via the initial processing. Once there, you can then create the document term matrix with cast_dtm.
dtm = lda_dat_df %>%
cast_dtm(doc, term, count)
dtm
<<DocumentTermMatrix (documents: 500, terms: 136)>>
Non-/sparse entries: 26121/41879
Sparsity : 62%
Maximal term length: 7
Weighting : term frequency (tf)
The above information tells us how much sparsity (or zero counts) we have in the matrix (62%), the actual number of zeros vs. non-zeros that produce the sparsity percentage, the maximal word/term length (7), and the weighting used. In this case we are dealing with the raw frequencies, thus the weighting is just term frequency.
Once you have the topic model results, you can then start working with term and topic probabilities via the functions in topicmodels or the ‘tidy’ version of the LDA object and tidytext. First we’ll get the top 10 terms and graphically display them. Recall that topic C-3 should be seen with later words, and A-1 with earlier, and this is born out fairly clearly. Topic B was our less frequent topic with a smaller vocabulary. See the LDAvis package also for an interactive way to examine topic and term relationships.
terms(LDA_1, 10)
Topic 1 | Topic 2 | Topic 3 |
---|---|---|
word22 | word67 | word95 |
word44 | word76 | word100 |
word15 | word63 | word120 |
word49 | word74 | word94 |
word37 | word71 | word86 |
word1 | word77 | word115 |
word11 | word57 | word126 |
word45 | word72 | word93 |
word34 | word58 | word96 |
word12 | word56 | word132 |
top_terms = tidy(LDA_1, matrix='beta') %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
We can get at the topic probabilities for each document as follows. For this demonstration, for clarity we set things up such that probabilities are essentially 1 or 0 for documents presenting some topic. Do not expect this in practice.
doctop_probs = tidy(LDA_1, matrix = "gamma")
doctop_probs %>%
group_by(document, topic) %>%
summarise(meanProb=mean(gamma)) %>%
head
document | topic | meanProb |
---|---|---|
1 | 1 | 0.0003705 |
1 | 2 | 0.9993 |
1 | 3 | 0.0003705 |
2 | 1 | 0.9992 |
2 | 2 | 0.0004034 |
2 | 3 | 0.0004034 |
A rough breakdown of expected topic probabilities across all documents is \(.6*Ndocs/2 + .1*Ndocs/2 =\) 0.35 for topic A, \(.25*Ndocs/2 + .1*Ndocs/2 =\) 0.175 for B, and \(.8*Ndocs/2 + .15*Ndocs/2 =\) 0.475 for C. We can use the topics function to extract the most likely topic per document and get the frequency of occurrence. Note that we don’t have to say that each document is only associated with one topic in general.
prop.table(table(topics(LDA_1)))
1 2 3
0.352 0.178 0.470
## # dplyr is notably more verbose here, but can be used also
## doctop_probs %>%
## group_by(document) %>%
## summarise (topic = which.max(gamma)) %>%
## group_by(topic) %>%
## summarise(n = n()) %>%
## mutate(freq = n / sum(n))
topic | n | freq |
---|---|---|
1 | 176 | 0.352 |
2 | 89 | 0.178 |
3 | 235 | 0.47 |
Looks like we’re right where we’d expect.
This document conceptually demonstrates the mechanics of LDA’s data generating process. I suggest you go back and fiddle with the settings to see how things change. For example, one could use an actual Dirichlet draw for theta, and a more loose setting for the term-topic affiliation for a more nuanced result.
One common extension of LDA is the correlated topic model. The above approach does not estimate topic correlations, and the Dirichlet prior fixes them based on alpha. The CTM would help account for the fact that something with a ‘business’ topic might also have the ‘finance’ topic. Other extensions include supervised LDA, where topics might be used in classification tasks, structural topic models, where we examine predictors of topic prevalence, dynamic topic models, where topics may evolve over time, and many others.
Note that just about everything you’ll come across with LDA will regard topic modeling, but it is by no means restricted to the analysis of text. In fact, one of its earliest uses was actually in genetics5. Any time you have a matrix of counts like this, LDA is a potential candidate for analysis, and might be preferable to PCA or factor analysis. It is a very useful technique to have in your statistical toolbox.
Sometimes you will see LDA without the \(\eta\). In Blei et al.‘s original article what is shown is called ’smoothed’ LDA, and is a more fully Bayesian approach where \(\eta\) serves as a prior allowing a non-zero probability for all terms.↩
In practice, you must decide on a \(k\), and that will often depend on the goals of analysis. For example, in a highly predictive focused, possibly big data situation, possibly supervised, one might be dealing with hundreds of topics for a given corpus. In more theoretically driven settings, one might desire a few for practical reasons and perhaps allow for greater interpretability. Statistics such as perplexity can be used to choose the ‘best’ solution, but in that latter setting it might be better to use interpretability.↩
Note that in practice, the alpha parameter would be constant, and serve to simply allow for each term to have some probability of belonging to any topic. See footnote 1 above.↩
It might seem like such things would be important for topic analysis, but it turns out they are not. Topic modeling can distinguish bank as belonging to a topic on rivers as well as one on financial matters.↩
Pritchard, J. K.; Stephens, M.; Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics.↩