Topic modeling

Basic idea

Topic modeling as typically conducted is a tool for much more than text. The primary technique of Latent Dirichlet Allocation (LDA) should be as much a part of your toolbox as principal components and factor analysis. It can be seen merely as a dimension reduction approach, but it can also be used for its rich interpretative quality as well. The basic idea is that we’ll take a whole lot of features and boil them down to a few ‘topics’. In this sense LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics.

In the standard setting, to be able to conduct such an analysis from text one needs a document-term matrix, where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any n-gram of interest.

Outside of text analysis terms could represent bacterial composition, genetic information, or whatever the researcher is interested in. Likewise, documents can be people, geographic regions, etc. The gist is, despite the common text-based application, that what constitutes a document or term is dependent upon the research question, and LDA can be applied in a variety of research settings.

Steps

When it comes to text analysis, most of the time in topic modeling is spent on processing the text itself. Importing/scraping it, dealing with capitalization, punctuation, removing stopwords, dealing with encoding issues, removing other miscellaneous common words. It is a highly iterative process such that once you get to the document-term matrix, you’re just going to find the stuff that was missed before and repeat the process with new ‘cleaning parameters’ in place. So getting to the analysis stage is the hard part. See the Shakespeare section, which comprises 5 acts, of which the first four and some additional scenes represent all the processing needed to get to the final scene of topic modeling. In what follows we’ll start at the end of that journey.

Topic Model Example

Shakespeare

In this example, we’ll look at Shakespeare’s plays and poems, using a topic model with 10 topics. For our needs, we’ll use the topicmodels package for the analysis, and mostly others for post-processing. Due to the large number of terms, this could take a while to run depending on your machine (maybe a minute or two). We can also see how things compare with the academic classifications for the texts.

load('Data/shakes_dtm_stemmed.RData')
library(topicmodels)
shakes_10 = LDA(convert(shakes_dtm, to = "topicmodels"), k = 10)

Examine Terms within Topics

One of the first things to do is attempt to interpret the topics, and we can start by seeing which terms are most probable for each topic.

get_terms(shakes_10, 20)

We can see there is a lot of overlap in these topics for top terms. Just looking at the top 10, love occurs in all of them, god and heart as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term’s probability of occurrence within a document, and term exclusivity, or how likely a term is to occur in only one particular topic. See the Shakespeare section for some examples of those.

Examine Document-Topic Expression

Next we can look at which documents are more likely to express each topic.

t(topics(shakes_10, 2))

For example, based just on term frequency, Hamlet is most likely to be associated with Topic 1. That topic is affiliated with the (stemmed words) love, night, heaven, heart, natur, ey, hear, hand, life, fear, death, prai, poor, friend, soul, hold, word, live, stand, head. Sounds about right for Hamlet.

The following visualization shows a heatmap for the topic probabilities of each document. Darker values mean higher probability for a document expressing that topic. I’ve also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram. The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem.

A couple things stand out. To begin with, most works are associated with one topic⁹. In terms of the discovered topics, traditional classification really probably only works for the historical works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work). Furthermore, tragedies and comedies might hit on the same topics, albeit from different perspectives. In addition, at least some works are very poetical, or at least have topics in common with the poems (love, beauty). If we take four clusters from the cluster analysis, the result boils down to Phoenix (on its own), standard poems, a mixed bag of more love-oriented works and the remaining poems, then everything else.

Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic. The order is based on the most likely topics across all documents.

So we can see that topic modeling can be used to classify the documents themselves into groups of documents most likely to express the same sorts of topics.

Extensions

There are extensions of LDA used in topic modeling that will allow your analysis to go even further.

Correlated Topic Models: the standard LDA does not estimate the topic correlation as part of the process.
Supervised LDA: In this scenario, topics can be used for prediction, e.g. the classification of tragedy, comedy etc. (similar to PC regression)
Structured Topic Models: Here we want to find the relevant covariates that can explain the topics (e.g. year written, author sex, etc.)
Other: There are still other ways to examine topics.

Topic Model Exercise

Movie reviews

Perform a topic model on the Cornell Movie review data. I’ve done some initial cleaning (e.g. removing stopwords, punctuation, etc.), and have both a tidy data frame and document term matrix for you to use. The former is provided if you want to do additional processing. But otherwise, just use the topicmodels package and perform your own analysis on the DTM. You can compare to this result.

load('data/movie_reviews.RData')
library(topicmodels)

Associated Press articles

Do some topic modeling on articles from the Associated Press data from the First Text Retrieval Conference in 1992. The following will load the DTM, so you are ready to go. See how your result compares with that of Dave Blei, based on 100 topics.

library(topicmodels)
data("AssociatedPress")

There isn’t a lot to work within the realm of choosing an ‘optimal’ number of topics, but I investigated it via a measure called perplexity. It bottomed out at around 50 topics. Usually such an approach is done through cross-validation. However, the solution chosen has no guarantee to produce human interpretable topics.↩