Appendix

Texts

Donald Barthelme

“I have to admit we are mired in the most exquisite mysterious muck.
This muck heaves and palpitates. It is multi-directional and has a mayor.”

“You may not be interested in absurdity, but absurdity is interested in you.”

The First Thing the Baby Did Wrong

This short story is essentially a how-to on parenting.

link

The Balloon

This story is about a balloon that can represent whatever you want it to.

link

Some of Us Had Been Threatening Our Friend Colby

A brief work about etiquette and how to act in society.

link

Raymond Carver

“It ought to make us feel ashamed when we talk like we know what we’re talking about when we talk about love.”

“That’s all we have, finally, the words, and they had better be the right ones.”

What We Talk About When We Talk About Love

The text we use is actually Beginners, or the unedited version. A drink is required in order to read it with the proper context. Probably several. No. Definitely several.

link

Billy Dee Shakespeare

“It works every time.”

These old works have pretty much no relevance today, and are mostly forgotten by everyone except humanities faculty. The analysis of them depicted in this document is pretty much definitive, and leaves little else to say regarding them, so don’t bother reading them if you haven’t already.

R

Up until even a couple years ago, R was terrible at text. You really only had base R for basic processing and a couple packages that were not straightforward to use. There was little for scraping the web. Nowadays, I would say it’s probably easier to deal with text in R than it is elsewhere, including Python. Packages like rvest, stringr/stringi, and tidytext and more make it almost easy enough to jump right in.

One can peruse the Natural Language Processing task view to start getting a sense of what all is available in R.

NLP task view

The one drawback with R is that most of the dealing with text is slow and/or memory intensive. The Shakespeare texts are only a few dozen and not very long works, and yet your basic LDA might still take a minute or so. Most text analysis situations might have thousands to millions of texts, such that the corpus itself may be too much to hold in memory, and thus R, at least on a standard computing device or with the usual methods, might not be viable for your needs.

Python

While R has done a lot to catch up, more advanced text analysis techniques are developed in Python (if not lower level languages), and so the state of the art may be found there. Furthermore, much of text analysis is a high volume affair, and that means it will likely be done much more efficiently in the Python environment if so, though one still might need a high performance computing environment. Here are some of the popular modules in Python.

nltk
textblob (the tidytext for Python)
gensim (topic modeling)
spaCy

A Faster LDA

We noted in the Shakespeare start to finish example that there are faster alternatives than the standard LDA in topicmodels. In particular, the powerful text2vec package contains a faster and less memory intensive implementation of LDA and dealing with text generally. Both of which are very important if you’re wanting to use R for text analysis. The other nice thing is that it works with LDAvis for visualization.

For the following, we’ll use one of the partially cleaned document term matrix for the Shakespeare texts. One of the things to get used to is that text2vec uses the newer R6 classes of R objects, hence the $ approach you see to using specific methods.

library(text2vec)
load('data/shakes_dtm_stemmed.RData')
# load('data/shakes_words_df.RData') # non-stemmed

# convert to the sparse matrix representation using Matrix package
shakes_dtm = as(shakes_dtm, 'CsparseMatrix')

# setup the model
lda_model = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)

# fit the model
doc_topic_distr =   lda_model$fit_transform(x = shakes_dtm, 
                                            n_iter = 1000,
                                            convergence_tol = 0.0001, 
                                            n_check_convergence = 25,
                                            progressbar = FALSE)

INFO [2018-03-06 19:16:15] iter 25 loglikelihood = -1746173.024
INFO [2018-03-06 19:16:16] iter 50 loglikelihood = -1683541.903
INFO [2018-03-06 19:16:17] iter 75 loglikelihood = -1660985.396
INFO [2018-03-06 19:16:17] iter 100 loglikelihood = -1648984.411
INFO [2018-03-06 19:16:18] iter 125 loglikelihood = -1641481.467
INFO [2018-03-06 19:16:19] iter 150 loglikelihood = -1638983.461
INFO [2018-03-06 19:16:20] iter 175 loglikelihood = -1636730.733
INFO [2018-03-06 19:16:20] iter 200 loglikelihood = -1636356.883
INFO [2018-03-06 19:16:21] iter 225 loglikelihood = -1636487.222
INFO [2018-03-06 19:16:21] early stopping at 225 iteration

lda_model$get_top_words(n = 10, topic_number = 1:10, lambda = 1)

      [,1]     [,2]     [,3]     [,4]    [,5]      [,6]      [,7]    [,8]      [,9]     [,10]   
 [1,] "prai"   "hear"   "ey"     "love"  "word"    "natur"   "night" "god"     "friend" "death" 
 [2,] "honor"  "madam"  "sweet"  "dai"   "letter"  "fortun"  "fear"  "dai"     "hand"   "grace" 
 [3,] "heaven" "bring"  "fair"   "true"  "hous"    "world"   "ear"   "england" "nobl"   "soul"  
 [4,] "life"   "sea"    "heart"  "wit"   "prai"    "power"   "sleep" "crown"   "word"   "live"  
 [5,] "matter" "bear"   "light"  "fair"  "sweet"   "poor"    "death" "war"     "stand"  "blood" 
 [6,] "honest" "seek"   "desir"  "live"  "husband" "set"     "dead"  "arm"     "rome"   "life"  
 [7,] "fellow" "heard"  "beauti" "youth" "woman"   "nobl"    "bid"   "majesti" "honor"  "dai"   
 [8,] "hear"   "lose"   "black"  "heart" "reason"  "truth"   "bed"   "fight"   "leav"   "hope"  
 [9,] "heart"  "strang" "kiss"   "marri" "hand"    "leav"    "mad"   "sword"   "deed"   "heaven"
[10,] "friend" "sister" "sun"    "night" "talk"    "command" "hand"  "heart"   "tear"   "die"

which.max(doc_topic_distr['Hamlet', ])

[1] 7

# top-words could be sorted by “relevance” which also takes into account
# frequency of word in the corpus (0 < lambda < 1)
lda_model$get_top_words(n = 10, topic_number = 1:10, lambda = 0.2)

      [,1]      [,2]      [,3]     [,4]      [,5]     [,6]       [,7]     [,8]      [,9]      [,10]      
 [1,] "honest"  "madam"   "ey"     "love"    "letter" "natur"    "ear"    "england" "rome"    "bloodi"   
 [2,] "beseech" "sea"     "cheek"  "youth"   "merri"  "report"   "sleep"  "majesti" "deed"    "royal"    
 [3,] "knave"   "water"   "black"  "wit"     "woo"    "spirit"   "beat"   "field"   "banish"  "graciou"  
 [4,] "warrant" "sister"  "wretch" "signior" "jest"   "judgment" "night"  "uncl"    "countri" "high"     
 [5,] "glad"    "women"   "flower" "count"   "finger" "worst"    "air"    "march"   "citi"    "subject"  
 [6,] "action"  "hair"    "sweet"  "lover"   "choos"  "author"   "soft"   "lieg"    "son"     "sovereign"
 [7,] "worship" "lose"    "vow"    "danc"    "ring"   "qualiti"  "knock"  "fight"   "rise"    "foe"      
 [8,] "matter"  "entreat" "mortal" "song"    "horn"   "virgin"   "poison" "battl"   "kneel"   "flourish" 
 [9,] "fellow"  "seek"    "wing"   "paint"   "bond"   "wine"     "shake"  "harri"   "fly"     "king"     
[10,] "walk"    "passion" "short"  "wed"     "troth"  "direct"   "move"   "crown"   "wert"    "tide"

# ldavis not shown
# lda_model$plot()

Given that most text analysis can be very time consuming for a model, consider any approach that might give you more efficiency.