Sentiment Analysis

Basic idea

A common and intuitive approach to text is sentiment analysis. In a grand sense, we are interested in the emotional content of some text, e.g. posts on Facebook, tweets, or movie reviews. Most of the time, this is obvious when one reads it, but if you have hundreds of thousands or millions of strings to analyze, you’d like to be able to do so efficiently.

We will use the tidytext package for our demonstration. It comes with a lexicon of positive and negative words that is actually a combination of multiple sources, one of which provides numeric ratings, while the others suggest different classes of sentiment.

library(tidytext)
sentiments %>% slice(sample(1:nrow(sentiments)))

# A tibble: 27,314 x 4
   word              sentiment lexicon  score
   <chr>             <chr>     <chr>    <int>
 1 decomposition     negative  nrc         NA
 2 imaculate         positive  bing        NA
 3 greatness         positive  bing        NA
 4 impatient         negative  bing        NA
 5 contradicting     negative  loughran    NA
 6 irrecoverableness negative  bing        NA
 7 advisable         trust     nrc         NA
 8 humiliation       disgust   nrc         NA
 9 obscures          negative  bing        NA
10 affliction        negative  bing        NA
# ... with 27,304 more rows

The gist is that we are dealing with a specific, pre-defined vocabulary. Of course, any analysis will only be as good as the lexicon. The goal is usually to assign a sentiment score to a text, possibly an overall score, or a generally positive or negative grade. Given that, other analyses may be implemented to predict sentiment via standard regression tools or machine learning approaches.

Issues

Context, sarcasm, etc.

Now consider the following.

sentiments %>% filter(word=='sick')

# A tibble: 5 x 4
  word  sentiment lexicon score
  <chr> <chr>     <chr>   <int>
1 sick  disgust   nrc        NA
2 sick  negative  nrc        NA
3 sick  sadness   nrc        NA
4 sick  negative  bing       NA
5 sick  <NA>      AFINN      -2

Despite the above assigned sentiments, the word sick has been used at least since 1960s surfing culture as slang for positive affect. A basic approach to sentiment analysis as described here will not be able to detect slang or other context like sarcasm. However, lots of training data for a particular context may allow one to correctly predict such sentiment. In addition, there are, for example, slang lexicons, or one can simply add their own complements to any available lexicon.

Lexicons

In addition, the lexicons are going to maybe be applicable to general usage of English in the western world. Some might wonder where exactly these came from or who decided that the word abacus should be affiliated with ‘trust’. You may start your path by typing ?sentiments at the console if you have the tidytext package loaded.

Sentiment Analysis Examples

The first thing the baby did wrong

We demonstrate sentiment analysis with the text The first thing the baby did wrong, which is a very popular brief guide to parenting written by world renown psychologist Donald Barthelme who, in his spare time, also wrote postmodern literature. This particular text talks about an issue with the baby, whose name is Born Dancin’, and who likes to tear pages out of books. Attempts are made by her parents to rectify the situation, without much success, but things are finally resolved at the end. The ultimate goal will be to see how sentiment in the text evolves over time, and in general we’d expect things to end more positively than they began.

How do we start? Let’s look again at the sentiments data set in the tidytext package.

sentiments %>% slice(sample(1:nrow(sentiments)))

# A tibble: 27,314 x 4
   word           sentiment lexicon score
   <chr>          <chr>     <chr>   <int>
 1 blunder        sadness   nrc        NA
 2 solidity       positive  nrc        NA
 3 mortuary       fear      nrc        NA
 4 absorbed       positive  nrc        NA
 5 successful     joy       nrc        NA
 6 virus          negative  nrc        NA
 7 exorbitantly   negative  bing       NA
 8 discombobulate negative  bing       NA
 9 wail           negative  nrc        NA
10 intimidatingly negative  bing       NA
# ... with 27,304 more rows

The bing lexicon provides only positive or negative labels. The AFINN, on the other hand, is numerical, with ratings -5:5 that are in the score column. The others get more imaginative, but also more problematic. Why assimilate is superfluous is beyond me. It clearly should be negative given the Borg connotations.

sentiments %>% 
  filter(sentiment=='superfluous')

# A tibble: 56 x 4
   word         sentiment   lexicon  score
   <chr>        <chr>       <chr>    <int>
 1 aegis        superfluous loughran    NA
 2 amorphous    superfluous loughran    NA
 3 anticipatory superfluous loughran    NA
 4 appertaining superfluous loughran    NA
 5 assimilate   superfluous loughran    NA
 6 assimilating superfluous loughran    NA
 7 assimilation superfluous loughran    NA
 8 bifurcated   superfluous loughran    NA
 9 bifurcation  superfluous loughran    NA
10 cessions     superfluous loughran    NA
# ... with 46 more rows

Read in the text files

But I digress. We start with the raw text, reading it in line by line. In what follows we read in all the texts (three) in a given directory, such that each element of ‘text’ is the work itself, i.e. text is a list column⁵. The unnest function will unravel the works to where each entry is essentially a paragraph form.

library(tidytext)
barth0 = 
  data_frame(file = dir('data/texts_raw/barthelme', full.names = TRUE)) %>%
  mutate(text = map(file, read_lines)) %>%
  transmute(work = basename(file), text) %>%
  unnest(text)

Iterative processing

One of the things stressed in this document is the iterative nature of text analysis. You will consistently take two steps forward, and then one or two back as you find issues that need to be addressed. For example, in a subsequent step I found there were encoding issues⁶, so the following attempts to fix them. In addition, we want to tokenize the documents such that our tokens are sentences (e.g. as opposed to words or paragraphs). The reason for this is that I will be summarizing the sentiment at sentence level.

# Fix encoding, convert to sentences; you may get a warning message
barth = barth0 %>% 
  mutate(
    text = 
      sapply(
        text, 
        stringi::stri_enc_toutf8, 
        is_unknown_8bit = TRUE,
        validate = TRUE
        )
  ) %>%
  unnest_tokens(
    output = sentence,
    input = text,
    token = 'sentences'
  )

Tokenization

The next step is to drill down to just the document we want, and subsequently tokenize to the word level. However, I also create a sentence id so that we can group on it later.

# get baby doc, convert to words
baby = barth %>% 
  filter(work=='baby.txt') %>% 
  mutate(sentence_id = 1:n()) %>%
  unnest_tokens(
    output = word,
    input = sentence,
    token = 'words',
    drop = FALSE
  ) %>%
  ungroup()

Get sentiments

Now that the data has been prepped, getting the sentiments is ridiculously easy. But that is how it is with text analysis. All the hard work is spent with the data processing. Here all we need is an inner join of our words with a sentiment lexicon of choice. This process will only retain words that are also in the lexicon. I use the numeric-based lexicon here. At that point, we get a sum score of sentiment by sentence.

# get sentiment via inner join
baby_sentiment = baby %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(sentence_id, sentence) %>% 
  summarise(sentiment = sum(score)) %>%
  ungroup()

Alternative approach

As we are interested in the sentence level, it turns out that the sentimentr package has built-in functionality for this, and includes a more nuanced sentiment scores that takes into account valence shifters, e.g. words that would negate something with positive or negative sentiment (‘I do not like it’).

baby_sentiment = barth0 %>%
  filter(work=='baby.txt') %>% 
  get_sentences(text) %>% 
  sentiment() %>% 
  drop_na() %>%   # empty lines
  mutate(sentence_id = row_number())

The following visualizes sentiment over the progression of sentences (note that not every sentence will receive a sentiment score). You can read the sentence by hovering over the dot. The ▬ is the running average.

In general, the sentiment starts out negative as the problem is explained. It bounces back and forth a bit but ends on a positive note. You’ll see that some sentences’ context are not captured. For example, sentence 16 is ‘But it didn’t do any good’. However good is going to be marked as a positive sentiment in any lexicon by default. In addition, the token length will matter. Longer sentences are more likely to have some sentiment, for example.

Romeo & Juliet

For this example, I’ll invite you to more or less follow along, as there is notable pre-processing that must be done. We’ll look at sentiment in Shakespeare’s Romeo and Juliet. I have a cleaner version in the raw texts folder, but we can take the opportunity to use the gutenbergr package to download it directly from Project Gutenberg, a storehouse for works that have entered the public domain.

library(gutenbergr)
gw0 = gutenberg_works(title == "Romeo and Juliet")  # look for something with this title

# A tibble: 1 x 4
  gutenberg_id title            author               gutenberg_author_id
         <int> <chr>            <chr>                              <int>
1         1513 Romeo and Juliet Shakespeare, William                  65

rnj = gutenberg_download(gw0$gutenberg_id)

We’ve got the text now, but there is still work to be done. The following is a quick and dirty approach, but see the Shakespeare section to see a more deliberate one.

We first slice off the initial parts we don’t want like title, author etc. Then we get rid of other tidbits that would interfere, using a little regex as well to aid the process.

rnj_filtered = rnj %>% 
  slice(-(1:49)) %>% 
  filter(!text==str_to_upper(text),            # will remove THE PROLOGUE etc.
         !text==str_to_title(text),            # will remove names/single word lines
         !str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\[')) %>% 
  select(-gutenberg_id) %>% 
  unnest_tokens(sentence, input=text, token='sentences') %>% 
  mutate(sentenceID = 1:n())

The following unnests the data to word tokens. In addition, you can remove stopwords like a, an, the etc., and tidytext comes with a stop_words data frame. However, some of the stopwords have sentiments, so you would get a bit of a different result if you retain them. As Black Sheep once said, the choice is yours, and you can deal with this, or you can deal with that.

# show some of the matches
stop_words$word[which(stop_words$word %in% sentiments$word)] %>% head(20)

 [1] "able"        "against"     "allow"       "almost"      "alone"       "appear"      "appreciate"  "appropriate" "available"   "awfully"     "believe"     "best"        "better"      "certain"     "clearly"    
[16] "could"       "despite"     "downwards"   "enough"      "furthermore"

# remember to call output 'word' or antijoin won't work without a 'by' argument
rnj_filtered = rnj_filtered %>% 
  unnest_tokens(output=word, input=sentence, token='words') %>%   
  anti_join(stop_words)

Now we add the sentiments via the inner_join function. Here I use ‘bing’, but you can use another, and you might get a different result.

rnj_filtered %>% 
  count(word) %>% 
  arrange(desc(n))

# A tibble: 3,288 x 2
   word      n
   <chr> <int>
 1 thou    276
 2 thy     165
 3 love    140
 4 thee    139
 5 romeo   110
 6 night    83
 7 death    71
 8 hath     64
 9 sir      58
10 art      55
# ... with 3,278 more rows

rnj_sentiment = rnj_filtered %>% 
  inner_join(sentiments)
rnj_sentiment

# A tibble: 12,668 x 5
   sentenceID word    sentiment lexicon score
        <int> <chr>   <chr>     <chr>   <int>
 1          1 dignity positive  nrc        NA
 2          1 dignity trust     nrc        NA
 3          1 dignity positive  bing       NA
 4          1 fair    positive  nrc        NA
 5          1 fair    positive  bing       NA
 6          1 fair    <NA>      AFINN       2
 7          1 ancient negative  nrc        NA
 8          1 grudge  anger     nrc        NA
 9          1 grudge  negative  nrc        NA
10          1 grudge  negative  bing       NA
# ... with 12,658 more rows

rnj_sentiment_bing = rnj_sentiment %>% 
  filter(lexicon=='bing')
table(rnj_sentiment_bing$sentiment)


negative positive 
    1244      833

Looks like this one is going to be a downer. The following visualizes the positive and negative sentiment scores as one progresses sentence by sentence through the work using the plotly package. I also show same information expressed as a difference (opaque line).

It’s a close game until perhaps the midway point, when negativity takes over and despair sets in with the story. By the end [[:SPOILER ALERT:]] Sean Bean is beheaded, Darth Vader reveals himself to be Luke’s father, and Verbal is Keyser Söze.

Sentiment Analysis Summary

In general, sentiment analysis can be a useful exploration of data, but it is highly dependent on the context and tools used. Note also that ‘sentiment’ can be anything, it doesn’t have to be positive vs. negative. Any vocabulary may be applied, and so it has more utility than the usual implementation.

It should also be noted that the above demonstration is largely conceptual and descriptive. While fun, it’s a bit simplified. For starters, trying to classify words as simply positive or negative itself is not a straightforward endeavor. As we noted at the beginning, context matters, and in general you’d want to take it into account. Modern methods of sentiment analysis would use approaches like word2vec or deep learning to predict a sentiment probability, as opposed to a simple word match. Even in the above, matching sentiments to texts would probably only be a precursor to building a model predicting sentiment, which could then be applied to new data.

Exercise

Step 0: Install the packages

If you haven’t already, install the tidytext package. Install the janeaustenr package and load both of them⁷.

Step 1: Initial inspection

First you’ll want to look at what we’re dealing with, so take a gander at austenbooks.

library(tidytext); library(janeaustenr)
austen_books()

# A tibble: 73,422 x 2
   text                  book               
 * <chr>                 <fct>              
 1 SENSE AND SENSIBILITY Sense & Sensibility
 2 ""                    Sense & Sensibility
 3 by Jane Austen        Sense & Sensibility
 4 ""                    Sense & Sensibility
 5 (1811)                Sense & Sensibility
 6 ""                    Sense & Sensibility
 7 ""                    Sense & Sensibility
 8 ""                    Sense & Sensibility
 9 ""                    Sense & Sensibility
10 CHAPTER 1             Sense & Sensibility
# ... with 73,412 more rows

austen_books() %>% 
  distinct(book)

# A tibble: 6 x 1
  book               
  <fct>              
1 Sense & Sensibility
2 Pride & Prejudice  
3 Mansfield Park     
4 Emma               
5 Northanger Abbey   
6 Persuasion

We will examine only one text. In addition, for this exercise we’ll take a little bit of a different approach, looking for a specific kind of sentiment using the NRC database. It contains 10 distinct sentiments.

get_sentiments("nrc") %>% distinct(sentiment)

# A tibble: 10 x 1
   sentiment   
   <chr>       
 1 trust       
 2 fear        
 3 negative    
 4 sadness     
 5 anger       
 6 surprise    
 7 positive    
 8 disgust     
 9 joy         
10 anticipation

Now, select from any of those sentiments you like (or more than one), and one of the texts as follows.

nrc_sadness <- get_sentiments("nrc") %>% 
  filter(sentiment == "positive")

ja_book = austen_books() %>%
    filter(book == "Emma")

Step 2: Data prep

Now we do a little prep, and I’ll save you the trouble. You can just run the following.

ja_book =  ja_book %>%
  mutate(chapter = str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)),
         chapter = cumsum(chapter),
         line_book = row_number()) %>%
  unnest_tokens(word, text)

ja_book =  ja_book %>%
  mutate(chapter = str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)),
         chapter = cumsum(chapter),
         line_book = row_number()) %>%
  group_by(chapter) %>% 
  mutate(line_chapter = row_number()) %>% 
  # ungroup()
  unnest_tokens(word, text)

Step 3: Get sentiment

Now, on your own, try the inner join approach we used previously to match the sentiments to the text. Don’t try to overthink this. The third pipe step will use the count function with the word column and also the argument sort=TRUE. Note this is just to look at your result, we aren’t assigning it to an object yet.

ja_book %>%
  ? %>%
  ?

The following shows my negative evaluation of Mansfield Park.

# A tibble: 4,204 x 3
# Groups:   chapter [48]
   chapter word           n
     <int> <chr>      <int>
 1      24 feeling       35
 2       7 ill           25
 3      46 evil          25
 4      26 cross         24
 5      27 cross         24
 6      48 punishment    24
 7       7 cutting       20
 8      19 feeling       20
 9      33 feeling       20
10      34 feeling       20
# ... with 4,194 more rows

Step 4: Visualize

Now let’s do a visualization for sentiment. So redo your inner join, but we’ll create a data frame that has the information we need.

plot_data = ja_book %>%
  inner_join(nrc_bad) %>%
  group_by(chapter, line_book, line_chapter) %>% 
  count() %>%
  group_by(chapter) %>% 
  mutate(negativity = cumsum(n),
         mean_chapter_negativity=mean(negativity)) %>% 
  group_by(line_chapter) %>%
  mutate(mean_line_negativity=mean(n))
  
plot_data

# A tibble: 4,398 x 7
# Groups:   line_chapter [453]
   chapter line_book line_chapter     n negativity mean_chapter_negativity mean_line_negativity
     <int>     <int>        <int> <int>      <int>                   <dbl>                <dbl>
 1       1        17            7     2          2                    111.                 3.41
 2       1        18            8     4          6                    111.                 2.65
 3       1        20           10     1          7                    111.                 3.31
 4       1        24           14     1          8                    111.                 2.88
 5       1        26           16     2         10                    111.                 2.54
 6       1        27           17     3         13                    111.                 2.67
 7       1        28           18     3         16                    111.                 3.58
 8       1        29           19     2         18                    111.                 2.31
 9       1        34           24     3         21                    111.                 2.17
10       1        41           31     1         22                    111.                 2.87
# ... with 4,388 more rows

At this point you have enough to play with, so I leave you to plot whatever you want.

The following⁸ shows both the total negativity within a chapter, as well as the per line negativity within a chapter. We can see that there is less negativity towards the end of chapters. We can also see that there appears to be more negativity in later chapters (darker lines).

I suggest not naming your column ‘text’ in practice. It is a base function in R, and using it within the tidyverse may result in problems distinguishing the function from the column name (similar to n() function and the n column created by count and tally). I only do so for pedagogical reasons.↩
There are almost always encoding issues in my experience.↩
This exercise is more or less taken directly from the tidytext book.↩
This depiction goes against many of my visualization principles. I like it anyway.↩