Sentiment Analysis
Basic idea
A common and intuitive approach to text is sentiment analysis. In a grand sense, we are interested in the emotional content of some text, e.g. posts on Facebook, tweets, or movie reviews. Most of the time, this is obvious when one reads it, but if you have hundreds of thousands or millions of strings to analyze, you’d like to be able to do so efficiently.
We will use the tidytext package for our demonstration. It comes with a lexicon of positive and negative words that is actually a combination of multiple sources, one of which provides numeric ratings, while the others suggest different classes of sentiment.
# A tibble: 27,314 x 4
word sentiment lexicon score
<chr> <chr> <chr> <int>
1 decomposition negative nrc NA
2 imaculate positive bing NA
3 greatness positive bing NA
4 impatient negative bing NA
5 contradicting negative loughran NA
6 irrecoverableness negative bing NA
7 advisable trust nrc NA
8 humiliation disgust nrc NA
9 obscures negative bing NA
10 affliction negative bing NA
# ... with 27,304 more rows
The gist is that we are dealing with a specific, pre-defined vocabulary. Of course, any analysis will only be as good as the lexicon. The goal is usually to assign a sentiment score to a text, possibly an overall score, or a generally positive or negative grade. Given that, other analyses may be implemented to predict sentiment via standard regression tools or machine learning approaches.
Issues
Context, sarcasm, etc.
Now consider the following.
# A tibble: 5 x 4
word sentiment lexicon score
<chr> <chr> <chr> <int>
1 sick disgust nrc NA
2 sick negative nrc NA
3 sick sadness nrc NA
4 sick negative bing NA
5 sick <NA> AFINN -2
Despite the above assigned sentiments, the word sick has been used at least since 1960s surfing culture as slang for positive affect. A basic approach to sentiment analysis as described here will not be able to detect slang or other context like sarcasm. However, lots of training data for a particular context may allow one to correctly predict such sentiment. In addition, there are, for example, slang lexicons, or one can simply add their own complements to any available lexicon.
Lexicons
In addition, the lexicons are going to maybe be applicable to general usage of English in the western world. Some might wonder where exactly these came from or who decided that the word abacus should be affiliated with ‘trust’. You may start your path by typing ?sentiments
at the console if you have the tidytext package loaded.
Sentiment Analysis Examples
The first thing the baby did wrong
We demonstrate sentiment analysis with the text The first thing the baby did wrong, which is a very popular brief guide to parenting written by world renown psychologist Donald Barthelme who, in his spare time, also wrote postmodern literature. This particular text talks about an issue with the baby, whose name is Born Dancin’, and who likes to tear pages out of books. Attempts are made by her parents to rectify the situation, without much success, but things are finally resolved at the end. The ultimate goal will be to see how sentiment in the text evolves over time, and in general we’d expect things to end more positively than they began.
How do we start? Let’s look again at the sentiments data set in the tidytext package.
# A tibble: 27,314 x 4
word sentiment lexicon score
<chr> <chr> <chr> <int>
1 blunder sadness nrc NA
2 solidity positive nrc NA
3 mortuary fear nrc NA
4 absorbed positive nrc NA
5 successful joy nrc NA
6 virus negative nrc NA
7 exorbitantly negative bing NA
8 discombobulate negative bing NA
9 wail negative nrc NA
10 intimidatingly negative bing NA
# ... with 27,304 more rows
The bing lexicon provides only positive or negative labels. The AFINN, on the other hand, is numerical, with ratings -5:5 that are in the score column. The others get more imaginative, but also more problematic. Why assimilate is superfluous is beyond me. It clearly should be negative given the Borg connotations.
# A tibble: 56 x 4
word sentiment lexicon score
<chr> <chr> <chr> <int>
1 aegis superfluous loughran NA
2 amorphous superfluous loughran NA
3 anticipatory superfluous loughran NA
4 appertaining superfluous loughran NA
5 assimilate superfluous loughran NA
6 assimilating superfluous loughran NA
7 assimilation superfluous loughran NA
8 bifurcated superfluous loughran NA
9 bifurcation superfluous loughran NA
10 cessions superfluous loughran NA
# ... with 46 more rows
Read in the text files
But I digress. We start with the raw text, reading it in line by line. In what follows we read in all the texts (three) in a given directory, such that each element of ‘text’ is the work itself, i.e. text
is a list column5. The unnest function will unravel the works to where each entry is essentially a paragraph form.
Iterative processing
One of the things stressed in this document is the iterative nature of text analysis. You will consistently take two steps forward, and then one or two back as you find issues that need to be addressed. For example, in a subsequent step I found there were encoding issues6, so the following attempts to fix them. In addition, we want to tokenize the documents such that our tokens are sentences (e.g. as opposed to words or paragraphs). The reason for this is that I will be summarizing the sentiment at sentence level.
Tokenization
The next step is to drill down to just the document we want, and subsequently tokenize to the word level. However, I also create a sentence id so that we can group on it later.
Get sentiments
Now that the data has been prepped, getting the sentiments is ridiculously easy. But that is how it is with text analysis. All the hard work is spent with the data processing. Here all we need is an inner join of our words with a sentiment lexicon of choice. This process will only retain words that are also in the lexicon. I use the numeric-based lexicon here. At that point, we get a sum score of sentiment by sentence.
Alternative approach
As we are interested in the sentence level, it turns out that the sentimentr package has built-in functionality for this, and includes a more nuanced sentiment scores that takes into account valence shifters, e.g. words that would negate something with positive or negative sentiment (‘I do not like it’).
baby_sentiment = barth0 %>%
filter(work=='baby.txt') %>%
get_sentences(text) %>%
sentiment() %>%
drop_na() %>% # empty lines
mutate(sentence_id = row_number())
The following visualizes sentiment over the progression of sentences (note that not every sentence will receive a sentiment score). You can read the sentence by hovering over the dot. The ▬ is the running average.
In general, the sentiment starts out negative as the problem is explained. It bounces back and forth a bit but ends on a positive note. You’ll see that some sentences’ context are not captured. For example, sentence 16 is ‘But it didn’t do any good’. However good is going to be marked as a positive sentiment in any lexicon by default. In addition, the token length will matter. Longer sentences are more likely to have some sentiment, for example.
Romeo & Juliet
For this example, I’ll invite you to more or less follow along, as there is notable pre-processing that must be done. We’ll look at sentiment in Shakespeare’s Romeo and Juliet. I have a cleaner version in the raw texts folder, but we can take the opportunity to use the gutenbergr package to download it directly from Project Gutenberg, a storehouse for works that have entered the public domain.
library(gutenbergr)
gw0 = gutenberg_works(title == "Romeo and Juliet") # look for something with this title
# A tibble: 1 x 4
gutenberg_id title author gutenberg_author_id
<int> <chr> <chr> <int>
1 1513 Romeo and Juliet Shakespeare, William 65
We’ve got the text now, but there is still work to be done. The following is a quick and dirty approach, but see the Shakespeare section to see a more deliberate one.
We first slice off the initial parts we don’t want like title, author etc. Then we get rid of other tidbits that would interfere, using a little regex as well to aid the process.
rnj_filtered = rnj %>%
slice(-(1:49)) %>%
filter(!text==str_to_upper(text), # will remove THE PROLOGUE etc.
!text==str_to_title(text), # will remove names/single word lines
!str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\[')) %>%
select(-gutenberg_id) %>%
unnest_tokens(sentence, input=text, token='sentences') %>%
mutate(sentenceID = 1:n())
The following unnests the data to word tokens. In addition, you can remove stopwords like a, an, the etc., and tidytext comes with a stop_words data frame. However, some of the stopwords have sentiments, so you would get a bit of a different result if you retain them. As Black Sheep once said, the choice is yours, and you can deal with this, or you can deal with that.
# show some of the matches
stop_words$word[which(stop_words$word %in% sentiments$word)] %>% head(20)
[1] "able" "against" "allow" "almost" "alone" "appear" "appreciate" "appropriate" "available" "awfully" "believe" "best" "better" "certain" "clearly"
[16] "could" "despite" "downwards" "enough" "furthermore"
# remember to call output 'word' or antijoin won't work without a 'by' argument
rnj_filtered = rnj_filtered %>%
unnest_tokens(output=word, input=sentence, token='words') %>%
anti_join(stop_words)
Now we add the sentiments via the inner_join function. Here I use ‘bing’, but you can use another, and you might get a different result.
# A tibble: 3,288 x 2
word n
<chr> <int>
1 thou 276
2 thy 165
3 love 140
4 thee 139
5 romeo 110
6 night 83
7 death 71
8 hath 64
9 sir 58
10 art 55
# ... with 3,278 more rows
# A tibble: 12,668 x 5
sentenceID word sentiment lexicon score
<int> <chr> <chr> <chr> <int>
1 1 dignity positive nrc NA
2 1 dignity trust nrc NA
3 1 dignity positive bing NA
4 1 fair positive nrc NA
5 1 fair positive bing NA
6 1 fair <NA> AFINN 2
7 1 ancient negative nrc NA
8 1 grudge anger nrc NA
9 1 grudge negative nrc NA
10 1 grudge negative bing NA
# ... with 12,658 more rows
negative positive
1244 833
Looks like this one is going to be a downer. The following visualizes the positive and negative sentiment scores as one progresses sentence by sentence through the work using the plotly package. I also show same information expressed as a difference (opaque line).
It’s a close game until perhaps the midway point, when negativity takes over and despair sets in with the story. By the end [[:SPOILER ALERT:]] Sean Bean is beheaded, Darth Vader reveals himself to be Luke’s father, and Verbal is Keyser Söze.
Sentiment Analysis Summary
In general, sentiment analysis can be a useful exploration of data, but it is highly dependent on the context and tools used. Note also that ‘sentiment’ can be anything, it doesn’t have to be positive vs. negative. Any vocabulary may be applied, and so it has more utility than the usual implementation.
It should also be noted that the above demonstration is largely conceptual and descriptive. While fun, it’s a bit simplified. For starters, trying to classify words as simply positive or negative itself is not a straightforward endeavor. As we noted at the beginning, context matters, and in general you’d want to take it into account. Modern methods of sentiment analysis would use approaches like word2vec or deep learning to predict a sentiment probability, as opposed to a simple word match. Even in the above, matching sentiments to texts would probably only be a precursor to building a model predicting sentiment, which could then be applied to new data.
Exercise
Step 0: Install the packages
If you haven’t already, install the tidytext package. Install the janeaustenr package and load both of them7.
Step 1: Initial inspection
First you’ll want to look at what we’re dealing with, so take a gander at austenbooks.
# A tibble: 73,422 x 2
text book
* <chr> <fct>
1 SENSE AND SENSIBILITY Sense & Sensibility
2 "" Sense & Sensibility
3 by Jane Austen Sense & Sensibility
4 "" Sense & Sensibility
5 (1811) Sense & Sensibility
6 "" Sense & Sensibility
7 "" Sense & Sensibility
8 "" Sense & Sensibility
9 "" Sense & Sensibility
10 CHAPTER 1 Sense & Sensibility
# ... with 73,412 more rows
# A tibble: 6 x 1
book
<fct>
1 Sense & Sensibility
2 Pride & Prejudice
3 Mansfield Park
4 Emma
5 Northanger Abbey
6 Persuasion
We will examine only one text. In addition, for this exercise we’ll take a little bit of a different approach, looking for a specific kind of sentiment using the NRC database. It contains 10 distinct sentiments.
# A tibble: 10 x 1
sentiment
<chr>
1 trust
2 fear
3 negative
4 sadness
5 anger
6 surprise
7 positive
8 disgust
9 joy
10 anticipation
Now, select from any of those sentiments you like (or more than one), and one of the texts as follows.
Step 2: Data prep
Now we do a little prep, and I’ll save you the trouble. You can just run the following.
Step 3: Get sentiment
Now, on your own, try the inner join approach we used previously to match the sentiments to the text. Don’t try to overthink this. The third pipe step will use the count function with the word
column and also the argument sort=TRUE
. Note this is just to look at your result, we aren’t assigning it to an object yet.
The following shows my negative evaluation of Mansfield Park.
# A tibble: 4,204 x 3
# Groups: chapter [48]
chapter word n
<int> <chr> <int>
1 24 feeling 35
2 7 ill 25
3 46 evil 25
4 26 cross 24
5 27 cross 24
6 48 punishment 24
7 7 cutting 20
8 19 feeling 20
9 33 feeling 20
10 34 feeling 20
# ... with 4,194 more rows
Step 4: Visualize
Now let’s do a visualization for sentiment. So redo your inner join, but we’ll create a data frame that has the information we need.
plot_data = ja_book %>%
inner_join(nrc_bad) %>%
group_by(chapter, line_book, line_chapter) %>%
count() %>%
group_by(chapter) %>%
mutate(negativity = cumsum(n),
mean_chapter_negativity=mean(negativity)) %>%
group_by(line_chapter) %>%
mutate(mean_line_negativity=mean(n))
plot_data
# A tibble: 4,398 x 7
# Groups: line_chapter [453]
chapter line_book line_chapter n negativity mean_chapter_negativity mean_line_negativity
<int> <int> <int> <int> <int> <dbl> <dbl>
1 1 17 7 2 2 111. 3.41
2 1 18 8 4 6 111. 2.65
3 1 20 10 1 7 111. 3.31
4 1 24 14 1 8 111. 2.88
5 1 26 16 2 10 111. 2.54
6 1 27 17 3 13 111. 2.67
7 1 28 18 3 16 111. 3.58
8 1 29 19 2 18 111. 2.31
9 1 34 24 3 21 111. 2.17
10 1 41 31 1 22 111. 2.87
# ... with 4,388 more rows
At this point you have enough to play with, so I leave you to plot whatever you want.
The following8 shows both the total negativity within a chapter, as well as the per line negativity within a chapter. We can see that there is less negativity towards the end of chapters. We can also see that there appears to be more negativity in later chapters (darker lines).
I suggest not naming your column ‘text’ in practice. It is a base function in R, and using it within the tidyverse may result in problems distinguishing the function from the column name (similar to
n()
function and then
column created by count and tally). I only do so for pedagogical reasons.↩There are almost always encoding issues in my experience.↩
This exercise is more or less taken directly from the tidytext book.↩
This depiction goes against many of my visualization principles. I like it anyway.↩