Shakespeare Start to Finish

The following attempts to demonstrate the usual difficulties one encounters dealing with text by procuring and processing the works of Shakespeare. The source is MIT, which has made the ‘complete’ works available on the web since 1993, plus one other from Gutenberg. The initial issue is simply getting the works from the web. Subsequently there is metadata, character names, stopwords etc. to be removed. At that point, we can stem and count the words in each work, which, when complete, puts us at the point we are ready for analysis.

The primary packages used are tidytext, stringr, and when things are ready for analysis, quanteda.

ACT I. Scrape MIT and Gutenberg Shakespeare

Scene I. Scrape main works

Initially we must scrape the web to get the documents we need. The rvest package will be used as follows.

  • Start with the url of the site
  • Get the links off that page to serve as base urls for the works
  • Scrape the document for each url
  • Deal with the collection of Sonnets separately
  • Write out results

Now we just paste the main site url to the work urls and download them. Here is where we come across our first snag. The html_text function has what I would call a bug but what the author feels is a feature. Basically, it ignores line breaks of the form <br> in certain situations. This means it will smash text together that shouldn’t be, thereby making any analysis of it fairly useless14. Luckily, @rentrop provided a solution, which is in r/fix_read_html.R.

Scene III. Save and write out

Now we can save our results so we won’t have to repeat any of the previous scraping. We want to save the main text object as an RData file, and write out the texts to their own file. When dealing with text, you’ll regularly want to save stages so you can avoid repeating what you don’t have to, as often you will need to go back after discovering new issues further down the line.

Scene IV. Read text from files

After the above is done, it’s not required to redo, so we can always get what we need. I’ll start with the raw text as files, as that is one of the more common ways one deals with documents. When text is nice and clean, this can be fairly straightforward.

The function at the end comes from the tidyr package. Up to that line, each element in the text column is the entire text, while the column itself is thus a ‘list-column’. In other words, we have a 42 x 2 matrix. But to do what we need, we’ll want to have access to each line, and the unnest function unpacks each line within the title. The first few lines of the result are shown after.

Scene V. Add additional works

It is typical to be gathering texts from multiple sources. In this case, we’ll get The Phoenix and the Turtle from the Project Gutenberg website. There is an R package that will allow us to work directly with the site, making the process straightforward15. I also considered two other works, but I refrained from “The Two Noble Kinsmen” because like many other of Shakespeare’s versions on Gutenberg, it’s basically written in a different language. I also refrained from The Passionate Pilgrim because it’s mostly not Shakespeare.

When first doing this project, I actually started with Gutenberg, but it became a notable PITA. The texts were inconsistent in source, and sometimes reproduced printing errors purposely, which would have compounded typical problems. I thought it could have been solved by using the Complete Works of Shakespeare but the download only came with that title, meaning one would have to hunt for and delineate each separate work. This might not have been too big of an issue, except that there is no table of contents, nor consistent naming of titles across different printings. The MIT approach, on the other hand, was a few lines of code. This represents a common issue in text analysis when dealing with sources, a different option may save a lot of time in the end.

The following code could be more succinct to deal with one text, but I initially was dealing with multiple works, so I’ve left it in that mode. In the end, we’ll have a tibble with an id column for the file/work name, and another column that contains the lines of text.

ACT II. Preliminary Cleaning

If you think we’re even remotely getting close to being ready for analysis, I say Ha! to you. Our journey has only just begun (cue the Carpenters).

Now we can start thinking about prepping the data for eventual analysis. One of the nice things about having the data in a tidy format is that we can use string functionality over the column of text in a simple fashion.

Scene I. Remove initial text/metadata

First on our to-do list is to get rid of all the preliminary text of titles, authorship, and similar. This is fairly straightforward when you realize the text we want will be associated with something like ACT I, or in the case of the Sonnets, the word Sonnet. So, the idea it to drop all text up to those points. I’ve created a function that will do that, and then just apply it to each works tibble16. For the poems and A Funeral Elegy for Master William Peter, we look instead for the line where his name or initials start the line.

# A tibble: 6 x 2
  id               text                         
  <chr>            <chr>                        
1 Romeo_and_Juliet Romeo and Juliet: Entire Play
2 Romeo_and_Juliet " "                          
3 Romeo_and_Juliet ""                           
4 Romeo_and_Juliet ""                           
5 Romeo_and_Juliet ""                           
6 Romeo_and_Juliet Romeo and Juliet             
# A tibble: 6 x 2
  id               text    
  <chr>            <chr>   
1 Romeo_and_Juliet ""      
2 Romeo_and_Juliet ""      
3 Romeo_and_Juliet PROLOGUE
4 Romeo_and_Juliet ""      
5 Romeo_and_Juliet ""      
6 Romeo_and_Juliet ""      

Scene II. Miscellaneous removal

Next, we’ll want to remove empty rows, any remaining titles, lines that denote the act or scene, and other stuff. I’m going to remove the word prologue and epilogue as a stopword later. While some texts have a line that just says that (PROLOGUE), others have text that describes the scene (Prologue. Blah blah) and which I’ve decided to keep. As such, we just need the word itself gone.

# A tibble: 3,992 x 2
   id               text                                           
   <chr>            <chr>                                          
 1 Romeo_and_Juliet PROLOGUE                                       
 2 Romeo_and_Juliet Two households, both alike in dignity,         
 3 Romeo_and_Juliet In fair Verona, where we lay our scene,        
 4 Romeo_and_Juliet From ancient grudge break to new mutiny,       
 5 Romeo_and_Juliet Where civil blood makes civil hands unclean.   
 6 Romeo_and_Juliet From forth the fatal loins of these two foes   
 7 Romeo_and_Juliet A pair of star-cross'd lovers take their life; 
 8 Romeo_and_Juliet Whose misadventured piteous overthrows         
 9 Romeo_and_Juliet Do with their death bury their parents' strife.
10 Romeo_and_Juliet The fearful passage of their death-mark'd love,
# ... with 3,982 more rows

ACT III. Stop words

As we’ve noted before, we’ll want to get rid of stop words, things like articles, possessive pronouns, and other very common words. In this case, we also want to include character names. However, the big wrinkle here is that this is not English as currently spoken, so we need to remove ‘ye’, ‘thee’, ‘thine’ etc. In addition, there are things that need to be replaced, like o’er to over, which may then also be removed. In short, this is not so straightforward.

Scene I. Character names

We’ll get the list of character names from via rvest, but I added some from the poems and others that still came through the processing one way or another, e.g. abbreviated names.

A new snag is that some characters with multiple names may be represented (typically) by the first or last name, or in the case of three, the middle, e.g. Sir Toby Belch. Others are still difficultly named e.g. RICHARD PLANTAGENET (DUKE OF GLOUCESTER). The following should capture everything by splitting the names on spaces, removing parentheses, and keeping unique terms.

[1] "Children" "Dionyza"  "Aaron"   

Scene II. Old, Middle, & Modern English

While Shakespeare is considered Early Modern English, some text may be more historical, so I include Middle and Old English stopwords, as they were readily available from the cltk Python module (link). I also added some things to the modern English list like “thou’ldst” that I found lingering after initial passes. I first started using the works from Gutenberg, and there, the Old English might have had some utility. As the texts there were inconsistently translated and otherwise problematic, I abandoned using them. Here, the Old English vocabulary applied to these texts it only removes ‘wit’, so I refrain from using it.

Scene III. Remove stopwords

We’re now ready to start removing words. However, right now, we have lines not words. We can use the tidytext function unnest_tokens, which is like unnest from tidyr, but works on different tokens, e.g. words, sentences, or paragraphs. Note that by default, the function will make all words lower case to make matching more efficient.

We also will be doing a little stemming here. I’m getting rid of suffixes that end with the suffix after an apostrophe. Many of the remaining words will either be stopwords or need to be further stemmed later. I also created a middle/modern English stemmer for words that are not caught otherwise (me_st_stem). Again, this is the sort of thing you discover after initial passes (e.g. ‘criedst’). After that, we can use the anti_join remove the stopwords.

As before, you should do a couple spot checks.


ACT IV. Other fixes

Now we’re ready to finally do the word counts. Just kidding! There is still work to do for the remainder, and you’ll continue to spot things after runs. One remaining issue is the words that end in ‘st’ and ‘est’, and others that are not consistently spelled or otherwise need to be dealt with. For example, ‘crost’ will not be stemmed to ‘cross’, as ‘crossed’ would be. Finally, I limit the result to any words that have more than two characters, as my inspection suggested these are left-over suffixes, or otherwise would be considered stopwords anyway.

At this point we could still maybe add things to this list of additional fixes, but I think it’s time to actually start playing with the data.

ACT V. Fun stuff

We are finally ready to get to the fun stuff. Finally! And now things get easy.

Scene I. Count the terms

We can get term counts with standard dplyr approaches, and packages like tidytext will take that and also do some other things we might want. Specifically, we can use the latter to create the document-term matrix (DTM) that will be used in other analysis. The function cast_dfm will create a dfm class object, or ‘document-feature’ matrix class object (from quanteda), which is the same thing but recognizes this sort of stuff is not specific to words. With word counts in hand, would be good save to save at this point, since they’ll serve as the basis for other processing.

# A tibble: 115,954 x 3
# Groups:   id, word [115,954]
   id                          word      n
   <chr>                       <chr> <int>
 1 Sonnets                     love    195
 2 The_Two_Gentlemen_of_Verona love    171
 3 Romeo_and_Juliet            love    150
 4 As_You_Like_It              love    118
 5 Love_s_Labour_s_Lost        love    118
 6 A_Midsummer_Night_s_Dream   love    114
 7 Richard_III                 god     111
 8 Titus_Andronicus            rome    103
 9 Much_Ado_about_Nothing      love     92
10 Coriolanus                  rome     90
# ... with 115,944 more rows

Now things are looking like Shakespeare, with love for everyone17. You’ll notice I’ve kept place names such as Rome, but this might be something you’d prefer to remove. Other candidates would be madam, woman, man, majesty (as in ‘his/her’) etc. This sort of thing is up to the researcher.

Scene II. Stemming

Now we’ll stem the words. This is actually more of a pre-processing step, one that we’d do along with (and typically after) stopword removal. I do it here to mostly demonstrate how to use quanteda to do it, as it can also be used to remove stopwords and do many of the other things we did with tidytext.

Stemming will make words like eye and eyes just ey, or convert war, wars and warring to war. In other words, it will reduce variations of a word to a common root form, or ‘word stem’. We could have done this in a step prior to counting the terms, but then you only have the stemmed result to work with for the document term matrix from then on. Depending on your situation, you may or may not want to stem, or maybe you’d want to compare results. The quanteda package will actually stem with the DTM (i.e. work on the column names) and collapse the word counts accordingly. I note the difference in words before and after stemming.

Document-feature matrix of: 43 documents, 22,052 features (87.8% sparse).
[1] 22052
Document-feature matrix of: 43 documents, 13,325 features (83.8% sparse).
[1] 13325

The result is notably fewer columns, which will speed up any analysis, as well as produce a slightly more dense matrix.

Scene III. Exploration

Top features

Let’s start looking at the data more intently. The following shows the 10 most common words and their respective counts. This is also an easy way to find candidates to add to the stopword list. Note that dai and prai are stems for day and pray. Love occurs 2.15 times as much as the most frequent word!

 love heart   eye   god   day  hand  hear  live death night 
 2918  1359  1300  1284  1229  1226  1043  1015  1010  1001 

The following is a word cloud. They are among the most useless visual displays imaginable. Just because you can, doesn’t mean you should.

If you want to display relative frequency do so.


The quanteda package has some built in similarity measures such as cosine similarity, which you can think of similarly to the standard correlation (also available as an option). I display it visually to better get a sense of things.

We can already begin to see the clusters of documents. For example, the more historical are the clump in the upper left. The oddball is The Phoenix and the Turtle, though Lover’s Complaint and the Elegy are also less similar than standard Shakespeare. The Phoenix and the Turtle is about the death of ideal love, represented by the Phoenix and Turtledove, for which there is a funeral. It actually is considered by scholars to be in stark contrast to his other output. Elegy itself is actually written for a funeral, but probably not by Shakespeare. A Lover’s Complaint is thought to be an inferior work by the Bard by some critics, and maybe not even authored by him, so perhaps what we’re seeing is a reflection of that lack of quality. In general, we’re seeing things that we might expect.


We can examine readability scores for the texts, but for this we’ll need them in raw form. We already had them from before, I just added Phoenix from the Gutenberg download.

# A tibble: 43 x 2
   id                            text          
   <chr>                         <list>        
 1 A_Lover_s_Complaint.txt       <chr [813]>   
 2 A_Midsummer_Night_s_Dream.txt <chr [6,630]> 
 3 All_s_Well_That_Ends_Well.txt <chr [10,993]>
 4 Antony_and_Cleopatra.txt      <chr [14,064]>
 5 As_You_Like_It.txt            <chr [9,706]> 
 6 Coriolanus.txt                <chr [13,440]>
 7 Cymbeline.txt                 <chr [11,388]>
 8 Elegy.txt                     <chr [1,316]> 
 9 Hamlet.txt                    <chr [13,950]>
10 Henry_V.txt                   <chr [9,777]> 
# ... with 33 more rows

With raw texts, we need to convert them to a corpus object to proceed more easily. The corpus function from quanteda won’t read directly from a list column or a list at all, so we’ll convert it via the tm package, which more or less defeats the purpose of using the quanteda package, except that the textstat_readability function gives us what we want, but I digress.

Unfortunately, the concept of readability is ill-defined, and as such, there are dozens of measures available dating back nearly 75 years. The following is based on the Coleman-Liau grade score (higher grade = more difficult). The conclusion here is first, Shakespeare isn’t exactly a difficult read, and two, the poems may be more so relative to the other works.

Lexical diversity

There are also metrics of lexical diversity. As with readability, there is no one way to measure ‘diversity’. Here we’ll go back to using the standard DTM, as the focus is on the terms, whereas readability is more at the sentence level. Most standard measures of lexical diversity are variants on what is called the type-token ratio, which in our setting is the number of unique terms (types) relative to the total terms (tokens). We can use textstat_lexdiv for our purposes here, which will provide several measures of diversity by default.

This visual is based on the (absolute) scaled values of those several metrics, and might suggest that the poems are relatively more diverse. This certainly might be the case for Phoenix, but it could also be a reflection of the limitation of several of the measures, such that longer works are seen as less diverse, as tokens are added more so than types the longer the text goes.

As a comparison, the following shows the results of the ‘Measure of Textual Diversity’ calculated using the koRpus package18. It is notably less affected by text length, though the conclusions are largely the same. There is notable correlation between the MTLD and readability as well19. In general, Shakespeare tends to be more expressive in poems, and less so with comedies.

Scene IV. Topic model

I’d say we’re now ready for topic model. That didn’t take too much did it?

Running the model and exploring the topics

We’ll run one with 10 topics. As in the previous example in this document, we’ll use topicmodels and the LDA function. Later, we’ll also compare our results with the traditional classifications of the texts. Note that this will take a while to run depending on your machine (maybe a minute or two). Faster implementation can be found with text2vec.

One of the first things to do is to interpret the topics, and we can start by seeing which terms are most probable for each topic.

We can see there is a lot of overlap in these topics for top terms. Just looking at the top 10, love occurs in all of them, god and heart are common as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term’s probability of occurrence within a document, and term exclusivity, or how likely a term is to occur in only one particular topic. See the stm package and corresponding labelTopics function as a way to get several alternatives. As an example, I show the results of their version of the following20:

  • FREX: FRequency and EXclusivity, it is a weighted harmonic mean of a term’s rank within a topic in terms of frequency and exclusivity.
  • lift: Ratio of the term’s probability within a topic to its probability of occurrence across all documents. Overly sensitive to rare words.
  • score: Another approach that will give more weight to more exclusive terms.
  • prob: This is just the raw probability of the term within a given topic.

As another approach, consider the saliency and relevance of term via the LDAvis package. While you can play with it here, it’s probably easier to open it separately. Note that this has to be done separately from the model, and may have topic numbers in a different order.

Given all these measures, one can assess how well they match what topics the documents would be most associated with.

For example, based just on term frequency, Hamlet is most likely to be associated with Topic 1. That topic is affiliated with the (stemmed words) love, night, heaven, heart, natur, ey, hear, hand, life, fear, death, prai, poor, friend, soul, hold, word, live, stand, head. The other measures pick up on words like Dane and Denmark. Sounds about right for Hamlet.

The following visualization shows a heatmap for the topic probabilities of each document. Darker values mean higher probability for a document expressing that topic. I’ve also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram21. The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem.

A couple things stand out. To begin with, most works are associated with one topic22. In terms of the discovered topics, traditional classification really probably only works for the historical works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work). Furthermore, tragedies and comedies might hit on the same topics, albeit from different perspectives. In addition, at least some works are very poetical, or at least have topics in common with the poems (love, beauty). If we take four clusters from the cluster analysis, the result boils down to Phoenix, Complaint, standard poems, a mixed bag of more romance-oriented works and the remaining poems, then everything else.

Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic. The order is based on the most likely topics across all documents.

The following shows the average topic probability for each of the traditional classes. Topics are represented by their first five most probable terms.

Aside from the poems, the classes are a good mix of topics, and appear to have some overlap. Tragedies are perhaps most diverse.

Summary of Topic Models

This is where the summary would go, but I grow weary…


  1. If you can think of a use case where x<br>y<br>z leading to xyz would be both expected as default behavior and desired please let me know.

  2. If this surprises you, let me remind you that there are over 10k packages on CRAN alone.

  3. I found it easier to work with the entire data frame for the function, hence splitting it on id and recombining. Some attempt was made to work within the tidyverse, but there were numerous issues to what should have been a fairly easy task.

  4. Love might as well be a stopword for Shakespeare.

  5. I don’t show this as I actually did it in parallel due to longer works taking a notable time to calculate MTLD.

  6. The Pearson correlation between MTLD and the Coleman Liau grade readability depicted previously was .87.

  7. These descriptions are from Sievert and Shirley 2014.

  8. If you are actually interested in clustering the documents (or anything for that matter in my opinion), this would not be the way to do so. For one, the documents are already clustered based on most probable topic. Second, cosine distance isn’t actually a proper distance. Third, as shocking as it may seem, newer methods have been developed since the hierarchical clustering approach, which basically has a dozen arbitrary choices to be made at each step. However, as a simple means to a visualization, the method is valuable if it helps with understanding the data.

  9. There isn’t a lot to work within the realm of choosing an ‘optimal’ number of topics, but I investigated it via a measure called perplexity. It bottomed out at around 50 topics. Usually such an approach is done through cross-validation. However, the solution chosen has no guarantee to produce human interpretable topics.