Dealing with text is typically not even considered in the applied statistical training of most disciplines. This is in direct contrast with how often it has to be dealt with prior to more common analysis, or how interesting it might be to have text be the focus of analysis. This document and corresponding workshop will aim to provide a sense of the things one can do with text, and the sorts of analyses that might be useful.
The goal of this workshop is primarily to provide a sense of common tasks related to dealing with text as part of the data or the focus of analysis, and provide some relatively easy to use tools. It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.
Additionally, we’ll have exercises to practice, but those comfortable enough to do so should follow along with the in-text examples. Note that there is more content here than will be covered in a single workshop.
The document is for the most part very applied in nature, and doesn’t assume much beyond familiarity with the R statistical computing environment. For programming purposes, it would be useful if you are familiar with the tidyverse, or at least dplyr specifically, otherwise some of the code may be difficult to understand (and is required if you want to run it).
Here are some of the packages used in this document:
- Topic Models
- Word Embedding
Note the following color coding used in this document:
- Download the zip file here. It contains an RStudio project with several data files that you can use as you attempt to replicate the analyses. Be mindful of where you put it.
- Unzip it. Be mindful of where you put the resulting folder.
- Open RStudio.
- File/Open Project and navigate to and click on the blue icon in the folder you just created.
- Install any of the above packages you want.