String Theory
Basic data types
R has several core data structures:
- Vectors
- Factors
- Lists
- Matrices/arrays
- Data frames
Vectors form the basis of R data structures. There are two main types- atomic and lists. All elements of an atomic vector are the same type.
Examples include:
- character
- numeric (double)
- integer
- logical
Character strings
When dealing with text, objects of class character are what you’d typically be dealing with.
Not much to it, but be aware there is no real limit to what is represented as a character vector. For example, in a data frame, you could have a column where each entry is one of the works of Shakespeare.
Factors
Although not exactly precise, one can think of factors as integers with labels. So, the underlying representation of a variable for sex is 1:2 with labels ‘Male’ and ‘Female’. They are a special class with attributes, or metadata, that contains the information about the levels.
$levels
[1] "a" "b" "c"
$class
[1] "factor"
While the underlying representation is numeric, it is important to remember that factors are categorical. They can’t be used as numbers would be, as the following demonstrates.
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Error in Summary.factor(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : 'sum' not meaningful for factors
Any numbers could be used, what we’re interested in are the labels, so a ‘sum’ doesn’t make any sense. All of the following would produce the same factor.
factor(c(1, 2, 3), labels=c('a', 'b', 'c'))
factor(c(3.2, 10, 500000), labels=c('a', 'b', 'c'))
factor(c(.49, 1, 5), labels=c('a', 'b', 'c'))
Because of the integer+metadata representation, factors are actually smaller than character strings, often notably so.
[1] "80.8 Kb"
[1] "42.4 Kb"
[1] "39.1 Kb"
However, if memory is really a concern, it’s probably not that using factors will help, but rather better hardware.
Analysis
It is important to know that raw text cannot be analyzed quantitatively. There is no magic that takes a categorical variable with text labels and estimates correlations among words and other words or numeric data. Everything that can be analyzed must have some numeric representation first, and this is where factors come in. For example, here is a data frame with two categorical predictors (factor*
), a numeric predictor (x
), and a numeric target (y
). What follows is what it looks like if you wanted to run a regression model in that setting.
df =
crossing(factor_1 = c('A', 'B'),
factor_2 = c('Q', 'X', 'J')) %>%
mutate(x=rnorm(6),
y=rnorm(6))
df
# A tibble: 6 x 4
factor_1 factor_2 x y
<chr> <chr> <dbl> <dbl>
1 A J 0.797 -0.190
2 A Q -1.000 -0.496
3 A X 1.05 0.487
4 B J -0.329 -0.101
5 B Q 0.905 -0.809
6 B X 1.18 -1.92
(Intercept) | x | factor_1B | factor_2Q | factor_2X |
---|---|---|---|---|
1 | 0.7968603 | 0 | 0 | 0 |
1 | -0.9999264 | 0 | 1 | 0 |
1 | 1.0522363 | 0 | 0 | 1 |
1 | -0.3291774 | 1 | 0 | 0 |
1 | 0.9049071 | 1 | 1 | 0 |
1 | 1.1754300 | 1 | 0 | 1 |
The model.matrix function exposes the underlying matrix that is actually used in the regression analysis. You’d get a coefficient for each column of that matrix. As such, even the intercept must be represented in some fashion. For categorical data, the default coding scheme is dummy coding. A reference category is arbitrarily chosen (it doesn’t matter which, and you can always change it), while the other categories are represented by indicator variables, where a 1 represents the corresponding label and everything else is zero. For details on this coding scheme or others, consult any basic statistical modeling book.
In addition, you’ll note that in all text-specific analysis, the underlying information is numeric. For example, with topic models, the base data structure is a document-term matrix of counts.
Characters vs. Factors
The main thing to note is that factors are generally a statistical phenomenon, and are required to do statistical things with data that would otherwise be a simple character string. If you know the relatively few levels the data can take, you’ll generally want to use factors, or at least know that statistical packages and methods will require them. In addition, factors allow you to easily overcome the silly default alphabetical ordering of category levels in some very popular visualization packages.
For other things, such as text analysis, you’ll almost certainly want character strings instead, and in many cases it will be required. It’s also worth noting that a lot of base R and other behavior will coerce strings to factors. This made a lot more sense in the early days of R, but is not really necessary these days.
For more on this stuff see the following:
Basic Text Functionality
Base R
A lot of folks new to R are not aware of just how much basic text processing R comes with out of the box. Here are examples of note.
- paste: glue text/numeric values together
- substr: extract or replace substrings in a character vector
- grep family: use regular expressions to deal with patterns of text
- strsplit: split strings
- nchar: how many characters in a string
- as.numeric: convert a string to numeric if it can be
- strtoi: convert a string to integer if it can be (faster than as.integer)
- adist: string distances
I probably use paste/paste0 more than most things when dealing with text, as string concatenation comes up so often. The following provides some demonstration.
[1] "a|b|cd"
[1] "abcd"
[1] "abcd"
[1] "x1" "x2" "x3"
Beyond that, use of regular expression and functionality included in the grep family is a major way to save a lot of time during data processing. I leave that to its own section later.
Useful packages
A couple packages will probably take care of the vast majority of your standard text processing needs. Note that even if they aren’t adding anything to the functionality of the base R functions, they typically will have been optimized in some fashion, particularly with regard to speed.
- stringr/stringi: More or less the same stuff you’ll find with substr, grep etc. except easier to use and/or faster. They also add useful functionality not in base R (e.g. str_to_title). The stringr package is mostly a wrapper for the stringi functions, with some additional functions.
- tidyr: has functions such as unite, separate, replace_na that can often come in handy when working with data frames.
- glue: a newer package that can be seen as a fancier paste. Most likely it will be useful when creating functions or shiny apps in which variable text output is desired.
One issue I have with both packages and base R is that often they return a list object, when it should be simplifying to the vector format it was initially fed. This sometimes requires an additional step or two of further processing that shouldn’t be necessary, so be prepared for it1.
Other
In this section, I’ll add some things that come to mind that might come into play when you’re dealing with text.
Dates
Dates are not character strings. Though they may start that way, if you actually want to treat them as dates you’ll need to convert the string to the appropriate date class. The lubridate package makes dealing with dates much easier. It comes with conversion, extraction and other functionality that will be sure to save you some time.
[1] "2018-03-06"
[1] "2018-03-07"
[1] "2019-03-06"
[1] TRUE
[1] 2017-07-01 UTC--2017-07-04 UTC
[1] "259200s (~3 days)"
[1] 4320
This package makes dates so much easier, you should always use it when dealing with them.
Categorical Time
In regression modeling with few time points, one often has to decide on whether to treat the year as categorical (factor) or numeric (continuous). This greatly depends on how you want to tell your data story or other practical concerns. For example, if you have five years in your data, treating year as categorical means you are interested in accounting for unspecified things that go on in a given year. If you treat it as numeric, you are more interested in trends. Either is fine.
Web
A major resource for text is of course the web. Packages like rvest,httr, xml2, and many other packages specific to website APIs are available to help you here. See the R task view for web technologies as a starting point.
Encoding
Encoding can be a sizable PITA sometimes, and will often come up when dealing with webscraping and other languages. The rvest and stringr packages may be able to get you past some issues at least. See their respective functions repair_encoding and str_conv as starting points on this issue.
Summary of basic text functionality
Being familiar with commonly used string functionality in base R and packages like stringr can save a ridiculous amount of time in your data processing. The more familiar you are with them the easier time you’ll have with text.
Regular Expressions
A regular expression, regex for short, is a sequence of characters that can be used as a search pattern for a string. Common operations are to merely detect, extract, or replace the matching string. There are actually many different flavors of regex for different programming languages, which are all flavors that originate with the Perl approach, or can enable the Perl approach to be used. However, knowing one means you pretty much know the others with only minor modifications if any.
To be clear, not only is regex another language, it’s nigh on indecipherable. You will not learn much regex, but what you do learn will save a potentially enormous amount of time you’d otherwise spend trying to do things in a more haphazard fashion. Furthermore, practically every situation that will come up has already been asked and answered on Stack Overflow, so you’ll almost always be able to search for what you need.
Here is an example:
^r.*shiny[0-9]$
What is that you may ask? Well here is an example of strings it would and wouldn’t match.
string = c('r is the shiny', 'r is the shiny1', 'r shines brightly')
grepl(string, pattern='^r.*shiny[0-9]$')
[1] FALSE TRUE FALSE
What the regex is esoterically attempting to match is any string that starts with ‘r’ and ends with ‘shiny_’ where _ is some single digit. Specifically, it breaks down as follows:
- ^ : starts with, so ^r means starts with r
- . : any character
- * : match the preceding zero or more times
- shiny : match ‘shiny’
- [0-9] : any digit 0-9 (note that we are still talking about strings, not actual numbered values)
- $ : ends with preceding
Typical Uses
None of it makes sense, so don’t attempt to do so. Just try to remember a couple key approaches, and search the web for the rest.
Along with ^ . * [0-9] $, a couple more common ones are:
- [a-z] : letters a-z
- [A-Z] : capital letters
- + : match the preceding one or more times
- () : groupings
- | : logical or e.g. [a-z]|[0-9] (a lower-case letter or a number)
- ? : preceding item is optional, and will be matched at most once. Typically used for ‘look ahead’ and ‘look behind’
- \ : escape a character, like if you actually wanted to search for a period instead of using it as a regex pattern, you’d use \., though in R you need \\, i.e. double slashes, for escape.
In addition, in R there are certain predefined characters that can be called:
- [:punct:] : punctuation
- [:blank:] : spaces and tabs
- [:alnum:] : alphanumeric characters
Those are just a few. The key functions can be found by looking at the help file for the grep function (?grep
). However, the stringr package has the same functionality with perhaps a slightly faster processing (though that’s due to the underlying stringi package).
See if you can guess which of the following will turn up TRUE
.
grepl(c('apple', 'pear', 'banana'), pattern='a')
grepl(c('apple', 'pear', 'banana'), pattern='^a')
grepl(c('apple', 'pear', 'banana'), pattern='^a|a$')
Scraping the web, munging data, just finding things in your scripts … you can potentially use this all the time, and not only with text analysis, as we’ll now see.
dplyr helper functions
The dplyr package comes with some poorly documented2 but quite useful helper functions that essentially serve as human-readable regex, which is a very good thing. These functions allow you to select variables3 based on their names. They are usually just calling base R functions in the end.
- starts_with: starts with a prefix (same as regex ‘^blah’)
- ends_with: ends with a prefix (same as regex ‘blah$’)
- contains: contains a literal string (same as regex ‘blah’)
- matches: matches a regular expression (put your regex here)
- num_range: a numerical range like x01, x02, x03. (same as regex ‘x[0-9][0-9]’)
- one_of: variables in character vector. (if you need to quote variable names, e.g. within a function)
- everything: all variables. (a good way to spend time doing something only to accomplish what you would have by doing nothing, or a way to reorder variables)
For more on using stringr and regular expressions in R, you may find this cheatsheet useful.
Text Processing Examples
Example 1
Let’s say you’re dealing with some data that has been handled typically, that is to say, poorly. For example, you have a variable in your data representing whether something is from the north or south region.
It might seem okay until…
Var1 | Freq |
---|---|
South | 76 |
north | 68 |
North | 75 |
north | 70 |
North | 70 |
south | 65 |
South | 76 |
Even if you spotted the casing issue, there is still a white space problem4. Let’s say you want this to be capitalized ‘North’ and ‘South’. How might you do it? It’s actually quite easy with the stringr tools.
The str_trim function trims white space from either side (or both), while str_to_title converts everything to first letter capitalized.
Var1 | Freq |
---|---|
North | 283 |
South | 217 |
Compare that to how you would have done it before knowing how to use text processing tools. One might have spent several minutes with some find and replace approach in a spreadsheet, or maybe even several if... else
statements in R until all problematic cases were taken care of. Not very efficient.
Example 2
Suppose you import a data frame, and the data was originally in wide format, where each column represented a year of data collection for the individual. Since it is bad form for data columns to have numbers for names, when you import it, the result looks like the following.
So, the problem now is to change the names to be Year_1, Year_2, etc. You might think you might have to use colnames and manually create a string of names to replace the current ones.
Or perhaps you’re thinking of the paste0 function, which works fine and saves some typing.
However, data sets may be hundreds of columns, and the columns of data may have the same pattern but not be next to one another. For example, the first few dozen columns are all data that belongs to the first wave, etc. It is tedious to figure out which columns you don’t want, but even then you’re resulting to using magic numbers with the above approach, and one column change to data will mean that redoing the name change will fail.
However, the following accomplishes what we want, and is reproducible regardless of where the columns are in the data set.
df %>%
rename_at(vars(num_range('X', 1:5)),
str_replace, pattern='X', replacement='Year_') %>%
head()
id Year_1 Year_2 Year_3 Year_4 Year_5
1 1 1.18 -2.04 -0.03 -0.36 0.43
2 2 0.34 -1.34 -0.30 -0.15 0.47
3 3 -0.32 -0.97 1.03 0.20 0.97
4 4 -0.57 1.36 1.29 0.00 0.32
5 5 0.64 0.73 -0.16 -1.29 -0.79
6 6 -0.59 0.16 -1.28 0.55 0.75
Let’s parse what it’s specifically doing.
- rename_at allows us to rename specific columns
- Which columns? X1 through X:5. The num_range helper function creates the character strings X1, X2, X3, X4, and X5.
- Now that we have the names, we use vars to tell rename_at which ones. It would have allowed additional sets of variables as well.
- rename_at needs a function to apply to each of those column names. In this case the function is str_replace, to replace patterns of strings with some other string
- The specific arguments to str_replace (pattern to be replaced, replacement pattern) are also supplied.
So in the end we just have to use the num_range helper function within the function that tells rename_at what it should be renaming, and let str_replace do the rest.
Exercises
In your own words, state the difference between a character string and a factor variable.
Consider the following character vector.
How might you paste the elements together so that there is an underscore _
between characters and no space (“A_1_Q”)? If you highlight the next line you’ll see the hint.
Revisit how we used the collapse argument within paste. paste(..., collapse=?)
Paste Part 2: The following application of paste produces this result.
[1] "A B" "1 2" "Q z"
Now try to produce "A - B" "1 - 2" "Q - z"
. To do this, note that one can paste any number of things together (i.e. more than two). So try adding ’ - ’ to it.
- Use regex to grab the Star Wars names that have a number. Use both grep and grepl and compare the results
Now use your hacking skills to determine which one is the tallest.
- Load the dplyr package, and use the its helper functions to grab all the columns in the starwars data set (comes with the package) with
color
in the name but without referring to them directly. The following shows a generic example. There are several ways to do this. Try two if you can.
I also don’t think it necessary to have separate functions for str_* functions in stringr depending on whether, e.g. I want ‘all’ matches (practically every situation) or just the first (very rarely). It could have just been an additional argument with default
all=TRUE
.↩At least they’re exposed now.↩
For rows, you’ll have to use a grepl/str_detect approach. For example,
filter(grepl(col1, pattern='^X'))
would subset to only rows where col1 starts with X.↩This is a very common issue among Excel users, and just one of the many reasons not to use it.↩