Data Structures

In order to use R to understand the world around you, you have to know the basics of how R works. Everything in R revolves around information in the form of data, so let’s start with how data exists within R.

R has several core data structures, and we’ll take a look at each.

  • Vectors
  • Factors
  • Lists
  • Matrices/arrays
  • Data frames

The more you know about R data structures, the more you’ll know how to use them, how packages use them, and you’ll also better understand why things go wrong if they do, and the further you’ll be able to go with your data.

Vectors

Vectors form the basis of R data structures. Two main types are atomic and lists, but we’ll talk about lists separately.

Here is an R vector. The elements of the vector are numeric values.

x = c(1, 3, 2, 5, 4)
x
[1] 1 3 2 5 4

All elements of an atomic vector are the same type. Example types include:

  • character
  • numeric (double)
  • integer
  • logical

In addition, there are special kinds of values like NA (‘not available’ i.e. missing), NULL, NaN (not a number), Inf (infinite) and so forth.

You can use typeof to examine an object’s type, or use an is function, e.g. is.logical, to check if an object is a specific type.

Character strings

When dealing with text, objects of class character are what you’d typically be dealing with.

x = c('... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy', 'Memory')
x

Not much to it, but be aware there is no real limit to what is represented as a character vector. For example, in a data frame, you could have a column where each entry is one of the works of Shakespeare.

Factors

An important type of vector is a factor. Factors are used to represent categorical data structures. Although not exactly precise, one can think of factors as integers with labels. So the underlying representation of a variable for sex is 1:2 with labels ‘Male’ and ‘Female’. They are a special class with attributes, or metadata, that contains the information about the levels.

x = factor(rep(letters[1:3], e = 10))
x
 [1] a a a a a a a a a a b b b b b b b b b b c c c c c c c c c c
Levels: a b c
attributes(x)
$levels
[1] "a" "b" "c"

$class
[1] "factor"

The underlying representation is numeric, but it is important to remember that factors are categorical. Thus, they can’t be used as numbers would be, as the following demonstrates.

x_num = as.numeric(x)  # convert to a numeric object
sum(x_num)
[1] 60
sum(x)
Error in Summary.factor(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : 'sum' not meaningful for factors

Strings vs. Factors

The main thing to note is that factors are generally a statistical phenomenon, and are required to do statistical things with data that would otherwise be a simple character string. If you know the relatively few levels the data can take, you’ll generally want to use factors, or at least know that statistical packages and methods may require them. In addition, factors allow you to easily overcome the silly default alphabetical ordering of category levels in some very popular visualization packages.

For other things, such as text analysis, you’ll almost certainly want character strings instead, and in many cases it will be required. It’s also worth noting that a lot of base R and other behavior will coerce strings to factors. This made a lot more sense in the early days of R, but is not really necessary these days.

Some packages to note to help you with processing strings and factors:

  • forcats
  • stringr

Logicals

Logical scalar/vectors are those that take on one of two values TRUE or FALSE. They are especially useful in flagging whether to run certain parts of code, and indexing certain parts of data structures (e.g. taking rows that correspond to TRUE). We’ll talk about the latter usage more later in the document.

Here is a logical vector.

my_logic = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)

Note also that logicals are also treated as binary 0:1, and so, for example, taking the mean will provide the proportion of TRUE values.

!my_logic
[1] FALSE  TRUE FALSE  TRUE FALSE FALSE
as.numeric(my_logic)
[1] 1 0 1 0 1 1
mean(my_logic)
[1] 0.6666667

Numeric and integer

The most common type of data structure you’ll deal with are integer and numeric vectors.

class(1:3)
[1] "integer"
x = rnorm(5)
x
[1] -1.7796807  1.1137213  0.9723414 -0.7039966 -0.3132836
class(x)
[1] "numeric"

Matrices

With multiple dimensions, we are dealing with arrays. Matrices are 2-d arrays, and extremely commonly used for scientific computing. The vectors making up a matrix must all be of the same type. For example, all values in a matrix might be numeric, or all character strings.

Creating a matrix

Creating a matrix can be done in a variety of ways.

# create vectors
x = 1:4
y = 5:8
z = 9:12

rbind(x, y, z)   # row bind
  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    5    6    7    8
z    9   10   11   12
cbind(x, y, z)   # column bind
     x y  z
[1,] 1 5  9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(
  c(x, y, z),
  nrow = 3,
  ncol = 4,
  byrow = TRUE
)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Lists

Lists in R are highly flexible objects, and probably the most commonly used for applied data science. Unlike vectors, whose elements must be of the same type, lists can contain anything as their elements, even other lists.

Here is a list. We use the list function to create it.

x = list(1, "apple", list(3, "cat"))
x
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] "cat"

We often want to loop some function over a list.

for (elem in x) print(class(elem))
[1] "numeric"
[1] "character"
[1] "list"

Lists can, and often do, have named elements.

x = list("a" = 25, "b" = -1, "c" = 0)
x[["b"]]
[1] -1

Almost all standard models in base R and other packages return an object that is a list. Knowing how to work with a list will allow you to easily access the contents of the model object for further processing.

Python has similar structures, lists and dictionaries, where the latter works similarly to R’s named list.

Data Frames

Data frames are a very commonly used data structure. Elements of a data frame can be different types, and this is because the data.frame class is actually just a list. As such, everything about lists applies to them. But they can also be indexed by row or column as well, just like matrices. There are other very common types of object classes associated with packages that are both a data.frame and some other type of structure (e.g. tibbles in the tidyverse).

Usually your data frame will come directly from import or manipulation of other R objects (e.g. matrices). However, you should know how to create one from scratch.

Creating a data frame

The following will create a data frame with two columns, a and b.

mydf = data.frame(a = c(1,5,2), 
                  b = c(3,8,1))

Much to the disdain of the tidyverse, we can add row names also.

rownames(mydf) = paste0('row', 1:3)
mydf
     a b
row1 1 3
row2 5 8
row3 2 1

Everything about lists applies to data.frames, so we can add, select, and remove elements of a data frame just like lists. However we’ll visit this more in depth later, and see that we’ll have much more flexibility with data frames than we would lists for common data analysis and visualization.

Data Structure Exercises

Excercise #1

Create an object that is a matrix and/or a data.frame, and inspect its class or structure (use the class or str functions on the object you just created).

Exercise #2

Create a list of 3 elements, the first of which contains character strings, the second numbers, and the third, the data.frame or matrix you just created.

Thinking Exercises

  • How is a factor different from a character vector?

  • How is a data.frame the same as and different from a matrix?

  • How is a data.frame the same as and different from a list?