Data Structures



The goal of data science is to use data to understand the world around you. The primary tool of data science is a programming language that can convert human intention and collected evidence to actionable results. The tool we’ll demonstrate here is R.

In order to use R to understand the world around you, you have to know the basics of how R works. Everything in R revolves around information in the form of data, so let’s start with how data exists within R.

R has several core data structures, and we’ll take a look at each.

  • Vectors
  • Factors
  • Lists
  • Matrices/arrays
  • Data frames

The more you know about R data structures, the more you’ll know how to use them, how packages use them, and you’ll also better understand why things go wrong when they do, and the further you’ll be able to go with your data. Furthermore, most of these data structures are common to many programming languages (e.g. vectors, lists, matrices), so what you learn with R will often generalize to other languages as well.

R and other programming languages are used via an IDE (integrated development environment), which makes programming vastly easier through syntax highlighting, code completion, and more. RStudio the IDE of choice for R, while Python is varied (e.g. PyCharm for software developers, Spyder for users of Anaconda), and others like VSCode might be useful for many languages.

Vectors

Vectors form the basis of R data structures. Two main types are atomic and lists, but we’ll talk about lists separately.

Here is an R vector. The elements of the vector are numeric values.

x = c(1, 3, 2, 5, 4)
x
[1] 1 3 2 5 4

All elements of an atomic vector are the same type. Example types include:

  • character
  • numeric (double)
  • integer
  • logical

In addition, there are special kinds of values like NA (‘not available’ i.e. missing), NULL, NaN (not a number), Inf (infinite) and so forth.

You can use typeof to examine an object’s type, or use an is function, e.g. is.logical, to check if an object is a specific type.

Character strings

When dealing with text, objects of the character class are what you’d typically be dealing with.

x = c('... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy', 'Memory')
class(x)
[1] "character"

Not much to it, but be aware there is no real limit to what is represented as a character vector. For example, in a data frame, a special class we’ll talk about later, you could have a column where each entry is one of the works of Shakespeare.

Factors

An important type of vector is a factor. Factors are used to represent categorical data structures. Although not exactly precise, one can think of factors as integers with labels. For example, the underlying representation of a variable for sex is 1:2 with labels ‘Male’ and ‘Female’. They are a special class with attributes, or metadata, that contains the information about the levels.

x = factor(rep(letters[1:3], e = 10))
x
 [1] a a a a a a a a a a b b b b b b b b b b c c c c c c c c c c
Levels: a b c
attributes(x)
$levels
[1] "a" "b" "c"

$class
[1] "factor"

The underlying representation is numeric, but it is important to remember that factors are categorical. Thus, they can’t be used as numbers would be, as the following demonstrates.

x_num = as.numeric(x)  # convert to a numeric object
sum(x_num)
[1] 60
sum(x)
Error in Summary.factor(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : 'sum' not meaningful for factors

Strings vs. factors

The main thing to note is that factors are generally a statistical phenomenon, and are required to do statistical things with data that would otherwise be a simple character string. If you know the relatively few levels the data can take, you’ll generally want to use factors, or at least know that statistical packages and methods may require them. In addition, factors allow you to easily overcome the silly default alphabetical ordering of category levels in some very popular visualization packages.

For other things, such as text analysis, you’ll almost certainly want character strings instead, and in many cases it will be required. It’s also worth noting that a lot of base R and other behavior will coerce strings to factors. This made a lot more sense in the early days of R, but is not really necessary these days.

Some packages to note to help you with processing strings and factors:

  • forcats
  • stringr

Logicals

Logical scalar/vectors are those that take on one of two values: TRUE or FALSE. They are especially useful in flagging whether to run certain parts of code, and indexing certain parts of data structures (e.g. taking rows that correspond to TRUE). We’ll talk about the latter usage later.

Here is a logical vector.

my_logic = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)

Note also that logicals are also treated as binary 0:1, and so, for example, taking the mean will provide the proportion of TRUE values.

!my_logic
[1] FALSE  TRUE FALSE  TRUE FALSE FALSE
as.numeric(my_logic)
[1] 1 0 1 0 1 1
mean(my_logic)
[1] 0.6666667

Numeric and integer

The most common type of data structure you’ll deal with are integer and numeric vectors.

ints = -3:3   # integer sequences are easily constructed with the colon operator
class(ints)
[1] "integer"
x = rnorm(5)  # 5 random values from the standard normal distribution

x
[1] -0.7613756  0.4875454 -0.2499864 -1.1288420  0.5874086
typeof(x)
[1] "double"
class(x)
[1] "numeric"
typeof(ints)
[1] "integer"
is.numeric(ints)  # also numeric!
[1] TRUE

The main difference between the two is that integers regard whole numbers only and are otherwise smaller in size in memory, but practically speaking you typically won’t distinguish them for most of your data science needs.

Dates

Another common data structure you’ll deal with is a date variable. Typically dates require special treatment and to work as intended, but they can be stored as character strings or factors if desired. The following shows some of the base R functionality for this.

Sys.Date()
[1] "2020-08-19"
x = as.Date(c(Sys.Date(), '2020-09-01'))

x
[1] "2020-08-19" "2020-09-01"

In almost every case however, a package like lubridate will make processing them much easier. The following shows how to strip out certain aspects of a date using it.

library(lubridate)

month(Sys.Date())
[1] 8
day(Sys.Date())
[1] 19
wday(Sys.Date(), label = TRUE )
[1] Wed
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
quarter(Sys.Date())
[1] 3
as_date('2000-01-01') + 100
[1] "2000-04-10"

In general though, dates are treated as numeric variables, with consistent (but arbitrary) starting point. If you use these in analysis, you’ll probably want to make zero a useful value (e.g. the starting date).

as.numeric(Sys.Date())
[1] 18493
as.Date(10, origin = '2000-01-01')  # 10 days after a supplied origin
[1] "2000-01-11"

For visualization purposes, you can typically treat date variables as is, as ordered factors, or use the values as labels, and get the desired result.

Matrices

With multiple dimensions, we are dealing with arrays. Matrices are two dimensional (2-d) arrays, and extremely commonly used for scientific computing. The vectors making up a matrix must all be of the same type. For example, all values in a matrix might be numeric, or all character strings.

Creating a matrix

Creating a matrix can be done in a variety of ways.

# create vectors
x = 1:4
y = 5:8
z = 9:12

rbind(x, y, z)   # row bind
  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    5    6    7    8
z    9   10   11   12
cbind(x, y, z)   # column bind
     x y  z
[1,] 1 5  9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(
  c(x, y, z),
  nrow = 3,
  ncol = 4,
  byrow = TRUE
)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Lists

Lists in R are highly flexible objects, and probably the most commonly used for applied data science. Unlike vectors, whose elements must be of the same type, lists can contain anything as their elements, even other lists.

Here is a list. We use the list function to create it.

x = list(1, "apple", list(3, "cat"))
x
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] "cat"

We often want to loop some function over a list.

for (element in x) print(class(element))
[1] "numeric"
[1] "character"
[1] "list"

Lists can, and often do, have named elements, which we can then extract by name.

x = list("a" = 25, "b" = -1, "c" = 0)
x[["b"]]
[1] -1

Almost all standard models in base R and other packages return an object that is a list. Knowing how to work with a list will allow you to easily access the contents of the model object for further processing.

Python has similar structures, lists and dictionaries, where the latter works similarly to R’s named list.

Data Frames

Data frames are a very commonly used data structure, and are essentially a representation of data in a table format with rows and columns. Elements of a data frame can be different types, and this is because the data.frame class is actually just a list. As such, everything about lists applies to them. But they can also be indexed by row or column as well, just like matrices. There are other very common types of object classes associated with packages that are both a data.frame and some other type of structure (e.g. tibbles in the tidyverse).

Usually your data frame will come directly from import or manipulation of other R objects (e.g. matrices). However, you should know how to create one from scratch.

Creating a data frame

The following will create a data frame with two columns, a and b.

mydf = data.frame(
  a = c(1, 5, 2),
  b = c(3, 8, 1)
)

Much to the disdain of the tidyverse, we can add row names also.

rownames(mydf) = paste0('row', 1:3)
mydf
     a b
row1 1 3
row2 5 8
row3 2 1

Everything about lists applies to data.frames, so we can add, select, and remove elements of a data frame just like lists. However we’ll visit this more in depth later, and see that we’ll have much more flexibility with data frames than we would lists for common data analysis and visualization.

Data Structure Exercises

Exercise 1

Create an object that is a matrix and/or a data.frame, and inspect its class or structure (use the class or str functions on the object you just created).

Exercise 2

Create a list of 3 elements, the first of which contains character strings, the second numbers, and the third, the data.frame or matrix you just created in Exercise 1.

Thinking Exercises

  • How is a factor different from a character vector?

  • How is a data.frame the same as and different from a matrix?

  • How is a data.frame the same as and different from a list?

Python Data Structures Notebook

Available on GitHub