Until you get comfortable getting data into your chosen programming and analytical tool, you’re not going to use it as much as you could. You should at least be able to read in common data formats like comma/tab-separated, Excel, etc. Standard methods in base R for reading in tabular data include the following functions:
Base R also comes with the foreign package for reading in other types of files, especially other statistical packages. However, while you may see it still in use, it’s not as useful as what’s found in other packages.
Most of the
read.* functions are going to have a corresponding
write.*function. For example if I read a comma-separated file as follows:
Then I would save an R object, e.g. a data frame, as a csv file as follows:
Better & Faster Approaches
Now that you know how base R tools work, you can mostly forget those functions, as there are some better and faster ways to read in data. The readr package has read_csv, read_delim, and others. These make assumptions about what type each vector is after an initial scan of the data, then proceed accordingly. If you don’t have ‘big’ data, the subsequent speed gain won’t help much, however, such an approach actually can be used as a diagnostic to pick up potential data entry errors, as warnings are given when unexpected observations occur.
The data.table package provides a faster version read.table, and is typically faster than readr approaches (fread).
A package for reading in foreign statistical files is haven, which has functions like read_spss and read_dta for SPSS and Stata files respectively. The package readxl is a clean way to read Excel files that doesn’t require any outside packages or languages. The package rio uses haven, readxl etc., but with just two functions for everything: import, export (also convert).
At least for common tabular data types, readr and haven will likely serve most of your needs, at least to start.
Reading in data is usually a one-off event, such that you’ll never need to use the package again after the data is loaded. In that case, you might use the following approach, so that you don’t need to attach the whole package.
You can use that for any package, which can help avoid naming conflicts by not loading a bunch of different packages. Furthermore, if you need packages that do have a naming conflict, using this approach will ensure the function from the package you want will be used.
R provides the means to read and store compressed data types. While there are a variety of ways to do so save and save.image are probably the most common. To save a one or more objects we can use save as follows:
To get those objects when we next use R, we can use the load function.
The save.image function works just like save, but saves all objects currently in your working environment. You would still just use load to load the objects back into your working environment. This might seem handy at first glance, but I would suggest you be more precise with which objects you save1.
If you just needs some quick data to learn a new package or try something out, you can consider the datasets package that’s automatically loaded with R. To be honest, most of them are not very accessible conceptually, too small to be interesting, or have other issues, but again, this doesn’t preclude them from helping you learn new things.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
In addition, many packages come with their own data, and these are either available when you load the package, can be loaded with the data function.
times accel 1 2.4 0.0 2 2.6 -1.3 3 3.2 -2.7 4 3.6 0.0 5 4.0 -2.7 6 6.2 -2.7
Other Types of Data
Be aware that R can handle practically any type of data you want to throw at it. Some examples include:
- text (e.g. a novel)
- shapefiles (e.g. for geographic data)
- Google spreadsheets
And many, many others.
On the Horizon
feather is designed to make reading/writing data frames efficient, and the really nice thing about it is that it works in both Python and R. It’s still in early stages of development on the R side though.
You may come across the situation where your data cannot be held in memory. One of the first things to be aware of for data processing is that you may not need to have the data all in memory at once. Before shifting to a hardware solution, consider if the following is possible.
- Chunking: reading and processing the data in chunks
- Line at a time: dealing with individual lines of data
- Other data formats: for example SQL databases (sqldf package, src_dbi in dplyr)
However, it may be that the end result is still too large. In that case you’ll have to consider a cluster-based or distributed data situation. Of course R will have tools for that as well.
Use readr and haven to read the following files. Use the URL just like you would any file name. The latter is a Stata file. You can use the RStudio’s menu approach to import the file if you want.
If you downloaded the data for this workshop, the files can be accessed in that folder.
Why might you use read_csv from the readr package rather than read.csv in base R?
What is your definition of ‘big’ data?
Python I/O Notebook
For some reason the default for RStudio is to save and reload your entire workspace of objects, essentially an automatic
save.imagewhenever you close your session. You should check off this option in
Tools/Global Options/General, as I’ve never seen it do much besides cause people issues, as they aren’t sure what version of objects are actually loaded and so rerun their data prep anyway, or with large data it can simply be time-consuming without possibly needing to be.↩︎