Until you get comfortable getting data into R, you’re not going to use it as much as you would. You should at least be able to read in common data formats like comma/tab-separated, Excel, etc. Standard methods of reading in tabular data include the following functions:
Base R also comes with the foreign package for reading in other types of files, especially other statistical packages. However, while you may see it still in use, it’s not as useful as what’s found in other packages.
Reading in data is usually a one-off event, such that you’ll never need to use the package again after the data is loaded. In that case, you might use the following approach, so that you don’t need to attach the whole package.
You can use that for any package, which can help avoid naming conflicts by not loading a bunch of different packages. Furthermore, if you need packages that do have a naming conflict, using this approach will ensure the function from the package you want will be used.
There are some better and faster ways to read in data than the base R approach. A package for reading in foreign statistical files is haven, which has functions like read_spss and read_dta for SPSS and Stata files respectively. The package readxl is a clean way to read Excel files that doesn’t require any outside packages or languages. The package rio uses haven, readxl etc., but with just two functions for everything: import, export (also convert).
For faster versions of base R functions, readr has read_csv, read_delim, and others. These make assumptions about what type each vector is after an initial scan of the data, then proceed accordingly. If you don’t have ‘big’ data, the subsequent speed gain won’t help much, however, such an approach actually can be used as a diagnostic to pick up potential data entry errors, as warnings are given when unexpected observations occur.
The data.table package provides a faster version read.table, and is typically faster than readr approaches (fread).
Be aware that R can handle practically any type of data you want to throw at it. Some examples include:
- text (e.g. a novel)
- shapefiles (e.g. for geographic data)
- Google spreadsheets
And many, many others.
On the horizon
feather is designed to make reading/writing data frames efficient, and the really nice thing about it is that it works in both Python and R. It’s still in early stages of development on the R side though.
You may come across the situation where your data cannot be held in memory. One of the first things to be aware of for data processing is that you may not need to have the data all in memory at once. Before shifting to a hardware solution, consider if the following is possible.
- Chunking: reading and processing the data in chunks
- Line at a time: dealing with individual lines of data
- Other data formats: for example SQL databases (sqldf package, src_dbi in dplyr)
However, it may be that the end result is still too large. In that case you’ll have to consider a cluster-based or distributed data situation. Of course R will have tools for that as well.
Use readr and haven to read the following files. Use the URL just like you would any file name. The latter is a Stata file. You can use the RStudio’s menu approach to import the file if you want.
If you downloaded the data for this workshop, the files can be accessed in that folder.
Why might you use read_csv from the readr package rather than read.csv in base R?
What is your definition of ‘big’ data?