- Installing R
- Installing Rstudio
- Getting to know the power of Rstudio
- Installing additional functionality
R is a programming language (Gnu-S)
R is an environment for statistical computing and graphics
R is not a "stats package"
It is never useful to ask if you can do something in R.
Installing R
You now have a fully functioning statistical programming environment.
However… you will likely never need to use base R directly
Installing Rstudio
Cross-platform
Integrated Development Environment
Making programming easier
But lots more…
Not only can you easily publish your efforts to html, you can even code directly in html, css, javascript etc.
The Rstudio group and others have a suite of packages devoted to interactive visualizations that often will require no more effort than standard plots
Projects are a way to keep you research focused
Easily pick up where you left off
Easily switch from one project to another
Version control
Any project is able to take advantage of version control
Among the many benefits, version-control provides peace of mind for research reproducibility, enhances collaboration, and allows one to keep on top of all the changes a typical research endeavor undergoes.
R's true power comes from the thousands of packages produced by the statistical and greater research communities
At this point there is a package for practically everything you'd want to do, especially when it comes to modeling
The best part is that adding this functionality is very easy
CRAN
Other
To install packages there are two primary approaches with Rstudio
R does not load packages in your library automatically. To do so, use the library function to select the package.
Common coding practice loads a package only at the point it's needed (rather than loading all at the beginning)
Creating and manipulating objects
Using functions
In R, you will constantly create objects and manipulate objects with functions.
Objects can be anything, and this will become clear as we continue through the course.
We will use many functions over the course of the day, including building our own.
The most common data structures used can be seen to posses
One type only
Multiple types
Dataframes are what you've been using if you have experience with other statistical packages or programming languages.
Typical use is that each column represents attributes, targets etc., while rows represent observations (e.g. people, time points)
R is a functional language.
Functions are objects that take some input and return a value
They are highly flexible, in that the nature of the input can range from nothing to a couple dozen arguments, values can be lists of modeling output, a plot, or even another function.
myfunction(x=someX, y=someY)
Use the matrix function to create a matrix, use c to concatentate values, use : to create sequences, and finally provide arguments pertaining to the number of rows and columns.
myMatrix = matrix(c(1:3, 4:6, 7:9), nrow=3, ncol=3, byrow=T)
Try creating a dataframe version of the myMatrix object, called myDF. To convert a matrix object to a dataframe, use the as.data.frame function on it.
While most functions are unique, many are generic and have methods that will allow the function to operate differently for different classes of objects. The following demonstrates this with the commonly used summary function.
methods(summary)
## [1] summary.aov summary.aovlist* ## [3] summary.aspell* summary.check_packages_in_dir* ## [5] summary.connection summary.data.frame ## [7] summary.Date summary.default ## [9] summary.ecdf* summary.factor ## [11] summary.glm summary.infl* ## [13] summary.lm summary.loess* ## [15] summary.manova summary.matrix ## [17] summary.mlm* summary.nls* ## [19] summary.packageStatus* summary.PDF_Dictionary* ## [21] summary.PDF_Stream* summary.POSIXct ## [23] summary.POSIXlt summary.ppr* ## [25] summary.prcomp* summary.princomp* ## [27] summary.proc_time summary.srcfile ## [29] summary.srcref summary.stepfun ## [31] summary.stl* summary.table ## [33] summary.tukeysmooth* ## see '?methods' for accessing help and source code
Write and run the following lines, then use some of the following functions on those objects:
x = rnorm(100) y = 2*x + rnorm(100) z = sample(letters[1:5], 100, replace=T)
unstructured
Saving and using RData files.
read.* family of functions
myData = read.csv('my/file/location/myFile.csv')
The readr package has similar functionality and same naming convention using underscore instead of dot.
Some efficiencies are built in, as well as making it easier to see if there are any issues, even if minor
In addition, the readxl package also provides the ability to read in MS Excel files (or particular sheets), without any thing additional needed.
In general, there appears to still be some misconception that entirely separate programs are needed to transfer data from one statistical enviornment to another.
However, this hasn't been the case for a very long time.
Base R comes with a package called foreign for reading in data files from various packages such as SPSS, Stata etc.
However, the haven package will read current Stata files and offers some capacity to write to spss or stata files.
Other packages may have the ability to read and write very spec
A variety of data formats are web-oriented, such as json, xml, html tables and the like.
In addition, one can pull a lot of data directly via APIs (e.g. use google maps or similar to geocode your datas).
There are a variety of packages to help with this. A starting point would be the Web Technologies Task View.
Furthermore, a variety of packages are available to analyze or otherwise deal with the data in a streaming fashion.
Install the dplyr (for many things as we'll see), leaflet (for vis), and ggmap (for a simple geocode function that uses Google API) packages
Create an R object that is a character string of your address (or use the following for ND: 'Notre Dame, Indiana')
place = ggmap::geocode('Notre Dame, Indiana')
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Notre+Dame,+Indiana&sensor=false
library(dplyr); library(leaflet)
## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
leaflet() %>% addTiles() %>% addCircles(lng=place$lon, lat=place$lat, radius=50, opacity=1, color='firebrick', fillColor = 'darkred', fillOpacity = 1)
Unstructured data doesn't really have a precise definition, and usually means anything that isn't in tabular format, with the most common example being text.
While there is a vast amount of ways to attack text specifically, other data might require alternative solutions.
The goal is almost always to impose a common structure in order for the data to be analyzed.
Again, R has many packages that might be of use, but it will depend on your situation
Big data is that which can't be housed or analyzed in a normal computing environment (e.g. even desktop with lots of horsepower could not crunch it).
R is memory intensive, and even data that can be imported might not be feasible with some types of analysis.
However there are cluster computing (e.g. CRC) and distributed data solutions that make R viable in these situations as well.
AWS, Spark and other means to deal with large data sets will eventually be easy enough for average folk to enage such data through the web.
Writing to some data format is typically as easy as reading it in.
Just as with reading the data, what package you might use for the format in question will vary, but one or two packages will cover the majority of common data scenarios.
Writing to text files pretty much ensures all other analytical programs will be able to read it.
One unique data format to get used to working with is the RData file.
When you use the functions save (for specific objects) or save.image (for all objects, i.e. your workspace), you have the capacity to save everything you've created to a file
When you return to using R, the load function on the file will restore all the objects you've created.
Install the readr package
Use the read_tsv (tab separated) file to read in the following data:
http://csr.nd.edu/assets/22641/testwebdata.txt
Save your workspace
Base R has a very flexible/powerful indexing system with which to get at your data.
Consider the following simple example, where we want values of x greater than 1
x = rnorm(10) x[x>0 & x<1]
## [1] 0.2774292 0.4291247 0.5060559
A more complicated approach. Values of y in which the absolute residuals of a regression on x are greater than 1.
y[abs(resid(lm(y~x))) > 1]
This is not the way you should be doing things, but serves as an example of the possibilities
Indexing
Lists (or dataframes)
Matrices (or dataframes)
see ?Extract
For this exercise, you will work with one of the many datasets that come with a basic R installation (?datasets), but note that packages typically come with their own as well.
For the following: Examine the iris dataset using the str (structure) function. You do not have to do anything to get the iris data, it's an object in the base R environment.
Create a separate object that just contains the the Species column of data
Create an object containing the 10-15th rows
Note the Petal Width of the 100th observation via indexing
Bonus: subset the data to only the virginica species. Hint- use ==
Piping is an alternative way to successively perform data manipulation and indexing (mainly through magrittr package).
In conjunction with dplyr, it can make data manipulation and wrangling straightforward and with very clear code.
Install the rvest package (read html functionality).
# ☻ ☻ ☻ library(dplyr); library(rvest); library(stringr) link ="http://www.inflationdata.com/inflation/Inflation_Rate/HistoricalInflation.aspx" html(link) %>% # read a webpage html_table(header = TRUE) %>% # grab all the tables `[[`(1) %>% # grab the first table select(Year, Ave.) %>% # select specific columns rename(Ave = Ave.) %>% # rename filter(Year >= 2010) %>% # select certain rows mutate(AveNum = str_extract(Ave, '[0-9].[0-9]+'), AveNum2 = as.numeric(str_replace(Ave, ' %', ''))) # create new variables
## Year Ave AveNum AveNum2 ## 1 2015 <NA> NA ## 2 2014 1.62 % 1.62 1.62 ## 3 2013 1.47 % 1.47 1.47 ## 4 2012 2.07 % 2.07 2.07 ## 5 2011 3.16 % 3.16 3.16 ## 6 2010 1.64 % 1.64 1.64
Note that you can assign the results of the piped operations to create an object
myObjectSubset = mydata %>% filter(Year > 2000)
Piping is especially useful for visualization, and ggvis and newer more web-directed visualization packages will typically work with standard pipe operators.
library(ggvis) iris %>% ggvis(x=~Petal.Length, y=~Petal.Width) %>% layer_points(fill:='#ff5500', fillOpacity:=.35, size:=25) %>% layer_smooths(stroke:='darkred')
Filtering (or Slicing) refers to subsetting by rows.
You have already seen how to do this with base R functionality with numbered indexing
the dplyr package has two primary functions for this
It also has other functions that can make the code clearer and easier to work wtih
state.x77 %>% as.data.frame %>% filter(Frost >150) %>% slice(1:3)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area ## 1 365 6315 1.5 69.31 11.3 66.7 152 566432 ## 2 2541 4884 0.7 72.06 6.8 63.9 166 103766 ## 3 1058 3694 0.7 70.39 2.7 54.7 161 30920
Selecting here refers to the subset of columns, and with dplyr again you get enhanced functionality and clarity
iris %>% select(starts_with('Petal')) %>% slice(1:2)
## Petal.Length Petal.Width ## 1 1.4 0.2 ## 2 1.4 0.2
Oftentimes we engage in such operations in order to obtain summary information about groups.
iris %>% group_by(Species) %>% summarise_each('mean')
## Source: local data frame [3 x 5] ## ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026
state.x77 %>% data.frame %>% mutate(region = state.region) %>% group_by(region) %>% tally
## Source: local data frame [4 x 2] ## ## region n ## 1 Northeast 9 ## 2 South 16 ## 3 North Central 12 ## 4 West 13
Piping can be fed to various types of functions to make exploratory analysis easier.
iris %>% group_by(Species) %>% summarise_each('mean') %>% select(-Species) %>% as.matrix %>% heatmap
Piping can be fed to various types of functions to make exploratory analysis easier.
iris %>% select(-Species) %>% cor %>% round(2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 1.00 -0.12 0.87 0.82 ## Sepal.Width -0.12 1.00 -0.43 -0.37 ## Petal.Length 0.87 -0.43 1.00 0.96 ## Petal.Width 0.82 -0.37 0.96 1.00
Examples of non-standard data:
Most of the data manipulation of such objects will be to put into a more analyzable format.
You will likely have to use package specific functions (e.g. jsonlite)
Link
Functions
Functions are extremely important in the R world.
The key thing to remember is that functions (almost always) take specific inputs and always return a value
Every R helpfile states explicitly the syntax required to use a function
The basic components
Note that not all arguments are required, and some will have default values.
It is important to know the arguments of any function you use, or you might be missing out on quite a bit, or not understand why you're getting an error.
As an example, examine the arguments for the mean function (i.e. type ?mean for the help file)
The body of a function is where all the code that works on those arguments
Type median.default at your console to see the body of median, which is fairly simple as far as functions go.
median.default
## function (x, na.rm = FALSE) ## { ## if (is.factor(x) || is.data.frame(x)) ## stop("need numeric data") ## if (length(names(x))) ## names(x) <- NULL ## if (na.rm) ## x <- x[!is.na(x)] ## else if (any(is.na(x))) ## return(x[FALSE][NA]) ## n <- length(x) ## if (n == 0L) ## return(x[FALSE][NA]) ## half <- (n + 1L)%/%2L ## if (n%%2L == 1L) ## sort(x, partial = half)[half] ## else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) ## } ## <bytecode: 0x000000001510c9d8> ## <environment: namespace:stats>
Note that many of R's core functions actually call other functions that are written in C for faster computation.
If they are written well, you will typically see much of the first part devoted to error checking.
A good way to start learning decent coding style is by looking at code from base R functions.
The environment is the frame or map of the location of the function's variables.
Its purpose is primarily to bind names to a set of values
What do you think the last line will produce?
b = 3 f = function(b){ return(b) } f(2) f = function(){ return(b) } f()
In the first case, the reason 2 is returned instead of 3 is because the first environment searched to locate b is the the one created by the function. As it is passed as an argument, the value associated with the argument is returned.
In the second case, b is not found in the current (function) environment, so the global environment is searched.
f = function(){ print(environment()) # return(b) } f(); environment()
## <environment: 0x00000000096df488> ## <environment: R_GlobalEnv>
This will be more important when writing your own functions, and more so if you create your own package.
However, it's important to at least be aware of environments to better understand how functions are working when you use them.
Once you get the hang of R, you'll want to write your own functions to further make your data wrangling and exploration efficient.
Furthermore, functions aid reproducibility.
A simple rule is, if you've written the same line of code more than twice, you should probably write a function the does the operation you're attempting.
And it doesn't have to be complicated.
As noted, Functions take arguments and return values.
To create a function, we use the following approach:
funcName = function(argsGoHere){ Body . . . Value to be returned }
The following returns 'Positive' if the input is greater than zero, 'Negative' if not.
posNeg = function(x){ result = ifelse(sum(x) > 0, 'Positive', 'Negative') result } randomData = rnorm(10) posNeg(x=randomData)
## [1] "Negative"
This does the job but could be made much better. Can you think of any ideas for improvement?
Suggestions:
Remember that functions are objects.
We will demonstrate some very important functionals later.
At that point we will also see some anonymous functions, in which we create a simple function on the fly.
There are also functions that return functions.
Write a function that does the following:
Take an input and return a list object containing the mean, sd, and sum.
For the purposes of this exercise, in the body of the function create separate objects for each thing to return, and combine them into a list.
Use a snippet to get started (type fun and hit tab).
Rstudio's debugger makes also makes testing very easy.
As an example, one can use a function like debugonce to start the debugging process.
debugonce(myfunc) myfunc(arg)
Try it with the function you just created.
Data manipulation involves a lot of repeated operations, and R shines in this area in that it makes a lot of this easier.
You'll often see iterative programming within a function, though it can be and often is used in the normal course of programming.
The typical approach seen in other languages can be used, and sometimes is still the way to go.
Calculate column means:
# million column matrix d = matrix(rnorm(10000000), nrow=10) # might take a second or two dMeans = numeric(ncol(d)) # not necessary but does speed up explicit loops to predefine the object # The following may take up to ~ 10 seconds for (i in 1:ncol(d)){ dMeans[i] = mean(d[,i]) }
One may loop:
A more general way to loop is with a while in place of a for.
The loop will run until some condition is met.
Any for loop can be written as a while statement.
n = 1 while(n <= ncol(d)){ dMeans[n] = mean(d[,n]) n = n + 1 }
Within functions you will often see something like the following:
if(conditionMet){ doThis } else if(otherConditionMet) { doThat } else { doSomethingElse }
For more on for, while, and if…else constructs see ?Control.
While standard looping works, other approaches are available, and some vectorized operations can reduce code and are often faster.
Vectorizing your code means working on whole objects rather than iterating over individual elements.
Consider the following loop (type it or something similar yourself):
a = rnorm(10) b = rnorm(10) for(i in seq(10)){ message(a[i] + b[i]) }
The following is easier to write, clearer to read, and faster:
a + b
There are many vectorized functions that come with base R, or are otherwise optimized such that you'll want to quickly learn them and avoid using explicit loops.
The vectorized approach works in a lot of other places also, and is especially useful in indexing.
myvar = 1:100 myvar[sample(myvar, 50)] = NA # insert 50 missing values sum(is.na(myvar)) # check how many myvar[is.na(myvar)] = 0 # change missing to 0 any(is.na(myvar)) # any left 1:10 >= 5 # which of the sequence 1:10 is greater or equal to 5 rowSums(d > 0) # count values greater than zero for each row
## [1] 50 ## [1] FALSE ## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE ## [1] 500172 500014 500230 499986 500501 500293 500732 499196 499581 500193
The quicker you learn the apply family of functions, as well as related versions in the aforementioned plyr package, the quicker your data processing will be.
Apply the mean function to the columns. Similar time as a loop in this case but cleaner code.
dmeans = apply(d, 2, mean) # the 2 specifies columns
Note that we supply a function as an argument to function.
As mentioned earlier, one will often use anonymous functions with apply.
We create a simple function on the fly, without assignment.
apply(d, 1, function(x) any(x>5))
## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
The apply is a family of functions:
They also have parallelized versions, which means even if they weren't any faster than the standad loop they could be greatly sped up beyond the base speed.
plyr, dplyr, and tidyr packages have functions with similar capabilities but:
library(plyr) aaply(d, 1, mean) # array to array adply(d, 1, mean, .parallel = T) # array to dataframe in parallel
Using either of the two matrices below (or create one yourself), perform a row or column operation with two different approaches- one using an explicit loop, one of which uses apply-like functionality.
Examples: if any in a row equal some value, any columns with values greater than some other value.
Suggestions:
Hint: first decide on your problem, keep as simple as need be. Then write pseudo-code for the anonymous function part if using one. Then try to convert to a working example.
If you want to try something that isn't clear from the above suggestions just ask.
Example:
nums = matrix(sample(1:3, 9, replace = T), 3, 3) lets = matrix(sample(letters, 9, replace = T), 3, 3)
After all that we've gone through at this point, modeling will be the easy part for you.
The biggest hurdle in modeling is preparing the data, and the better you are at programming:
Once complete, you now have thousands of packages to choose from for modeling.
The standard format for regression modeling can serve as baseline code for modeling in other packages.
Most take a formula and data argument. - a regression of y on x and z
lm(y ~ x + z, data=myData)
Some might take matrix/vector arguments.
lm.fit(X, y) # lm.fit is the workhorse of lm
Unsupervised methods usually will need a matrix argument
princomp(X)
And then there are variations on these themes, but those constitute the vast majority of standard and complex models.
In addition, they all come with additional arguments that you'll need to familiarize yourself with before using the function
glm(art ~ ., data=bioChemists, family = 'poisson') # number of articles predicted by everything else library(pscl) hurdle(art ~ ., data=bioChemists, dist = 'poisson') # hurdle model to deal with excess zeros
library(lme4) lmer(Reaction ~ Days + (1|Subject), data=sleepstudy) # a mixed model with a random intercept for Subject
Do a search for a model that you would like to learn more about.
If it's a somewhat standard model try an R search
??`generalized linear model`
For something you're less sure about you might try several approaches:
General web search of model name with R attached (best approach)
RDocumentation.org and search 'all fields' or 'description' with the model name.
You can also ask us to help narrow your search to maybe some popular implementations.
Once you find a package that has something you want to try, take the following steps.
example('modelingFunctionName')
Otherwise you will:
Think of one or two models you're thinking about or simply maybe interested in and try this for yourself.
We'll come around and help you try and implement them if you have trouble or answer other questions.
ggplot2
shiny, ggvis, htmlwidgets
Never forget that Base R has a lot of visualization capabilities that can produce high quality graphs
For many quick peeks, it's still often to be preferred.
x = rnorm(100) y = 2*x^2 + rnorm(100) hist(x)
plot(x, y)
Unfortunately base R plots can take a lot to get to that professional quality, as the defaults are generally poor (my opinion)
Many packages or other base R functions serve a particular purpose
In addition, a widely used static plotting package is ggplot2
ggplot2 is based on the 'Grammar of Graphics' and takes on a layered approach to building visualizations, using a technique similar to the piping we showed earlier.
It primarily works on dataframes.
library(ggplot2)
## ## Attaching package: 'ggplot2' ## ## The following object is masked from 'package:ggvis': ## ## resolution
ggplot(aes(x=Petal.Length, y=Sepal.Length), data=iris) + geom_point()
ggplot2 works on certain 'aesthetics', which are typically variables that take on different values, with 'geoms', things like points, lines, densities etc.
Let's add more.
ggplot(aes(x=Petal.Length, y=Sepal.Length), data=iris) + geom_point(color='red', size=4, alpha=.5) # not within aesthetic function
ggplot(aes(x=Petal.Length, y=Sepal.Length), data=iris) + geom_point(aes(color=Species)) + theme_minimal()
Install ggplot2 if you haven't.
Change the aesthetics in the first line to look at Widths instead of Lengths.
Delete the interior portion of geom_point (i.e. it should look like geom_point()).
Add the following lines after the geom, keep the theme at the end if you want.
geom_smooth() + facet_wrap(~Species) +
If you like, use facet_grid rather than wrap.
ggplot2 was and still is an awesome tool and still very useful for standard static plots.
However, ggvis, its successor, enables interactivity and is more web-oriented.
Furthermore there is a growing number of packages put out by Rstudio and others that embrace modern methods of visual display.
The following shows how to do one of the previous plots with ggvis.
library(ggvis) ggvis(x=~Petal.Length, y=~Sepal.Length, data=iris) %>% layer_points(fill=~Species, fillOpacity:=.75)
It's already a clean looking plot, and ready to add interactive components.
As before, we'll build up from the above.
Try to add layer_smooths to the code, with an argument se=TRUE.
ggvis(x=~Petal.Length, y=~Sepal.Length, data=iris) %>% layer_points(fill=~Species, fillOpacity:=.75)
Search htmlwidgest showcase
Shiny
My own (with Josh's assistance) demo using these tools.
Cleaner is almost always better
Think multivariately (beyond bivariate)
Think interactively
~10% of the population has some form of color blindness.
Color is required, not 'extra'