Outline

  • Installing R
  • Installing Rstudio
    • Getting to know the power of Rstudio
  • Installing additional functionality

What is R?

R is a programming language (Gnu-S)

R is an environment for statistical computing and graphics

R is not a "stats package"

It is never useful to ask if you can do something in R.

  • R's only limitation is your programming ability and creativity.

Installing R

Installing R

  • Pick a mirror (closer better but doesn't really matter. 0-Cloud is fine.)
  • Pick an OS
  • Windows: Download base
  • MAC: first .pkg file
  • Linux: follow instructions according to your flavor

You now have a fully functioning statistical programming environment.

However… you will likely never need to use base R directly

Installing Rstudio

Installing Rstudio

Cross-platform

  • Go to website
  • Download
  • Install

Using Rstudio

What is Rstudio?

Integrated Development Environment

Using Rstudio

Making programming easier

  • Code completion
  • Keyboard shortcuts
  • Debugging

But lots more…

Using Rstudio

  • Web-enabled
    • Interactive graphics
    • Create webpages, presentations (like this one) and more
  • Projects
  • Built in version-control
  • Package development

Multiple File Types

  • R scripts
  • R markdown
  • Latex
  • Presentation
  • General text/code editor that recognizes several other languages

Web-enabled

Not only can you easily publish your efforts to html, you can even code directly in html, css, javascript etc.

The Rstudio group and others have a suite of packages devoted to interactive visualizations that often will require no more effort than standard plots

Projects

Projects are a way to keep you research focused

Easily pick up where you left off

Easily switch from one project to another

Version control

Version Control

Any project is able to take advantage of version control

  • Git
  • SVN

Among the many benefits, version-control provides peace of mind for research reproducibility, enhances collaboration, and allows one to keep on top of all the changes a typical research endeavor undergoes.

Adding Functionality

Adding Functionality

R's true power comes from the thousands of packages produced by the statistical and greater research communities

At this point there is a package for practically everything you'd want to do, especially when it comes to modeling

The best part is that adding this functionality is very easy

Installing Packages

CRAN

  • Comprehensive R Archive Network
  • Main place for R packages

Other

  • Github (also Rforge R-forge)
  • Bioconductor

To install packages there are two primary approaches with Rstudio

  • install.packages function
  • Using the packages tab

Using Packages

R does not load packages in your library automatically. To do so, use the library function to select the package.

  • library('packagename')

Common coding practice loads a package only at the point it's needed (rather than loading all at the beginning)

Outline

  • Creating and manipulating objects

    • Classes of objects
  • Data Structures
  • Using functions

Creating and manipulating objects

In R, you will constantly create objects and manipulate objects with functions.

Objects can be anything, and this will become clear as we continue through the course.

We will use many functions over the course of the day, including building our own.

Data Structures

The most common data structures used can be seen to posses

One type only

  • Vector (may be of length 1)
    • numeric, integer, character and logical are the most commonly used
    • In addition, a special class of the integer type are factors
  • Matrix
    • collection of vectors
  • Array
    • beyond 2 dimensions

Data Structures

Multiple types

  • Lists
    • Dataframes
    • Perhaps most commonly used class of lists

Dataframes are what you've been using if you have experience with other statistical packages or programming languages.

Typical use is that each column represents attributes, targets etc., while rows represent observations (e.g. people, time points)

Using functions

R is a functional language.

Functions are objects that take some input and return a value

They are highly flexible, in that the nature of the input can range from nothing to a couple dozen arguments, values can be lists of modeling output, a plot, or even another function.

Using functions

myfunction(x=someX, y=someY)

Using functions

Use the matrix function to create a matrix, use c to concatentate values, use : to create sequences, and finally provide arguments pertaining to the number of rows and columns.

myMatrix = matrix(c(1:3, 4:6, 7:9), nrow=3, ncol=3, byrow=T)

Try creating a dataframe version of the myMatrix object, called myDF. To convert a matrix object to a dataframe, use the as.data.frame function on it.

Using functions

While most functions are unique, many are generic and have methods that will allow the function to operate differently for different classes of objects. The following demonstrates this with the commonly used summary function.

methods(summary)
##  [1] summary.aov                    summary.aovlist*              
##  [3] summary.aspell*                summary.check_packages_in_dir*
##  [5] summary.connection             summary.data.frame            
##  [7] summary.Date                   summary.default               
##  [9] summary.ecdf*                  summary.factor                
## [11] summary.glm                    summary.infl*                 
## [13] summary.lm                     summary.loess*                
## [15] summary.manova                 summary.matrix                
## [17] summary.mlm*                   summary.nls*                  
## [19] summary.packageStatus*         summary.PDF_Dictionary*       
## [21] summary.PDF_Stream*            summary.POSIXct               
## [23] summary.POSIXlt                summary.ppr*                  
## [25] summary.prcomp*                summary.princomp*             
## [27] summary.proc_time              summary.srcfile               
## [29] summary.srcref                 summary.stepfun               
## [31] summary.stl*                   summary.table                 
## [33] summary.tukeysmooth*          
## see '?methods' for accessing help and source code

Getting Started with Statistical Functions

Write and run the following lines, then use some of the following functions on those objects:

  • summary
  • mean
  • sd
  • table
  • plot
  • cbind
x = rnorm(100)
y = 2*x + rnorm(100)
z = sample(letters[1:5], 100, replace=T)

Playtime

Outline:

  • Geting data in from outside sources
  • text
  • statistical packages
  • web
  • unstructured

  • Saving R objects
  • Writing to common data formats
  • Saving and using RData files.

Text/flat files

read.* family of functions

  • read.table (most general), read.csv, others
myData = read.csv('my/file/location/myFile.csv')

readr as an alternative

The readr package has similar functionality and same naming convention using underscore instead of dot.

  • read_csv

Some efficiencies are built in, as well as making it easier to see if there are any issues, even if minor

In addition, the readxl package also provides the ability to read in MS Excel files (or particular sheets), without any thing additional needed.

  • Doesn't fix the fact that Excel is a terrible format to keep your data in.
  • You'd often still be better off writing to csv from Excel first

Statistical Packages

In general, there appears to still be some misconception that entirely separate programs are needed to transfer data from one statistical enviornment to another.

However, this hasn't been the case for a very long time.

  • Different packages have long been able to read each others' files directly
  • One can always write to a text file (e.g. csv)

Base R comes with a package called foreign for reading in data files from various packages such as SPSS, Stata etc.

However, the haven package will read current Stata files and offers some capacity to write to spss or stata files.

Other packages may have the ability to read and write very spec

Web

A variety of data formats are web-oriented, such as json, xml, html tables and the like.

In addition, one can pull a lot of data directly via APIs (e.g. use google maps or similar to geocode your datas).

There are a variety of packages to help with this. A starting point would be the Web Technologies Task View.

Furthermore, a variety of packages are available to analyze or otherwise deal with the data in a streaming fashion.

Web

Install the dplyr (for many things as we'll see), leaflet (for vis), and ggmap (for a simple geocode function that uses Google API) packages

Create an R object that is a character string of your address (or use the following for ND: 'Notre Dame, Indiana')

place = ggmap::geocode('Notre Dame, Indiana')
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Notre+Dame,+Indiana&sensor=false
library(dplyr); library(leaflet)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
leaflet() %>% 
  addTiles() %>% 
  addCircles(lng=place$lon, lat=place$lat, radius=50, opacity=1, color='firebrick', fillColor = 'darkred', fillOpacity = 1)

Unstructured Data

Unstructured data doesn't really have a precise definition, and usually means anything that isn't in tabular format, with the most common example being text.

  • Perhaps better would be to say it's uncommonly structured (relative to tabular data)

While there is a vast amount of ways to attack text specifically, other data might require alternative solutions.

The goal is almost always to impose a common structure in order for the data to be analyzed.

Again, R has many packages that might be of use, but it will depend on your situation

Big Data

Big data is that which can't be housed or analyzed in a normal computing environment (e.g. even desktop with lots of horsepower could not crunch it).

R is memory intensive, and even data that can be imported might not be feasible with some types of analysis.

  • Largish data that turns into four chains of thousands of simulated values for hundreds of parameters

However there are cluster computing (e.g. CRC) and distributed data solutions that make R viable in these situations as well.

AWS, Spark and other means to deal with large data sets will eventually be easy enough for average folk to enage such data through the web.

Output: writing to common formats

Writing to some data format is typically as easy as reading it in.

  • e.g. write.csv(rObject, 'fileLocation/fileName.csv')

Just as with reading the data, what package you might use for the format in question will vary, but one or two packages will cover the majority of common data scenarios.

Writing to text files pretty much ensures all other analytical programs will be able to read it.

Saving and Using .RData files

One unique data format to get used to working with is the RData file.

When you use the functions save (for specific objects) or save.image (for all objects, i.e. your workspace), you have the capacity to save everything you've created to a file

When you return to using R, the load function on the file will restore all the objects you've created.

Playtime

Install the readr package

Use the read_tsv (tab separated) file to read in the following data:

http://csr.nd.edu/assets/22641/testwebdata.txt

Save your workspace

  • Use save.image('fileLocation/fileName.RData')
  • Close out Rstudio
  • Open it back up
  • Use load('fileLocation/fileName.RData'), and print one of your objects to the console

Outline

  • Two approaches
    • Base R
    • Piping
  • Filtering
  • Selecting
  • Slicing
  • Using non-standard data objects

Indexing

Base R Indexing

Base R has a very flexible/powerful indexing system with which to get at your data.

Consider the following simple example, where we want values of x greater than 1

x = rnorm(10)
x[x>0 & x<1]
## [1] 0.2774292 0.4291247 0.5060559

A more complicated approach. Values of y in which the absolute residuals of a regression on x are greater than 1.

y[abs(resid(lm(y~x))) > 1]

This is not the way you should be doing things, but serves as an example of the possibilities

Base R Indexing

Indexing

Lists (or dataframes)

  • ['name'] [number] to extract list elements (including data.frames)
    • myList['myelement']
    • [[ can select a single element, where [] can return multiple
  • $name also
    • myData$myvariable

Matrices (or dataframes)

  • [rows, columns] or ['rownames', 'columnnames']
    • myData[1:3, 6:10]
    • myData[,'var1']

see ?Extract

Base R Indexing

For this exercise, you will work with one of the many datasets that come with a basic R installation (?datasets), but note that packages typically come with their own as well.

For the following: Examine the iris dataset using the str (structure) function. You do not have to do anything to get the iris data, it's an object in the base R environment.

Create a separate object that just contains the the Species column of data

Create an object containing the 10-15th rows

Note the Petal Width of the 100th observation via indexing

Bonus: subset the data to only the virginica species. Hint- use ==

Piping

Piping

Piping is an alternative way to successively perform data manipulation and indexing (mainly through magrittr package).

  • The most common way to do so is with this symbol %>%
  • However, there are other ways to do so, and some with other packages

In conjunction with dplyr, it can make data manipulation and wrangling straightforward and with very clear code.

Piping

Install the rvest package (read html functionality).

# ☻ ☻ ☻
library(dplyr); library(rvest); library(stringr)
link ="http://www.inflationdata.com/inflation/Inflation_Rate/HistoricalInflation.aspx"
html(link) %>%                                        # read a webpage
  html_table(header = TRUE) %>%                       # grab all the tables
  `[[`(1) %>%                                         # grab the first table
  select(Year, Ave.) %>%                              # select specific columns
  rename(Ave = Ave.) %>%                              # rename
  filter(Year >= 2010) %>%                            # select certain rows
  mutate(AveNum = str_extract(Ave, '[0-9].[0-9]+'),   
         AveNum2 = as.numeric(str_replace(Ave, ' %', '')))  # create new variables
##   Year    Ave AveNum AveNum2
## 1 2015          <NA>      NA
## 2 2014 1.62 %   1.62    1.62
## 3 2013 1.47 %   1.47    1.47
## 4 2012 2.07 %   2.07    2.07
## 5 2011 3.16 %   3.16    3.16
## 6 2010 1.64 %   1.64    1.64

Piping

Note that you can assign the results of the piped operations to create an object

myObjectSubset = mydata %>% 
  filter(Year > 2000)

Piping is especially useful for visualization, and ggvis and newer more web-directed visualization packages will typically work with standard pipe operators.

library(ggvis)
iris %>% 
  ggvis(x=~Petal.Length, y=~Petal.Width) %>% 
  layer_points(fill:='#ff5500', fillOpacity:=.35, size:=25) %>% 
  layer_smooths(stroke:='darkred')

Filtering

Filtering

Filtering (or Slicing) refers to subsetting by rows.

You have already seen how to do this with base R functionality with numbered indexing

the dplyr package has two primary functions for this

  • filter
  • slice

It also has other functions that can make the code clearer and easier to work wtih

state.x77 %>% 
  as.data.frame %>% 
  filter(Frost >150) %>% 
  slice(1:3)
##   Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## 1        365   6315        1.5    69.31   11.3    66.7   152 566432
## 2       2541   4884        0.7    72.06    6.8    63.9   166 103766
## 3       1058   3694        0.7    70.39    2.7    54.7   161  30920

Selecting

Selecting

Selecting here refers to the subset of columns, and with dplyr again you get enhanced functionality and clarity

iris %>% 
  select(starts_with('Petal')) %>% 
  slice(1:2)
##   Petal.Length Petal.Width
## 1          1.4         0.2
## 2          1.4         0.2

Aggregation

Oftentimes we engage in such operations in order to obtain summary information about groups.

iris %>% 
  group_by(Species) %>% 
  summarise_each('mean')
## Source: local data frame [3 x 5]
## 
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026
state.x77 %>% 
  data.frame %>% 
  mutate(region = state.region) %>% 
  group_by(region) %>% 
  tally
## Source: local data frame [4 x 2]
## 
##          region  n
## 1     Northeast  9
## 2         South 16
## 3 North Central 12
## 4          West 13

Basic Stats

Piping can be fed to various types of functions to make exploratory analysis easier.

iris %>% 
  group_by(Species) %>% 
  summarise_each('mean') %>% 
  select(-Species) %>% 
  as.matrix %>% 
  heatmap

Basic Stats

Piping can be fed to various types of functions to make exploratory analysis easier.

iris %>% 
  select(-Species) %>% 
  cor %>% 
  round(2)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length         1.00       -0.12         0.87        0.82
## Sepal.Width         -0.12        1.00        -0.43       -0.37
## Petal.Length         0.87       -0.43         1.00        0.96
## Petal.Width          0.82       -0.37         0.96        1.00

Using non-standard data

Examples of non-standard data:

  • text
  • images
  • spatial data (e.g. maps)
  • web-oriented (json, xml)

Most of the data manipulation of such objects will be to put into a more analyzable format.

You will likely have to use package specific functions (e.g. jsonlite)

Playtime

Link

Outline

Functions

  • Review of using
  • Writing your own

Using functions

Using functions

Functions are extremely important in the R world.

  • they are themselves objects and can be used as such

The key thing to remember is that functions (almost always) take specific inputs and always return a value

Every R helpfile states explicitly the syntax required to use a function

The basic components

  • argument list
  • body
  • environment

Arguments

Note that not all arguments are required, and some will have default values.

It is important to know the arguments of any function you use, or you might be missing out on quite a bit, or not understand why you're getting an error.

As an example, examine the arguments for the mean function (i.e. type ?mean for the help file)

Body

The body of a function is where all the code that works on those arguments

Type median.default at your console to see the body of median, which is fairly simple as far as functions go.

median.default
## function (x, na.rm = FALSE) 
## {
##     if (is.factor(x) || is.data.frame(x)) 
##         stop("need numeric data")
##     if (length(names(x))) 
##         names(x) <- NULL
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     else if (any(is.na(x))) 
##         return(x[FALSE][NA])
##     n <- length(x)
##     if (n == 0L) 
##         return(x[FALSE][NA])
##     half <- (n + 1L)%/%2L
##     if (n%%2L == 1L) 
##         sort(x, partial = half)[half]
##     else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
## }
## <bytecode: 0x000000001510c9d8>
## <environment: namespace:stats>

Body

Note that many of R's core functions actually call other functions that are written in C for faster computation.

If they are written well, you will typically see much of the first part devoted to error checking.

A good way to start learning decent coding style is by looking at code from base R functions.

Environment

The environment is the frame or map of the location of the function's variables.

Its purpose is primarily to bind names to a set of values

What do you think the last line will produce?

b = 3

f = function(b){
  return(b)
}

f(2)

f = function(){
  return(b)
}
f()

Environment

In the first case, the reason 2 is returned instead of 3 is because the first environment searched to locate b is the the one created by the function. As it is passed as an argument, the value associated with the argument is returned.

In the second case, b is not found in the current (function) environment, so the global environment is searched.

f = function(){
  print(environment())
  # return(b)
}
f(); environment()
## <environment: 0x00000000096df488>
## <environment: R_GlobalEnv>

This will be more important when writing your own functions, and more so if you create your own package.

However, it's important to at least be aware of environments to better understand how functions are working when you use them.

Writing your own functions

Writing your own functions

Once you get the hang of R, you'll want to write your own functions to further make your data wrangling and exploration efficient.

Furthermore, functions aid reproducibility.

A simple rule is, if you've written the same line of code more than twice, you should probably write a function the does the operation you're attempting.

Writing your own functions

And it doesn't have to be complicated.

As noted, Functions take arguments and return values.

To create a function, we use the following approach:

funcName = function(argsGoHere){
  Body
  .
  .
  .
  
  Value to be returned
}

Writing your own functions

The following returns 'Positive' if the input is greater than zero, 'Negative' if not.

posNeg = function(x){
  result = ifelse(sum(x) > 0, 'Positive', 'Negative')
  result
}

randomData = rnorm(10)
posNeg(x=randomData)
## [1] "Negative"

This does the job but could be made much better. Can you think of any ideas for improvement?

Writing your own functions

Suggestions:

  • error handling
  • support for zero values
  • perhaps add an argument for converting to factor or binary

Functionals

Remember that functions are objects.

  • As such, they can be passed as arguments to other functions, while still returning a vector of some kind

We will demonstrate some very important functionals later.

At that point we will also see some anonymous functions, in which we create a simple function on the fly.

There are also functions that return functions.

Create a function

Write a function that does the following:

Take an input and return a list object containing the mean, sd, and sum.

For the purposes of this exercise, in the body of the function create separate objects for each thing to return, and combine them into a list.

Use a snippet to get started (type fun and hit tab).

Debugging

Rstudio's debugger makes also makes testing very easy.

As an example, one can use a function like debugonce to start the debugging process.

debugonce(myfunc)
myfunc(arg)

Try it with the function you just created.

Outline

  • Standard looping
  • Using functionals
  • Vectorized approaches
  • Parallelizing

Iterative Programming

Standard looping

Data manipulation involves a lot of repeated operations, and R shines in this area in that it makes a lot of this easier.

You'll often see iterative programming within a function, though it can be and often is used in the normal course of programming.

The typical approach seen in other languages can be used, and sometimes is still the way to go.

Calculate column means:

# million column matrix
d = matrix(rnorm(10000000), nrow=10)    # might take a second or two
dMeans = numeric(ncol(d))               # not necessary but does speed up explicit loops to predefine the object

# The following may take up to ~ 10 seconds
for (i in 1:ncol(d)){
  dMeans[i] = mean(d[,i])
}

Standard looping

One may loop:

  • directly over elements (less optimal)
  • over the numeric indices (as the in that demo, possibly the most common in the R community)
  • over the names

Standard looping

A more general way to loop is with a while in place of a for.

The loop will run until some condition is met.

Any for loop can be written as a while statement.

n = 1
while(n <= ncol(d)){
  dMeans[n] = mean(d[,n])
  n = n + 1
}

If Else

Within functions you will often see something like the following:

if(conditionMet){
  doThis
} else if(otherConditionMet) {
  doThat
} else {
  doSomethingElse
}

For more on for, while, and if…else constructs see ?Control.

Looping Alternatives: Vectorized Code

While standard looping works, other approaches are available, and some vectorized operations can reduce code and are often faster.

Vectorizing your code means working on whole objects rather than iterating over individual elements.

Looping Alternatives: Vectorized Code

Consider the following loop (type it or something similar yourself):

a = rnorm(10)
b = rnorm(10)

for(i in seq(10)){
  message(a[i] + b[i])
}

The following is easier to write, clearer to read, and faster:

a + b

Looping Alternatives: Vectorized Code

There are many vectorized functions that come with base R, or are otherwise optimized such that you'll want to quickly learn them and avoid using explicit loops.

  • rowSums, colSums
  • rowMeans, colMeans
  • matrix operations (e.g. crossprod)
  • scale

Looping Alternatives: Vectorized Code

The vectorized approach works in a lot of other places also, and is especially useful in indexing.

myvar = 1:100
myvar[sample(myvar, 50)] = NA       # insert 50 missing values
sum(is.na(myvar))                   # check how many
myvar[is.na(myvar)] = 0             # change missing to 0
any(is.na(myvar))                   # any left

1:10 >= 5                           # which of the sequence 1:10 is greater or equal to 5

rowSums(d > 0)                      # count values greater than zero for each row
## [1] 50
## [1] FALSE
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [1] 500172 500014 500230 499986 500501 500293 500732 499196 499581 500193

Looping Alternatives: Functionals

The quicker you learn the apply family of functions, as well as related versions in the aforementioned plyr package, the quicker your data processing will be.

Apply the mean function to the columns. Similar time as a loop in this case but cleaner code.

dmeans = apply(d, 2, mean) # the 2 specifies columns

Note that we supply a function as an argument to function.

Looping Alternatives: Functionals

As mentioned earlier, one will often use anonymous functions with apply.

We create a simple function on the fly, without assignment.

apply(d, 1, function(x) any(x>5))
##  [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE

Looping Alternatives: Functionals

The apply is a family of functions:

  • lapply (will return list)
  • sapply and vapply for vectors and lists (can return list or vector/matrix)
  • tapply for groupwise applications
  • mapply for multi-argument passing (for example to do something with 2 lists of input)
  • replicate to do something N times

They also have parallelized versions, which means even if they weren't any faster than the standad loop they could be greatly sped up beyond the base speed.

Looping Alternatives: Functionals

plyr, dplyr, and tidyr packages have functions with similar capabilities but:

  • are often faster
  • allow for easier management of input to output
  • often 'just work'
library(plyr)
aaply(d, 1, mean)                  # array to array
adply(d, 1, mean,  .parallel = T)  # array to dataframe in parallel

Your turn

Your turn

Using either of the two matrices below (or create one yourself), perform a row or column operation with two different approaches- one using an explicit loop, one of which uses apply-like functionality.

Examples: if any in a row equal some value, any columns with values greater than some other value.

Suggestions:

  • Calculations like rowMeans, colSums
  • & | ! (and and or operator, ?Logic)
  • Logical Operators == != > < <= >= (?Logic ?Comparison for more info)
  • any which all

Hint: first decide on your problem, keep as simple as need be. Then write pseudo-code for the anonymous function part if using one. Then try to convert to a working example.

If you want to try something that isn't clear from the above suggestions just ask.

Example:

nums = matrix(sample(1:3, 9, replace = T), 3, 3)
lets = matrix(sample(letters, 9, replace = T), 3, 3)

Modeling

After all that we've gone through at this point, modeling will be the easy part for you.

The biggest hurdle in modeling is preparing the data, and the better you are at programming:

  • the faster it will go
  • the more you can explore
  • the more you can spend time learning other things (like new modeling packages)

Once complete, you now have thousands of packages to choose from for modeling.

Modeling

The standard format for regression modeling can serve as baseline code for modeling in other packages.

Most take a formula and data argument. - a regression of y on x and z

lm(y ~ x + z, data=myData)

Some might take matrix/vector arguments.

lm.fit(X, y)          # lm.fit is the workhorse of lm

Unsupervised methods usually will need a matrix argument

  • sometimes a dataframe is ok
princomp(X)

Modeling

And then there are variations on these themes, but those constitute the vast majority of standard and complex models.

In addition, they all come with additional arguments that you'll need to familiarize yourself with before using the function

glm(art ~ ., data=bioChemists, family = 'poisson')   # number of articles predicted by everything else

library(pscl)
hurdle(art ~ ., data=bioChemists, dist = 'poisson')  # hurdle model to deal with excess zeros
library(lme4)
lmer(Reaction ~ Days + (1|Subject), data=sleepstudy)  # a mixed model with a random intercept for Subject

Modeling

Do a search for a model that you would like to learn more about.

If it's a somewhat standard model try an R search

??`generalized linear model`

For something you're less sure about you might try several approaches:

  • General web search of model name with R attached (best approach)

  • RDocumentation.org and search 'all fields' or 'description' with the model name.

  • RSiteSearch
    • generally poor
  • You can also ask us to help narrow your search to maybe some popular implementations.

Modeling

Once you find a package that has something you want to try, take the following steps.

  • Install the package
  • Note that many packages are now on github and might require different means for installation
  • Consult the manual
  • Cannot be stressed enough
  • In the manual find one or two primary functions
  • Read the helpfile, run the example code when finished (if available; you don't have to copy and paste)
example('modelingFunctionName')

Modeling

Otherwise you will:

  • not know which functions might be most useful
  • get errors when you try to run it with your own data
  • miss out on neat features of the function
  • miss out on other related functions (e.g. visualization, diagnostics)

Think of one or two models you're thinking about or simply maybe interested in and try this for yourself.

We'll come around and help you try and implement them if you have trouble or answer other questions.

Other

Outline

  • Traditional Static Plotting
  • Base R plotting
  • ggplot2

  • Newer approaches
  • shiny, ggvis, htmlwidgets

Traditional Static Plotting

Traditional Static Plotting

Never forget that Base R has a lot of visualization capabilities that can produce high quality graphs

For many quick peeks, it's still often to be preferred.

x = rnorm(100)
y = 2*x^2 + rnorm(100)
hist(x)


plot(x, y)

Traditional Static Plotting

Unfortunately base R plots can take a lot to get to that professional quality, as the defaults are generally poor (my opinion)

Many packages or other base R functions serve a particular purpose

  • heatmaps, dendrograms, diagnostics correlation plots etc.

In addition, a widely used static plotting package is ggplot2

ggplot2

ggplot2 is based on the 'Grammar of Graphics' and takes on a layered approach to building visualizations, using a technique similar to the piping we showed earlier.

It primarily works on dataframes.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:ggvis':
## 
##     resolution
ggplot(aes(x=Petal.Length, y=Sepal.Length), data=iris) +
  geom_point()

ggplot2

ggplot2 works on certain 'aesthetics', which are typically variables that take on different values, with 'geoms', things like points, lines, densities etc.

Let's add more.

ggplot(aes(x=Petal.Length, y=Sepal.Length), data=iris) +
  geom_point(color='red', size=4, alpha=.5)  # not within aesthetic function

ggplot2

ggplot(aes(x=Petal.Length, y=Sepal.Length), data=iris) +
  geom_point(aes(color=Species)) +
  theme_minimal()

You try

Install ggplot2 if you haven't.

Change the aesthetics in the first line to look at Widths instead of Lengths.

Delete the interior portion of geom_point (i.e. it should look like geom_point()).

Add the following lines after the geom, keep the theme at the end if you want.

  geom_smooth() +
  facet_wrap(~Species) +

If you like, use facet_grid rather than wrap.

Newer approaches

ggplot2 was and still is an awesome tool and still very useful for standard static plots.

  • And has continued to develop even as the creator has moved on

However, ggvis, its successor, enables interactivity and is more web-oriented.

Furthermore there is a growing number of packages put out by Rstudio and others that embrace modern methods of visual display.

ggvis

The following shows how to do one of the previous plots with ggvis.

library(ggvis)
ggvis(x=~Petal.Length, y=~Sepal.Length, data=iris) %>% 
  layer_points(fill=~Species, fillOpacity:=.75) 

It's already a clean looking plot, and ready to add interactive components.

ggvis

As before, we'll build up from the above.

Try to add layer_smooths to the code, with an argument se=TRUE.

ggvis(x=~Petal.Length, y=~Sepal.Length, data=iris) %>% 
  layer_points(fill=~Species, fillOpacity:=.75) 

Going further

Things to think about with visualization

Cleaner is almost always better

  • avoid things like unnecessary gridlines, backgrounds etc.

Think multivariately (beyond bivariate)

  • e.g. there is never a good reason for bar plot or similarly simple visuals that can be explained in a single sentence

Think interactively

~10% of the population has some form of color blindness.

Color is required, not 'extra'

Exercise