Writing Functions

You can’t do anything in R without using functions, but have you ever written your own? Why would you?

  • Efficiency
  • Customized functionality
  • Reproducibility
  • Extend the work that’s already been done

There are many benefits to writing your own functions, and it’s actually easy to do. Once you get the basic concept down, you’ll likely find yourself using your own functions more and more.

A Starting Point

Let’s assume you want to calculate the mean, standard deviation, and number of missing values for a variable, called myvar. We could do something like the following

mean(myvar)
sd(myvar)
sum(is.na(myvar))

Now let’s say you need to do it for several variables. Here’s what your custom function could look like. It takes a single input, the variable you want information about, and returns a data frame with that info.

my_summary <- function(x) {
  data.frame(
    mean = mean(x),
    sd = sd(x),
    N_missing = sum(is.na(x))
  )
}

In the above, x is an arbitrary name for an input. You can name it whatever you want, but the more meaningful the better. In R (and other languages) these are called arguments, but these inputs will determine in part what is eventually produced as output by the function.

my_summary(mtcars$mpg)
      mean       sd N_missing
1 20.09062 6.026948         0

Works fine. However, data typically isn’t that pretty. It often has missing values.

load('data/gapminder.RData')
my_summary(gapminder_2019$lifeExp)
  mean sd N_missing
1   NA NA       516

If there are actually missing values, we need to set na.rm = TRUE or the mean and sd will return NA. Let’s try it. We can either hard bake it in, as in the initial example, or add an argument to let us control how to handle NAs with our custom function.

my_summary <- function(x) {
  data.frame(
    mean = mean(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    N_missing = sum(is.na(x))
  )
}


my_summary_na <- function(x, remove_na = TRUE) {
  data.frame(
    mean = mean(x, na.rm = remove_na),
    sd = sd(x, na.rm = remove_na),
    N_missing = sum(is.na(x))
  )
}


my_summary(gapminder_2019$lifeExp)
      mean       sd N_missing
1 43.13218 16.31355       516
my_summary_na(gapminder_2019$lifeExp, remove_na = FALSE)
  mean sd N_missing
1   NA NA       516

Seems to work fine. Let’s add how many total observations there are.

my_summary <- function(x) {
  # create an arbitrarily named object with the summary information
  summary_data = data.frame(
    mean = mean(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    N_total = length(x),
    N_missing = sum(is.na(x))
  )
  
  # return the result!
  summary_data       
}

That was easy! Let’s try it.

my_summary(gapminder_2019$lifeExp)
      mean       sd N_total N_missing
1 43.13218 16.31355   40953       516

Now let’s do it for every column! We’ve used the map function before, now let’s use a variant that will return a data frame.

gapminder_2019 %>% 
  select_if(is.numeric) %>% 
  map_dfr(my_summary, .id = 'variable')
    variable         mean           sd N_total N_missing
1       year 1.909000e+03 6.321997e+01   40953         0
2    lifeExp 4.313218e+01 1.631355e+01   40953       516
3        pop 1.353928e+07 6.565653e+07   40953         0
4  gdpPercap 4.591026e+03 1.016210e+04   40953         0
5 giniPercap 4.005331e+01 9.102757e+00   40953         0

The map_dfr function is just like our previous usage in the iterative programming section, just that it will create mini-data.frames then row-bind them together.

This shows that writing the first part of any function can be straightforward. Then, once in place, you can usually add functionality without too much trouble. Eventually you could have something very complicated, but which will make sense to you because you built it from the ground up.

Keep in mind as you start out that your initial decisions to make are:

  • What are the inputs (arguments) to the function?
  • What is the value to be returned?

When you think about writing a function, just write the code that can do it first. The goal is then to generalize beyond that single use case. RStudio even has a shortcut to help you get started. Consider our starting point. Highlight the code, hit Ctrl/Cmd + Shft + X, then give it a name.

mean(myvar)
sd(myvar)
sum(is.na(myvar))

It should look something like this.

test_fun <- function(myvar) {
  mean(myvar)
  sd(myvar)
  sum(is.na(myvar))
}

RStudio could tell that you would need at least one input myvar, but beyond that, you’re now on your way to tweaking the function as you see fit.

Note that what goes in and what comes out could be anything, even nothing!

two <- function() {
  2
}

two()
[1] 2

Or even another function!

center <- function(type) {
  if (type == 'mean') {
    mean
  } 
  else {
    median
  }
}

center(type = 'mean')
function (x, ...) 
UseMethod("mean")
<bytecode: 0x7fe3efc05860>
<environment: namespace:base>
myfun = center(type = 'mean')

myfun(1:5)
[1] 3
myfun = center(type = 'median')

myfun(1:4)
[1] 2.5

We can also set default values for the inputs.

hi <- function(name = 'Beyoncé') {
  paste0('Hi ', name, '!')
}

hi()
[1] "Hi Beyoncé!"
hi(name = 'Jay-Z')
[1] "Hi Jay-Z!"

If you are working within an RStudio project, it would be a good idea to create a folder for your functions and save each as their own script. When you need the function just use the following:

source('my_functions/awesome_func.R')

This would make it easy to even create your own personal package with the functions you create.

However you go about creating a function and for whatever purpose, try to make a clear decision at the beginning

  • What is the (specific) goal of your function?
  • What is the minimum needed to obtain that goal?

There is even a keyboard shortcut to create R style documentation automatically!

Cmd/Ctrl + Option/Alt + Shift + R


DRY

An oft-quoted mantra in programming is Don’t Repeat Yourself. One context regards iterative programming, where we would rather write one line of code than one-hundred. More generally though, we would like to gain efficiency where possible. A good rule of thumb is, if you are writing the same set of code more than twice, you should write a function to do it instead.

Consider the following example, where we want to subset the data given a set of conditions. Given the cylinder, engine displacement, and mileage, we’ll get different parts of the data.

good_mileage_displ_low_cyl_4  = if_else(cyl == 4 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_6  = if_else(cyl == 6 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_8  = if_else(cyl == 8 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_4 = if_else(cyl == 4 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_6 = if_else(cyl == 6 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_8 = if_else(cyl == 8 & displ > mean(displ) & hwy > 30, 'yes', 'no')

It was tedious, but that’s not much code. But now consider- what if you want to change the mpg cutoff? The mean to median? Something else? You have to change all of it. Screw that- let’s write a function instead! What kinds of inputs will we need?

  • cyl: Which cylinder type we want
  • mpg_cutoff: The cutoff for ‘good’ mileage
  • displ_fun: Whether the displacement to be based on the mean or something else
  • displ_low: Whether we are interested in low or high displacement vehicles
  • cls: the class of the vehicle (e.g. compact or suv)
good_mileage <- function(
  cylinder = 4,
  mpg_cutoff = 30,
  displ_fun = mean,
  displ_low = TRUE,
  cls = 'compact'
) {
  
  if (displ_low == TRUE) {              # condition to check, if it holds,
    result <- mpg %>%                   # filter data given the arguments
      filter(
        cyl == cylinder,
        displ <= displ_fun(displ),
        hwy >= mpg_cutoff,
        class == cls
      )
  } 
  else {                                # if the condition doesn't hold, filter 
    result <- mpg %>%                   # the data this way instead
      filter(
        cyl == cylinder,
        displ >= displ_fun(displ),      # the only change is here
        hwy >= mpg_cutoff,
        class == cls
      )
  }
  
  result                                # return the object
}

So what’s going on here? Not a whole lot really. The function just filters the data to observations that match the input criteria, and returns that result at the end. We also put default values to the arguments, which can be done to your discretion.

Conditionals

The core of the above function uses a conditional statement using standard ifelse structure. The if part determines whether some condition holds. If it does, then proceed to the next step in the brackets. If not, skip to the else part. You may have used the ifelse function in base R, or dplyr’s if_else as above, which are a short cuts for this approach. We can also add conditional else statements (else if), drop the else part entirely, nest conditionals within other conditionals, etc. Like loops, conditional statements look very similar across all programming languages.

JavaScript:

if (Math.random() < 0.5) {
  console.log("You got Heads!")
} else {
  console.log("You got Tails!")
}

Python:

if x == 2:
  print(x)
else:
  print(x*x)

In any case, with our function at the ready, we can now do the things we want to as needed:

good_mileage(mpg_cutoff = 40)
# A tibble: 1 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class  
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
1 volkswagen   jetta   1.9  1999     4 manual(m5) f        33    44 d     compact
good_mileage(
  cylinder = 8,
  mpg_cutoff = 15,
  displ_low = FALSE,
  cls = 'suv'
)
# A tibble: 34 x 11
   manufacturer model              displ  year   cyl trans    drv     cty   hwy fl    class
   <chr>        <chr>              <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>
 1 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4) r        14    20 r     suv  
 2 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4) r        11    15 e     suv  
 3 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4) r        14    20 r     suv  
 4 chevrolet    c1500 suburban 2wd   5.7  1999     8 auto(l4) r        13    17 r     suv  
 5 chevrolet    c1500 suburban 2wd   6    2008     8 auto(l4) r        12    17 r     suv  
 6 chevrolet    k1500 tahoe 4wd      5.3  2008     8 auto(l4) 4        14    19 r     suv  
 7 chevrolet    k1500 tahoe 4wd      5.7  1999     8 auto(l4) 4        11    15 r     suv  
 8 chevrolet    k1500 tahoe 4wd      6.5  1999     8 auto(l4) 4        14    17 d     suv  
 9 dodge        durango 4wd          4.7  2008     8 auto(l5) 4        13    17 r     suv  
10 dodge        durango 4wd          4.7  2008     8 auto(l5) 4        13    17 r     suv  
# … with 24 more rows

Let’s extend the functionality by adding a year argument (the only values available are 2008 and 1999).

good_mileage <- function(
  cylinder = 4,
  mpg_cutoff = 30,
  displ_fun = mean,
  displ_low = TRUE,
  cls = 'compact',
  yr = 2008
) {
  
  if (displ_low) {
    result = mpg %>%
    filter(cyl == cylinder,
           displ <= displ_fun(displ),
           hwy >= mpg_cutoff,
           class == cls,
           year == yr)
  } 
  else {
    result = mpg %>%
    filter(cyl == cylinder,
           displ >= displ_fun(displ),
           hwy >= mpg_cutoff,
           class == cls,
           year == yr)
  }
  
  result
}
good_mileage(
  cylinder = 8,
  mpg_cutoff = 19,
  displ_low = FALSE,
  cls = 'suv',
  yr = 2008
)
# A tibble: 6 x 11
  manufacturer model              displ  year   cyl trans    drv     cty   hwy fl    class
  <chr>        <chr>              <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>
1 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4) r        14    20 r     suv  
2 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4) r        14    20 r     suv  
3 chevrolet    k1500 tahoe 4wd      5.3  2008     8 auto(l4) 4        14    19 r     suv  
4 ford         explorer 4wd         4.6  2008     8 auto(l6) 4        13    19 r     suv  
5 jeep         grand cherokee 4wd   4.7  2008     8 auto(l5) 4        14    19 r     suv  
6 mercury      mountaineer 4wd      4.6  2008     8 auto(l6) 4        13    19 r     suv  

So we now have something that is flexible, reusable, and extensible, and it took less code than writing out the individual lines of code.

Anonymous functions

Oftentimes we just need a quick and easy function for a one-off application, especially when using apply/map functions. Consider the following two lines of code.

apply(mtcars, 2, sd)
apply(mtcars, 2, function(x) x / 2 )

The difference between the two is that for the latter, our function didn’t have to be a named object already available. We created a function on the fly just to serve a specific purpose. A function doesn’t exist in base R that just does nothing but divide by two, but since it is simple, we just created it as needed.

To further illustrate this, we’ll create a robust standardization function that uses the median and median absolute deviation rather than the mean and standard deviation.

# some variables have a mad = 0, and so return Inf (x/0) or NaN (0/0)
# apply(mtcars, 2, function(x) (x - median(x))/mad(x)) %>% 
#   head()

mtcars %>%
  map_df(function(x) (x - median(x))/mad(x))
# A tibble: 32 x 11
       mpg    cyl   disp     hp     drat     wt   qsec    vs    am   gear   carb
     <dbl>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
 1  0.333   0     -0.258 -0.169  0.291   -0.919 -0.883   NaN   Inf  0      1.35 
 2  0.333   0     -0.258 -0.169  0.291   -0.587 -0.487   NaN   Inf  0      1.35 
 3  0.665  -0.674 -0.629 -0.389  0.220   -1.31   0.636   Inf   Inf  0     -0.674
 4  0.407   0      0.439 -0.169 -0.873   -0.143  1.22    Inf   NaN -0.674 -0.674
 5 -0.0924  0.674  1.17   0.674 -0.774    0.150 -0.487   NaN   NaN -0.674  0    
 6 -0.203   0      0.204 -0.233 -1.33     0.176  1.77    Inf   NaN -0.674 -0.674
 7 -0.905   0.674  1.17   1.58  -0.689    0.319 -1.32    NaN   NaN -0.674  1.35 
 8  0.961  -0.674 -0.353 -0.791 -0.00710 -0.176  1.62    Inf   NaN  0      0    
 9  0.665  -0.674 -0.395 -0.363  0.319   -0.228  3.67    Inf   NaN  0      0    
10  0       0     -0.204  0      0.319    0.150  0.417   Inf   NaN  0      1.35 
# … with 22 more rows

Even if you don’t use anonymous functions (sometimes called lambda functions), it’s important to understand them, because you’ll often see other people’s code using them.


While it goes beyond the scope of this document at present, I should note that RStudio has a very nice and easy to use debugger. Once you get comfortable writing functions, you can use the debugger to troubleshoot problems that arise, and test new functionality (see the ‘Debug’ menu). In addition, one can profile functions to see what parts are, for example, more memory intensive, or otherwise serve as a bottleneck (see the ‘Profile’ menu). You can use the profiler on any code, not just functions.

Writing Functions Exercises

Excercise 1

Write a function that takes the log of the sum of two values (i.e. just two single numbers) using the log function. Just remember that within a function, you can write R code just like you normally would.

log_sum <- function(a, b) {
  ?
}

Excercise 1b

What happens if the sum of the two numbers is negative? You can’t take a log of a negative value, so it’s an error. How might we deal with this? Try using a conditional to provide an error message using the stop function. The first part is basically identical to the function you just did. But given that result, you will need to check for whether it is negative or not. The message can be whatever you want.

log_sum <- function(a, b) {
  
  ?
  
  if (? < 0) {
    stop('Your message here.')
  } else {
    ?
    return(your_log_sum_results)    # this is an arbitrary name, change accordingly
  }
}

Exercise 2

Let’s write a function that will take a numeric variable and convert it to a character string of ‘positive’ vs. ‘negative’. We can use if {}... else {} structure, ifelse, or dplyr::if_else- they all would accomplish this. In this case, the input is a single vector of numbers, and the output will recode any negative value to ‘negative’ and positive values to ‘positive’ (or whatever you want). Here is an example of how we would just do it as a one-off.

set.seed(123)  # so you get the exact same 'random' result
x <- rnorm(10)
if_else(x < 0, "negative", "positive")

Now try your hand at writing a function for that.

pos_neg <- function(?) {
  ?
}