Writing Functions
You can’t do anything in R without using functions, but have you ever written your own? Why would you?
- Efficiency
- Customized functionality
- Reproducibility
- Extend the work that’s already been done
There are many benefits to writing your own functions, and it’s actually easy to do. Once you get the basic concept down, you’ll likely find yourself using your own functions more and more.
A Starting Point
Let’s assume you want to calculate the mean, standard deviation, and number of missing values for a variable, called myvar
. We could do something like the following
Now let’s say you need to do it for several variables. Here’s what your custom function could look like. It takes a single input, the variable you want information about, and returns a data frame with that info.
In the above, x
is an arbitrary name for an input. You can name it whatever you want, but the more meaningful the better. In R (and other languages) these are called arguments, but these inputs will determine in part what is eventually produced as output by the function.
mean sd N_missing
1 20.09062 6.026948 0
Works fine. However, data typically isn’t that pretty. It often has missing values.
mean sd N_missing
1 NA NA 516
If there are actually missing values, we need to set na.rm = TRUE
or the mean and sd will return NA
. Let’s try it. We can either hard bake it in, as in the initial example, or add an argument to let us control how to handle NAs with our custom function.
my_summary <- function(x) {
data.frame(
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
N_missing = sum(is.na(x))
)
}
my_summary_na <- function(x, remove_na = TRUE) {
data.frame(
mean = mean(x, na.rm = remove_na),
sd = sd(x, na.rm = remove_na),
N_missing = sum(is.na(x))
)
}
my_summary(gapminder_2019$lifeExp)
mean sd N_missing
1 43.13218 16.31355 516
mean sd N_missing
1 NA NA 516
Seems to work fine. Let’s add how many total observations there are.
my_summary <- function(x) {
# create an arbitrarily named object with the summary information
summary_data = data.frame(
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
N_total = length(x),
N_missing = sum(is.na(x))
)
# return the result!
summary_data
}
That was easy! Let’s try it.
mean sd N_total N_missing
1 43.13218 16.31355 40953 516
Now let’s do it for every column! We’ve used the map function before, now let’s use a variant that will return a data frame.
variable mean sd N_total N_missing
1 year 1.909000e+03 6.321997e+01 40953 0
2 lifeExp 4.313218e+01 1.631355e+01 40953 516
3 pop 1.353928e+07 6.565653e+07 40953 0
4 gdpPercap 4.591026e+03 1.016210e+04 40953 0
5 giniPercap 4.005331e+01 9.102757e+00 40953 0
The map_dfr function is just like our previous usage in the iterative programming section, just that it will create mini-data.frames then row-bind them together.
This shows that writing the first part of any function can be straightforward. Then, once in place, you can usually add functionality without too much trouble. Eventually you could have something very complicated, but which will make sense to you because you built it from the ground up.
Keep in mind as you start out that your initial decisions to make are:
- What are the inputs (arguments) to the function?
- What is the value to be returned?
When you think about writing a function, just write the code that can do it first. The goal is then to generalize beyond that single use case. RStudio even has a shortcut to help you get started. Consider our starting point. Highlight the code, hit Ctrl/Cmd + Shft + X, then give it a name.
It should look something like this.
RStudio could tell that you would need at least one input myvar
, but beyond that, you’re now on your way to tweaking the function as you see fit.
Note that what goes in and what comes out could be anything, even nothing!
[1] 2
Or even another function!
function (x, ...)
UseMethod("mean")
<bytecode: 0x7fe3efc05860>
<environment: namespace:base>
[1] 3
[1] 2.5
We can also set default values for the inputs.
[1] "Hi Beyoncé!"
[1] "Hi Jay-Z!"
If you are working within an RStudio project, it would be a good idea to create a folder for your functions and save each as their own script. When you need the function just use the following:
This would make it easy to even create your own personal package with the functions you create.
However you go about creating a function and for whatever purpose, try to make a clear decision at the beginning
- What is the (specific) goal of your function?
- What is the minimum needed to obtain that goal?
There is even a keyboard shortcut to create R style documentation automatically!
DRY
An oft-quoted mantra in programming is Don’t Repeat Yourself. One context regards iterative programming, where we would rather write one line of code than one-hundred. More generally though, we would like to gain efficiency where possible. A good rule of thumb is, if you are writing the same set of code more than twice, you should write a function to do it instead.
Consider the following example, where we want to subset the data given a set of conditions. Given the cylinder, engine displacement, and mileage, we’ll get different parts of the data.
good_mileage_displ_low_cyl_4 = if_else(cyl == 4 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_6 = if_else(cyl == 6 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_8 = if_else(cyl == 8 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_4 = if_else(cyl == 4 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_6 = if_else(cyl == 6 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_8 = if_else(cyl == 8 & displ > mean(displ) & hwy > 30, 'yes', 'no')
It was tedious, but that’s not much code. But now consider- what if you want to change the mpg cutoff? The mean to median? Something else? You have to change all of it. Screw that- let’s write a function instead! What kinds of inputs will we need?
- cyl: Which cylinder type we want
- mpg_cutoff: The cutoff for ‘good’ mileage
- displ_fun: Whether the displacement to be based on the mean or something else
- displ_low: Whether we are interested in low or high displacement vehicles
- cls: the class of the vehicle (e.g. compact or suv)
good_mileage <- function(
cylinder = 4,
mpg_cutoff = 30,
displ_fun = mean,
displ_low = TRUE,
cls = 'compact'
) {
if (displ_low == TRUE) { # condition to check, if it holds,
result <- mpg %>% # filter data given the arguments
filter(
cyl == cylinder,
displ <= displ_fun(displ),
hwy >= mpg_cutoff,
class == cls
)
}
else { # if the condition doesn't hold, filter
result <- mpg %>% # the data this way instead
filter(
cyl == cylinder,
displ >= displ_fun(displ), # the only change is here
hwy >= mpg_cutoff,
class == cls
)
}
result # return the object
}
So what’s going on here? Not a whole lot really. The function just filters the data to observations that match the input criteria, and returns that result at the end. We also put default values to the arguments, which can be done to your discretion.
Conditionals
The core of the above function uses a conditional statement using standard if…else structure. The if part determines whether some condition holds. If it does, then proceed to the next step in the brackets. If not, skip to the else part. You may have used the ifelse function in base R, or dplyr’s if_else as above, which are a short cuts for this approach. We can also add conditional else statements (else if), drop the else part entirely, nest conditionals within other conditionals, etc. Like loops, conditional statements look very similar across all programming languages.
JavaScript:
Python:
In any case, with our function at the ready, we can now do the things we want to as needed:
# A tibble: 1 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
# A tibble: 34 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
2 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 11 15 e suv
3 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
4 chevrolet c1500 suburban 2wd 5.7 1999 8 auto(l4) r 13 17 r suv
5 chevrolet c1500 suburban 2wd 6 2008 8 auto(l4) r 12 17 r suv
6 chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 14 19 r suv
7 chevrolet k1500 tahoe 4wd 5.7 1999 8 auto(l4) 4 11 15 r suv
8 chevrolet k1500 tahoe 4wd 6.5 1999 8 auto(l4) 4 14 17 d suv
9 dodge durango 4wd 4.7 2008 8 auto(l5) 4 13 17 r suv
10 dodge durango 4wd 4.7 2008 8 auto(l5) 4 13 17 r suv
# … with 24 more rows
Let’s extend the functionality by adding a year argument (the only values available are 2008 and 1999).
good_mileage <- function(
cylinder = 4,
mpg_cutoff = 30,
displ_fun = mean,
displ_low = TRUE,
cls = 'compact',
yr = 2008
) {
if (displ_low) {
result = mpg %>%
filter(cyl == cylinder,
displ <= displ_fun(displ),
hwy >= mpg_cutoff,
class == cls,
year == yr)
}
else {
result = mpg %>%
filter(cyl == cylinder,
displ >= displ_fun(displ),
hwy >= mpg_cutoff,
class == cls,
year == yr)
}
result
}
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
2 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
3 chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 14 19 r suv
4 ford explorer 4wd 4.6 2008 8 auto(l6) 4 13 19 r suv
5 jeep grand cherokee 4wd 4.7 2008 8 auto(l5) 4 14 19 r suv
6 mercury mountaineer 4wd 4.6 2008 8 auto(l6) 4 13 19 r suv
So we now have something that is flexible, reusable, and extensible, and it took less code than writing out the individual lines of code.
Anonymous functions
Oftentimes we just need a quick and easy function for a one-off application, especially when using apply/map functions. Consider the following two lines of code.
The difference between the two is that for the latter, our function didn’t have to be a named object already available. We created a function on the fly just to serve a specific purpose. A function doesn’t exist in base R that just does nothing but divide by two, but since it is simple, we just created it as needed.
To further illustrate this, we’ll create a robust standardization function that uses the median and median absolute deviation rather than the mean and standard deviation.
# some variables have a mad = 0, and so return Inf (x/0) or NaN (0/0)
# apply(mtcars, 2, function(x) (x - median(x))/mad(x)) %>%
# head()
mtcars %>%
map_df(function(x) (x - median(x))/mad(x))
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.333 0 -0.258 -0.169 0.291 -0.919 -0.883 NaN Inf 0 1.35
2 0.333 0 -0.258 -0.169 0.291 -0.587 -0.487 NaN Inf 0 1.35
3 0.665 -0.674 -0.629 -0.389 0.220 -1.31 0.636 Inf Inf 0 -0.674
4 0.407 0 0.439 -0.169 -0.873 -0.143 1.22 Inf NaN -0.674 -0.674
5 -0.0924 0.674 1.17 0.674 -0.774 0.150 -0.487 NaN NaN -0.674 0
6 -0.203 0 0.204 -0.233 -1.33 0.176 1.77 Inf NaN -0.674 -0.674
7 -0.905 0.674 1.17 1.58 -0.689 0.319 -1.32 NaN NaN -0.674 1.35
8 0.961 -0.674 -0.353 -0.791 -0.00710 -0.176 1.62 Inf NaN 0 0
9 0.665 -0.674 -0.395 -0.363 0.319 -0.228 3.67 Inf NaN 0 0
10 0 0 -0.204 0 0.319 0.150 0.417 Inf NaN 0 1.35
# … with 22 more rows
Even if you don’t use anonymous functions (sometimes called lambda functions), it’s important to understand them, because you’ll often see other people’s code using them.
While it goes beyond the scope of this document at present, I should note that RStudio has a very nice and easy to use debugger. Once you get comfortable writing functions, you can use the debugger to troubleshoot problems that arise, and test new functionality (see the ‘Debug’ menu). In addition, one can profile functions to see what parts are, for example, more memory intensive, or otherwise serve as a bottleneck (see the ‘Profile’ menu). You can use the profiler on any code, not just functions.
Writing Functions Exercises
Excercise 1
Write a function that takes the log of the sum of two values (i.e. just two single numbers) using the log function. Just remember that within a function, you can write R code just like you normally would.
Excercise 1b
What happens if the sum of the two numbers is negative? You can’t take a log of a negative value, so it’s an error. How might we deal with this? Try using a conditional to provide an error message using the stop function. The first part is basically identical to the function you just did. But given that result, you will need to check for whether it is negative or not. The message can be whatever you want.
Exercise 2
Let’s write a function that will take a numeric variable and convert it to a character string of ‘positive’ vs. ‘negative’. We can use if {}... else {}
structure, ifelse, or dplyr::if_else- they all would accomplish this. In this case, the input is a single vector of numbers, and the output will recode any negative value to ‘negative’ and positive values to ‘positive’ (or whatever you want). Here is an example of how we would just do it as a one-off.
set.seed(123) # so you get the exact same 'random' result
x <- rnorm(10)
if_else(x < 0, "negative", "positive")
Now try your hand at writing a function for that.