More Programming

This section is kind of a grab bag of miscellaneous things related to programming. If you’ve made it this far, feel free to keep going!

Code Style

A lot has been written about coding style over the decades. If there was a definitive answer, you would have heard of it by now. However, there are a couple things you can do at the beginning of your programming approach to go a long way making your code notably better.

Why does your code exist?

Either use text in an R Markdown file or comment your R script. Explain why, not what, the code is doing. Think of it as leaving your future self a note (they will thank you!). Be clear, and don’t assume you’ll remember why you were doing what you did.

Assignment

You see some using <- or = for assignment. While there is a slight difference, if you’re writing decent code it shouldn’t matter. Far more programming languages use =, so that’s reason to prefer it. However, if you like being snobby about things, go with <-. Whichever you use, do so consistently¹⁶.

Code length

If your script is becoming hundreds of lines long, you probably need to compartmentalize your operations into separate scripts. For example, separate your data processing from your model scripts.

Spacing

Don’t be stingy with spaces. As you start out, err on the side of using them. Just note there are exceptions (e.g. no space between function name and parenthesis, unless that function is something like if or else), but you’ll get used to the exceptions over time.

x=rnorm(10, mean=0,sd=1)              # harder to read
                                      # space between lines too!
x = rnorm(10, mean = 0, sd = 1)       # easier to read

Naming things

You might not think of it as such initially, but one of the more difficult challenges in programming is naming things. Even if we can come up with a name for an object or file, there are different styles we can use for the name.

Here is a brief list of things to keep in mind.

The name should make sense to you, your future self, and others that will use the code
Try to be concise, but see the previous
Make liberal use of suffixes/prefixes for naming the same types of things e.g. model_x, model_z
For function names, try for verbs that describe what they do (e.g. add_two vs. two_more or plus2)
Don’t name anything with ‘final’
Don’t name something that is already an R function/object (e.g. T, c, data, etc.)
Avoid distinguishing names only by number, e.g. data1 data2

Common naming styles include:

snake_case
CamelCase or camelCase
spinal-case (e.g. for file names)
dot.case

For objects and functions, I find snake case easier to read and less prone to issues¹⁷. For example, camel case can fail miserably when acronyms are involved. Dots already have specific uses (file name extensions, function methods, etc.), so probably should be avoided unless you’re using them for that specific purpose (they can also make selecting the whole name difficult depending on the context).

Other

Use tools like the built-in RStudio code cleanup shortcut like Ctrl/Cmd + Shft + A. It’s not perfect, in the sense I disagree with some of its style choice, but it will definitely be better than you will do on your own starting out.

Vectorization

Boolean indexing

Assume x is a vector of numbers. How would we create an index representing any value greater than 2?

x = c(-1, 2, 10, -5)
idx = x > 2
idx

[1] FALSE FALSE  TRUE FALSE

x[idx]

[1] 10

As mentioned previously, logicals are objects with values of TRUE or FALSE, like the idx variable above. While sometimes we want to deal with the logical object as an end, it is extremely commonly used as an index in data processing. Note how we don’t have to create an explicit index object first (though often you should), as R indexing is ridiculously flexible. Here are more examples, not necessarily recommended, but just to demonstrate the flexibility of Boolean indexing.

x[x > 2]

x[x != 'cat']

x[ifelse(x > 2 & x !=10, TRUE, FALSE)]

x[{y = idx; y}]

x[resid(lm(y ~ x)) > 0]

All of these will transfer to the tidyverse filter function.

df %>% 
  filter(x > 2, z == 'a')  # commas are like &

Vectorized operations

Boolean indexing allows us to take vectorized approaches to dealing with data. Consider the following unfortunately coded loop, where we create a variable y, which takes on the value of Yes if the variable x is greater than 2, and No if otherwise.

for (i in 1:nrow(mydf)) {
  
  check = mydf$x[i] > 2
  
  if (check == TRUE) {
    mydf$y[i] = 'Yes'
  }
  else {
    mydf$y[i] = 'No'
  }
}

Compare¹⁸:

mydf$y = 'No'

mydf$y[mydf$x > 2] = 'Yes'

This gets us the same thing, and would be much faster than the looped approach. Boolean indexing is an example of a vectorized operation. The whole vector is considered, rather than each element individually. The result is that any preprocessing is done once rather than the n iterations of the loop. In R, this will always faster.

Example: Log all values in a matrix.

mymatrix_log = log(mymatrix)

This is way faster than looping over elements, rows or columns. Here we’ll let the apply function stand in for our loop, logging the elements of each column.

mymatrix = matrix(runif(100), 10, 10)

identical(apply(mymatrix, 2, log), log(mymatrix))

[1] TRUE

library(microbenchmark)

microbenchmark(apply(mymatrix, 2, log), log(mymatrix))

Unit: nanoseconds
                    expr   min      lq     mean median      uq    max neval cld
 apply(mymatrix, 2, log) 33961 41309.5 77630.22  64910 96955.0 289128   100   b
           log(mymatrix)   918  1040.5  3473.66   1258  1704.5 189819   100  a

Many vectorized functions already exist in R. They are often written in C, Fortran etc., and so even faster. Not all programming languages lean toward vectorized operations, and may not see much speed gain from it. In R however, you’ll want to prefer it. Even without the speed gain, it’s cleaner/clearer code, another reason to use the approach.

Timings

We made our own function before, however, there is a scale function in base R that uses a more vectorized approach under the hood to standardize variables. The following demonstrates various approaches to standardizing the columns of the matrix, even using a parallelized approach. As you’ll see, the base R function requires very little code and beats the others.

mymat = matrix(rnorm(100000), ncol=1000)

stdize <- function(x) {
  (x-mean(x)) / sd(x)
}

doubleloop = function() {
  
  for (i in 1:ncol(mymat_asdf)) {
    x = mymat_asdf[, i]
    for (j in 1:length(x)) {
      x[j] = (x[j] - mean(x)) / sd(x)
    }
  }
  
}

singleloop = function() {
  
  for (i in 1:ncol(mymat_asdf)) {
    x = mymat_asdf[, i]
    x = (x - mean(x)) / sd(x)
  }
  
}


library(parallel)
cl = makeCluster(8)
clusterExport(cl, c('stdize', 'mymat'))

test = microbenchmark::microbenchmark(
  doubleloop = doubleloop(),
  singleloop = singleloop(),
  apply = apply(mymat, 2, stdize),
  parApply = parApply(cl, mymat, 2, stdize),
  vectorized = scale(mymat),
  times = 25
)

stopCluster(cl)

test

Regular Expressions

A regular expression, regex for short, is a sequence of characters that can be used as a search pattern for a string. Common operations are to merely detect, extract, or replace the matching string. There are actually many different flavors of regex for different programming languages, which are all flavors that originate with the Perl approach, or can enable the Perl approach to be used. However, knowing one means you pretty much know the others with only minor modifications if any.

To be clear, not only is regex another language, it’s nigh on indecipherable. You will not learn much regex, but what you do learn will save a potentially enormous amount of time you’d otherwise spend trying to do things in a more haphazard fashion. Furthermore, practically every situation that will come up has already been asked and answered on Stack Overflow, so you’ll almost always be able to search for what you need.

Here is an example of a pattern we might be interested in:

^r.*shiny[0-9]$

What is that you may ask? Well here is an example of strings it would and wouldn’t match. We’re using grepl to return a logical (i.e. TRUE or FALSE) if any of the strings match the pattern in some way.

string = c('r is the shiny', 'r is the shiny1', 'r shines brightly')

grepl(string, pattern = '^r.*shiny[0-9]$')

[1] FALSE  TRUE FALSE

What the regex is esoterically attempting to match is any string that starts with ‘r’ and ends with ‘shiny_’ where _ is some single digit. Specifically it breaks down as follows:

^ : starts with, so ^r means starts with r
. : any character
* : match the preceding zero or more times
shiny : match ‘shiny’
[0-9] : any digit 0-9 (note that we are still talking about strings, not actual numbered values)
$ : ends with preceding

Typical uses

None of it makes sense, so don’t attempt to do so. Just try to remember a couple key approaches, and search the web for the rest.

Along with ^ . * [0-9] $, a couple more common ones are:

[a-z] : letters a-z
[A-Z] : capital letters
+ : match the preceding one or more times
() : groupings
| : logical or e.g. [a-z]|[0-9] (a lower case letter or a number)
? : preceding item is optional, and will be matched at most once. Typically used for ‘look ahead’ and ‘look behind’
\ : escape a character, like if you actually wanted to search for a period instead of using it as a regex pattern, you’d use \., though in R you need \\, i.e. double slashes, for escape.

In addition, in R there are certain predefined characters that can be called:

[:punct:] : punctuation
[:blank:] : spaces and tabs
[:alnum:] : alphanumeric characters

Those are just a few. The key functions can be found by looking at the help file for the grep function (?grep). However, the stringr package has the same functionality with perhaps a slightly faster processing (though that’s due to the underlying stringi package).

See if you can guess which of the following will turn up TRUE.

grepl(c('apple', 'pear', 'banana'), pattern='a')
grepl(c('apple', 'pear', 'banana'), pattern='^a')
grepl(c('apple', 'pear', 'banana'), pattern='^a|a$')

Scraping the web, munging data, just finding things in your scripts … you can potentially use this all the time, and not only with text analysis, as we’ll now see.

Code Style Exercises

Exercise 1

For the following model related output, come up with a name for each object.

lm(hwy ~ cyl, data = mpg)                 # hwy mileage predicted by number of cylinders

summary(lm(hwy ~ cyl, data = mpg))        # the summary of that

lm(hwy ~ cyl + displ + year, data = mpg)  # an extension of that

Exercise 2

Fix this code.

x=rnorm(100, 10, 2)
y=.2* x+ rnorm(100)
data = data.frame(x,y)
q = lm(y~x, data=data)
summary(q)

Vectorization Exercises

Before we do this, did you remember to fix the names in the previous exercise?

Exercise 1

Show a non-vectorized (e.g. a loop) and a vectorized way to add a two to the numbers 1 through 3.

Exercise 2

Of the following, which do you think is faster? Test it with the bench package.

x = matrix(rpois(100000, lambda = 5), ncol = 100)
colSums(x)
apply(x, 2, sum)

bench::mark(
  cs  = colSums(x),
  app = apply(x, 2, sum),
  time_unit = 'ms'   # microseconds
)

Regex Exercises

Exercise 1

Using stringr and str_replace, replace all the states a’s with nothing.

library(stringr)

str_replace(state.name, pattern = ?, replacement = ?)

I usually reserve arrows for functions and package development.↩︎
Even though I think it’s obviously easier to read, based on how eyes/brains work, this is not necessarily the case for everyone. Many style guides you encounter will try to come across as definitive. They are not, nor are most of their claims tested. Same here.↩︎
For those familiar with ifelse, that would be applicable, but is not the point of the example.↩︎