Drastic R speed-ups via vectorization (and bug fixes)

by Sergiy Nesterko on April 29th, 2011

RDS visual

Figure 1: A screenshot of the corrected and enhanced dynamic visualization of RDS. Green balls are convenience sample, pink balls are subsequently recruited individuals, pink lines are links between network nodes that have been explored by the process, and numbers in circles correspond to sample wave number.

It is common to hear that R is slow, and so when I faced the necessity to scale old R code (pertaining to material described in this post) to operate on data 100 times larger than it used to, I was initially at a loss. The problem with the old code was that it took several days and about 4000 semi-parallel jobs to complete. With the size of data increasing by a factor of 100, the task was becoming infeasible to complete. Eventually however, I was able to achieve an over 100-fold speedup of the R code, with the speedup being due to addressing two issues:

  1. Vectorization. R code is sped up drastically by vectorizing where possible. Vectorization means turning loops and calls to functions like sapply into vector operations, as in the following:
    > time1 <- proc.time()
    > res1 <- sapply(1:1000000, exp)
    > time2 <- proc.time()
    > time2 - time1
    user system elapsed
    6.321 0.146 6.429
    >
    > time1 <- proc.time()
    > res2 <- exp(1:1000000)
    > time2 <- proc.time()
    > time2 - time1
    user system elapsed
    0.022 0.016 0.051
    > all(res1 == res2)
    [1] TRUE

    In this particular instance, the speedup is over 100x.
  2. Vector indexing. This is a slightly more complex issue to illustrate. The main aspect of it is that it turns out that the environments in R are not hashed by default, and it is very slow accessing vector entries by name. So, if there is any way to access a vector by index, implementing it may lead to significant speedups:
    > foo <- rnorm(1000000)
    > names(foo) <- sample(letters, 1000000, replace = TRUE)
    >
    > ind <- sample(letters, 500000, replace = TRUE)
    > time1 <- proc.time()
    > res1 <- foo[ind]
    > time2 <- proc.time()
    > time2 - time1
    user system elapsed
    0.445 0.019 0.474
    >
    > idx <- match(ind, names(foo))
    > time1 <- proc.time()
    > res2 <- foo[idx]
    > time2 <- proc.time()
    > time2 - time1
    user system elapsed
    0.017 0.000 0.021
    >
    > all(res1 == res2)
    [1] TRUE

    There is a speedup of over ~20x when indexing the vector by a numeric index.

The final issue I addressed when looking back at my code was fixing bugs. That is, the kind of bugs that didn't crash the execution, but rather produced believable, but not intended results. I think that this sort of bugs can be resolved by looking critically at the output, and evaluating it from many aspects. An example of such fix (and enhancement) is a screenshot taken from the dynamic visualization of RDS process in Figure 1. The previous version did not really go through different sensitivity constants for every network type, but rather presented the user with random variations of a network with given type, and the sensitivity constant fixed to 1. Now it is fixed and working.

There was another hidden bug in the core simulation code that I was optimizing the performance of, and fixing it also led to an increase in speed, but that is (hopefully) not the main source of code execution speed optimization, so I didn't put it in the list above.

The conclusion that I draw from this exercise is to put more emphasis on code checking, and optimization.

Tags: , ,

One Response to “Drastic R speed-ups via vectorization (and bug fixes)”

  1. [...] code to avoid loops won’t always make it faster (such as using an apply() function), whereas vectorization as a way to avoid a loop [...]

Leave a Reply