Archive for April, 2011

Drastic R speed-ups via vectorization (and bug fixes)

Friday, April 29th, 2011

RDS visual

Figure 1: A screenshot of the corrected and enhanced dynamic visualization of RDS. Green balls are convenience sample, pink balls are subsequently recruited individuals, pink lines are links between network nodes that have been explored by the process, and numbers in circles correspond to sample wave number.

It is common to hear that R is slow, and so when I faced the necessity to scale old R code (pertaining to material described in this post) to operate on data 100 times larger than it used to, I was initially at a loss. The problem with the old code was that it took several days and about 4000 semi-parallel jobs to complete. With the size of data increasing by a factor of 100, the task was becoming infeasible to complete. Eventually however, I was able to achieve an over 100-fold speedup of the R code, with the speedup being due to addressing two issues: (more…)

The data science puzzle

Monday, April 11th, 2011

Throughout the past few years, I have heard several times that the demand for quantitatively and data oriented professionals is growing. Clearly, this is good news for statisticians, as statistics is central to the process of extracting a meaningful and actionable signal from data. The terms data science, and data scientist have been accompanying many of the related articles. So, I have decided to do some research and look for evidence of increase in information analysis demand. My goal has been to understand the peculiarities of how our profession is perceived in the community, and attempt to clarify the meaning of the new data science term. (more…)