Data science term in The Economist

by Sergiy Nesterko on May 19th, 2011

Seems that there is no stopping now: the term data science appears prominently in the headline article of the current issue of The Economist.

The Economist

Compared with the rest of America, Silicon Valley feels like a boomtown. Corporate chefs are in demand again, office rents are soaring and the pay being offered to talented folk in fashionable fields like data science is reaching Hollywood levels. And no wonder, given the prices now being put on web companies.

It is indeed quite misleading that the term has the word science in it as it implies an established field, while in fact the science of data is statistics. I wrote a post on the subject earlier in an attempt to single out what is it that distinguishes data science from statistics. That set aside, however, the article is supportive of the rise in demand for our profession, which is a good news for the specialists. Hopefully, the tech bubble mentioned there won’t be inflated further by people who misuse the data science term.

Does homophily exist?

by Sergiy Nesterko on May 9th, 2011

This spring I gave two talks, one at the New England Statistics Symposium (NESS) hosted by the Department of Statistics, University of Connecticut, and a post-qualifying talk in my home department. Both talks were on my work with my advisor Joe Blitzstein, and both drew heavily on the term homophily. The first talk concerned refined simulation study results concerning design-based estimation, and the second one was about model-based estimation under Respondent-Driven Sampling. For the latter, we consider the data collected within a recent study of populations at high risk of HIV conducted in San Diego. The study took nearly 2 years to complete and was aimed at collecting information describing behavioral and health aspects of the target population. It is a privilege and responsibility to be commissioned to analyze the collected data, as the results of the analysis may be used for subsequent policy decisions. Figure 1 demonstrates the (anonymized) recruitment trees of the study.

San Diego study recruitment tree

Figure 1: San Diego study recruitment trees as functions of HIV status. On the x axis, observation means the HIV status group.

Read the rest of this entry »

Drastic R speed-ups via vectorization (and bug fixes)

by Sergiy Nesterko on April 29th, 2011
RDS visual

Figure 1: A screenshot of the corrected and enhanced dynamic visualization of RDS. Green balls are convenience sample, pink balls are subsequently recruited individuals, pink lines are links between network nodes that have been explored by the process, and numbers in circles correspond to sample wave number.

It is common to hear that R is slow, and so when I faced the necessity to scale old R code (pertaining to material described in this post) to operate on data 100 times larger than it used to, I was initially at a loss. The problem with the old code was that it took several days and about 4000 semi-parallel jobs to complete. With the size of data increasing by a factor of 100, the task was becoming infeasible to complete. Eventually however, I was able to achieve an over 100-fold speedup of the R code, with the speedup being due to addressing two issues: Read the rest of this entry »

The data science puzzle

by Sergiy Nesterko on April 11th, 2011

Throughout the past few years, I have heard several times that the demand for quantitatively and data oriented professionals is growing. Clearly, this is good news for statisticians, as statistics is central to the process of extracting a meaningful and actionable signal from data. The terms data science, and data scientist have been accompanying many of the related articles. So, I have decided to do some research and look for evidence of increase in information analysis demand. My goal has been to understand the peculiarities of how our profession is perceived in the community, and attempt to clarify the meaning of the new data science term. Read the rest of this entry »

Dynamic visualization of RDS version 2

by Sergiy Nesterko on March 27th, 2011

Early this semester, I worked on complementing my visualization of the Respondent-Driven Sampling (RDS) process presented in this post to illustrate its evolution over time. That was how the second version was created, which is displayed here.

Please refer to the earlier post for detailed description of the main functionality. The second version implements an additional view of the process, which plots the portion of the underlying network as discovered by the RDS process over time. To switch to an alternate view at any time, press the change view button. The wide pink horizontal line in the alternate view marks the true population mean. Read the rest of this entry »

Dynamic visualization of RDS

by Sergiy Nesterko on December 18th, 2010

The visualization below is the last element of work with my advisor Joe Blitzstein on exploring the Respondent-Driven Sampling (RDS) process via simulation. Read the rest of this entry »

Tradeoffs in estimation under Respondent-Driven Sampling, and Chernoff faces

by Sergiy Nesterko on October 6th, 2010

Recently I have been working hard on finalizing the paper that we are writing with my advisor Joe Blitzstein about estimation under Respondent-Driven Sampling (RDS). Specifically, the paper aims to develop general intuition about how the process works on networks with different topologies, and what are the driving factors of current estimators’ performance (or lack thereof).

To do this, we simulated many networks belonging to one of three main types (homophily, rich-gets-richer and inverse homophily), simulated many RDS processes of different configurations on each, and compared performance of the well-established Volz-Heckathorn (VH) estimator, and plain vanilla mean as point estimators under each scenario. Among other findings, it has turned out that the VH estimator underperforms the plain mean on the considered class of homophily networks, and prevails in some other cases. Read the rest of this entry »

Working with In4mation Insights

by Sergiy Nesterko on September 30th, 2010

Starting in the summer of 2010 I have been fortunate to work on several projects with a leading market research firm In4mation Insights, based in Needham Heights MA.

My job function has led me to work closely with the firm’s partners Steve Cohen and Mark Garratt, who have both been an example of impeccable professionalism and wit, and also with some other members of the team – Mark Irwin, Sanjib Mohanty and Ryan Hickey who are all quite sharp. Read the rest of this entry »

Visualizing while on Opening Workshop on Complex Networks at SAMSI

by Sergiy Nesterko on August 31st, 2010

It is now almost the end of my stay here in Research Triangle Park, NC at the Opening Workshop on Complex Networks organized by SAMSI. I presented a poster here on some of my work with Joe Blitzstein on estimation under respondent-driven sampling. This was about simulation studies we have done to lay foundations for our development of the new estimation method as outlined in this post. I will prepare a post describing this earlier work once we submit a paper on it, which should be soon. I also had a pleasure to meet other researchers working in the field, in particular Matt Salganik and Erik Volz. It was really enjoyable and inspiring to discuss problems relevant to estimation in RDS.

Apart from enjoying the workshop, I have had a chance to enjoy some Processing and experimented with some ideas about visualizing high dimensional dependent data (that is, when the number of dimensions is larger than 3). Read the rest of this entry »

Conferences in the summer of 2010

by Sergiy Nesterko on August 20th, 2010

This summer I have attended Joint Statistical Meetings (JSM) in Vancouver, and have been fortunate to have been accepted to Complex Networks Opening Workshop held by Statistical and Applied Mathematical Science Institute (SAMSI) in North Carolina near Chapel Hill. Both events are exciting and intellectually stimulating. Read the rest of this entry »