The gap between academia and current industry practices in data analysis

by Sergiy Nesterko on March 25th, 2012

The demand for specialists who can extract meaningful insights from data is increasing, which is good for statisticians as statistics is, among other things, the science of extracting signal from data. This is discussed in articles such as this January article in Forbes, and also the McKinsey Institute report published in May last year, an excerpt from which is given below:

There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

This sounds encouraging for the current students in fields such as statistics as they are looking to get out on the hot job market after graduation. However, are they prepared for what industry jobs need them to do?

One news that hasn't been covered in the press yet is that methods and data related problems in industry are often different from those described in the body of scientific publications. By different I mean either scientifically invalid or scientifically valid and cutting edge.

An example of this phenomenon is the so-called Technical Analysis of financial data, which is often used by algorithmic trading groups to devise computer-based trading strategies. Technical analysis is a term that people came up with to describe a set of methods that are often useful, and yet their validity is questionable from the scientific perspective. Quantitative traders have been employing this type of analysis for a long time without knowing whether it is valid.

Another example is a project I worked on, which was to create an algorithm of optimizing annual marketing campaigns for a large consumer packaged goods company (over $6 billion sales) to achieve 3-5% revenue increase without increasing expenditure described in this post. Essentially, this was an exercise in Response Surface methods with dimensionality as high as 327,600,000. There are no scientific papers in the field that consider problems of such high dimensionality. And yet companies are interested in such projects, even given the fact that methods for their solution are not scientifically verified (we worked hard to justify the validity of our approach for the project).

Recently I received an email inviting quantitatively oriented PhD's to apply for a summer fellowship in California to work on data science projects. Here is a quote from the email:

The Insight Data Science Fellows Program is a new post-doctoral training fellowship designed to bridge the gap between academia and a career data science.

Further, here is what is stated on the website of the organization sponsoring the program:

Bridging the gap between academia and data science.

As with algorithmic trading about 15 years ago, the use of sometimes scientifically questionable data analysis techniques is commanded by the increased demand for insights from quantitative information. Such approaches, which in the world of quantitative finance are called Technical Analysis, during the current data boom are named Data Science.

When using the term, one should be careful that while the methods employed by inadequately trained "data scientists" may be scientifically valid, they may well not be. There is an inherent danger in calling something that encompasses incorrect methods as a sort of "science" as this instills a perception of a field that is well-established and trustworthy. However, the term is about a couple years old. In my opinion, a more accurate one would be "current data analysis practices employed in industry".

The way we name the phenomenon does not change what it is. It is the fact that there is a lot of data and a lot of problems in industry that often go beyond what has been seen or addressed in academia. This is an exciting time for statisticians.

How to convert d3.js SVG to PDF

by Sergiy Nesterko on January 30th, 2012

Consider the following plot:

The image above is an SVG created using d3.js. I am going to use it in a paper, so need to convert it to PDF. It so happened that how to do this is not common knowledge, as is mentioned in this blog (which is, by the way, a good and balanced source of information relevant to our profession). Here is a short tutorial on how to do it: Read the rest of this entry »

Informative versus flashy visualizations, and growth in Harvard Stat concentration enrollment

by Sergiy Nesterko on December 4th, 2011

Some time ago my advisor Joseph Blitzstein asked me to create a visualization of the numbers of Harvard Statistics concentrators (undergraduate students who major in Statistics). The picture would be used by the chair of the department to illustrate the growth of the program for university officials, so I decided to make it look pretty. The first form that came to my mind was showing the enrollment growth over years using a bar plot.

Starting in 2005, the numbers follow exponential growth, which is a remarkable achievement of the department. We then decided to follow the trend, and extrapolate by adding predicted enrollment numbers for 2011, 12 and 13. At that moment, there was no data for 2011. Read the rest of this entry »

Interactive MCMC visualizations project

by Sergiy Nesterko on September 29th, 2011

Recently I have started being actively involved in a Markov Chain Monte Carlo visualization project with Marc Lipsitch and Miguel HernĂ¡n from the Department of Epidemiology in the Harvard School of Public Health. Work is done jointly with Sarah Cobey. The idea is to create a set of interactive visualizations to comprehensively describe and explain the concepts of how MCMC and its variants operate. Figure 1 shows a screenshot of the first sketch of an interactive visualization of a trace plot.

a screenshot

Figure 1: a screenshot of the first version of an interactive trace plot of an MCMC chain.

I am very excited about the project, both from the technical perspective of making interactive visualizations work with minimum amount of code, but also because I think that the idea of conceptualizing MCMC via the use of visualization is innovative and will be very useful for those trying to understand the process better.

The tools we are using in development are HTML, CSS, Javascript and d3.js.

JSM2011, and a final stretch at RDS

by Sergiy Nesterko on August 18th, 2011

The Joint Statistical Meetings conference took place in Miami Beach on July 30-August 5. It went very well, and the definite highlight was the keynote lecture by Sir David Cox. Among the other sessions, the following stand out:

  1. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics by Georg M. Goerg, a student at CMU Stat. The talk was very enjoyable and the conveyed ideas were crisp and exciting, the main one being that zero-mean time series can be thought of as histograms by representing them as frequency distributions which allows for an elegant non-parametric classification approach by minimizing the KL divergence of observed and simulated frequency histograms.
  2. Large Scale Data at Facebook by Eric Sun from Facebook. Though not groundbreaking, the talk was exciting as it described the work environment at Facebook and the approach taken to getting signals out of massive data. Mostly, curious facts were presented from analyzing the frequencies of word occurrences in user status updates, with the interesting part being the analysis framework developed to do that.
  3. Jointly Modeling Homophily in Networks and Recruitment Patterns in Respondent-Driven Sampling of Networks by my advisor Joe Blitzstein about our most recent research on model-based estimation for Respondent-Driven Sampling (RDS). The approach we are developing is looking to have several very attractive features in comparison to current estimation techniques and is designed for the case of homophily of varying degree. An example is illustrated on Figure 1.

    Figure 1: An example of homophily, with the network plotted over the histogram of the homophily inducing quantity (left), and resulting (normalized) vertex degrees plotted over the same histogram (right).

    We hope to finish the relevant paper soon and open the approach to extensions by the research community.

During the conference, I also had a chance to finish making a dynamic 3D visualization of a constrained optimization algorithm I developed for In4mation Insights, which is exciting. As for Miami Beach itself, it is a great place to go out and enjoy the good food, sun and beach. JSM2012 will be held in San Diego.

I created the visualization in this post using Processing.

Optimization, experiment design, and Sir David Cox

by Sergiy Nesterko on August 3rd, 2011

It has been almost a year of my involvement in a project of global marketing mix optimization solution for a large consumer packaged goods company. Conceptually, the problem is simple: given a fitted model of a company's revenue as a function of promotion campaigns for its products, and using past year's promotion campaigns allocation scenario as a starting point, find a revenue maximizing scenario subject to promotion expenditure constraints.

Figure 1: A visualization of a step in a solution of an optimization problem. To see the full dynamic visualization, go to

The problem becomes more interesting when we go into details. Read the rest of this entry », a new project

by Sergiy Nesterko on July 12th, 2011

Recently I purchased the domain and created an interactive logo/visualization for Theory Information Analysis, a screenshot of which is presented above. Theory is a new project which I would like to represent applied real word work, including quantitative consulting and applied research. Read the rest of this entry »

Dynamic visualization, paper supplement 1

by Sergiy Nesterko on May 28th, 2011

Read the rest of this entry »

Dynamic visualization, paper supplement 2

by Sergiy Nesterko on May 28th, 2011

Read the rest of this entry »

Data science term in The Economist

by Sergiy Nesterko on May 19th, 2011

Seems that there is no stopping now: the term data science appears prominently in the headline article of the current issue of The Economist.

The Economist

Compared with the rest of America, Silicon Valley feels like a boomtown. Corporate chefs are in demand again, office rents are soaring and the pay being offered to talented folk in fashionable fields like data science is reaching Hollywood levels. And no wonder, given the prices now being put on web companies.

It is indeed quite misleading that the term has the word science in it as it implies an established field, while in fact the science of data is statistics. I wrote a post on the subject earlier in an attempt to single out what is it that distinguishes data science from statistics. That set aside, however, the article is supportive of the rise in demand for our profession, which is a good news for the specialists. Hopefully, the tech bubble mentioned there won't be inflated further by people who misuse the data science term.