The gap between academia and current industry practices in data analysis

by Sergiy Nesterko on March 25th, 2012

The demand for specialists who can extract meaningful insights from data is increasing, which is good for statisticians as statistics is, among other things, the science of extracting signal from data. This is discussed in articles such as this January article in Forbes, and also the McKinsey Institute report published in May last year, an excerpt from which is given below:

There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

This sounds encouraging for the current students in fields such as statistics as they are looking to get out on the hot job market after graduation. However, are they prepared for what industry jobs need them to do?

One news that hasn't been covered in the press yet is that methods and data related problems in industry are often different from those described in the body of scientific publications. By different I mean either scientifically invalid or scientifically valid and cutting edge.

An example of this phenomenon is the so-called Technical Analysis of financial data, which is often used by algorithmic trading groups to devise computer-based trading strategies. Technical analysis is a term that people came up with to describe a set of methods that are often useful, and yet their validity is questionable from the scientific perspective. Quantitative traders have been employing this type of analysis for a long time without knowing whether it is valid.

Another example is a project I worked on, which was to create an algorithm of optimizing annual marketing campaigns for a large consumer packaged goods company (over $6 billion sales) to achieve 3-5% revenue increase without increasing expenditure described in this post. Essentially, this was an exercise in Response Surface methods with dimensionality as high as 327,600,000. There are no scientific papers in the field that consider problems of such high dimensionality. And yet companies are interested in such projects, even given the fact that methods for their solution are not scientifically verified (we worked hard to justify the validity of our approach for the project).

Recently I received an email inviting quantitatively oriented PhD's to apply for a summer fellowship in California to work on data science projects. Here is a quote from the email:

The Insight Data Science Fellows Program is a new post-doctoral training fellowship designed to bridge the gap between academia and a career data science.

Further, here is what is stated on the website of the organization sponsoring the program:

Bridging the gap between academia and data science.

As with algorithmic trading about 15 years ago, the use of sometimes scientifically questionable data analysis techniques is commanded by the increased demand for insights from quantitative information. Such approaches, which in the world of quantitative finance are called Technical Analysis, during the current data boom are named Data Science.

When using the term, one should be careful that while the methods employed by inadequately trained "data scientists" may be scientifically valid, they may well not be. There is an inherent danger in calling something that encompasses incorrect methods as a sort of "science" as this instills a perception of a field that is well-established and trustworthy. However, the term is about a couple years old. In my opinion, a more accurate one would be "current data analysis practices employed in industry".

The way we name the phenomenon does not change what it is. It is the fact that there is a lot of data and a lot of problems in industry that often go beyond what has been seen or addressed in academia. This is an exciting time for statisticians.

