The data science puzzle

by Sergiy Nesterko on April 11th, 2011

Throughout the past few years, I have heard several times that the demand for quantitatively and data oriented professionals is growing. Clearly, this is good news for statisticians, as statistics is central to the process of extracting a meaningful and actionable signal from data. The terms data science, and data scientist have been accompanying many of the related articles. So, I have decided to do some research and look for evidence of increase in information analysis demand. My goal has been to understand the peculiarities of how our profession is perceived in the community, and attempt to clarify the meaning of the new data science term.

Here are a few articles that are highlighting the perception of the trend in the community:

  1. “For Today’s Graduate, Just One Word: Statistics” is an article in New York Times from August, 2009 featuring the now well-known quote from Hal Varian, the chief economist of Google:

    … the sexy job of the next 10 years will be statisticians.

    In the article, several examples are given of the problems that are addressed by statisticians, for example devising models to improve a web crawler’s performance in terms of network usage, improving the Netflix recommendation system, analyze sensor and location data to optimize food shipments etc. More so than before, companies are seeking to improve their products and optimize performance by leveraging the ever increasing amounts of recorded data with the help of data analysis experts. However, what is it that makes the solution process move forward?

  2. “The Rise of Data Science” written by David Champagne (CTO of Revolution Analytics) in January 2011 describes the phenomenon is more detail, and also starts the discussion as to what are the characteristics that help statisticians (or data scientists) be most effective when working on data analysis projects. Figure 1 gives a graphic by Drew Conway (a PhD student in Politics at NYU) that summarizes the relevant message of the article.
    Skill sets for data scientists

    Figure 1: Skill sets for data scientists, by Drew Conway

    The idea of the article is that while there is a boom in the amount of recorded data, it takes more than simple knowledge of statistical techniques to successfully analyze them. There are other technical components to the process, such as data cleaning, algorithms design, programming language selection, computation parallelization for the case of large datasets, goodness-of-fit testing. There are also less technical components, such as model, or method selection and formulation and interpretation of results into actionable conclusions. The former and the latter are inseparable, but it is the latter that make our work most valuable. So how much of that Substantive Expertise is really necessary?

This is where the authors of the articles I have found start being vague. Everyone seems to know that a statistician’s knowledge of substantive subject matter is important, or even crucial for a successful data analysis project, but no one can pin down what exactly that means, and how much would be enough. This is why, in my opinion, the term data science needs clarification.

There are many aspects on the technical side. For example, there are different flavours of ML, MCMC and HMC algorithms and subtleties thereof, current developments in modeling and machine learning techniques for the different purposes and data types, strategies for dealing with massive datasets, missing data problems, causal inference issues and so on. All these components interact and have a unique impact on every problem, and a statistician must be comfortable navigating these issues. Moreover, incorrect decision when addressing these technical aspects of a problem may lead to completely bogus results. Statisticians often concentrate on the validity of the approach, but put less emphasis on interpretability and comprehensibility of the results by the people we work with.

My point of view is that the skills that distinguish a successful project are genuine interest in subject matter and ability to effectively communicate. It is in this dialogue that the approach may evolve and be refined to a superior one. This is what can allow a statistician to gain sufficient substantive expertise from collaborators or clients on the timescale of a project, and broaden it in the course of subsequent engagements. Ability to listen to people we work with, to understand their perception of the problem, being able to translate their intuition into problem solving machinery, and then interpret the results back is an art. I believe, it is the final and most important component of what the term data science stands for.

The idea of importance of communication and adequate subject matter approach is not new. For example, graduate students in our department have been treated to this wisdom by faculty members just recently during the departmental retreat, designed to help students navigate the future career prospects. The faculty discussing the relevant features of the profession were Prof. David Harrington (Professor in Harvard Biostat department and Director of the Biostatistics Research Program in Dana-Farber/Harvard Cancer Center), Prof. Tirthankar Dasgupta (an Assistant Professor in Harvard Statistics department), and my dear advisor Prof. Joseph Blitzstein, who spoke about the future of the field.

And the future looks bright. But what will make it successful for statisticians is impeccable work ethics, ability to effectively communicate, excitement and genuine interest in the subject matter and impact of the problem.

Tags: , , ,

2 Responses to “The data science puzzle”

  1. Andy says:

    This is what Sergiy works on when he should be listening to colloquium speakers.

  2. […] in it as it implies an established field, while in fact the science of data is statistics. I wrote a post on the subject earlier in an attempt to single out what is it that distinguishes data science from […]

Leave a Reply