Management consulting view on big data

by Sergiy Nesterko on June 25th, 2012

The Economist

The amount of data recorded and analyzed in business, medicine, education and public policy is increasing every day at a rapid rate, to the extent that it is hard to keep pace with it. I am particularly interested in how, and whether, the leaders of organizations and government bodies are responding to and extracting value from the phenomenon.

Particularly interesting is the point of view of top management consulting firms, who are also very interested in the trend. For example, McKinsey Institute published a report on big data a year ago. More recently, there was a recording of a QA session with a senior partner of BCG Philip Evans on big data posted on Schumpeter blog on The Economist about a week ago.

Specifically, Mr. Evans eluded to how the emergence of “big data” may change the course of strategic development of companies. The most recent method has been vertical integration, when companies aim to acquire/develop more entities along the supply chain (i.e., electric power supplier aims to operate not only power plants, but also raw materials, power grids etc) to reduce costs. According to Mr. Evans, during the “big data” era, we will see more of horizontal integration, when instead of operating several entities along the supply chain, a company focuses on one, and grows by scaling the product up to many markets. As per Mr. Evans, an example of this approach is Google.

Additionally, Mr. Evans stated that companies will become fragmented into two camps, the one where there exists a well-defined serializable product or service around which a company can scale up, such as “inferring patterns in large amounts of data”, and another where more unique individual skills are needed, such as entrepreneurship, creativity etc.

I found the interview very interesting. We do see successful companies employing horizontal integration (Google, Apple, Amazon). That is, they do focus on a few important products or services, and scale them up to multiple markets. Does this have anything to do with “big data”? It certainly does, as horizontal integration is employed by big players in the big data realm as well, such as EMC. However, horizontal integration is inherent more to the concept of the Internet and the evolution of IT, as is the “big data” phenomenon.

Secondly, I have to disagree with the statement that inferring patterns in large amounts of data is (easily) serializable. This task is an open scientific problem that is a subject of active current research. The only solutions existent at the moment are those belonging to the second camp as defined by Mr. Evans. A task of attempting to design an algorithm to extract a specific answer to a specific question from a dataset in a given format needs to be approached individually by qualified specialists such as statisticians. Such project does involve creativity and a substantial amount of intellectual effort. After an approach is developed, it can be replicated for the specific dataset it has been designed for (say, when more observations have been collected), and not for other datasets, otherwise the results may be unreliable.

More broadly, what does the phenomenon mean for companies? Horizontal integration is implied by the ability to quickly scale up products and services implied by the development of the Internet and the IT, as is big data. So, what is the message of the latter by itself?

Let us not make the matter overly complicated. Buried in the terabytes of “big data” is the ability of companies to be better informed about the market around them and their own internal operations, to optimize activities better, to find out what the competition is up to better, to price their products better than competition, and so on. “Being better informed” is a value generating asset, and companies with large amounts of repeated features (many instances of the same product/service sold, large numbers of employees, many visitors seeing their ads on the Internet) need to realize this. The first ones that do, and those who employ the better methods of extracting interpretable information from the relevant data sources will benefit from the value of being better informed than others.

I couldn’t be more excited about the fact that companies, governments, educational institutions and public policy agencies are beginning to realize the value of being better informed by patterns inferred from data, be they massive, big, or not so big. The fact that top management consultants are talking about it means that top executives are demonstrating this interest.

The gap between academia and current industry practices in data analysis

by Sergiy Nesterko on March 25th, 2012

The demand for specialists who can extract meaningful insights from data is increasing, which is good for statisticians as statistics is, among other things, the science of extracting signal from data. This is discussed in articles such as this January article in Forbes, and also the McKinsey Institute report published in May last year, an excerpt from which is given below:

There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

This sounds encouraging for the current students in fields such as statistics as they are looking to get out on the hot job market after graduation. However, are they prepared for what industry jobs need them to do?

One news that hasn’t been covered in the press yet is that methods and data related problems in industry are often different from those described in the body of scientific publications. By different I mean either scientifically invalid or scientifically valid and cutting edge.

An example of this phenomenon is the so-called Technical Analysis of financial data, which is often used by algorithmic trading groups to devise computer-based trading strategies. Technical analysis is a term that people came up with to describe a set of methods that are often useful, and yet their validity is questionable from the scientific perspective. Quantitative traders have been employing this type of analysis for a long time without knowing whether it is valid.

Another example is a project I worked on, which was to create an algorithm of optimizing annual marketing campaigns for a large consumer packaged goods company (over $6 billion sales) to achieve 3-5% revenue increase without increasing expenditure described in this post. Essentially, this was an exercise in Response Surface methods with dimensionality as high as 327,600,000. There are no scientific papers in the field that consider problems of such high dimensionality. And yet companies are interested in such projects, even given the fact that methods for their solution are not scientifically verified (we worked hard to justify the validity of our approach for the project).

Recently I received an email inviting quantitatively oriented PhD’s to apply for a summer fellowship in California to work on data science projects. Here is a quote from the email:

The Insight Data Science Fellows Program is a new post-doctoral training fellowship designed to bridge the gap between academia and a career data science.

Further, here is what is stated on the website of the organization sponsoring the program:

INSIGHT DATA SCIENCE
FELLOWS PROGRAM
Bridging the gap between academia and data science.

As with algorithmic trading about 15 years ago, the use of sometimes scientifically questionable data analysis techniques is commanded by the increased demand for insights from quantitative information. Such approaches, which in the world of quantitative finance are called Technical Analysis, during the current data boom are named Data Science.

When using the term, one should be careful that while the methods employed by inadequately trained “data scientists” may be scientifically valid, they may well not be. There is an inherent danger in calling something that encompasses incorrect methods as a sort of “science” as this instills a perception of a field that is well-established and trustworthy. However, the term is about a couple years old. In my opinion, a more accurate one would be “current data analysis practices employed in industry”.

The way we name the phenomenon does not change what it is. It is the fact that there is a lot of data and a lot of problems in industry that often go beyond what has been seen or addressed in academia. This is an exciting time for statisticians.

How to convert d3.js SVG to PDF

by Sergiy Nesterko on January 30th, 2012

Consider the following plot:


The image above is an SVG created using d3.js. I am going to use it in a paper, so need to convert it to PDF. It so happened that how to do this is not common knowledge, as is mentioned in this blog (which is, by the way, a good and balanced source of information relevant to our profession). Here is a short tutorial on how to do it: Read the rest of this entry »

Informative versus flashy visualizations, and growth in Harvard Stat concentration enrollment

by Sergiy Nesterko on December 4th, 2011

Some time ago my advisor Joseph Blitzstein asked me to create a visualization of the numbers of Harvard Statistics concentrators (undergraduate students who major in Statistics). The picture would be used by the chair of the department to illustrate the growth of the program for university officials, so I decided to make it look pretty. The first form that came to my mind was showing the enrollment growth over years using a bar plot.

Starting in 2005, the numbers follow exponential growth, which is a remarkable achievement of the department. We then decided to follow the trend, and extrapolate by adding predicted enrollment numbers for 2011, 12 and 13. At that moment, there was no data for 2011. Read the rest of this entry »

Interactive MCMC visualizations project

by Sergiy Nesterko on September 29th, 2011

Recently I have started being actively involved in a Markov Chain Monte Carlo visualization project with Marc Lipsitch and Miguel HernĂ¡n from the Department of Epidemiology in the Harvard School of Public Health. Work is done jointly with Sarah Cobey. The idea is to create a set of interactive visualizations to comprehensively describe and explain the concepts of how MCMC and its variants operate. Figure 1 shows a screenshot of the first sketch of an interactive visualization of a trace plot.

a screenshot

Figure 1: a screenshot of the first version of an interactive trace plot of an MCMC chain.


I am very excited about the project, both from the technical perspective of making interactive visualizations work with minimum amount of code, but also because I think that the idea of conceptualizing MCMC via the use of visualization is innovative and will be very useful for those trying to understand the process better.

The tools we are using in development are HTML, CSS, Javascript and d3.js.

JSM2011, and a final stretch at RDS

by Sergiy Nesterko on August 18th, 2011

The Joint Statistical Meetings conference took place in Miami Beach on July 30-August 5. It went very well, and the definite highlight was the keynote lecture by Sir David Cox. Among the other sessions, the following stand out:

  1. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics by Georg M. Goerg, a student at CMU Stat. The talk was very enjoyable and the conveyed ideas were crisp and exciting, the main one being that zero-mean time series can be thought of as histograms by representing them as frequency distributions which allows for an elegant non-parametric classification approach by minimizing the KL divergence of observed and simulated frequency histograms.
  2. Large Scale Data at Facebook by Eric Sun from Facebook. Though not groundbreaking, the talk was exciting as it described the work environment at Facebook and the approach taken to getting signals out of massive data. Mostly, curious facts were presented from analyzing the frequencies of word occurrences in user status updates, with the interesting part being the analysis framework developed to do that.
  3. Jointly Modeling Homophily in Networks and Recruitment Patterns in Respondent-Driven Sampling of Networks by my advisor Joe Blitzstein about our most recent research on model-based estimation for Respondent-Driven Sampling (RDS). The approach we are developing is looking to have several very attractive features in comparison to current estimation techniques and is designed for the case of homophily of varying degree. An example is illustrated on Figure 1.

    Figure 1: An example of homophily, with the network plotted over the histogram of the homophily inducing quantity (left), and resulting (normalized) vertex degrees plotted over the same histogram (right).

    We hope to finish the relevant paper soon and open the approach to extensions by the research community.

During the conference, I also had a chance to finish making a dynamic 3D visualization of a constrained optimization algorithm I developed for In4mation Insights, which is exciting. As for Miami Beach itself, it is a great place to go out and enjoy the good food, sun and beach. JSM2012 will be held in San Diego.

I created the visualization in this post using Processing.

Optimization, experiment design, and Sir David Cox

by Sergiy Nesterko on August 3rd, 2011

It has been almost a year of my involvement in a project of global marketing mix optimization solution for a large consumer packaged goods company. Conceptually, the problem is simple: given a fitted model of a company’s revenue as a function of promotion campaigns for its products, and using past year’s promotion campaigns allocation scenario as a starting point, find a revenue maximizing scenario subject to promotion expenditure constraints.

Figure 1: A visualization of a step in a solution of an optimization problem. To see the full dynamic visualization, go to theory.info.

The problem becomes more interesting when we go into details. Read the rest of this entry »

theory.info, a new project

by Sergiy Nesterko on July 12th, 2011


Recently I purchased the domain and created an interactive logo/visualization for Theory Information Analysis, a screenshot of which is presented above. Theory is a new project which I would like to represent applied real word work, including quantitative consulting and applied research. Read the rest of this entry »

Dynamic visualization, paper supplement 1

by Sergiy Nesterko on May 28th, 2011

Read the rest of this entry »

Dynamic visualization, paper supplement 2

by Sergiy Nesterko on May 28th, 2011

Read the rest of this entry »