Archive for November, 2012

Democratization of data science: why this is inefficient

Sunday, November 4th, 2012

The use of data in industry is increasing by the hour, and so does investment in Big Data. Gartner, an information technology research and advisory firm, says the spending on big data will be $28 billion in 2012 alone. This is estimated to trigger a domino effect of $232 billion in spending through the next 5 years.

The business world is evolving rapidly to meet the demands of data-hungry executives. On the data storage front, for example, new technology is quickly developed under the Hadoop umbrella. On the data analysis front, there are startups that tackle and productize related problems such as quid, Healthrageous, Bidgely, and many others. What drives this innovation in analyzing data? What allows so many companies to claim that their products are credible?

Not surprisingly, the demand for analytic talent has been growing, with McKinsey calling Big Data the next frontier of innovation. So, let's make this clear - businesses need specialists to innovate, to generate ideas and algorithms that would extract value from data.

Who are those specialists, where do they come from? With a shortage of up to 190,000 data specialists projected for 2018, there is a new trend emerging for "the democratization of data science" which means bringing skills to meaningfully analyze data to more people:

The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it’s new education platforms (e.g., Coursera and Udacity) teaching students fundamental skills in everything from basic statistics to natural language processing and machine learning.
...
Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.

This quote is optimistic at best. Where is the guarantee that the product developed by a "data scientist" with a couple of classes worth of training is going to work for the entire market? In academic statistics and machine learning programs, students spend several years learning the proper existing methods, how to design new ones, and prove their general validity.

When people without adequate training make analytic products and offer them to many customers, such verification of the product is crucial. Otherwise, the customer may soon discover that the product doesn't work well enough or not at all, thus bringing down the ROI on the product. The customer will then go back and invest in hiring talent and designing solutions that would actually work for the case of this customer. If all customers have to do this, the whole vehicle with the democratized data science becomes significantly inefficient.

Behind each data analysis decision there must be rigorous scientific justification. For example, consider a very simple Binomial statistical model. We can think about customers visiting a website through a Google ad. Each customer is encoded as 1 if he or she ends up purchasing something on the website, and zero otherwise. The question of interest is, what proportion of customers coming through Google ads ends up buying on the website?

Below is a visualization of the log-likelihood and inverse Fisher information functions. Many inappropriately trained data specialists would not be able to interpret these curves correctly even for the simple model like this. But what about the complex algorithmic solutions they are required to build on a daily basis and roll out on the market?

We can simply take the proportion of customers who bought something, that will be our best guess about the underlying percentage of buying Google ad website visitors. This is not just common sense, the average can be proved to be the best estimator theoretically.

The uncertainty about our estimate can also be quantified by the value of the inverse observed Fisher Information function (picture, left) at the estimated value of p. The three curves correspond to the different numbers of customers who visited our website. The more customers we get, the lower our uncertainty about the proportion of the buying customers is. Try increasing the value of n. You will see that the corresponding curve goes down - our uncertainty about the estimated proportion vanishes.

This is the kind of theory that we need specialists who develop algorithmic products to be equipped with. It requires an investment in their proper education first. If we skip the proper education step, we risk lowering the usefulness and practicality of the products such data scientists design.