## Posts Tagged ‘algorithm design’

### Adaptive and social media in MOOCs: the data-driven and the people-driven

Thursday, May 23rd, 2013

In light of my new position as a HarvardX Research Fellow, I have been thinking about the role of data in improving online learning experiences (aka MOOCs) at edX. Can data tell us everything about the ideal learning experience of tomorrow? Can product developers at edX come up with the best version singe-handedly? Or, maybe, the online students could also tell us what is the ideal MOOC?

First, let’s think about what could be the “ideal MOOC”. There is a broad consensus that an ideal online learning experience would yield the best “educational outcomes” for the students. For now, let’s think about the educational outcome as something that’s well-approximated with the amount of learning. Specifically, this means that we want students to extract and internalize as much educational content from the interactive learning experience as possible. Finally, the educational content is information that is relevant to the substance of the class. For example, for a probability course, this would include information on how to use Bayes rule or the change of variables. For a Python programming class this would include information on how to operate Python modules and language syntax. For a class on interactive visualization, this could include (of course!) information on how to use d3js.

This is an important point. Educational content is information relevant to the substance of the class. We want the students to internalize as much of it as possible, make it their knowledge. How can we do that?

Let’s assume that the educational materials (lectures, homework, tests, examples) have already been prepared and we believe that they are good. How do we expose the materials to the students in the best possible way so that students learn the most, stay engaged, and more students complete the class?

Clearly, the setting of a MOOC is different from the setting of a standard classroom. One of the significant differences is the number of students – it’s massive. Depending on the course, the number of enrolled students can exceed 150 thousand – CS50x by David Malan on HarvardX is a great example. Do we want to expose every single student, no matter what country he/she is from, no matter what talents and aspirations he/she has, no matter how many peers he/she will study with, all to the same sequence of the material? Maybe, yes. And maybe, no.

The setting of MOOCs can be a wonderful platform for adaptive media – an algorithmic way of sequentially presenting content and interacting with the user in order to maximize the informational content that the user “internalizes”.

Adaptive media. It’s the characterizing trait of a computer as a medium – the ability to simulate responses, interact, predict, “act like a living being”. We can use it to model, predict, and synthesize the best way to serve content to users, algorithmically.

Can we use adaptive media in MOOCs? The benefits are obvious – with hundreds of thousands of enrollees, it is impossible to adequately staff the course with enough qualified facilitators. Adaptive media could be used together with the teachers’ input and social media such as forums, social grading, and study groups. The purpose, instead of displaying personalized ads, would be to make sure each student learns as much as possible from the interactive learning experience, in his or her unique way. There could also be a multitude of positive extras – reduced dropout rate, higher engagement, higher enrollment for adaptive MOOCs.

Isn’t this interesting?

### Democratization of data science: why this is inefficient

Sunday, November 4th, 2012

The use of data in industry is increasing by the hour, and so does investment in Big Data. Gartner, an information technology research and advisory firm, says the spending on big data will be $28 billion in 2012 alone. This is estimated to trigger a domino effect of$232 billion in spending through the next 5 years.

The business world is evolving rapidly to meet the demands of data-hungry executives. On the data storage front, for example, new technology is quickly developed under the Hadoop umbrella. On the data analysis front, there are startups that tackle and productize related problems such as quid, Healthrageous, Bidgely, and many others. What drives this innovation in analyzing data? What allows so many companies to claim that their products are credible?

Not surprisingly, the demand for analytic talent has been growing, with McKinsey calling Big Data the next frontier of innovation. So, let’s make this clear – businesses need specialists to innovate, to generate ideas and algorithms that would extract value from data.

Who are those specialists, where do they come from? With a shortage of up to 190,000 data specialists projected for 2018, there is a new trend emerging for “the democratization of data science” which means bringing skills to meaningfully analyze data to more people:

The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it’s new education platforms (e.g., Coursera and Udacity) teaching students fundamental skills in everything from basic statistics to natural language processing and machine learning.

Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.

This quote is optimistic at best. Where is the guarantee that the product developed by a “data scientist” with a couple of classes worth of training is going to work for the entire market? In academic statistics and machine learning programs, students spend several years learning the proper existing methods, how to design new ones, and prove their general validity.

When people without adequate training make analytic products and offer them to many customers, such verification of the product is crucial. Otherwise, the customer may soon discover that the product doesn’t work well enough or not at all, thus bringing down the ROI on the product. The customer will then go back and invest in hiring talent and designing solutions that would actually work for the case of this customer. If all customers have to do this, the whole vehicle with the democratized data science becomes significantly inefficient.

Behind each data analysis decision there must be rigorous scientific justification. For example, consider a very simple Binomial statistical model. We can think about customers visiting a website through a Google ad. Each customer is encoded as 1 if he or she ends up purchasing something on the website, and zero otherwise. The question of interest is, what proportion of customers coming through Google ads ends up buying on the website?

Below is a visualization of the log-likelihood and inverse Fisher information functions. Many inappropriately trained data specialists would not be able to interpret these curves correctly even for the simple model like this. But what about the complex algorithmic solutions they are required to build on a daily basis and roll out on the market?

We can simply take the proportion of customers who bought something, that will be our best guess about the underlying percentage of buying Google ad website visitors. This is not just common sense, the average can be proved to be the best estimator theoretically.

The uncertainty about our estimate can also be quantified by the value of the inverse observed Fisher Information function (picture, left) at the estimated value of p. The three curves correspond to the different numbers of customers who visited our website. The more customers we get, the lower our uncertainty about the proportion of the buying customers is. Try increasing the value of n. You will see that the corresponding curve goes down – our uncertainty about the estimated proportion vanishes.

This is the kind of theory that we need specialists who develop algorithmic products to be equipped with. It requires an investment in their proper education first. If we skip the proper education step, we risk lowering the usefulness and practicality of the products such data scientists design.

### Algorithms as products: lucrative, but what is the real value?

Friday, October 12th, 2012

Recently I attended a talk by Nate Silver (@fivethirtyeight) who leads a popular NYT election forecasts blog, where he talked about how he uses algorithms to predict the results of the election given the information available on the day of. Nate didn’t go in-depth on how his algorithms work, though there were such questions from the audience. On the one hand, it makes sense. Why tell how the algorithms work, what matters is whether they predict the election right. Indeed, it did in 2008, predicting 49/50 states right, as well as all of the 35 Senate races.

But on the other hand, if Nate Silver never publicly discloses how it works, how do we really know what the algorithm is based on, what are the weights on surveys, how it accounts for all the biases, etc? In science, algorithms are always disclosed and can be replicated by third parties. Such approach is not employed by Nate Silver, and it is understandable. His algorithm is a product, it gives him a job at NYT, prestige, and status. What would happen if anybody could replicate it?

The same non-disclosure strategy is employed by LinkedIn for its Talent Brand Index algorithm. The index is a new measure offered by LinkedIn of how attractive the company is for prospective and current employees.

The index will prove to be very lucrative for LinkedIn:

While there is likely to be a lot of quibbling about how the numbers are calculated, this product has the potential to make LinkedIn the “currency” by which corporations measure their professional recruitment efforts.

No wonder the company is trading at 23 X sales.

However, there is a key difference between LinkedIn’s Talent Brand Index and Nate Silver’s election forecast algorithms: it can never be checked whether the Talent Brand Index is right. Indeed, do we know how it is constructed? Here’s what I could find on that:

Last year, LinkedIn was home to over 15 billion interactions between professionals and companies. We cross-referenced our data with thousands of survey responses to pinpoint the specific activities that best indicate familiarity and interest in working for a company: connecting with employees, viewing employee profiles, visiting Company and Career Pages, and following companies. After crunching this data and normalizing for things like company size, we developed our top 100 global list. We then applied LinkedIn profile data to rank the most sought-after employers among professionals in five countries and four job functions.

The index cannot be re-created not only because there is no publicly available description of how it is calculated, but also because LinkedIn’s data on which it is calculated is proprietary.

So, the Talent Brand Index is a black box, recruiters don’t know how it works. But, they will pay to get access to it because the index provides employer rankings in terms of “people’s perception of working for them”. The companies will then work and invest heavily to improve their index ranking because the information is publicly available, and will help them recruit better talent.

However, how are the employers going to find out what is their ROI trying to improve their Talent Brand Index if they don’t know how it works? Not having the information on how the index works makes it a hard task. Let me give an example.

For simplicity, let us assume that the Talent Brand Index gives the weight of 5 to the positive sentiment expressed about the company by the current employees on their LinkedIn profiles, and a weight of 100 on the number of times the profiles of the employees are viewed on LinkedIn. Since the information on weights is hidden from the employers, they’d have to first run a randomized experiment to determine the effect of a particular company policy on employee profile views, and then measure the impact on the index. This is very costly and hard to implement, because it is hard to devise a potentially index-improving policy that would only involve a part of company’s employees (treatment group), and not the other part (control group), and to randomly assign employees to those parts, and then to measure the profile clicks, and so on.

But in our example, LinkedIn gives a very large weight to the number of views of the employees’ profiles! How can the employers find that out?

Practically, the answer is – they cannot.

This means that while the Talent Brand Index is a lucrative product for LinkedIn, the real value it provides to companies is vague. It provides no information as to what areas of an employer’s HR policy need to be improved in order to increase the Talent Brand Index, and in what priority. That’s why, the high-index companies will enjoy an increased influx of great talent, while the low-index companies will suffer a talent drain. This will reinforce the leaders’ positions, and worsen the positions of the HR underdogs.

Coming back to the broader picture, there are algorithms, and there are algorithms. Nate Silver’s election prediction algorithm is in fact a valuable product to its users even though its details are largely unknown. This is because it can be checked for truthfulness. LinkeIn’s Talent Brand Index product will bring double digit growth to the company due to the Big Data hype, but will it be really useful to its consumers in terms of helping them improve their hiring? The answer is not straightforward.

Algorithms as products should be designed with enough transparency to make them useful, or with a mechanism to externally verify them. Otherwise, their value to the customer is questionable.