Archive for the ‘Research’ Category

Computer Programming as a Necessary Team Skill for Research-Based Production Data Science in Industry

Friday, September 4th, 2020

Here is an article I mainly wrote last year while working at Deloitte Canada. A lot of the credit goes to Ian J. Scott who was a pleasure to work with, and always inspired me to try to inspire myself to keep creating something new. The article came out as a rather trivial piece of writing. Many data scientists would think — how uninteresting of a topic — coding skills for data science. I completely agree, and admit that I don’t like coding myself, on average!

The code used for the analysis of arXriv abstracts presented in this article can be accessed on GitHub at https://github.com/nesterko/ds_code_availability_on_arxiv.

Let me know any feedback. Views expressed here are my own and confidentiality of examples is preserved to the best of my knowledge. Enjoy!

Introduction

Jolted by the abundance of computational power and data sources, data science is becoming better defined and is gaining wider adoption (Meng, 2020), (Jordan, 2019). Given this reality, industry data science teams now have a growing platform to position and drive more value from academic research results as part of the models we put in production, by which I mean repeatedly (or continuously) executed and maintained models.

However, practicing data science teams such as my own often have just “one try” to address a problem and productionize a model, usually with a constrained time limit specific to each project. In this article, I introduce computer coding for data science as a critically necessary (but not sufficient) ingredient in team skills, which enables practicing data scientists to deploy competitive solutions incorporating relevant academic research  in areas where we practice.

Using an analysis of research paper abstracts and examples from personal experience, I show that research reproducibility in data science-related domains has started to increase in recent years, making it easier to reproduce and test research results in industry settings. I then discuss how one skill, namely the computer programming (coding) skill among practicing data scientists and our teammates, serves as a necessary (but not alone sufficient) ingredient enabling the prototyping work, which is needed to adapt research results in production models. I provide several coding training strategies that I have seen translate effectively into the development and deployment of production models in industry. Finally, I conclude that computer programming is a critical skill in data science teams for deploying academic research results in industry settings.

In the next section, I explore a recent positive uptick in the availability of open source code in research publications in data science-related domains, and introduce how coding skills for practicing data scientists relate to their deployment in production. Then, through examples of implemented projects, I provide a view into how computer programming skills and research results replication enter the operational environment of project delivery. I conclude with a selection of coding training strategies for data scientists that I have found effective for production model development and deployment, and suggest future avenues of inquiry into how data science as a capability can leverage academic research to help generate value from data in industry.

Trends in open source code availability in data science research publications

In my experience designing and performing model implementations in industry, the practice of bringing scientific elements to project approach is regularly tested and revisited for improvements and efficiencies. The main elements of the project implementation process that connect with academic research include a review of relevant research literature, and prototyping/deployment of selected methods, often performed in parallel and via adaptation or combination of a number of methods through trial and error.

In this section, I conduct a simplistic analysis of reproducibility of academic research in data science related domains. I show that it is becoming easier to reproduce academic research, by observing a positive uptick in reproducibility within popular arXiv article categories that are related to data science.

Following (Peng, 2011), let us consider reproducibility of data science research as a spectrum running from “least reproducible” to “gold standard”. On the “least reproducible” side of the spectrum, we find research represented with publication only. One step closer to “gold standard” is publication and code, another step is publication, code and data, and so on until we reach “gold standard”. Here I use a lens of estimating trends in availability of publication and code, that is one step closer to “gold standard” departing from “least reproducible” in the reproducibility spectrum described above, which I call “code availability” in research publications.

In order to look at trends in code availability in popular data science research domains, I adopt a similar approach to the one taken in (Sanders, 2019). Specifically, I downloaded arXiv publication abstracts for a few popular data science related categories, and analyzed them for the presence of a set of keywords that suggest code availability. The categories I downloaded from arXiv were cs.ai, cs.cv, stat.ap, stat.ml, and stat.th for the time period pulled from 2007 to 2019 inclusively. This data pull yielded close to 130 thousand articles in total. Within the analyzed categories, we see an increase in the number of articles in stat.ml category starting 2014, followed by cs.cv and cs.ai categories. The total number of articles in stat.ap and stat.th articles has remained steady since 2014 (see Figure below, bottom for a chart of numbers of articles by year).

In order to gauge trends in code availability in downloaded articles, I searched their abstracts for the following terms related to presence of open source code in the publication: ‘github’, ‘sourceforge’, ‘open source’, ‘code’, ‘r package’, ‘python module’, ‘python package’, ‘r module’. I then examined the trends in percent of abstracts each year that contain one or more of these terms.

To describe the analysis, I adopt a similar notation to (Sanders, 2019). Let the total count of abstracts from category\\(s\\) for year \\(y\\) be denoted as \\(N_{s,y}\\). Let the count of abstracts containing at least one of the terms corresponding to code availability be denoted as \\(C_{s,y}\\). Finally, let the percent articles in each category and each year using at least one of the terms listed above be defined as \\(U_{s,y}=C_{s,y}/N_{s,y}\\).

Trends in \\(C_{s,y}\\) over time readily suggest interesting insights when it comes to code availability in data science related research articles.

In the below Figure, I plotted \\(U_{s,y}\\) as well as total numbers of articles in each category \\(N_{s,y}\\) as a time series for each category \\(s\\) and year \\(y\\). In years prior to 2012, total numbers of articles are observed as trending up, but all below 2 thousand articles in each category. Following 2012, the total numbers of articles have been increasing in cs.ai, cs.cv and stat.ml. Percent articles with terms indicating code availability shows a steady increase in these categories as well. In stat.ap and stat.th, total number of submitted articles has slightly increased yet remained below 2000 per year throughout the considered time period, and percent articles with code availability has shown mild growth over the considered time horizon as well.

Figure. Top chart: average proportion of abstracts indicating code availability in arXiv articles within several categories (solid coloured lines) and standard errors (coloured shaded regions). Bottom chart: total quantity of articles in each category is displayed as dashed coloured lines.

The simplistic analysis here suggests that in the considered popular data science domains, code availability has increased in recent years, with up to one in five academic publications now having available code references. This is a great first step towards gold standard in reproducibility of published research. A more complete discussion of reproducibility and replicability can be found in (NASEM, 2019).

How can practicing data science teams be more prepared to translate academic research to production models? I propose that computer programming for data science is a critically relevant skill and part of the answer to this question given the positive dynamics in code availability above.

In my experience, computer programming is a frequently utilized and important element of data science project delivery in industry, especially in production models which rarely leverage standardized out-of-the-box pre-implemented models. Perhaps unsurprisingly, data science teams that I have worked with often had computer programming for data science as a required skill. It seems appropriate to highlight that data science-specific computer programming is a form of coding acumen that should warrant project delivery outcomes, which include academic research results analysis and testing, as I introduce `through examples provided in the following section.

Examples of industry projects and their connection with computer programming skills for data science

This section describes two data science project implementations in industry within the context of these projects’ connection with computer programming skills and reproducibility of relevant academic research. With this specific context in mind, I describe certain aspects of these projects at a high level in order to maintain confidentiality while aiming to provide sufficient detail and identify potential academic citations as relevant candidates for implementation. As I describe below, a connection of the delivered models with computer programming and reproducibility of relevant academic research is an important practical element of the project delivery process.

Each of the two project descriptions outlined in this section uses the following structure. First, I introduce the business context and challenge addressed by the delivered model. Then I describe how research literature review and candidate model implementation were combined part of project delivery. Finally, in each case I highlight the link between research reproducibility and computer programming skills for data science practitioners.

1. Testing for surprising events at retail stores

The first example is a project to implement a model to prioritize (test) for notable events out of all records of ordering, inventorying, and selling products at store locations of a large national retailer. As stores are franchised to individual owners, each store could exhibit its own patterns in these events, which a central enterprise team could utilize to recommend improvements to store owners for how to operate their respective stores to minimize product waste, better serve customers, and improve profitability.

The task of defining and monitoring events to improve service to customers can be considered from an academic perspective. Defining the right quantities to measure event outcomes, as well as testing procedures taking into account the time-dependent nature of underlying data on store orders, inventory, and sales, can be considered as a research project in itself. However, in this real world industry setting, the project team had a limited time to put in place a production model for use by the central enterprise team. We therefore needed to conduct a literature search while concurrently prototyping and comparing candidate models to deploy in production.

The literature search process yielded several references, including univariate hypothesis test methods adapted to the technology sector (Ng, 2019), time series-based methods (Lin et al., 2003), (Shipmon et al., 2017), and methods spanning a range of other models (Liu et al., 2008), (Ma & Perkins, 2003). As with the next example, the identified references came in different forms. For example, the material included in Andrew Ng’s course included code, while the findings of the team from Google (Shipmon et al., 2017) were described in a written report without specific code references. During our project implementation, the availability of computer programming acumen among our data science team members was an important factor determining the extent to which the methods described in the considered references were reproduced in candidate models for production deployment.

2. Inference of grocery customer food tastes

The second example is a project to implement a model to estimate (infer), in an interpretable way, grocery customers’ underlying food tastes based on their shopping records. This project proved a challenge due to a high volume of recorded purchase data patterns and their variations across customers. Once solved to a sufficient degree, the model’s output could be incorporated into guiding a retail client’s strategy and technology systems to better cater to customer needs.

At the outset of the project, it became immediately clear to the team that individual food tastes of customers and underlying data to be summarized presented a fertile ground for academic thought. Relevant methodologies concern defining the right quantity to be estimated from available data, and performing estimation taking into account unique characteristics of time-dependent patterns in how customers shop for groceries. As in the prior example, these topics could be a subject of an academic research project. However, given enterprise project delivery realities, the time period available to deliver a production model was limited to months rather than years. Team members needed to use available time wisely, performing literature review concurrently with prototyping and comparing candidate models.

During the literature review process, directly relevant research approaches in the publications identified by the team included Explicit Semantic Analysis (Gabrilovich & Markovitch, 2007), and Latent Dirichlet Allocation (LDA) modelling (Blei, Ng, & Jordan, 2003) and its variants such as Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD) (Hansen, 2013). Other modelling approaches relevant to the project included models for text, sequences, or time series such as deep neural network formulations (Hochreiter & Schmidhuber, 1997), or alternative topic inference models (Kong, Scott, & Goerg, 2016). These methods provided a window into how (with sufficient effort) available data could be summarized and interpreted as customer food tastes suitable for enterprise decision making.

As with the previous example, each of the research references identified above was “ready to be implemented” to a different degree. Some references were available as established community driven software packages (eg ESA and LDA), some were in the form of a written report (eg referenced work from Google), and others represented a mix of reproducibility characteristics (eg LDA-STWD, available as a report and a code prototype on GitHub). Additionally, all identified research references had varying characteristics when it came to their computational efficiency at high data volumes. The presence of sufficient computer programming skills among data scientists on the project team was a necessary ingredient to successfully navigate the prototyping of identified methods while staying within project timelines.

Conclusion

Computer programming for data science is a necessary team skill in translating research results into production models in industry. The level of maturity of this skill in data science teams often serves as a bottleneck in the rate with which teams are able to iterate through working implementations of candidate models.

In practicing data science teams, computer programming skills are a subject of continuous improvement. Here are a few effective strategies that I have experienced in industry:

  • Learn through practice. When practicing computer programming for data science, it is best to do so while solving a concrete problem. No problem is too small when it comes to learning coding. While it is often effective to practice on problems from day to day data science work, online learning and academic resources abound ranging from websites such as CodeAcademy and edX, to traditional educational programs.
  • Share your code with others. From getting your classmate’s or fellow teammate’s take on your code to posting it on your profile on GitHub, there are many opportunities to responsibly share your learning with your peers. This can help get more feedback on your progress. Check with your team lead or professor for their advice on how to best share and get feedback on your code.
  • Get inspiration from other domains. For example, Computer Science as an academic field offers much to learn in effective coding strategies. In industry settings, looking at code examples from environments other than those you are used to can spur ideas for how to improve your own coding acumen.
  • Write code for reuse. When practicing, imagine that someone else may need to run your code. If you mean for others to use your code, try to improve your code’s ease of use through turning code snippets into functions, breaking up long codes into logical “chapters”, utilizing comments within code, or even stand-alone documentation.

In addition to computer programming skills, effective data science teams in industry exhibit other skillsets such as foundational training in related academic fields, project ownership skills, communication, and domain context knowledge in industry domains where they operate. An excellent discussion of relevant skills and academic training can be found in (Irizarry, 2020), (Berthold, 2019), (Garber, 2019). In practical settings, each data science project that needs to be delivered presents itself with unique requirements. Almost invariably, however, computer programming for data science is a staple required to incorporate academic research results in production model development, deployment, and maintenance.

Therefore, computer programming for data science is an important ingredient to building the skills of an effective, research oriented data science team. However, there is much more to delivering value from data than just coding.

As introduced in (Irizarry, 2020), data science teams can include teammates with non-overlapping, or partially overlapping skills. Further, our teams need to effectively engage relevant stakeholders, which often includes navigating overlapping team skills and responsibilities for delivering real world outcomes based on data science project objectives. Few would question that the field is maturing, with data science teams facing a greater responsibility in how solutions are developed, delivered, and maintained across sectors. Areas where such levels of responsibility have been historically high include Statistics heavy domains such as US Census, where rigour in approaching team skillsets, delivery patterns, and stakeholder engagement has been motivated by the importance of Census as a vehicle for informing a wide ranging array of applications. Indeed, recent changes to the anonymization aspect of Census data are stimulating serious discussions of challenges and opportunities they bring, and rightfully so as emphasized in (Meng, 2020).

It is comforting that the community is starting to have a rigorous discussion about best practices in data science as a discipline mandated to generate value from data (Irizarry, 2020). In industry settings, data science is increasingly viewed as a business capability (Omnia AI, 2018). Viewed as part of a capability to deliver business value from data, data science teams need to participate in discussing the relevant questions in the community. For example, how can we borrow learnings from the more mature domains such as US Census, to improve how data science is deployed as a capability? How do we optimally structure data science teams to respond to the increasing requirements of generating value from data? What are the optimal ways to bring scientific elements to industry projects? How can data science teams best engage business stakeholders in order to deliver more value from data in various industry domains? What academic research methods can data science teams in industry use in order to minimize the risk of “black swan” events when production models break and cause adverse outcomes for an unforeseen reason?

Answers to questions related to how data science teams can deliver value from data are diverse, highly domain-specific, and require collaborative input from academic researchers, practicing data scientists, and business stakeholders. It is encouraging to see the increased rigour in developing the discussion and equipping the community with the right skills, tools, and methodologies to deliver value in data science.

Bibliography

Berthold, M. R. (2019). What Does It Take to be a Successful Data Scientist? Harvard Data Science Review , 1(2).

Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of machine Learning research , 3 (Jan), 993-1022.

Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI , 7, 1606-1611.

Garber, A. M. (2019). Data Science: What the Educated Citizen Needs to Know. Harvard Data Science Review , 1(1).

Hansen, J. A. (2013). Probabilistic Explicit Topic Modeling. Brigham Young University ScholarsArchive.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation , 9 (8), 1735-1780.

Irizarry, R. A. (2020). The Role of Academia in Data Science Education. Harvard Data Science Review , 2(1).

Jordan, M. I. (2019). Artificial Intelligence—The Revolution Hasn’t Happened Yet. Harvard Data Science Review , 1 (1), 1(1).

Kong, J., Scott, A., & Goerg, G. M. (2016). Improving topic clustering on search queries with word co-occurrence and bipartite graph co-clustering. Google Inc (to appear).

Lin et al. (2003). A symbolic representation of time series, with implications for streaming algorithms. Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery , 2-11.

Liu et al. (2008). Isolation forest. ighth IEEE International Conference on Data Mining , 413-422.

Ma, J., & Perkins, S. (2003). Time-series novelty detection using one-class support vector machines. Proceedings of the International Joint Conference on Neural Networks , 1741-1745.

Meng, X.-L. (2020). 2020: A Very Busy Year for Data Science (and for HDSR). Harvard Data Science Review , 2(1).

NASEM. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press.

Ng, A. (2019, 03 01). Anomaly Detection Using the Multivariate Gaussian Distribution. Retrieved 03 01, 2019, from Coursera: https://www.coursera.org/lecture/machine-learning/anomaly-detection-using-the-multivariate-gaussian-distribution-DnNr9

Omnia AI. (2018, June 1). Deloitte’s Artificial Intelligence Practice. Retrieved February 9, 2020, from Deloitte: https://www2.deloitte.com/ca/en/pages/deloitte-analytics/articles/omnia-artificial-intelligence.html#

Peng, R. D. (2011). Reproducible research in computational science. Science , 1226-1227.

Sanders, N. (2019, June 22). A Balanced Perspective on Prediction and Inference for Data Science in Industry. Retrieved from Harvard Data Science Review: https://doi.org/10.1162/99608f92.644ef4a4

Shipmon et al. (2017). Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data. arXiv , 1-9.

Responsive design for data visualization

Monday, March 2nd, 2015

Interactive data visualization is an increasingly important tool of conveying analytic information. It is used in teaching, in product design, data journalism, scientific research, and many other fields.

Much of the analytic insight that we consume every day goes through the minds and hands of data scientists. Data scientists own data analysis, this is our profession. Sometimes, we need to create interactive visualizations to communicate our work.

Our audience is increasingly diverse: analytics product owners, researchers, journalists, general public are just a few examples. How do we make sure our visualizations are maximally understandable and usable by our target audience?

In order to make interactive visualization more usable, data scientists need to start thinking a little bit like designers. Yes, it means expanding the significant skill set we already possess even further, or partnering with a designer. With the abundance of information and tools currently available, it’s not that hard — and even fun.

A cornerstone of good design is responsiveness. The interactive visualization we make needs to look good on any screen width. Here is an example, an interactive dashboard of edX learner progress by Zach Guo:

Zach’s visualization adapts to different widths of the containing box. See it for yourself by clicking and dragging the resize handle on the bottom right of the visualization:

Here is another example, a network visualization I made some time ago. It also adapts to different screen widths.

If your visualization is web-based (like both visualizations above), most of the time you can make it responsive by defining different CSS layout rules based on screen width. Here’s how I did it in the visualization above (check out source code for a complete picture of what’s going on):

@media (min-width: 1150px) {
    .ivcont.info {
        width : 250px;
    }
}

There is more discussion on data visualization usability in the material of a course on statistical computing and interactive visualization I taught at Harvard Statistics department in Spring 2013 (for example, consider Lecture 24).

Data scientists own data analysis. Part of making the end result accessible to the target audience, responsive design for data vis can be a fun addition to our toolkit.

Gender Balances: A look at the makeup of HarvardX registrants

Thursday, December 5th, 2013

Although the first semester of the 2013/14 academic year is coming to a close on campus and residential students are finishing up coursework and preparing for the break, the timelines are more asynchronous for students registered for 10 currently running online offerings. This batch of 10 consists of courses and modules launched by HarvardX at different times during the Fall of 2013.

While course development teams are working to create the most stimulating learning experiences and thinking about whether and how to give students a mini winter break in their courses or modules (or summer break for those in the southern hemisphere), the research team is busy studying the troves of data produced by past and current online offerings, working with course developers to set up learning experiments, and helping to facilitate research-based innovation at HarvardX.

As part of our work to inform course development and research, the research team generated course-specific and HarvardX-wide worldwide gender composition data.


The interactive visualization above shows self-reported gender composition data for all past and current online offerings, as well as overall for HarvardX. Choosing an item from the drop down menu shows data on a particular course or module on the left, while the chart on the right displays overall gender composition to facilitate comparison. Hovering the mouse over the chart brings up information on the specific numbers used to calculate percentages. As of November 17, 7-10% of students in different offerings didn’t specify gender information, which is reflected by the Missing category. Checking the box ‘Only male/female’ leaves only these categories, and calculates the percentages using the total number of reported males and females as the denominator. The data specification file including the source code and technical information can be accessed here.

The HarvardX student body is estimated to be mostly male (62% as of November 17, 2013), although there is considerable variation in gender balance from one offering to the next. For example, CS50x Introduction to Computer Science has decidedly more male students (estimated 79%), while both offerings on Poetry in America register mostly female students (estimated 57% and 61%). Some courses have almost equal percentages of female and male students. For example, GSE1x Unlocking the Immunity to Change, launching in Spring 2014, so far has registered an estimated 51% females and 49% males. We generally do not recommend interpreting the overall HarvardX average when overall enrollment is so heavily influenced by a small number of courses (e.g., Computer Science and the Science of Cooking).

In order to gain a better understanding of gender balance in HarvardX offerings, we made a world map, showing estimated gender composition of our students enrolled from different countries around the world.


The map above is an interactive visualization of estimated worldwide gender composition of students enrolled in HarvardX offerings worldwide. Blue color means that the balance is tilted towards male registrants, yellow – females, and green is approximate parity. Hovering the mouse over countries brings up information on exact estimated proportions of female and male registrants and the numbers the estimation is based on. Estimation was performed using Missing At Random assumption for missing data, and countries with less than 100 detected students are not colored as the estimated percentages can have a margin of error greater than ±5 percentage points. Choosing items from the drop down menu will bring up information on worldwide gender composition for a particular offering. The data specification file including the source code and technical information can be accessed here.

In most countries of the world, estimated gender balance is tilted towards males, with the pattern being strongest in African and South Asian countries. Exceptions include Philippines, Georgia, Armenia, Mongolia, and Uruguay, where overall estimated HarvardX gender balance is either close to 50% or tilted towards females. Possible explanations for this finding include cultural trends, selective registration to courses, which are more popular among females, Internet access, economic factors etc.

Individual HarvardX offerings exhibit very different patterns in worldwide gender composition. For example, MCB80.1x Fundamentals of Neuroscience, Part I (launched end of October) shows gender parity in the US, Canada, and Australia, while we estimate more female students from Philippines, Argentine, and Greece. Other countries including China, India, Pakistan, France, Sweden, and others are estimated to have more male students. At the same time, the recently launched PH201x Health and Society exhibits strong female enrollment from many countries around the world, while India, China, and Pakistan as well as other countries still are estimated to have more male students in the course.

There could be many possible explanations for the observed picture of worldwide gender composition in HarvardX offerings. One aspect to consider is popularity of courses in certain fields among males or females depending on the context of a particular country. For example, gender balance in the US varies greatly from one course to the next. The ways in which online learning (edX, HarvardX, and beyond) is perceived and promoted in a particular country through advertising, word of mouth, and other means may also have some influence on who ends up enrolling for courses. There are also other country-specific factors such as cultural setting, Internet access, religion, and others, all of which may contribute to the gender balance patterns we are observing.

One parallel that I find interesting is comparing worldwide gender compositions in HarvardX offerings and residential education.

The picture above is taken from UNESCO’s Worldwide Atlas of Gender Equality in Education from 2012, and visualizes worldwide gender composition in tertiary education. Yellow color means that there are more females enrolled in tertiary education than males, green means parity, and blue means that there are more males.

What’s interesting about UNESCO’s gender parity map and the interactive visualization of worldwide gender composition for HarvardX offerings, is that they should match if residential tertiary education exhibited the same gender enrollment patterns as HarvardX. However, while there are similarities, the two pictures don’t quite match. On average, more females, across multiple countries, participate in tertiary (that is, residential) education than then they do in HarvardX online courses.

Why is it?

It could be that at HarvardX, technical courses such as CS50x skew the enrollment demographic, which has been shown to be mostly male for technical/STEM subjects in all settings. It could also be that in some countries, on average, women don’t think that the initial MOOCs may have relevance to their lives and work as much as males do. It remains to be seen whether the patterns in these initial gender composition data show fundamental differences between gender demographics of residential tertiary and online education, or whether the observed patterns are due to a limited number of initial online offerings and are specific to HarvardX.

Clearly, our analysis generates more research questions than it answers. Finding and polishing bits and pieces of the puzzle to answer these questions is what makes working at HarvardX research so stimulating.

HarvardX research: both foundational and immediately applicable

Wednesday, October 23rd, 2013

There is a difference between research and how innovation happens in industry. Research tends to be more foundational and forward-thinking, while innovation in industry is more agile and looks to generate value as soon as possible. Bret Victor, one of my favorite people in interaction design, summarizes it nicely in the diagram below.

Bret Victor’s differences between industry and research innovation

HarvardX is a unique combination of industry and research by the classification above. The team I am part of (HarvardX research) works to generate research and help shape online learning now, as well as contribute to foundational knowledge. Course development teams, who create course content and define course structure, sit on the same floor as us. Course developers work together with the research team looking for ways to improve learning continuously and generalize findings beyond HarvardX to online and residential learning in general. Although the process still needs to be streamlined as we are scaling the effort, we are making progress. One example is the project on using assignment due dates to get a handle on student learning goals and inform course creation.

Here is how it got started.

As we were looking at the structure of past HarvardX courses, we discovered that there was a difference in how graded components were used across courses. Graded components include assignments, problem sets, or exams that contribute to the final grade of the course which determines whether a student gets a certificate of completion. Below is public information on when graded components occurred for 3 HarvardX courses.

The visualization above shows publicly available graded components structure for three completed HarvardX courses: PH207x (Health in Numbers), ER22x (Justice), and CB22x (The Ancient Greek Hero). Hovering the mouse over different elements of the plot reveals detailed information, clicking on course codes removes extra courses from display. For PH207x, each assignment had a due date preceding the release time of the next assignment (except the final exam). For the other two courses, students had the flexibility of completing their graded assignments at any time up until the end of the course.

When the due date passes on a particular graded component, students are no longer able to access and answer it for credit. The “word on the street” among course development teams so far has been that it’s generally desirable to set generous due dates on the graded components as this promotes alternative (formative) modes of learning allowing students not interested in obtaining a grade to access the graded components. Also, this way students who register for a class late have an opportunity to “catch up” by completing all assignments that they “missed”. However, so far it has been unclear what impact such due date structure has on academic achievement (certificate attainment rates) versus other modes of learning (non-certificate track, ie. leisurely browsing).

Indeed, one of the major metrics of online courses is certificate attainment – the proportion of students who register for the course and end up earning a certificate. It turns out that PH207x experienced the attainment rate of over 8.5%, which is the highest among all open HarvardX courses completed to date (average rate of around 4.5%). Does this mean that setting meaningful due dates boosts academic achievement by helping students “stay on track” and not postpone working on the assignments until the work becomes overwhelming? While the hypothesis is plausible, it is too early to draw causal conclusions. It may be that the observation is specific to just public health courses, or PH207x happened to have more committed students to begin with, etc.

While the effect on certificate attainment is certainly important, an equally important question to answer is what impact do due dates have on alternative modes of learning? That’s why we are planning to start an A/B test (randomized controlled experiment) to study the effect of due dates, in close collaboration with course development teams. Sitting on the same floor and being immersed in the same everyday context of HarvardX allows for agile planning, so we are hoping to launch the experiment as early as November 15 or even October 31. The findings of the study have the potential to immediately inform course development for new as well as future iterations of current courses, aiming to improve educational outcomes of learners around the world and on campus.

HarvardX is a great example of a place where research is not only foundational but also immediately applicable. While the combination is certainly stimulating, I wonder to what extent this paradigm translates to other fields, and what benefits and risks it carries. With these questions in mind, I cannot wait to see what results our experimentation will bring and how we can use data to improve online learning.

Data from HarvardX Research: worldwide student enrollment

Thursday, August 22nd, 2013

Summer is coming to an end: weather is starting to cool down, the lazy toasty feel in Cambridge streets is gradually going away. The fall is about to set in, residential courses at Harvard are about to start. At HarvardX, however, learning and teaching have been going on full speed and show no signs of slowing down or changing pace with the start of the conventional academic year. If anything, every day it’s becoming busier and busier, and much, much more interesting. It’s challenging to keep pace, but I want to say that we at HarvardX Research are managing.

The past two months have been quite busy: we needed to define infrastructure to store and analyze HarvardX data to supply course developers, researchers, leaders, and the public with much needed tools to gain insight on this one-year-old enterprise that’s shown immense growth. Indeed, just a few months ago HarvardX offered just 5 courses. Now the number is 17, with 12 more courses due to launch in the Fall and early 2014. HarvardX enrollment reached 200,000 in winter and has now more than doubled to reach 516,479 students worldwide as of August 18. Currently, HarvardX has students from almost anywhere in the world:

The visualization above allows to hover over countries to check their enrollment and select one of the 17 HarvardX courses from the menu to see the breakdown by course. It is going to appear on the official HarvardX website.

Global reach of HarvardX. In about one year, over 500 thousand students from 204 countries registered for HarvardX courses. That is a larger number than the number of students Harvard College graduated in its entire 377-year history. Of course, the HarvardX student body is substantially different from residential students, and it is going to take much effort and experimentation to find out meaningful differences and the most effective ways to deliver educational content to such diverse student body in the new setting of online learning.

Enrollment activity. One of the immediate insights is that there are many American students, but they account for less than a half of all HarvardX students. In Africa, Nigeria is the most enrolled country with 11,490 students, Spain has the most students in Europe (8,668; the second largest enrolled European country is Great Britain with 7,321 students), India is the highest-enrolled country in Asia with almost 50 thousand students, and in South America, Brazil has 10,535 students who registered for HarvardX courses. Enrollment per country is estimated based on known total enrollment and relative numbers of students with reported and recognized countries. The enrollment may be affected by various factors such as country population, Internet use in each country, legal regulations, and cultural patterns.

Expanding access. Although HarvardX has demonstrated a vast reach over the past year, this interactive graphic suggests opportunities for expansion. The majority of global HarvardX enrollment comes from English-speaking countries. An estimated total of 4,497, or 0.87% of enrolled students come from China in spite of its huge population of 1.3 billion (over 19% of Earth population). These findings suggest that we can further adapt HarvardX educational content to different cultures, languages, and student learning goals.

As we are shaping the HarvardX Research infrastructure, we are working to cope with data idiosyncrasies caused by the rapid evolution of the edX platform. In addition, we are working to convert data from edX and other platforms to a common format (for example, MCB80x). The data and visualization above are subject to several possible biases and errors: (1) total enrollment includes course team and edX staff registrations, (2) enrollment by country is estimated based on self-reported mailing address at registration and assumes missing-at-random (MAR), (3) mailing address parsing may have accuracy issues, (4) this data is from a system in beta development stage and may contain errors. A technical document further specifying the way the data were obtained, listing possible limitations and links to source files can be downloaded here.

Data and visualizations such as above are going to help power innovation at HarvardX.

Adaptive and social media in MOOCs: the data-driven and the people-driven

Thursday, May 23rd, 2013

In light of my new position as a HarvardX Research Fellow, I have been thinking about the role of data in improving online learning experiences (aka MOOCs) at edX. Can data tell us everything about the ideal learning experience of tomorrow? Can product developers at edX come up with the best version singe-handedly? Or, maybe, the online students could also tell us what is the ideal MOOC?

First, let’s think about what could be the “ideal MOOC”. There is a broad consensus that an ideal online learning experience would yield the best “educational outcomes” for the students. For now, let’s think about the educational outcome as something that’s well-approximated with the amount of learning. Specifically, this means that we want students to extract and internalize as much educational content from the interactive learning experience as possible. Finally, the educational content is information that is relevant to the substance of the class. For example, for a probability course, this would include information on how to use Bayes rule or the change of variables. For a Python programming class this would include information on how to operate Python modules and language syntax. For a class on interactive visualization, this could include (of course!) information on how to use d3js.

This is an important point. Educational content is information relevant to the substance of the class. We want the students to internalize as much of it as possible, make it their knowledge. How can we do that?

Let’s assume that the educational materials (lectures, homework, tests, examples) have already been prepared and we believe that they are good. How do we expose the materials to the students in the best possible way so that students learn the most, stay engaged, and more students complete the class?

Clearly, the setting of a MOOC is different from the setting of a standard classroom. One of the significant differences is the number of students – it’s massive. Depending on the course, the number of enrolled students can exceed 150 thousand – CS50x by David Malan on HarvardX is a great example. Do we want to expose every single student, no matter what country he/she is from, no matter what talents and aspirations he/she has, no matter how many peers he/she will study with, all to the same sequence of the material? Maybe, yes. And maybe, no.

The setting of MOOCs can be a wonderful platform for adaptive media – an algorithmic way of sequentially presenting content and interacting with the user in order to maximize the informational content that the user “internalizes”.

Adaptive media. It’s the characterizing trait of a computer as a medium – the ability to simulate responses, interact, predict, “act like a living being”. We can use it to model, predict, and synthesize the best way to serve content to users, algorithmically.

Adaptive media is used actively across the Web in conjunction with social media. Often, the inputs of adaptive media are the outputs of social media (and then it repeats). When you share an article on Facebook, the system learns about your preferences and makes sure that the next time you see content it’ll be more relevant to your interests. A lot of the time, by the custom-tailored content we mean advertisements. Same goes for LinkedIn – ever noticed the “Ads you may be interested in” section to the right on your LinkedIn profile?

Can we use adaptive media in MOOCs? The benefits are obvious – with hundreds of thousands of enrollees, it is impossible to adequately staff the course with enough qualified facilitators. Adaptive media could be used together with the teachers’ input and social media such as forums, social grading, and study groups. The purpose, instead of displaying personalized ads, would be to make sure each student learns as much as possible from the interactive learning experience, in his or her unique way. There could also be a multitude of positive extras – reduced dropout rate, higher engagement, higher enrollment for adaptive MOOCs.

Isn’t this interesting?

Democratization of data science: why this is inefficient

Sunday, November 4th, 2012

The use of data in industry is increasing by the hour, and so does investment in Big Data. Gartner, an information technology research and advisory firm, says the spending on big data will be $28 billion in 2012 alone. This is estimated to trigger a domino effect of $232 billion in spending through the next 5 years.

The business world is evolving rapidly to meet the demands of data-hungry executives. On the data storage front, for example, new technology is quickly developed under the Hadoop umbrella. On the data analysis front, there are startups that tackle and productize related problems such as quid, Healthrageous, Bidgely, and many others. What drives this innovation in analyzing data? What allows so many companies to claim that their products are credible?

Not surprisingly, the demand for analytic talent has been growing, with McKinsey calling Big Data the next frontier of innovation. So, let’s make this clear – businesses need specialists to innovate, to generate ideas and algorithms that would extract value from data.

Who are those specialists, where do they come from? With a shortage of up to 190,000 data specialists projected for 2018, there is a new trend emerging for “the democratization of data science” which means bringing skills to meaningfully analyze data to more people:

The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it’s new education platforms (e.g., Coursera and Udacity) teaching students fundamental skills in everything from basic statistics to natural language processing and machine learning.

Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.

This quote is optimistic at best. Where is the guarantee that the product developed by a “data scientist” with a couple of classes worth of training is going to work for the entire market? In academic statistics and machine learning programs, students spend several years learning the proper existing methods, how to design new ones, and prove their general validity.

When people without adequate training make analytic products and offer them to many customers, such verification of the product is crucial. Otherwise, the customer may soon discover that the product doesn’t work well enough or not at all, thus bringing down the ROI on the product. The customer will then go back and invest in hiring talent and designing solutions that would actually work for the case of this customer. If all customers have to do this, the whole vehicle with the democratized data science becomes significantly inefficient.

Behind each data analysis decision there must be rigorous scientific justification. For example, consider a very simple Binomial statistical model. We can think about customers visiting a website through a Google ad. Each customer is encoded as 1 if he or she ends up purchasing something on the website, and zero otherwise. The question of interest is, what proportion of customers coming through Google ads ends up buying on the website?

Below is a visualization of the log-likelihood and inverse Fisher information functions. Many inappropriately trained data specialists would not be able to interpret these curves correctly even for the simple model like this. But what about the complex algorithmic solutions they are required to build on a daily basis and roll out on the market?

We can simply take the proportion of customers who bought something, that will be our best guess about the underlying percentage of buying Google ad website visitors. This is not just common sense, the average can be proved to be the best estimator theoretically.

The uncertainty about our estimate can also be quantified by the value of the inverse observed Fisher Information function (picture, left) at the estimated value of p. The three curves correspond to the different numbers of customers who visited our website. The more customers we get, the lower our uncertainty about the proportion of the buying customers is. Try increasing the value of n. You will see that the corresponding curve goes down – our uncertainty about the estimated proportion vanishes.

This is the kind of theory that we need specialists who develop algorithmic products to be equipped with. It requires an investment in their proper education first. If we skip the proper education step, we risk lowering the usefulness and practicality of the products such data scientists design.

Algorithms as products: lucrative, but what is the real value?

Friday, October 12th, 2012

Recently I attended a talk by Nate Silver (@fivethirtyeight) who leads a popular NYT election forecasts blog, where he talked about how he uses algorithms to predict the results of the election given the information available on the day of. Nate didn’t go in-depth on how his algorithms work, though there were such questions from the audience. On the one hand, it makes sense. Why tell how the algorithms work, what matters is whether they predict the election right. Indeed, it did in 2008, predicting 49/50 states right, as well as all of the 35 Senate races.

But on the other hand, if Nate Silver never publicly discloses how it works, how do we really know what the algorithm is based on, what are the weights on surveys, how it accounts for all the biases, etc? In science, algorithms are always disclosed and can be replicated by third parties. Such approach is not employed by Nate Silver, and it is understandable. His algorithm is a product, it gives him a job at NYT, prestige, and status. What would happen if anybody could replicate it?

The same non-disclosure strategy is employed by LinkedIn for its Talent Brand Index algorithm. The index is a new measure offered by LinkedIn of how attractive the company is for prospective and current employees.

The index will prove to be very lucrative for LinkedIn:

While there is likely to be a lot of quibbling about how the numbers are calculated, this product has the potential to make LinkedIn the “currency” by which corporations measure their professional recruitment efforts.

No wonder the company is trading at 23 X sales.

However, there is a key difference between LinkedIn’s Talent Brand Index and Nate Silver’s election forecast algorithms: it can never be checked whether the Talent Brand Index is right. Indeed, do we know how it is constructed? Here’s what I could find on that:

Last year, LinkedIn was home to over 15 billion interactions between professionals and companies. We cross-referenced our data with thousands of survey responses to pinpoint the specific activities that best indicate familiarity and interest in working for a company: connecting with employees, viewing employee profiles, visiting Company and Career Pages, and following companies. After crunching this data and normalizing for things like company size, we developed our top 100 global list. We then applied LinkedIn profile data to rank the most sought-after employers among professionals in five countries and four job functions.

The index cannot be re-created not only because there is no publicly available description of how it is calculated, but also because LinkedIn’s data on which it is calculated is proprietary.

So, the Talent Brand Index is a black box, recruiters don’t know how it works. But, they will pay to get access to it because the index provides employer rankings in terms of “people’s perception of working for them”. The companies will then work and invest heavily to improve their index ranking because the information is publicly available, and will help them recruit better talent.

However, how are the employers going to find out what is their ROI trying to improve their Talent Brand Index if they don’t know how it works? Not having the information on how the index works makes it a hard task. Let me give an example.

For simplicity, let us assume that the Talent Brand Index gives the weight of 5 to the positive sentiment expressed about the company by the current employees on their LinkedIn profiles, and a weight of 100 on the number of times the profiles of the employees are viewed on LinkedIn. Since the information on weights is hidden from the employers, they’d have to first run a randomized experiment to determine the effect of a particular company policy on employee profile views, and then measure the impact on the index. This is very costly and hard to implement, because it is hard to devise a potentially index-improving policy that would only involve a part of company’s employees (treatment group), and not the other part (control group), and to randomly assign employees to those parts, and then to measure the profile clicks, and so on.

But in our example, LinkedIn gives a very large weight to the number of views of the employees’ profiles! How can the employers find that out?

Practically, the answer is – they cannot.

This means that while the Talent Brand Index is a lucrative product for LinkedIn, the real value it provides to companies is vague. It provides no information as to what areas of an employer’s HR policy need to be improved in order to increase the Talent Brand Index, and in what priority. That’s why, the high-index companies will enjoy an increased influx of great talent, while the low-index companies will suffer a talent drain. This will reinforce the leaders’ positions, and worsen the positions of the HR underdogs.

Coming back to the broader picture, there are algorithms, and there are algorithms. Nate Silver’s election prediction algorithm is in fact a valuable product to its users even though its details are largely unknown. This is because it can be checked for truthfulness. LinkeIn’s Talent Brand Index product will bring double digit growth to the company due to the Big Data hype, but will it be really useful to its consumers in terms of helping them improve their hiring? The answer is not straightforward.

Algorithms as products should be designed with enough transparency to make them useful, or with a mechanism to externally verify them. Otherwise, their value to the customer is questionable.

Management consulting view on big data

Monday, June 25th, 2012

The Economist

The amount of data recorded and analyzed in business, medicine, education and public policy is increasing every day at a rapid rate, to the extent that it is hard to keep pace with it. I am particularly interested in how, and whether, the leaders of organizations and government bodies are responding to and extracting value from the phenomenon.

Particularly interesting is the point of view of top management consulting firms, who are also very interested in the trend. For example, McKinsey Institute published a report on big data a year ago. More recently, there was a recording of a QA session with a senior partner of BCG Philip Evans on big data posted on Schumpeter blog on The Economist about a week ago.

Specifically, Mr. Evans eluded to how the emergence of “big data” may change the course of strategic development of companies. The most recent method has been vertical integration, when companies aim to acquire/develop more entities along the supply chain (i.e., electric power supplier aims to operate not only power plants, but also raw materials, power grids etc) to reduce costs. According to Mr. Evans, during the “big data” era, we will see more of horizontal integration, when instead of operating several entities along the supply chain, a company focuses on one, and grows by scaling the product up to many markets. As per Mr. Evans, an example of this approach is Google.

Additionally, Mr. Evans stated that companies will become fragmented into two camps, the one where there exists a well-defined serializable product or service around which a company can scale up, such as “inferring patterns in large amounts of data”, and another where more unique individual skills are needed, such as entrepreneurship, creativity etc.

I found the interview very interesting. We do see successful companies employing horizontal integration (Google, Apple, Amazon). That is, they do focus on a few important products or services, and scale them up to multiple markets. Does this have anything to do with “big data”? It certainly does, as horizontal integration is employed by big players in the big data realm as well, such as EMC. However, horizontal integration is inherent more to the concept of the Internet and the evolution of IT, as is the “big data” phenomenon.

Secondly, I have to disagree with the statement that inferring patterns in large amounts of data is (easily) serializable. This task is an open scientific problem that is a subject of active current research. The only solutions existent at the moment are those belonging to the second camp as defined by Mr. Evans. A task of attempting to design an algorithm to extract a specific answer to a specific question from a dataset in a given format needs to be approached individually by qualified specialists such as statisticians. Such project does involve creativity and a substantial amount of intellectual effort. After an approach is developed, it can be replicated for the specific dataset it has been designed for (say, when more observations have been collected), and not for other datasets, otherwise the results may be unreliable.

More broadly, what does the phenomenon mean for companies? Horizontal integration is implied by the ability to quickly scale up products and services implied by the development of the Internet and the IT, as is big data. So, what is the message of the latter by itself?

Let us not make the matter overly complicated. Buried in the terabytes of “big data” is the ability of companies to be better informed about the market around them and their own internal operations, to optimize activities better, to find out what the competition is up to better, to price their products better than competition, and so on. “Being better informed” is a value generating asset, and companies with large amounts of repeated features (many instances of the same product/service sold, large numbers of employees, many visitors seeing their ads on the Internet) need to realize this. The first ones that do, and those who employ the better methods of extracting interpretable information from the relevant data sources will benefit from the value of being better informed than others.

I couldn’t be more excited about the fact that companies, governments, educational institutions and public policy agencies are beginning to realize the value of being better informed by patterns inferred from data, be they massive, big, or not so big. The fact that top management consultants are talking about it means that top executives are demonstrating this interest.

The gap between academia and current industry practices in data analysis

Sunday, March 25th, 2012

The demand for specialists who can extract meaningful insights from data is increasing, which is good for statisticians as statistics is, among other things, the science of extracting signal from data. This is discussed in articles such as this January article in Forbes, and also the McKinsey Institute report published in May last year, an excerpt from which is given below:

There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

This sounds encouraging for the current students in fields such as statistics as they are looking to get out on the hot job market after graduation. However, are they prepared for what industry jobs need them to do?

One news that hasn’t been covered in the press yet is that methods and data related problems in industry are often different from those described in the body of scientific publications. By different I mean either scientifically invalid or scientifically valid and cutting edge.

An example of this phenomenon is the so-called Technical Analysis of financial data, which is often used by algorithmic trading groups to devise computer-based trading strategies. Technical analysis is a term that people came up with to describe a set of methods that are often useful, and yet their validity is questionable from the scientific perspective. Quantitative traders have been employing this type of analysis for a long time without knowing whether it is valid.

Another example is a project I worked on, which was to create an algorithm of optimizing annual marketing campaigns for a large consumer packaged goods company (over $6 billion sales) to achieve 3-5% revenue increase without increasing expenditure described in this post. Essentially, this was an exercise in Response Surface methods with dimensionality as high as 327,600,000. There are no scientific papers in the field that consider problems of such high dimensionality. And yet companies are interested in such projects, even given the fact that methods for their solution are not scientifically verified (we worked hard to justify the validity of our approach for the project).

Recently I received an email inviting quantitatively oriented PhD’s to apply for a summer fellowship in California to work on data science projects. Here is a quote from the email:

The Insight Data Science Fellows Program is a new post-doctoral training fellowship designed to bridge the gap between academia and a career data science.

Further, here is what is stated on the website of the organization sponsoring the program:

INSIGHT DATA SCIENCE
FELLOWS PROGRAM
Bridging the gap between academia and data science.

As with algorithmic trading about 15 years ago, the use of sometimes scientifically questionable data analysis techniques is commanded by the increased demand for insights from quantitative information. Such approaches, which in the world of quantitative finance are called Technical Analysis, during the current data boom are named Data Science.

When using the term, one should be careful that while the methods employed by inadequately trained “data scientists” may be scientifically valid, they may well not be. There is an inherent danger in calling something that encompasses incorrect methods as a sort of “science” as this instills a perception of a field that is well-established and trustworthy. However, the term is about a couple years old. In my opinion, a more accurate one would be “current data analysis practices employed in industry”.

The way we name the phenomenon does not change what it is. It is the fact that there is a lot of data and a lot of problems in industry that often go beyond what has been seen or addressed in academia. This is an exciting time for statisticians.