Archive for September, 2020

Computer Programming as a Necessary Team Skill for Research-Based Production Data Science in Industry

Friday, September 4th, 2020

Here is an article I mainly wrote last year while working at Deloitte Canada. A lot of the credit goes to Ian J. Scott who was a pleasure to work with, and always inspired me to try to inspire myself to keep creating something new. The article came out as a rather trivial piece of writing. Many data scientists would think — how uninteresting of a topic — coding skills for data science. I completely agree, and admit that I don’t like coding myself, on average!

The code used for the analysis of arXriv abstracts presented in this article can be accessed on GitHub at https://github.com/nesterko/ds_code_availability_on_arxiv.

Let me know any feedback. Views expressed here are my own and confidentiality of examples is preserved to the best of my knowledge. Enjoy!

Introduction

Jolted by the abundance of computational power and data sources, data science is becoming better defined and is gaining wider adoption (Meng, 2020), (Jordan, 2019). Given this reality, industry data science teams now have a growing platform to position and drive more value from academic research results as part of the models we put in production, by which I mean repeatedly (or continuously) executed and maintained models.

However, practicing data science teams such as my own often have just “one try” to address a problem and productionize a model, usually with a constrained time limit specific to each project. In this article, I introduce computer coding for data science as a critically necessary (but not sufficient) ingredient in team skills, which enables practicing data scientists to deploy competitive solutions incorporating relevant academic research  in areas where we practice.

Using an analysis of research paper abstracts and examples from personal experience, I show that research reproducibility in data science-related domains has started to increase in recent years, making it easier to reproduce and test research results in industry settings. I then discuss how one skill, namely the computer programming (coding) skill among practicing data scientists and our teammates, serves as a necessary (but not alone sufficient) ingredient enabling the prototyping work, which is needed to adapt research results in production models. I provide several coding training strategies that I have seen translate effectively into the development and deployment of production models in industry. Finally, I conclude that computer programming is a critical skill in data science teams for deploying academic research results in industry settings.

In the next section, I explore a recent positive uptick in the availability of open source code in research publications in data science-related domains, and introduce how coding skills for practicing data scientists relate to their deployment in production. Then, through examples of implemented projects, I provide a view into how computer programming skills and research results replication enter the operational environment of industry project delivery. I conclude with a selection of coding training strategies for data scientists that I have found effective for production model development and deployment, and suggest future avenues of inquiry into how data science as a capability can leverage academic research to help generate value from data in industry.

Trends in open source code availability in data science research publications

In my experience designing and performing model implementations in industry, the practice of bringing scientific elements to project approach is regularly tested and revisited for improvements and efficiencies. The main elements of the project implementation process that connect with academic research include a review of relevant research literature, and prototyping/deployment of selected methods, often performed in parallel and via adaptation or combination of a number of methods through trial and error.

In this section, I conduct a simplistic analysis of reproducibility of academic research in data science related domains. I show that it is becoming easier to reproduce academic research, by observing a positive uptick in reproducibility within popular arXiv article categories that are related to data science.

Following (Peng, 2011), let us consider reproducibility of data science research as a spectrum running from “least reproducible” to “gold standard”. On the “least reproducible” side of the spectrum, we find research represented with publication only. One step closer to “gold standard” is publication and code, another step is publication, code and data, and so on until we reach “gold standard”. Here I use a lens of estimating trends in availability of publication and code, that is one step closer to “gold standard” departing from “least reproducible” in the reproducibility spectrum described above, which I call “code availability” in research publications.

In order to look at trends in code availability in popular data science research domains, I adopt a similar approach to the one taken in (Sanders, 2019). Specifically, I downloaded arXiv publication abstracts for a few popular data science related categories, and analyzed them for presence of a set of keywords that suggest code availability. The categories I downloaded from arXiv were cs.ai, cs.cv, stat.ap, stat.ml, and stat.th for the time period pulled from 2007 to 2019 inclusively. This data pull yielded close to 130 thousand articles in total. Within the analyzed categories, we see an increase in the number of articles in stat.ml category starting 2014, followed by cs.cv and cs.ai categories. The total number of articles in stat.ap and stat.th articles has remained steady since 2014 (see Figure below, bottom for a chart of numbers of articles by year).

In order to gauge trends in code availability in downloaded articles, I searched their abstracts for the following terms related to presence of open source code in the publication: ‘github’, ‘sourceforge’, ‘open source’, ‘code’, ‘r package’, ‘python module’, ‘python package’, ‘r module’. I then examined the trends in percent of abstracts each year that contain one or more of these terms.

To describe the analysis, I adopt a similar notation to (Sanders, 2019). Let the total count of abstracts from category\\(s\\) for year \\(y\\) be denoted as \\(N_{s,y}\\). Let the count of abstracts containing at least one of the terms corresponding to code availability be denoted as \\(C_{s,y}\\). Finally, let the percent articles in each category and each year using at least one of the terms listed above be defined as \\(U_{s,y}=C_{s,y}/N_{s,y}\\).

Trends in \\(C_{s,y}\\) over time readily suggest interesting insights when it comes to code availability in data science related research articles.

In the below Figure, I plotted \\(U_{s,y}\\) as well as total numbers of articles in each category \\(N_{s,y}\\) as a time series for each category \\(s\\) and year \\(y\\). In years prior to 2012, total numbers of articles are observed as trending up, but all below 2 thousand articles in each category. Following 2012, the total numbers of articles have been increasing in cs.ai, cs.cv and stat.ml. Percent articles with terms indicating code availability shows a steady increase in these categories as well. In stat.ap and stat.th, total number of submitted articles has slightly increased yet remained below 2000 per year throughout the considered time period, and percent articles with code availability has shown mild growth over the considered time horizon as well.

Figure. Top chart: average proportion of abstracts indicating code availability in arXiv articles within several categories (solid coloured lines) and standard errors (coloured shaded regions). Bottom chart: total quantity of articles in each category is displayed as dashed coloured lines.

The simplistic analysis here suggests that in the considered popular data science domains, code availability has increased in recent years, with up to one in five academic publications now having available code references. This is a great first step towards gold standard in reproducibility of published research. A more complete discussion of reproducibility and replicability can be found in (NASEM, 2019).

How can practicing data science teams be more prepared to translate academic research to production models? I propose that computer programming for data science is a critically relevant skill and part of the answer to this question given the positive dynamics in code availability above.

In my experience, computer programming is a frequently utilized and important element of data science project delivery in industry, especially in production models which rarely leverage standardized out-of-the-box pre-implemented models. Perhaps unsurprisingly, data science teams that I have worked with often had computer programming for data science as a required skill. It seems appropriate to highlight that data science-specific computer programming is a form of coding acumen that should warrant project delivery outcomes, which include academic research results analysis and testing, as I introduce `through examples provided in the following section.

Examples of industry projects and their connection with computer programming skills for data science

This section describes two data science project implementations in industry within the context of these projects’ connection with computer programming skills and reproducibility of relevant academic research. With this specific context in mind, I describe certain aspects of these projects at a high level in order to maintain confidentiality while aiming to provide sufficient detail and identify potential academic citations as relevant candidates for implementation. As I describe below, a connection of the delivered models with computer programming and reproducibility of relevant academic research is an important practical element of the project delivery process.

Each of the two project descriptions outlined in this section uses the following structure. First, I introduce the business context and challenge addressed by the delivered model. Then I describe how research literature review and candidate model implementation were combined part of project delivery. Finally, in each case I highlight the link between research reproducibility and computer programming skills for data science practitioners.

1. Testing for surprising events at retail stores

The first example is a project to implement a model to prioritize (test) for notable events out of all records of ordering, inventorying, and selling products at store locations of a large national retailer. As stores are franchised to individual owners, each store could exhibit its own patterns in these events, which a central enterprise team could utilize to recommend improvements to store owners for how to operate their respective stores to minimize product waste, better serve customers, and improve profitability.

The task of defining and monitoring events to improve service to customers can be considered from an academic perspective. Defining the right quantities to measure event outcomes, as well as testing procedures taking into account the time-dependent nature of underlying data on store orders, inventory, and sales, can be considered as a research project in itself. However, in this real world industry setting, the project team had a limited time to put in place a production model for use by the central enterprise team. We therefore needed to conduct a literature search while concurrently prototyping and comparing candidate models to deploy in production.

The literature search process yielded several references, including univariate hypothesis test methods adapted to the technology sector (Ng, 2019), time series-based methods (Lin et al., 2003), (Shipmon et al., 2017), and methods spanning a range of other models (Liu et al., 2008), (Ma & Perkins, 2003). As with the next example, the identified references came in different forms. For example, the material included in Andrew Ng’s course included code, while the findings of the team from Google (Shipmon et al., 2017) were described in a written report without specific code references. During our project implementation, the availability of computer programming acumen among our data science team members was an important factor determining the extent to which the methods described in considered references were reproduced in candidate models for production deployment.

2. Inference of grocery customer food tastes

The second example is a project to implement a model to estimate (infer), in an interpretable way, grocery customers’ underlying food tastes based on their shopping records. This project proved a challenge due to a high volume of recorded purchase data patterns and their variations across customers. Once solved to a sufficient degree, the model’s output could be incorporated into guiding a retail client’s strategy and technology systems to better cater to customer needs.

At the outset of the project, it became immediately clear to the team that individual food tastes of customers and underlying data to be summarized presented a fertile ground for academic thought. Relevant methodologies concern defining the right quantity to be estimated from available data, and performing estimation taking into account unique characteristics of time-dependent patterns in how customers shop for groceries. As in the prior example, these topics could be a subject of an academic research project. However, given enterprise project delivery realities, the time period available to deliver a production model was limited to months rather than years. Team members needed to use available time wisely, performing literature review concurrently with prototyping and comparing candidate models.

During the literature review process, directly relevant research approaches in literature identified by the team included Explicit Semantic Analysis (Gabrilovich & Markovitch, 2007), and Latent Dirichlet Allocation (LDA) modelling (Blei, Ng, & Jordan, 2003) and its variants such as Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD) (Hansen, 2013). Other modelling approaches relevant to the project included models for text, sequences, or time series such as deep neural network formulations (Hochreiter & Schmidhuber, 1997), or alternative topic inference models (Kong, Scott, & Goerg, 2016). These methods provided a window into how (with sufficient effort) available data could be summarized and interpreted as customer food tastes suitable for enterprise decision making.

As with the previous example, each of the research references identified above was “ready to be implemented” to a different degree. Some references were available as established community driven software packages (eg ESA and LDA), some were in the form of a written report (eg referenced work from Google), and others represented a mix of reproducibility characteristics (eg LDA-STWD, available as a report and a code prototype on GitHub). Additionally, all identified research references had varying characteristics when it came to their computational efficiency at high data volumes. The presence of sufficient computer programming skills among data scientists on the project team was a necessary ingredient to successfully navigate the prototyping of identified methods while staying within project timelines.

Conclusion

Computer programming for data science is a necessary team skill in translating research results into production models in industry. The level of maturity of this skill in data science teams often serves as a bottleneck in the rate with which teams are able to iterate through working implementations of candidate models.

In practicing data science teams, computer programming skills are a subject of continuous improvement. Here are a few effective strategies that I have experienced in industry:

  • Learn through practice. When practicing computer programming for data science, it is best to do so while solving a concrete problem. No problem is too small when it comes to learning coding. While it is often effective to practice on problems from day to day data science work, online learning and academic resources abound ranging from websites such as CodeAcademy and edX, to traditional educational programs.
  • Share your code with others. From getting your classmate’s or fellow teammate’s take on your code to posting it on your profile on GitHub, there are many opportunities to responsibly share your learning with your peers. This can help get more feedback on your progress. Check with your team lead or professor for their advice on how to best share and get feedback on your code.
  • Get inspiration from other domains. For example, Computer Science as an academic field offers much to learn in effective coding strategies. In industry settings, looking at code examples from environments other than those you are used to can spur ideas for how to improve your own coding acumen.
  • Write code for reuse. When practicing, imagine that someone else may need to run your code. If you mean for others to use your code, try to improve your code’s ease of use through turning code snippets into functions, breaking up long codes into logical “chapters”, utilizing comments within code, or even stand-alone documentation.

In addition to computer programming skills, effective data science teams in industry exhibit other skillsets such as foundational training in related academic fields, project ownership skills, communication, and domain context knowledge in industry domains where they operate. An excellent discussion of relevant skills and academic training can be found in (Irizarry, 2020), (Berthold, 2019), (Garber, 2019). In practical settings, each data science project that needs to be delivered presents itself with unique requirements. Almost invariably, however, computer programming for data science is a staple required to incorporate academic research results in production model development, deployment, and maintenance.

Therefore, computer programming for data science is an important ingredient to building the skills of an effective, research oriented data science team. However, there is much more to delivering value from data than just coding.

As introduced in (Irizarry, 2020), data science teams can include teammates with non-overlapping, or partially overlapping skills. Further, our teams need to effectively engage relevant stakeholders, which often includes navigating overlapping team skills and responsibilities for delivering real world outcomes based on data science project objectives. Few would question that the field is maturing, with data science teams facing a greater responsibility in how solutions are developed, delivered, and maintained across sectors. Areas where such levels of responsibility have been historically high include Statistics heavy domains such as US Census, where rigour in approaching team skillsets, delivery patterns, and stakeholder engagement has been motivated by the importance of Census as a vehicle for informing a wide ranging array of applications. Indeed, recent changes to the anonymization aspect of Census data are stimulating serious discussions of challenges and opportunities they bring, and rightfully so as emphasized in (Meng, 2020).

It is comforting that the community is starting to have a rigorous discussion about best practices in data science as a discipline mandated to generate value from data (Irizarry, 2020). In industry settings, data science is increasingly viewed as a business capability (Omnia AI, 2018). Viewed as part of a capability to deliver business value from data, data science teams need to participate in discussing the relevant questions in the community. For example, how can we borrow learnings from the more mature domains such as US Census, to improve how data science is deployed as a capability? How do we optimally structure data science teams to respond to the increasing requirements of generating value from data? What are the optimal ways to bring scientific elements to industry projects? How can data science teams best engage business stakeholders in order to deliver more value from data in various industry domains? What academic research methods can data science teams in industry use in order to minimize the risk of “black swan” events when production models break and cause adverse outcomes for an unforeseen reason?

Answers to questions related to how data science teams can deliver value from data are diverse, highly domain-specific, and require collaborative input from academic researchers, practicing data scientists, and business stakeholders. It is encouraging to see the increased rigour in developing the discussion and equipping the community with the right skills, tools, and methodologies to deliver value in data science.

Bibliography

Berthold, M. R. (2019). What Does It Take to be a Successful Data Scientist? Harvard Data Science Review , 1(2).

Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of machine Learning research , 3 (Jan), 993-1022.

Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI , 7, 1606-1611.

Garber, A. M. (2019). Data Science: What the Educated Citizen Needs to Know. Harvard Data Science Review , 1(1).

Hansen, J. A. (2013). Probabilistic Explicit Topic Modeling. Brigham Young University ScholarsArchive.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation , 9 (8), 1735-1780.

Irizarry, R. A. (2020). The Role of Academia in Data Science Education. Harvard Data Science Review , 2(1).

Jordan, M. I. (2019). Artificial Intelligence—The Revolution Hasn’t Happened Yet. Harvard Data Science Review , 1 (1), 1(1).

Kong, J., Scott, A., & Goerg, G. M. (2016). Improving topic clustering on search queries with word co-occurrence and bipartite graph co-clustering. Google Inc (to appear).

Lin et al. (2003). A symbolic representation of time series, with implications for streaming algorithms. Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery , 2-11.

Liu et al. (2008). Isolation forest. ighth IEEE International Conference on Data Mining , 413-422.

Ma, J., & Perkins, S. (2003). Time-series novelty detection using one-class support vector machines. Proceedings of the International Joint Conference on Neural Networks , 1741-1745.

Meng, X.-L. (2020). 2020: A Very Busy Year for Data Science (and for HDSR). Harvard Data Science Review , 2(1).

NASEM. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press.

Ng, A. (2019, 03 01). Anomaly Detection Using the Multivariate Gaussian Distribution. Retrieved 03 01, 2019, from Coursera: https://www.coursera.org/lecture/machine-learning/anomaly-detection-using-the-multivariate-gaussian-distribution-DnNr9

Omnia AI. (2018, June 1). Deloitte’s Artificial Intelligence Practice. Retrieved February 9, 2020, from Deloitte: https://www2.deloitte.com/ca/en/pages/deloitte-analytics/articles/omnia-artificial-intelligence.html#

Peng, R. D. (2011). Reproducible research in computational science. Science , 1226-1227.

Sanders, N. (2019, June 22). A Balanced Perspective on Prediction and Inference for Data Science in Industry. Retrieved from Harvard Data Science Review: https://doi.org/10.1162/99608f92.644ef4a4

Shipmon et al. (2017). Detection of anomalous drops with limited features and sparse examples in noisy highly periodic data. arXiv , 1-9.