## Posts Tagged ‘visualization’

### Dataviz needs good UI tools

Tuesday, May 12th, 2015

A few weeks ago I set out to participate in a geospatial visualization challenge. It was exciting to finally get a chance to make an interactive map with more than markers and basic mouseover effects. It was also intriguing to try and display uncertainty on a map.

But, even the basic visualization below took more time than planned.

See databit Geospatial Prototype by Sergiy Nesterko on Databits.

What I realized very soon after starting my work was that I was spending 80% of the time implementing the user interface for my visualization. This UI experiment, for example, combines a tooltip with clickable menu. It took the most time and still isn’t well rounded!

This is a real problem. I want to think about how to make an aesthetically pleasing and effective visualization, and not endlessly debug a misbehaving UI element.

Can the lack of good open source UI elements be a limitation for dataviz? Are makers of visualizations constrained in their creativity because of the lack of good UI elements? I set out to see about that. Here is a competing challenge entry by Quinn Lee:

See databit medical geospatial by Quinn Lee on Databits.

Here Quinn uses a very nice generalized cross filter library called DC.js.

The main UI elements of Quinn’s data visualization are the plots themselves — you can click on their parts, zoom in/out on maps, and highlight a range of the time series plot. All these interactions will change the way the plots look. This is great. The only caveat is that this UI pattern is native to DC.js and is not very generalizable should you decide to say replace the maps in Quinn’s visualization with maps built with d3.geo.

Here is another challenge entry, by Moritz Winger.

See databit GeoSpatial Challenge by Moritz Winger on Databits.

I really like the colors and style of the piece.

In terms of UI, you can interact with the visualization by hovering over different elements, and clicking some of them. The question is this: would a rich open source UI element library help Moritz make an even better visualization?

I would like to see these elements available when making an interactive visualization:

1. Good tooltip menus for hover effects. D3-Tip is a great start and no wonder it gets o much usage and development. But, it does not always work
2. Good contextual menus. That is what I am working to implement with my TooltipMenuHybrid class in the first visualization in this post
3. Good open source libraries for positioning your plots in a mobile-friendly layout, and to allow users to drag and resize them. Again, my LittlePlot class above is one such experiment
4. Libraries for keeping the state of the visualization in the URL bar
5. …and so on

That said, it has certainly become easier to analyze data and convey data-driven insight today than ever before. The growth in open source
tools and related resources for dataviz isn’t slowing down. That means that the set of UI options available to creative coders will keep expanding and the bottleneck I describe here will soon go away.

### Responsive design for data visualization

Monday, March 2nd, 2015

Interactive data visualization is an increasingly important tool of conveying analytic information. It is used in teaching, in product design, data journalism, scientific research, and many other fields.

Much of the analytic insight that we consume every day goes through the minds and hands of data scientists. Data scientists own data analysis, this is our profession. Sometimes, we need to create interactive visualizations to communicate our work.

Our audience is increasingly diverse: analytics product owners, researchers, journalists, general public are just a few examples. How do we make sure our visualizations are maximally understandable and usable by our target audience?

In order to make interactive visualization more usable, data scientists need to start thinking a little bit like designers. Yes, it means expanding the significant skill set we already possess even further, or partnering with a designer. With the abundance of information and tools currently available, it’s not that hard — and even fun.

A cornerstone of good design is responsiveness. The interactive visualization we make needs to look good on any screen width. Here is an example, an interactive dashboard of edX learner progress by Zach Guo:

Zach’s visualization adapts to different widths of the containing box. See it for yourself by clicking and dragging the resize handle on the bottom right of the visualization:

Here is another example, a network visualization I made some time ago. It also adapts to different screen widths.

If your visualization is web-based (like both visualizations above), most of the time you can make it responsive by defining different CSS layout rules based on screen width. Here’s how I did it in the visualization above (check out source code for a complete picture of what’s going on):

@media (min-width: 1150px) {
.ivcont.info {
width : 250px;
}
}

There is more discussion on data visualization usability in the material of a course on statistical computing and interactive visualization I taught at Harvard Statistics department in Spring 2013 (for example, consider Lecture 24).

Data scientists own data analysis. Part of making the end result accessible to the target audience, responsive design for data vis can be a fun addition to our toolkit.

### Effective way for data scientists to grow impact

Sunday, October 5th, 2014

In order to get things done people need to communicate effectively. At school, teachers present to students. In consulting, consultants make powerpoint slide decks. In research, researchers make presentations and talks to spread their ideas.

When it comes to data scientists, many of us write code (in R, Python, Julia etc) in order to analyze data, inform decisions. To many people, what we do is rocket science.

What is the most effective and easy way to spread our ideas and grow impact? A good answer is interactive visualization. And not just for data scientists, but for anyone working with analytics.

Sure enough, pretty and intuitive graphics are a good way to deliver insight. And, with modern technologies interactive visualization can grow into products, viral marketing campaigns, and journalism pieces.

I have been doing interactive visualization for a while. Below is a visualization I made to explore geographic enrolment patterns of HarvardX. What started as an exploratory project ended up as a product — an interactive analytics platform we called HarvardX Insights. It ended up on the cover of Campus Technology, and several universities from around the world contacted HarvardX to get the code.

See databit HarvardX Certificates World Map by Sergiy Nesterko on Databits.

And here is something for data scientists — a visualization of the Hamiltonian Monte Carlo algorithm. I taught it to my students last year during a graduate course on statistical computing and interactive visualization at Harvard Statistics. This visualization was one of several I created for the course together with students.

See databit Hamiltonian (Hybrid) Monte Carlo by Sergiy Nesterko on Databits.

People who work with data increasingly need to acquire and apply creative coding skills in order to put their ideas to work. This helps come closer to the end user of an analytic insight, and avoid possible operational distortions and dead ends along the way. That’s why, resources that promote and teach creative coding are in high demand among my peer data scientists. I am a big fan of Mike Bostock’s Blocks, and other resources such as Codepen, JSFiddle, and Stack Overflow.

Recently, I have been using and contributing to Databits more and more. Databits is a website for data scientists, data journalists, and other creative coders to share work, connect, and grow impact. I believe that eventually, the site will allow to be more targeted and specifically learn from and follow peer data scientists and other creative coders who are focused on producing effective interactive visualization and other cool stuff. For example, I look forward to learning some Processing applications from this guy. In the meanwhile, I helped put together a simple databit based on Processing.js:

See databit Processing.js Hello World Sketch by Sergiy Nesterko on Databits.

The site also runs Challenges, an initiative aimed at finding meaningful problems for creative data scientists to solve, and put on their portfolios. I find this pretty cool.

I look forward to learning new things, finding cool problems, and making the world a better place with data. Now my creative work has a home — you can check out my creative endeavors and interests on my Databits profile page.

### Gender Balances: A look at the makeup of HarvardX registrants

Thursday, December 5th, 2013

Although the first semester of the 2013/14 academic year is coming to a close on campus and residential students are finishing up coursework and preparing for the break, the timelines are more asynchronous for students registered for 10 currently running online offerings. This batch of 10 consists of courses and modules launched by HarvardX at different times during the Fall of 2013.

While course development teams are working to create the most stimulating learning experiences and thinking about whether and how to give students a mini winter break in their courses or modules (or summer break for those in the southern hemisphere), the research team is busy studying the troves of data produced by past and current online offerings, working with course developers to set up learning experiments, and helping to facilitate research-based innovation at HarvardX.

As part of our work to inform course development and research, the research team generated course-specific and HarvardX-wide worldwide gender composition data.

The interactive visualization above shows self-reported gender composition data for all past and current online offerings, as well as overall for HarvardX. Choosing an item from the drop down menu shows data on a particular course or module on the left, while the chart on the right displays overall gender composition to facilitate comparison. Hovering the mouse over the chart brings up information on the specific numbers used to calculate percentages. As of November 17, 7-10% of students in different offerings didn’t specify gender information, which is reflected by the Missing category. Checking the box ‘Only male/female’ leaves only these categories, and calculates the percentages using the total number of reported males and females as the denominator. The data specification file including the source code and technical information can be accessed here.

The HarvardX student body is estimated to be mostly male (62% as of November 17, 2013), although there is considerable variation in gender balance from one offering to the next. For example, CS50x Introduction to Computer Science has decidedly more male students (estimated 79%), while both offerings on Poetry in America register mostly female students (estimated 57% and 61%). Some courses have almost equal percentages of female and male students. For example, GSE1x Unlocking the Immunity to Change, launching in Spring 2014, so far has registered an estimated 51% females and 49% males. We generally do not recommend interpreting the overall HarvardX average when overall enrollment is so heavily influenced by a small number of courses (e.g., Computer Science and the Science of Cooking).

In order to gain a better understanding of gender balance in HarvardX offerings, we made a world map, showing estimated gender composition of our students enrolled from different countries around the world.

The map above is an interactive visualization of estimated worldwide gender composition of students enrolled in HarvardX offerings worldwide. Blue color means that the balance is tilted towards male registrants, yellow – females, and green is approximate parity. Hovering the mouse over countries brings up information on exact estimated proportions of female and male registrants and the numbers the estimation is based on. Estimation was performed using Missing At Random assumption for missing data, and countries with less than 100 detected students are not colored as the estimated percentages can have a margin of error greater than ±5 percentage points. Choosing items from the drop down menu will bring up information on worldwide gender composition for a particular offering. The data specification file including the source code and technical information can be accessed here.

In most countries of the world, estimated gender balance is tilted towards males, with the pattern being strongest in African and South Asian countries. Exceptions include Philippines, Georgia, Armenia, Mongolia, and Uruguay, where overall estimated HarvardX gender balance is either close to 50% or tilted towards females. Possible explanations for this finding include cultural trends, selective registration to courses, which are more popular among females, Internet access, economic factors etc.

Individual HarvardX offerings exhibit very different patterns in worldwide gender composition. For example, MCB80.1x Fundamentals of Neuroscience, Part I (launched end of October) shows gender parity in the US, Canada, and Australia, while we estimate more female students from Philippines, Argentine, and Greece. Other countries including China, India, Pakistan, France, Sweden, and others are estimated to have more male students. At the same time, the recently launched PH201x Health and Society exhibits strong female enrollment from many countries around the world, while India, China, and Pakistan as well as other countries still are estimated to have more male students in the course.

There could be many possible explanations for the observed picture of worldwide gender composition in HarvardX offerings. One aspect to consider is popularity of courses in certain fields among males or females depending on the context of a particular country. For example, gender balance in the US varies greatly from one course to the next. The ways in which online learning (edX, HarvardX, and beyond) is perceived and promoted in a particular country through advertising, word of mouth, and other means may also have some influence on who ends up enrolling for courses. There are also other country-specific factors such as cultural setting, Internet access, religion, and others, all of which may contribute to the gender balance patterns we are observing.

One parallel that I find interesting is comparing worldwide gender compositions in HarvardX offerings and residential education.

The picture above is taken from UNESCO’s Worldwide Atlas of Gender Equality in Education from 2012, and visualizes worldwide gender composition in tertiary education. Yellow color means that there are more females enrolled in tertiary education than males, green means parity, and blue means that there are more males.

What’s interesting about UNESCO’s gender parity map and the interactive visualization of worldwide gender composition for HarvardX offerings, is that they should match if residential tertiary education exhibited the same gender enrollment patterns as HarvardX. However, while there are similarities, the two pictures don’t quite match. On average, more females, across multiple countries, participate in tertiary (that is, residential) education than then they do in HarvardX online courses.

Why is it?

It could be that at HarvardX, technical courses such as CS50x skew the enrollment demographic, which has been shown to be mostly male for technical/STEM subjects in all settings. It could also be that in some countries, on average, women don’t think that the initial MOOCs may have relevance to their lives and work as much as males do. It remains to be seen whether the patterns in these initial gender composition data show fundamental differences between gender demographics of residential tertiary and online education, or whether the observed patterns are due to a limited number of initial online offerings and are specific to HarvardX.

Clearly, our analysis generates more research questions than it answers. Finding and polishing bits and pieces of the puzzle to answer these questions is what makes working at HarvardX research so stimulating.

### HarvardX research: both foundational and immediately applicable

Wednesday, October 23rd, 2013

There is a difference between research and how innovation happens in industry. Research tends to be more foundational and forward-thinking, while innovation in industry is more agile and looks to generate value as soon as possible. Bret Victor, one of my favorite people in interaction design, summarizes it nicely in the diagram below.

Bret Victor’s differences between industry and research innovation

HarvardX is a unique combination of industry and research by the classification above. The team I am part of (HarvardX research) works to generate research and help shape online learning now, as well as contribute to foundational knowledge. Course development teams, who create course content and define course structure, sit on the same floor as us. Course developers work together with the research team looking for ways to improve learning continuously and generalize findings beyond HarvardX to online and residential learning in general. Although the process still needs to be streamlined as we are scaling the effort, we are making progress. One example is the project on using assignment due dates to get a handle on student learning goals and inform course creation.

Here is how it got started.

As we were looking at the structure of past HarvardX courses, we discovered that there was a difference in how graded components were used across courses. Graded components include assignments, problem sets, or exams that contribute to the final grade of the course which determines whether a student gets a certificate of completion. Below is public information on when graded components occurred for 3 HarvardX courses.

The visualization above shows publicly available graded components structure for three completed HarvardX courses: PH207x (Health in Numbers), ER22x (Justice), and CB22x (The Ancient Greek Hero). Hovering the mouse over different elements of the plot reveals detailed information, clicking on course codes removes extra courses from display. For PH207x, each assignment had a due date preceding the release time of the next assignment (except the final exam). For the other two courses, students had the flexibility of completing their graded assignments at any time up until the end of the course.

When the due date passes on a particular graded component, students are no longer able to access and answer it for credit. The “word on the street” among course development teams so far has been that it’s generally desirable to set generous due dates on the graded components as this promotes alternative (formative) modes of learning allowing students not interested in obtaining a grade to access the graded components. Also, this way students who register for a class late have an opportunity to “catch up” by completing all assignments that they “missed”. However, so far it has been unclear what impact such due date structure has on academic achievement (certificate attainment rates) versus other modes of learning (non-certificate track, ie. leisurely browsing).

Indeed, one of the major metrics of online courses is certificate attainment – the proportion of students who register for the course and end up earning a certificate. It turns out that PH207x experienced the attainment rate of over 8.5%, which is the highest among all open HarvardX courses completed to date (average rate of around 4.5%). Does this mean that setting meaningful due dates boosts academic achievement by helping students “stay on track” and not postpone working on the assignments until the work becomes overwhelming? While the hypothesis is plausible, it is too early to draw causal conclusions. It may be that the observation is specific to just public health courses, or PH207x happened to have more committed students to begin with, etc.

While the effect on certificate attainment is certainly important, an equally important question to answer is what impact do due dates have on alternative modes of learning? That’s why we are planning to start an A/B test (randomized controlled experiment) to study the effect of due dates, in close collaboration with course development teams. Sitting on the same floor and being immersed in the same everyday context of HarvardX allows for agile planning, so we are hoping to launch the experiment as early as November 15 or even October 31. The findings of the study have the potential to immediately inform course development for new as well as future iterations of current courses, aiming to improve educational outcomes of learners around the world and on campus.

HarvardX is a great example of a place where research is not only foundational but also immediately applicable. While the combination is certainly stimulating, I wonder to what extent this paradigm translates to other fields, and what benefits and risks it carries. With these questions in mind, I cannot wait to see what results our experimentation will bring and how we can use data to improve online learning.

### Democratization of data science: why this is inefficient

Sunday, November 4th, 2012

The use of data in industry is increasing by the hour, and so does investment in Big Data. Gartner, an information technology research and advisory firm, says the spending on big data will be $28 billion in 2012 alone. This is estimated to trigger a domino effect of$232 billion in spending through the next 5 years.

The business world is evolving rapidly to meet the demands of data-hungry executives. On the data storage front, for example, new technology is quickly developed under the Hadoop umbrella. On the data analysis front, there are startups that tackle and productize related problems such as quid, Healthrageous, Bidgely, and many others. What drives this innovation in analyzing data? What allows so many companies to claim that their products are credible?

Not surprisingly, the demand for analytic talent has been growing, with McKinsey calling Big Data the next frontier of innovation. So, let’s make this clear – businesses need specialists to innovate, to generate ideas and algorithms that would extract value from data.

Who are those specialists, where do they come from? With a shortage of up to 190,000 data specialists projected for 2018, there is a new trend emerging for “the democratization of data science” which means bringing skills to meaningfully analyze data to more people:

The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it’s new education platforms (e.g., Coursera and Udacity) teaching students fundamental skills in everything from basic statistics to natural language processing and machine learning.

Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.

This quote is optimistic at best. Where is the guarantee that the product developed by a “data scientist” with a couple of classes worth of training is going to work for the entire market? In academic statistics and machine learning programs, students spend several years learning the proper existing methods, how to design new ones, and prove their general validity.

When people without adequate training make analytic products and offer them to many customers, such verification of the product is crucial. Otherwise, the customer may soon discover that the product doesn’t work well enough or not at all, thus bringing down the ROI on the product. The customer will then go back and invest in hiring talent and designing solutions that would actually work for the case of this customer. If all customers have to do this, the whole vehicle with the democratized data science becomes significantly inefficient.

Behind each data analysis decision there must be rigorous scientific justification. For example, consider a very simple Binomial statistical model. We can think about customers visiting a website through a Google ad. Each customer is encoded as 1 if he or she ends up purchasing something on the website, and zero otherwise. The question of interest is, what proportion of customers coming through Google ads ends up buying on the website?

Below is a visualization of the log-likelihood and inverse Fisher information functions. Many inappropriately trained data specialists would not be able to interpret these curves correctly even for the simple model like this. But what about the complex algorithmic solutions they are required to build on a daily basis and roll out on the market?

We can simply take the proportion of customers who bought something, that will be our best guess about the underlying percentage of buying Google ad website visitors. This is not just common sense, the average can be proved to be the best estimator theoretically.

The uncertainty about our estimate can also be quantified by the value of the inverse observed Fisher Information function (picture, left) at the estimated value of p. The three curves correspond to the different numbers of customers who visited our website. The more customers we get, the lower our uncertainty about the proportion of the buying customers is. Try increasing the value of n. You will see that the corresponding curve goes down – our uncertainty about the estimated proportion vanishes.

This is the kind of theory that we need specialists who develop algorithmic products to be equipped with. It requires an investment in their proper education first. If we skip the proper education step, we risk lowering the usefulness and practicality of the products such data scientists design.

### How to convert d3.js SVG to PDF

Monday, January 30th, 2012

Consider the following plot:

The image above is an SVG created using d3.js. I am going to use it in a paper, so need to convert it to PDF. It so happened that how to do this is not common knowledge, as is mentioned in this blog (which is, by the way, a good and balanced source of information relevant to our profession). Here is a short tutorial on how to do it: (more…)

### Informative versus flashy visualizations, and growth in Harvard Stat concentration enrollment

Sunday, December 4th, 2011

Some time ago my advisor Joseph Blitzstein asked me to create a visualization of the numbers of Harvard Statistics concentrators (undergraduate students who major in Statistics). The picture would be used by the chair of the department to illustrate the growth of the program for university officials, so I decided to make it look pretty. The first form that came to my mind was showing the enrollment growth over years using a bar plot.

Starting in 2005, the numbers follow exponential growth, which is a remarkable achievement of the department. We then decided to follow the trend, and extrapolate by adding predicted enrollment numbers for 2011, 12 and 13. At that moment, there was no data for 2011. (more…)

### Interactive MCMC visualizations project

Thursday, September 29th, 2011

Recently I have started being actively involved in a Markov Chain Monte Carlo visualization project with Marc Lipsitch and Miguel Hernán from the Department of Epidemiology in the Harvard School of Public Health. Work is done jointly with Sarah Cobey. The idea is to create a set of interactive visualizations to comprehensively describe and explain the concepts of how MCMC and its variants operate. Figure 1 shows a screenshot of the first sketch of an interactive visualization of a trace plot.

Figure 1: a screenshot of the first version of an interactive trace plot of an MCMC chain.

I am very excited about the project, both from the technical perspective of making interactive visualizations work with minimum amount of code, but also because I think that the idea of conceptualizing MCMC via the use of visualization is innovative and will be very useful for those trying to understand the process better.

The tools we are using in development are HTML, CSS, Javascript and d3.js.

### JSM2011, and a final stretch at RDS

Thursday, August 18th, 2011

The Joint Statistical Meetings conference took place in Miami Beach on July 30-August 5. It went very well, and the definite highlight was the keynote lecture by Sir David Cox. Among the other sessions, the following stand out:

1. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics by Georg M. Goerg, a student at CMU Stat. The talk was very enjoyable and the conveyed ideas were crisp and exciting, the main one being that zero-mean time series can be thought of as histograms by representing them as frequency distributions which allows for an elegant non-parametric classification approach by minimizing the KL divergence of observed and simulated frequency histograms.
2. Large Scale Data at Facebook by Eric Sun from Facebook. Though not groundbreaking, the talk was exciting as it described the work environment at Facebook and the approach taken to getting signals out of massive data. Mostly, curious facts were presented from analyzing the frequencies of word occurrences in user status updates, with the interesting part being the analysis framework developed to do that.
3. Jointly Modeling Homophily in Networks and Recruitment Patterns in Respondent-Driven Sampling of Networks by my advisor Joe Blitzstein about our most recent research on model-based estimation for Respondent-Driven Sampling (RDS). The approach we are developing is looking to have several very attractive features in comparison to current estimation techniques and is designed for the case of homophily of varying degree. An example is illustrated on Figure 1.

Figure 1: An example of homophily, with the network plotted over the histogram of the homophily inducing quantity (left), and resulting (normalized) vertex degrees plotted over the same histogram (right).

We hope to finish the relevant paper soon and open the approach to extensions by the research community.

During the conference, I also had a chance to finish making a dynamic 3D visualization of a constrained optimization algorithm I developed for In4mation Insights, which is exciting. As for Miami Beach itself, it is a great place to go out and enjoy the good food, sun and beach. JSM2012 will be held in San Diego.

I created the visualization in this post using Processing.