Archive for the ‘Research’ Category

How to convert d3.js SVG to PDF

Monday, January 30th, 2012

Consider the following plot:


The image above is an SVG created using d3.js. I am going to use it in a paper, so need to convert it to PDF. It so happened that how to do this is not common knowledge, as is mentioned in this blog (which is, by the way, a good and balanced source of information relevant to our profession). Here is a short tutorial on how to do it: (more…)

Interactive MCMC visualizations project

Thursday, September 29th, 2011

Recently I have started being actively involved in a Markov Chain Monte Carlo visualization project with Marc Lipsitch and Miguel HernĂ¡n from the Department of Epidemiology in the Harvard School of Public Health. Work is done jointly with Sarah Cobey. The idea is to create a set of interactive visualizations to comprehensively describe and explain the concepts of how MCMC and its variants operate. Figure 1 shows a screenshot of the first sketch of an interactive visualization of a trace plot.

a screenshot

Figure 1: a screenshot of the first version of an interactive trace plot of an MCMC chain.


I am very excited about the project, both from the technical perspective of making interactive visualizations work with minimum amount of code, but also because I think that the idea of conceptualizing MCMC via the use of visualization is innovative and will be very useful for those trying to understand the process better.

The tools we are using in development are HTML, CSS, Javascript and d3.js.

JSM2011, and a final stretch at RDS

Thursday, August 18th, 2011

The Joint Statistical Meetings conference took place in Miami Beach on July 30-August 5. It went very well, and the definite highlight was the keynote lecture by Sir David Cox. Among the other sessions, the following stand out:

  1. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics by Georg M. Goerg, a student at CMU Stat. The talk was very enjoyable and the conveyed ideas were crisp and exciting, the main one being that zero-mean time series can be thought of as histograms by representing them as frequency distributions which allows for an elegant non-parametric classification approach by minimizing the KL divergence of observed and simulated frequency histograms.
  2. Large Scale Data at Facebook by Eric Sun from Facebook. Though not groundbreaking, the talk was exciting as it described the work environment at Facebook and the approach taken to getting signals out of massive data. Mostly, curious facts were presented from analyzing the frequencies of word occurrences in user status updates, with the interesting part being the analysis framework developed to do that.
  3. Jointly Modeling Homophily in Networks and Recruitment Patterns in Respondent-Driven Sampling of Networks by my advisor Joe Blitzstein about our most recent research on model-based estimation for Respondent-Driven Sampling (RDS). The approach we are developing is looking to have several very attractive features in comparison to current estimation techniques and is designed for the case of homophily of varying degree. An example is illustrated on Figure 1.

    Figure 1: An example of homophily, with the network plotted over the histogram of the homophily inducing quantity (left), and resulting (normalized) vertex degrees plotted over the same histogram (right).

    We hope to finish the relevant paper soon and open the approach to extensions by the research community.

During the conference, I also had a chance to finish making a dynamic 3D visualization of a constrained optimization algorithm I developed for In4mation Insights, which is exciting. As for Miami Beach itself, it is a great place to go out and enjoy the good food, sun and beach. JSM2012 will be held in San Diego.

I created the visualization in this post using Processing.

Optimization, experiment design, and Sir David Cox

Wednesday, August 3rd, 2011

It has been almost a year of my involvement in a project of global marketing mix optimization solution for a large consumer packaged goods company. Conceptually, the problem is simple: given a fitted model of a company’s revenue as a function of promotion campaigns for its products, and using past year’s promotion campaigns allocation scenario as a starting point, find a revenue maximizing scenario subject to promotion expenditure constraints.

Figure 1: A visualization of a step in a solution of an optimization problem. To see the full dynamic visualization, go to theory.info.

The problem becomes more interesting when we go into details. (more…)

theory.info, a new project

Tuesday, July 12th, 2011


Recently I purchased the domain and created an interactive logo/visualization for Theory Information Analysis, a screenshot of which is presented above. Theory is a new project which I would like to represent applied real word work, including quantitative consulting and applied research. (more…)

Dynamic visualization, paper supplement 1

Saturday, May 28th, 2011

(more…)

Dynamic visualization, paper supplement 2

Saturday, May 28th, 2011

(more…)

Data science term in The Economist

Thursday, May 19th, 2011

Seems that there is no stopping now: the term data science appears prominently in the headline article of the current issue of The Economist.

The Economist

Compared with the rest of America, Silicon Valley feels like a boomtown. Corporate chefs are in demand again, office rents are soaring and the pay being offered to talented folk in fashionable fields like data science is reaching Hollywood levels. And no wonder, given the prices now being put on web companies.

It is indeed quite misleading that the term has the word science in it as it implies an established field, while in fact the science of data is statistics. I wrote a post on the subject earlier in an attempt to single out what is it that distinguishes data science from statistics. That set aside, however, the article is supportive of the rise in demand for our profession, which is a good news for the specialists. Hopefully, the tech bubble mentioned there won’t be inflated further by people who misuse the data science term.

Does homophily exist?

Monday, May 9th, 2011

This spring I gave two talks, one at the New England Statistics Symposium (NESS) hosted by the Department of Statistics, University of Connecticut, and a post-qualifying talk in my home department. Both talks were on my work with my advisor Joe Blitzstein, and both drew heavily on the term homophily. The first talk concerned refined simulation study results concerning design-based estimation, and the second one was about model-based estimation under Respondent-Driven Sampling. For the latter, we consider the data collected within a recent study of populations at high risk of HIV conducted in San Diego. The study took nearly 2 years to complete and was aimed at collecting information describing behavioral and health aspects of the target population. It is a privilege and responsibility to be commissioned to analyze the collected data, as the results of the analysis may be used for subsequent policy decisions. Figure 1 demonstrates the (anonymized) recruitment trees of the study.

San Diego study recruitment tree

Figure 1: San Diego study recruitment trees as functions of HIV status. On the x axis, observation means the HIV status group.

(more…)

Drastic R speed-ups via vectorization (and bug fixes)

Friday, April 29th, 2011
RDS visual

Figure 1: A screenshot of the corrected and enhanced dynamic visualization of RDS. Green balls are convenience sample, pink balls are subsequently recruited individuals, pink lines are links between network nodes that have been explored by the process, and numbers in circles correspond to sample wave number.

It is common to hear that R is slow, and so when I faced the necessity to scale old R code (pertaining to material described in this post) to operate on data 100 times larger than it used to, I was initially at a loss. The problem with the old code was that it took several days and about 4000 semi-parallel jobs to complete. With the size of data increasing by a factor of 100, the task was becoming infeasible to complete. Eventually however, I was able to achieve an over 100-fold speedup of the R code, with the speedup being due to addressing two issues: (more…)