Posts Tagged ‘Joe’

How to convert d3.js SVG to PDF

Monday, January 30th, 2012

Consider the following plot:


The image above is an SVG created using d3.js. I am going to use it in a paper, so need to convert it to PDF. It so happened that how to do this is not common knowledge, as is mentioned in this blog (which is, by the way, a good and balanced source of information relevant to our profession). Here is a short tutorial on how to do it: (more…)

Informative versus flashy visualizations, and growth in Harvard Stat concentration enrollment

Sunday, December 4th, 2011

Some time ago my advisor Joseph Blitzstein asked me to create a visualization of the numbers of Harvard Statistics concentrators (undergraduate students who major in Statistics). The picture would be used by the chair of the department to illustrate the growth of the program for university officials, so I decided to make it look pretty. The first form that came to my mind was showing the enrollment growth over years using a bar plot.

Starting in 2005, the numbers follow exponential growth, which is a remarkable achievement of the department. We then decided to follow the trend, and extrapolate by adding predicted enrollment numbers for 2011, 12 and 13. At that moment, there was no data for 2011. (more…)

JSM2011, and a final stretch at RDS

Thursday, August 18th, 2011

The Joint Statistical Meetings conference took place in Miami Beach on July 30-August 5. It went very well, and the definite highlight was the keynote lecture by Sir David Cox. Among the other sessions, the following stand out:

  1. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics by Georg M. Goerg, a student at CMU Stat. The talk was very enjoyable and the conveyed ideas were crisp and exciting, the main one being that zero-mean time series can be thought of as histograms by representing them as frequency distributions which allows for an elegant non-parametric classification approach by minimizing the KL divergence of observed and simulated frequency histograms.
  2. Large Scale Data at Facebook by Eric Sun from Facebook. Though not groundbreaking, the talk was exciting as it described the work environment at Facebook and the approach taken to getting signals out of massive data. Mostly, curious facts were presented from analyzing the frequencies of word occurrences in user status updates, with the interesting part being the analysis framework developed to do that.
  3. Jointly Modeling Homophily in Networks and Recruitment Patterns in Respondent-Driven Sampling of Networks by my advisor Joe Blitzstein about our most recent research on model-based estimation for Respondent-Driven Sampling (RDS). The approach we are developing is looking to have several very attractive features in comparison to current estimation techniques and is designed for the case of homophily of varying degree. An example is illustrated on Figure 1.

    Figure 1: An example of homophily, with the network plotted over the histogram of the homophily inducing quantity (left), and resulting (normalized) vertex degrees plotted over the same histogram (right).

    We hope to finish the relevant paper soon and open the approach to extensions by the research community.

During the conference, I also had a chance to finish making a dynamic 3D visualization of a constrained optimization algorithm I developed for In4mation Insights, which is exciting. As for Miami Beach itself, it is a great place to go out and enjoy the good food, sun and beach. JSM2012 will be held in San Diego.

I created the visualization in this post using Processing.

Dynamic visualization, paper supplement 1

Saturday, May 28th, 2011

(more…)

Dynamic visualization, paper supplement 2

Saturday, May 28th, 2011

(more…)

Does homophily exist?

Monday, May 9th, 2011

This spring I gave two talks, one at the New England Statistics Symposium (NESS) hosted by the Department of Statistics, University of Connecticut, and a post-qualifying talk in my home department. Both talks were on my work with my advisor Joe Blitzstein, and both drew heavily on the term homophily. The first talk concerned refined simulation study results concerning design-based estimation, and the second one was about model-based estimation under Respondent-Driven Sampling. For the latter, we consider the data collected within a recent study of populations at high risk of HIV conducted in San Diego. The study took nearly 2 years to complete and was aimed at collecting information describing behavioral and health aspects of the target population. It is a privilege and responsibility to be commissioned to analyze the collected data, as the results of the analysis may be used for subsequent policy decisions. Figure 1 demonstrates the (anonymized) recruitment trees of the study.

San Diego study recruitment tree

Figure 1: San Diego study recruitment trees as functions of HIV status. On the x axis, observation means the HIV status group.

(more…)

The data science puzzle

Monday, April 11th, 2011

Throughout the past few years, I have heard several times that the demand for quantitatively and data oriented professionals is growing. Clearly, this is good news for statisticians, as statistics is central to the process of extracting a meaningful and actionable signal from data. The terms data science, and data scientist have been accompanying many of the related articles. So, I have decided to do some research and look for evidence of increase in information analysis demand. My goal has been to understand the peculiarities of how our profession is perceived in the community, and attempt to clarify the meaning of the new data science term. (more…)

Dynamic visualization of RDS version 2

Sunday, March 27th, 2011

Early this semester, I worked on complementing my visualization of the Respondent-Driven Sampling (RDS) process presented in this post to illustrate its evolution over time. That was how the second version was created, which is displayed here.

Please refer to the earlier post for detailed description of the main functionality. The second version implements an additional view of the process, which plots the portion of the underlying network as discovered by the RDS process over time. To switch to an alternate view at any time, press the change view button. The wide pink horizontal line in the alternate view marks the true population mean. (more…)

Dynamic visualization of RDS

Saturday, December 18th, 2010

The visualization below is the last element of work with my advisor Joe Blitzstein on exploring the Respondent-Driven Sampling (RDS) process via simulation. (more…)

Tradeoffs in estimation under Respondent-Driven Sampling, and Chernoff faces

Wednesday, October 6th, 2010

Recently I have been working hard on finalizing the paper that we are writing with my advisor Joe Blitzstein about estimation under Respondent-Driven Sampling (RDS). Specifically, the paper aims to develop general intuition about how the process works on networks with different topologies, and what are the driving factors of current estimators’ performance (or lack thereof).

To do this, we simulated many networks belonging to one of three main types (homophily, rich-gets-richer and inverse homophily), simulated many RDS processes of different configurations on each, and compared performance of the well-established Volz-Heckathorn (VH) estimator, and plain vanilla mean as point estimators under each scenario. Among other findings, it has turned out that the VH estimator underperforms the plain mean on the considered class of homophily networks, and prevails in some other cases. (more…)