Recently I have been working hard on finalizing the paper that we are writing with my advisor Joe Blitzstein about estimation under Respondent-Driven Sampling (RDS). Specifically, the paper aims to develop general intuition about how the process works on networks with different topologies, and what are the driving factors of current estimators’ performance (or lack thereof).

To do this, we simulated many networks belonging to one of three main types (homophily, rich-gets-richer and inverse homophily), simulated many RDS processes of different configurations on each, and compared performance of the well-established Volz-Heckathorn (VH) estimator, and plain vanilla mean *as point estimators* under each scenario. Among other findings, it has turned out that the VH estimator underperforms the plain mean on the considered class of homophily networks, and prevails in some other cases.

To summarize the wealth of information obtained from the simulations, I decided to create a visualization that could incorporate as much dimensions as possible. In fact, I worked on some visualization ideas while on the opening workshop at SAMSI, as described in an earlier post. When showing them to my advisor, I was surprised to learn that similar ideas had in fact been already expressed by Herman Chernoff, currently a professor emeritus in our department. They are known as the Chernoff faces. In the end, I decided to keep the design because it is highly informative.

The visualization below encompasses 36 plots of different RDS configurations on different types of networks as per user’s selection below the plot. The x axis corresponds to sensitivity points of a topology. That is, we have generated 10 similar network subtypes within each topology type, by varying a parameter in the network formation function. The y axis corresponds to the number of coupons in an RDS process. Inside the plot, the bottom of each “face” corresponds to stochastic degree reporting (i.e., every vertex in the sample reports its degree with some error), and the top part corresponds to exact degree reporting. The larger half circles correspond to the difference in MSE between the VH and plain mean estimators, the left ones of the smaller half circles within each “face” correspond to difference in squared bias, and the right ones to the difference in variance. The yellow colour corresponds to positive sign, and blue to negative. Finally, the watermark numbers are the average-average number of times RDS processes came to a dead end and had to self-restart to reach the needed sample size.

The visualization has been created in Processing, a language for visualization I have started looking at recently. Please check this post for more details. I like this language because, though highly verbose, it still makes it feasible to develop aesthetically appealing visualizations within a reasonable time frame.

I look forward to finishing writing and submitting the paper on this work, and also to giving a talk at a network visualization symposium on October 22, which will use another interactive visualization, and will generally build on my related experience.

Tags: Joe, Processing, RDS, visualization

[…] introduce biases and make existing methods of estimation underperform, as I have outlined in this post. The visualization presented there summarizes our findings based on simulations, whereas the one […]

[…] is slow, and so when I faced the necessity to scale old R code (pertaining to material described in this post) to operate on data 100 times larger than it used to, I was initially at a loss. The problem with […]