- Team assignment details.
- Descriptions released approx. February 2.

- Can submit visualizations to class website! Good for participation.
- What is good style of code?
- Homework previews.
- Time commitment to problem sets.
- d3js support possibilities (online tutorials, workshop, TF)
- T-shirt competition clarified.

- Statistics
- Computer science
- Ecomonics
- Physics
- Biology
- others

- Close to physics, chemistry, biology etc.
- Defining trait - distributional assumptions on noise.
- Similar to Computer Science when no clear intuition driving the process can be found.

Response, data, noise, and parameters:$$\begin{align} \rr{response} &= f(\rr{data}, \rr{noise})\\ Y &= f_{\theta}(X, \epsilon) \end{align}$$

Representation:$$\begin{align}Y &= f_{\theta}(X, \epsilon) \rr{ or} \\ Y &\sim f_{\theta}(X, \epsilon)\end{align}$$

Graphical or generative :

Linear regression:$$\vec{Y} \sim \mathbf{X}\vec{\beta} + \vec{\epsilon}, \rr{ } \vec{\epsilon} \sim N\left(0, \sigma^2\mathbf{I}\right)$$

Anchor Process: $$\begin{align} Y_i \given Y_{r(i)} & \sim N \left(\rho Y_{r(i)} + (1 - \rho) \mu, (1 - \rho^2) \sigma^2 \right), \\Y_i & \sim N\left( \mu, \sigma^2 \right) \rr{ if } r(i)\rr{ doesn't exist}\end{align}$$

- Translate scientific/expert intuition into mathematical statements.
- Follow the quantitative analysis workflow.

- Each model tell a story.
- Let's tell a story - class activity.

Iterate between the stages:

**Initialize**: talk to collaborator/client, get the data, understand the question of interest.**Design**: an analytic method for a solution, evaluate its properties and assumptions.**Implement**: a working, practical computing solution to perform the designed procedure.**Communicate**: the findings, help understand and interpret them. Advise on the method's applicability in case of repeated use.

- Respondent-Driven Systems/Sampling
- Respondents follow their social network to recruit peers to achieve a common goal.
- Public health agencies are interested in estimating population average income, illness status etc.

The first guess for the model is $$Y_i \mathop{\sim}^{\rr{iid}} N\left( \mu, \sigma^2\right)$$

Yields the classic vanilla estimator $$\hat{\mu} = \bar{Y}$$

with liberal (tight) confidence bounds.

Almost no computation is needed for this model.

- The model is unbiased, but
- The naive 95% confidence interval has coverage of around 20% in case of homophily.
- How do we enhance our model to keep the unbiasedness, but achieve the right coverage rate for uncertainty intervals?

- Understand the system better.
- Build the intuition about system mechanics into our estimation model.
- Hope that our intuition and the way we've postulated it are good enough.

Model statement: $$\begin{align} Y_i \given Y_{r(i)} & \sim N \left(\rho Y_{r(i)} + (1 - \rho) \mu, (1 - \rho^2) \sigma^2 \right), \\Y_i & \sim N\left( \mu, \sigma^2 \right) \rr{ if } r(i)\rr{ doesn't exist}\end{align}$$

- \\( \{ \rho, \mu, \sigma^2 \} \\) is the parameter set
- \\( r(i) \\) is the index of referrer of respondent \\( i \\).
- \\( Y_i \\) is the observation of respondent \\( i \\).
- Needs more involved computation - there is no closed-form solution for \\( \hat{\mu} \\).

- The question to be answered a.k.a. the estimand.
- Developing the intuition of how things work.
- Translating the intuition into models, evaluating their properties.
- Being able to fit/compute with these models.
- Reporting the results.
- Looping back.

- Final project proposals available soon - check email and course Twitter!
- Odyssey access, ask your TF for clarifications.
- Free T-shirt competitions start now!

- Slides nesterko.com/lectures/stat221-2012/lecture2
- Class website theory.info/harvardstat221
- Class Piazza piazza.com/class#spring2013/stat221
- Class Twitter twitter.com/harvardstat221

- Next lecture: Introduction to data Visualization + statistical Modeling + Computing (VMC) II
- Don't hesitate to talk to course staff about the class and your specific needs.