- Intro.
- Modeling and computing material.
- Computing classification of models.
- Intro to computing with Odyssey.
- Thinking like a programmer - pseudocode, class languages.

- Critiquing a visualization, pset 1 information.

- Final projects out.
- Ranking survey out soon.
- Piazza is good, use it.

- Distinctive features of parametric stat modeling.
- Ingredients of a stat model.
- Finding the estimand.
- Telling a story with the model.
- Agile movement setup \\( \longleftrightarrow \\) modeling.
- Parsimony in modeling.
- Visualization for model design.

- Problem setup.
**Method design.****Implementation.**- Reporting.

Specifically, we will look at the connection and tradeoffs between modeling and computing.

- Survey done with RDS, need to estimate population mean.
- The researchers are using naive mean model which has a non-ignorable problem.
- We need to decide on a better method.

- Classify models in terms of computing intensity.
- Easily switch between workflow stages, and available options within a stage.
- Think a little like a programmer.

Easy computing

Hard computing

Simple model

Linear regression

X

Complex model

naive MCMC, vEM, HMM, GLM+HMC

complex HMM, HMC

Being able to navigate the tradeoffs in concrete applications is necessary to make the right choices.

Simple models:

- Regressions (logistic, linear), mixed effects regressions (somewhat).
- Autoregressive models.

Simple-to-complex models:

- Generalized linear models (GLM), spline/smooth regressions.

Complex models:

- Models with missing data/latent variables, HMMs.
- GLM's without the L, models from above combined.

- Standard models often don't work.
- Increase in the amount of available data - 2.8 trillion GB in 2012.
- Less and less cookie-cutter settings and challenges.
- Personal computers cannot handle large datasets, need distributed computing such as Odyssey.

- Design a model.
- Work on selecting a good computing technique for it, evaluate findings.
- Go back to either look for another computing technique, or change the model if the results are not satisfactory.

- Simple model: $$Y_i \mathop{\sim}^{\rr{iid}} N\left( \mu, \sigma^2\right)$$
- Can be estimated on the personal computer easily.
- But CI coverage rates are bad. The model is not a good representation of the underlying mechanism.
- Go back to the design phase.

- \\(\{h_1, h_2, \ldots \} \\) live on discrete space, follow a Markov Process with a transition matrix \\( T_\theta \\).
- At time \\( t\\), state \\( h_t \\)
*emits*observation \\( y_t \\) according to some specified model.

- Why would this be a bad candidate?

$$$$\begin{align} Y_i \given Y_{r(i)} & \sim N \left(\rho Y_{r(i)} + (1 - \rho) \mu, (1 - \rho^2) \sigma^2 \right), \\Y_i & \sim N\left( \mu, \sigma^2 \right) \rr{ if } r(i)\rr{ doesn't exist}\end{align}$$$$

- Needs Gibbs sampler to fit.
- Mostly analytic updates.
- Needs distributed computing for sensitivity analysis and coverage rate analysis.
- Provides good interpretation and interval coverage properties.

- Second candidate.
- Numerically stable (not so many parameters/missing data).
- Other favorable features.

- The estimand.
- Model performance metrics.
- What models are available/can be adapted?
- Work towards parsimony.

- Numeric stability.
- Cost of implementation and execution.
- Point estimation vs. full sampling approaches.
- Point estimation: MLE (via LP, QP approximations, EM, Fisher scoring, simulated annealing), MAP.
- Full sampling: MCMC, HMC, Gibbs, exact sampling.

Simulate from the simple model:

```
for i in 1:N
generate and store Y_i from Normal(mu, sigma-sq)
```

```
for i in 1:N
referrer = r(i)
if referrer
Yr = Y_referrer
generate and store Y_i from Normal(rho * Yr + (1 - rho) * mu, (1 - rho^2) * sigma-sq)
else
generate and store Y_i from Normal(mu, sigma-sq)
```

```
initialize parameter values
for i in 1:ndraws
## this is Gibbs sampler
draw mu analytically given current rho and sigma-sq
draw rho via Metropolis proposals on transformed space given current rho and sigma-sq
draw sigma-sq analytically given current mu and rho
discard first nburnin draws of parameters
```

```
tr = init(data)
for (ii in 1:settings$ndraws) {
## this is Gibbs sampler
pars = lastParams(tr)
tr$mu = combine(tr$mu, drawMu(data, pars))
pars = lastParams(tr)
tr$rho = combine(tr$rho, drawRho(data, pars))
pars = lastParams(tr)
tr$ss = combine(tr$ss, drawSigmaS(data, pars))
}
tr = discard(tr, settings$burnin)
tr
```

```
## initialize parameter values
tr = init(data, settings)
## perform the main Gibbs sampler loop
for (ii in 1:settings$ndraws) {
## lastParams gets the most recent parameter draws
pars = lastParams(tr)
tr$mu = combine(old = tr$mu, add = drawMu(data, pars))
pars = lastParams(tr)
tr$rho = combine(old = tr$rho, add = drawRho(data, pars))
pars = lastParams(tr)
tr$ss = combine(old = tr$ss, add = drawSigmaS(data, pars))
}
## Burn in - throw away first burnin draws set in settings
tr = discard(allparameters = tr, nthrow = settings$burnin)
tr
```

**Saves time**.- Easier to handle new languages.

R

```
for (ii in 1:10) {
print("R is awesome")
}
```

Bash

```
for ii in $(seq 1 10)
do
echo "R is awesome"
done
```

- Many code snippets provided in homeworks.

Javascript (optional), snippets provided

```
for (var ii = 0; ii < 10; ii++) {
console.log("R is awesome");
}
```

CSS (optional), snippets provided

```
svg line {
stroke : #000000;
opacity : 0.3;
}
```

HTML (optional), snippets provided

```
<html>
<body><p>R is awesome</p></body>
</html>
```

- Personal to distributed computing leap.
- short_serial, short_parallel queues on Odyssey.
- Computing resources are usually constrained.

- Vectorization.
- Think in terms of matrix operations.
- Vector indexing.
- Built-in functions and functions in packages, some are in lower level languages and so are faster.
- (Hint for pset1: may want to consider xpnd in MCMCpack).

time1 <- proc.time() res1 <- sapply(1:1000000, exp) time2 <- proc.time() time2 <- time1 ## user system elapsed ## 6.321 0.146 6.429 time1 - proc.time() res2 <- exp(1:1000000) time2 <- proc.time() time2 <- time1 ## user system elapsed ## 0.022 0.016 0.051 all(res1 == res2) ## [1] TRUE

- Homework visualization is on the course website.
- It's also in the problem set distribution files.

- Submit your final project rankings by February 10 at noon!
- Odyssey access: send your name, email, and phone # to your TF.

- Slides nesterko.com/lectures/stat221-2012/lecture3
- Class website theory.info/harvardstat221
- Class Piazza piazza.com/class#spring2013/stat221
- Class Twitter twitter.com/harvardstat221

- Next lecture: guest lecture by Rachel Schutt