# Decision Theory and Statistical Inference

### Stat 221, Lecture 16

@snesterko

• Decision Making based on Inference
• Relevance to multimodality
• Case study - marketing campaigns optimization

## In the presence of multimodality, maximum likelihood breaks down

• We use the full likelihood/posterior to get information about the parameters that describe the system.
• But how exactly do we use it to make decisions where the parameter values are inputs?
• One step beyond inference.

## Example: IT materials for schools

• The data is binary - students getting into college or not.
• Interested in a parameter - association of time spent on IT materials and getting into college.
• Fit a logistic regression-like model to check that.

## Questions

• Are there two groups of people in the class?
• Is this an artifact of the model?
• So how many IT materials should we order?

### Ex.: running man HMM

Non-identifiability even in this simple model.

## Introduction to Decision Theory

• There is "cost" to each decision.
• Some would call it a game "statistician against nature".
• The cost is postulated via a loss function.
• The cost depends on the underlying true value of the parameter.

## Classic cost functions

Squared error $$C(\theta, d) = (d - \theta)^2$$

Absolute error $$C(\theta, d) = | d - \theta |$$

Constant error $$C(\theta, d) = A \cdot I_{|d - \theta| > \epsilon}$$

## In the IT materials example, what cost function would we choose?

• What other ways could we employ to plan the next IT materials order?

## Definitions

Let \$$\delta (y) \$$ stand for a decision we make based on data.

Define risk function \$$R_\delta(\theta) \$$

$$R_\delta(\theta) = \rr{E}_\theta \left[ C\left(\theta , \delta\left(y \right) \right)\right]$$

A decision rule \$$\delta \$$ is admissible if there exists no another decision rule \$$\tilde{\delta} \$$ where \$$R_{\tilde{\delta}}(\theta) \leq R_{\delta}(\theta) \$$ for all \$$\theta \in \Theta \$$, with the inequality strict for at least one \$$\theta \$$.

## Example

• Interval parameter space \$$\Theta = [0, 1] \$$.
• Assume Mean Squared Error loss.
• Can we find decision rule that has a minimal MSE uniformly over \$$\Theta \$$?

## Minimaxity

The decision rule \$$\delta \$$ is minimax if for any other decision rule \$$\tilde{\delta} \$$, the risk function for \$$\delta \$$ satisfies

$$\mathop{\rr{sup}}_{\theta \in \Theta} \{ R_\delta(\theta)\} \leq \mathop{\rr{sup}}_{\theta \in \Theta} \{ R_{\tilde{\delta}}(\theta)\}$$

## Bayesian Decision Analysis

Bayes risk is

$$\rr{E}_\pi \left[ R_\delta (\theta)\right] = \int_\Theta R_\delta (\theta) \pi(\theta) d\theta$$

This is a double expectation, averaged both through data \$$y \$$ and parameter \$$\theta \$$.

## Rearrange

\begin{align} \rr{E}_\pi \{ R_\delta (\theta)\} & = \int \int C(\theta, \delta(y)) f(y \given \theta) dy \pi(\theta) d\theta \\ & = \int \int C(\theta, \delta(y)) p(\theta \given y) f(y) d \theta dy \\ & = \int f(y) \int C(\theta, \delta(y)) p(\theta \given y) d \theta dy\end{align}

Minimizing Bayes risk is equivalent to minimizing the posterior expectation of the loss function.

## Convenience of Bayes risk

• It defines a single number.
• So, it's easy to look for the best decision rule - just look for the one with the lowest Bayes risk.

## Theorem

• Any Bayes rule corresponding to a proper nondogmatic prior \$$\pi (\theta) \$$ is admissible.

This means Bayesian procedures are admissible in the frequentist sense.

## Example

Minimize the Bayes Risk with respect to squared error loss.

$$\rr{argmin}_\delta \left\{ \int \left( \theta - \delta(y)\right)^2 p(\theta \given y) d\theta \right\}$$

Let's find the corresponding decision rule.

## Calculations

\begin{align}\int \left( \theta - \delta(y)\right)^2 & p(\theta \given y) d\theta \\ = & \rr{Var} \left[ \theta - \delta(y) \given y \right] + \\ & \left( \rr{E} \left[ \theta - \delta(y) \given y \right]\right)^2 \\ = & \rr{Var} \left[ \theta \given y \right] + \left( \rr{E} \left[ \theta \given y \right] - \delta(y)\right)^2\end{align}

From here it's evident that the posterior mean minimizes mean squared loss.

## However, is the posterior mean really what we want when the posterior surface is irregular?

Student discussion.

## Inferential decisions

• When doing point estimation
• Usually, this means we are minimizing mean squared error.
• This allows for the handy bias-variance tradeoff intuition.
• When doing full posterior inference
• We get full draws from the posterior surface.
• Often, we collapse that information down into posterior mean and variance.
• Is this practical? Is there middle ground?

## One option: custom cost functions

• Use the full information provided by the posterior surface to arrive at an optimal decision.
• But let's not get carried away! There are serious risks to blindly doing this.

### Case study: marketing optimization

• A large CPG manufacturer wants to link its marketing campaigns to revenue.
• Further, they want to make the next year's marketing campaigns better.
• Possible solutions:
• Randomized experiments + response surface methods
• RCM
• Decision theory?

## Model statements

• Revenue \$$R \$$ as a function of promotion campaigns: $$R \given \theta, X \sim f(\theta, X)$$
• For example, regression-like model $$\rr{log} R \given \beta, X \sim X\beta + \epsilon$$
• This allows to obtain the posterior $$p(\theta \given R, X)$$
• The posterior is the information about model parameters as provided by the data subject to the model formulation.

## Futher, optimize marketing allocation for next year

Define a suitable cost function and optimize its posterior expectation with respect to \$$\delta(X) \$$

$$\int C(\theta, \delta(X)) p(\theta \given R, X) d \theta$$

What cost function should we define?

## Cost function

There are several ways (simple regression example):

• Link to mean revenue $$C(\theta, \delta(X)) = -\rr{exp} \left\{ \delta(X)\beta\right\}$$
• Link to resource reallocation costs $$C(\theta, \delta(X)) = -\rr{exp} \left\{ \delta(X)\beta\right\} + L(\delta(X))$$
• How to we define \$$L \$$?
• Incorporate nonlinearity

## Optimization

Once we've decided on our cost, need to find \$$\delta(X) \$$

• Complex optimization
• The CPG manufacturer has 100 brands with up to 300 product items each, there are 30 markets, 52 weeks, 7 promotion campaigns
• In total, this is $$327,600,000 \approx 320\rr{M} \rr{ dimensions}$$
• Add to this that this needs to be done at each point in the parameter space.

## Issues

• Computational issues - how do we do this?
• Validity concerns
• What are the alternative ways to go?

## Further resources

• Causal Inference, Don Rubin
• Response Surface Methodology, Myers et al
• Parallel computing

## Announcements

• T-shirt comps!
• April 4 second final project checkpoint

## Final slide

• Next lecture: Parallel statistical computing.