18 Modelling Poorly
18.1 Motivation
Selecting Predictors
As you’ve seen, choosing predictors is hard!
Wouldn’t it be nice if the computer would choose the best model for you?
The “Best” Model
- Best represents the Data Generating Process (DGP)
- Best for inference
- Misses the DGP, but provides useful insights into the relationships
- Also best for inference
- Fits the current data the best
- Overfitting?
- Best able to predict new values
- Random Forests and Neural Nets
The “best” model depends on the goal of the study!
18.2 Picking the “Best Fitting” Model
Model Comparison Criteria
- \(R^2\), or adjusted \(R^2\)
- Higher is better, but \(R^2\) increases as we add predictors
- \(s^2\), the residual variance.
- Adding predictors always decreases \(s^2\)
- Mallow’s \(C_p\) statistic
- \(C_p = RSS_p/s^2 - (n - 2p)\)
- \(RSS_p\) is the RSS of the smaller model, with \(p\) parameters
- \(s^2\) is the MSE from the largest model under consideration
- Adding predictors does not increase this statistic.
- \(C_p = RSS_p/s^2 - (n - 2p)\)
- AIC, the Akiake Information Criterion
- \(AIC = 2p - 2\ln(\hat L)\), where \(\hat L\) is the likelihood evaluated at the estimated parameters.
- E.g., when \(\epsilon_i\sim N(0,\sigma^2)\), the likelihood is the product of normal distributions with a mean of \(X\hat{\underline\beta}\) and variance \(s^2\).
- Does not increase with added predictors.
- \(AIC = 2p - 2\ln(\hat L)\), where \(\hat L\) is the likelihood evaluated at the estimated parameters.
More on AIC
\[ AIC = 2p - 2\ln(\hat L) \]
Recall from Maximum Likelihood Estimation, the likelihood is the likelihood of observing the particular data, given the parameters: \[ L(y|\underline\beta, X, \sigma) = \prod_{i=1}^nf_Y(X|\underline\beta, \sigma^2), \] where \(f_Y(X|\underline\beta, \sigma^2)\) is the normal distribution.
- A high AIC means we either:
- Have too many parameters, or
- Our model doesn’t fit the data well.
- A low AIC means we’ve got a good model that isn’t overly complicated
- “Low” is relative to other models
18.3 Automated Model Selection Algorithms
Best Subset
- Find the collection of predictors that optimizes the statistic of interest.
That’s it. You just try them all.
Backward and Forward Selection
- Backward Selection
- Include all predictors, try removing one
- Check the \(R^2\), p-values, Mallow’s Cp, or AIC
- Put that one back in the model, try removing another
- Repeat unti you’ve found the “best” predictor to exclude. Then find the next best one!
- Include all predictors, try removing one
- Forward Selection
- Find the best predictor to include first
- Find the best predictor to include second
- …
Both have some sort of stopping criteria.
Backward Selection
- Fit a model with all \(p\) predictors
- Try all models with \(p-1\) predictors.
- Identify the best one, say remove \(x_j\)
- Check stopping criteria.
- If stopping critera not met, try all models with \(p-2\) predictors, not including \(x_j\).
Backward Selection Example
- Start with
mpg ~ disp + wt + am + cyl + qsec. - Check all of the AICs, remove
cyl. - Check all of the AICs, remove
disp. - Check all AICs, stop.
Final model: mpg ~ wt + am + qsec
Forward Selection
- Start with
mpg ~ 1. - Test each predictor individually, check AIC, keep
wt. - Test each remaining predictor, check AIC, keep
cyl. - Test each remaining predictor, stop.
Final model: mpg ~ wt + cyl
Evaluating Algorithmic Predictor Selection
Suppose we have measured 30 predictors that we know are not related to the response.
How many predictors should Backwards Selection select?
Try it out yourself!
A Special Case: Race
- In general, we almost always want to include race if it was measured.
- If our model is using race to make a decision, we want to know about it!
- Possible approach: not dummy variables.
18.4 The Best Model
Neural Networks and Random Forests
- Neural Networks
- Essentially a series of linear regressions with a minor non-linear transformation.
- A “deep” neural network is non-linear transformations and all interactions.
- Very finicky, but very powerful when necessary.
- Random Forests
- Also a series of non-linear effects with interactions.
- Much much much less finicky.
Causal Inference
Experiments are our way of controlling variables so that we can isolate their effect.
Most data we tend to use is observational.
- Causal inference is statistical magic to determine causality from observation.
- … with varying degrees of success.
Why are you telling us this, Devan?
Which model is “best”?
- Best predictions?
- NN and RF, with cross-validation.
- Best inference?
- Build a model based on the context of the problem.
- Choose transformations and interactions appropriately.
- Only check p-values at the very end.
- Best subset of predictors?
- Recall: multicollinearity. Without an experiment, correlated predictors mean that there’s no way to tell which predictors are best.
Opinion: Algorithmic selection methods are bad approximations to better techniques that are outside out the scope of this course.