18  Modelling Poorly

Author
Affiliation

Dr. Devan Becker

Bad by Michael Jackson

Published

2024-07-17

18.1 Motivation

Selecting Predictors

As you’ve seen, choosing predictors is hard!

Wouldn’t it be nice if the computer would choose the best model for you?

The “Best” Model

  • Best represents the Data Generating Process (DGP)
    • Best for inference
  • Misses the DGP, but provides useful insights into the relationships
    • Also best for inference
  • Fits the current data the best
    • Overfitting?
  • Best able to predict new values
    • Random Forests and Neural Nets

The “best” model depends on the goal of the study!

18.2 Picking the “Best Fitting” Model

Model Comparison Criteria

  • \(R^2\), or adjusted \(R^2\)
    • Higher is better, but \(R^2\) increases as we add predictors
  • \(s^2\), the residual variance.
    • Adding predictors always decreases \(s^2\)
  • Mallow’s \(C_p\) statistic
    • \(C_p = RSS_p/s^2 - (n - 2p)\)
      • \(RSS_p\) is the RSS of the smaller model, with \(p\) parameters
      • \(s^2\) is the MSE from the largest model under consideration
    • Adding predictors does not increase this statistic.
  • AIC, the Akiake Information Criterion
    • \(AIC = 2p - 2\ln(\hat L)\), where \(\hat L\) is the likelihood evaluated at the estimated parameters.
      • E.g., when \(\epsilon_i\sim N(0,\sigma^2)\), the likelihood is the product of normal distributions with a mean of \(X\hat{\underline\beta}\) and variance \(s^2\).
    • Does not increase with added predictors.

More on AIC

\[ AIC = 2p - 2\ln(\hat L) \]

Recall from Maximum Likelihood Estimation, the likelihood is the likelihood of observing the particular data, given the parameters: \[ L(y|\underline\beta, X, \sigma) = \prod_{i=1}^nf_Y(X|\underline\beta, \sigma^2), \] where \(f_Y(X|\underline\beta, \sigma^2)\) is the normal distribution.

  • A high AIC means we either:
    • Have too many parameters, or
    • Our model doesn’t fit the data well.
  • A low AIC means we’ve got a good model that isn’t overly complicated
    • “Low” is relative to other models

18.3 Automated Model Selection Algorithms

Best Subset

  1. Find the collection of predictors that optimizes the statistic of interest.

That’s it. You just try them all.

Backward and Forward Selection

  • Backward Selection
    • Include all predictors, try removing one
      • Check the \(R^2\), p-values, Mallow’s Cp, or AIC
    • Put that one back in the model, try removing another
    • Repeat unti you’ve found the “best” predictor to exclude. Then find the next best one!
  • Forward Selection
    • Find the best predictor to include first
    • Find the best predictor to include second

Both have some sort of stopping criteria.

Backward Selection

  1. Fit a model with all \(p\) predictors
  2. Try all models with \(p-1\) predictors.
    • Identify the best one, say remove \(x_j\)
    • Check stopping criteria.
  3. If stopping critera not met, try all models with \(p-2\) predictors, not including \(x_j\).

Backward Selection Example

  1. Start with mpg ~ disp + wt + am + cyl + qsec.
  2. Check all of the AICs, remove cyl.
  3. Check all of the AICs, remove disp.
  4. Check all AICs, stop.

Final model: mpg ~ wt + am + qsec

Forward Selection

  1. Start with mpg ~ 1.
  2. Test each predictor individually, check AIC, keep wt.
  3. Test each remaining predictor, check AIC, keep cyl.
  4. Test each remaining predictor, stop.

Final model: mpg ~ wt + cyl

Think-Pair-Share

What might these methods be missing?

When would these methods be useful?

Evaluating Algorithmic Predictor Selection

Suppose we have measured 30 predictors that we know are not related to the response.

How many predictors should Backwards Selection select?

Try it out yourself!

A Special Case: Race

  • In general, we almost always want to include race if it was measured.
    • If our model is using race to make a decision, we want to know about it!
  • Possible approach: not dummy variables.

18.4 The Best Model

Neural Networks and Random Forests

  • Neural Networks
    • Essentially a series of linear regressions with a minor non-linear transformation.
    • A “deep” neural network is non-linear transformations and all interactions.
    • Very finicky, but very powerful when necessary.
  • Random Forests
    • Also a series of non-linear effects with interactions.
    • Much much much less finicky.

Causal Inference

Experiments are our way of controlling variables so that we can isolate their effect.

Most data we tend to use is observational.

  • Causal inference is statistical magic to determine causality from observation.
    • … with varying degrees of success.

Why are you telling us this, Devan?

Which model is “best”?

  • Best predictions?
    • NN and RF, with cross-validation.
  • Best inference?
    • Build a model based on the context of the problem.
    • Choose transformations and interactions appropriately.
    • Only check p-values at the very end.
  • Best subset of predictors?
    • Recall: multicollinearity. Without an experiment, correlated predictors mean that there’s no way to tell which predictors are best.

Opinion: Algorithmic selection methods are bad approximations to better techniques that are outside out the scope of this course.