22  Final Exam Review

Author
Affiliation

Dr. Devan Becker

Wilfrid Laurier University

Published

2024-07-31

22.1 Summaries

Generally important things

  • The bias and variance of \(\hat{\underline\beta}\).
  • Interpreting coefficients and inference.
  • Variance, rather than point estimates.
  • Interpreting residual plots.
  • Choosing a reasonable model given the context of the problem.

“Extra Topics” Lecture

  • Standardizing
    • Effect on parameter estimates (especially correlations)

Getting the Wrong Model

  • Bias due to missing predictors.
  • What does it even mean to have the “right” model?
    • Proxy measures and their effect on the other parameter estimates

Transforming the Predictors

  • Polynomial models
    • When to use them
    • Lower order terms
    • Extrapolation
  • Other transformations (e.g. log, combining predictors, etc.)

Transforming the Response

  • Explain “Stabilizing the variance”
  • Diagnosing the need for a transformation
  • Choosing transformations
  • The effect on the model
    • E.g. multiplicative errors, changes to the parameter estimates

Dummy Variables

  • Definition and interpretation
    • factor(cyl)8 in the coefficients table
  • If there are three categories, we need two dummies
    • “Reference” category is absorbed into the intercept.
  • Interaction terms: different intercept, different slope.
  • Significance of a dummy variable (or interaction with one)
  • Extra sum-of-squares to test whether categories are statistically different
    • What kind of test is this? ANOVA? ANCOVA?

Multicollinearity!

  • Why it increases variance (many different parameter combinations are equivalent)
  • Detecting via the variance inflation factor
    • Approx 10 is bad, but this is just a rule-of-thumb.
  • What to do about it, and what it means for interpreting coefficients.

Best Subset Selection

  • Goal: Inference or prediction?
    • How does Subset Selection fit into this?
  • General idea of the algorithms.
    • Don’t need to know Mallow’s Cp etc.
  • Why the p-values can’t really be trusted.
  • Useful as a preliminary step (sometimes).

Degrees of Freedom

  • General modelling strategies.
  • Choosing transformations based on domain knowledge.
  • Being explicit about the decisions made while modelling.

Regularization

  • It’s just linear regression with “smaller” slope estimates!
    • Intercept isn’t constrained.
    • Some slopes can be bigger than the slopes for linear regression, but sum of abs/squares is smaller.
  • Regularization prevents overfitting.
    • Choose the penalty parameter \(\lambda\) by minimizing out-of-sample prediction error, measured via cross-validation.
    • Within 1 SE of minimum MSE leads to a simpler (more regularized) model.
  • Adds bias, which is a good thing?
  • Requires standardization of the predictors.
  • LASSO sets parameters to 0, Ridge does not.

Classification

  • Response values are 0 or 1
    • Expected value of \(y\) is the proportion of 1s given a value of \(x\).
    • Predictions can be converted to 0s and 1s, with two types of errors (confusion matrix).
  • Logistic function looks like an “S”
    • Exact shape determined by linear predictor \(\eta(X) = X\underline\beta\)
  • Other than transformations of response, all lm topics apply.
    • Including regularization!
  • “Residuals” are weird
    • Not “observed minus expected” anymore!

22.2 Self-Study

In addition to a quiz that will be added to MyLS, below is the quizizz that we went though in class!

22.3 Midterm Solutions

ANOVA

Explain how a hypotheses test based on the ratio \(MS_{Reg}/MS_E\) in the ANOVA table is a test for whether any of the slope parameters are 0.

  • A line with all slopes equal to 0 is a horizontal line, which will have 0 variance around \(\bar y\), i.e. \(MS_{Reg} = 0\).
  • Due to random chance, we will never actually get \(MS_{Reg} = 0\). Dividing by MSE gives us a way to evaluate the size of \(MS_{Reg}\).

Bias/Variance

For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).

  1. Given the model \(y_i = \beta_0 + \epsilon_i\), show that \(\hat\beta_0 = \bar y\) minimizes the sum of squared error.

\[\begin{align*} \frac{1}{n}\sum(y_i - \hat y)^2 &= \frac{1}{n}\sum(y_i - \beta_0)^2\\ \frac{d}{d\beta}\frac{1}{n} &= \frac{1}{n}\sum(y_i - \beta_0)^2\\ &= \frac{2}{n}(\sum y_i - n\beta_0) \stackrel{set}{=}0\\ \implies \frac{2}{n}\sum y_i &= \frac{2}{n}n\beta_0 \implies \bar y = \beta_0 \end{align*}\]

This is a minimum since this is a polynomial in \(\beta_0\) with a positive coefficient for \(\beta_0\).

Bias/Variance

For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).

  1. Given the model \(y_i = \beta_0 + \epsilon_i\), show that \(E(\hat\beta_0) = \beta_0\) and \(V(\hat\beta_0) = \sigma^2/n\).

\[\begin{align*} E(\hat\beta_0) &= E\left(\frac{1}{n}\sum y_i\right)= \frac{1}{n}\sum E(y_i)\\ &= \frac{1}{n}\sum E(\beta_0 + \epsilon_i)= \frac{1}{n}n\beta_0 = \beta_0 \end{align*}\]

\[\begin{align*} V(\hat\beta_0) &= V\left(\frac{1}{n}\sum y_i\right)= \frac{1}{n^2}\sum V(y_i)\\ &= \frac{1}{n^2}\sum V(\beta_0 + \epsilon_i)= \frac{1}{n^2}n\sigma^2 = \frac{\sigma^2}{n} \end{align*}\]

Bias/Variance

For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).

  1. Now consider the multiple linear regression model \(Y = X\beta + \underline\epsilon\), where we know that \(\hat{\underline{\beta}} = (X^TX)^{-1}X^TY\). Show that \(E(\hat{\underline{\beta}}) = \underline{\beta}\).

\[ E(\hat{\underline{\beta}}) = E((X^TX)^{-1}X^TY) = (X^TX)^{-1}X^TE(Y) = (X^TX)^{-1}X^TX\underline\beta = \underline\beta \]