22 Final Exam Review

Author

Affiliation

Dr. Devan Becker

Wilfrid Laurier University

Published

2024-07-31

22.1 Summaries

Generally important things

The bias and variance of \(\hat{\underline\beta}\).
Interpreting coefficients and inference.
Variance, rather than point estimates.
Interpreting residual plots.
Choosing a reasonable model given the context of the problem.

“Extra Topics” Lecture

Standardizing
- Effect on parameter estimates (especially correlations)

Getting the Wrong Model

Bias due to missing predictors.
What does it even mean to have the “right” model?
- Proxy measures and their effect on the other parameter estimates

Transforming the Predictors

Polynomial models
- When to use them
- Lower order terms
- Extrapolation
Other transformations (e.g. log, combining predictors, etc.)

Transforming the Response

Explain “Stabilizing the variance”
Diagnosing the need for a transformation
Choosing transformations
The effect on the model
- E.g. multiplicative errors, changes to the parameter estimates

Dummy Variables

Definition and interpretation
- factor(cyl)8 in the coefficients table
If there are three categories, we need two dummies
- “Reference” category is absorbed into the intercept.
Interaction terms: different intercept, different slope.
Significance of a dummy variable (or interaction with one)
Extra sum-of-squares to test whether categories are statistically different
- What kind of test is this? ANOVA? ANCOVA?

Multicollinearity!

Why it increases variance (many different parameter combinations are equivalent)
Detecting via the variance inflation factor
- Approx 10 is bad, but this is just a rule-of-thumb.
What to do about it, and what it means for interpreting coefficients.

Best Subset Selection

Goal: Inference or prediction?
- How does Subset Selection fit into this?
General idea of the algorithms.
- Don’t need to know Mallow’s Cp etc.
Why the p-values can’t really be trusted.
Useful as a preliminary step (sometimes).

Degrees of Freedom

General modelling strategies.
Choosing transformations based on domain knowledge.
Being explicit about the decisions made while modelling.

Regularization

It’s just linear regression with “smaller” slope estimates!
- Intercept isn’t constrained.
- Some slopes can be bigger than the slopes for linear regression, but sum of abs/squares is smaller.
Regularization prevents overfitting.
- Choose the penalty parameter \(\lambda\) by minimizing out-of-sample prediction error, measured via cross-validation.
- Within 1 SE of minimum MSE leads to a simpler (more regularized) model.
Adds bias, which is a good thing?
Requires standardization of the predictors.
LASSO sets parameters to 0, Ridge does not.

Classification

Response values are 0 or 1
- Expected value of \(y\) is the proportion of 1s given a value of \(x\).
- Predictions can be converted to 0s and 1s, with two types of errors (confusion matrix).
Logistic function looks like an “S”
- Exact shape determined by linear predictor \(\eta(X) = X\underline\beta\)
Other than transformations of response, all lm topics apply.
- Including regularization!
“Residuals” are weird
- Not “observed minus expected” anymore!

22.2 Self-Study

In addition to a quiz that will be added to MyLS, below is the quizizz that we went though in class!

22.3 Midterm Solutions

ANOVA

Explain how a hypotheses test based on the ratio \(MS_{Reg}/MS_E\) in the ANOVA table is a test for whether any of the slope parameters are 0.

A line with all slopes equal to 0 is a horizontal line, which will have 0 variance around \(\bar y\), i.e. \(MS_{Reg} = 0\).
Due to random chance, we will never actually get \(MS_{Reg} = 0\). Dividing by MSE gives us a way to evaluate the size of \(MS_{Reg}\).

Bias/Variance

For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).

Given the model \(y_i = \beta_0 + \epsilon_i\), show that \(\hat\beta_0 = \bar y\) minimizes the sum of squared error.

\[\begin{align*} \frac{1}{n}\sum(y_i - \hat y)^2 &= \frac{1}{n}\sum(y_i - \beta_0)^2\\ \frac{d}{d\beta}\frac{1}{n} &= \frac{1}{n}\sum(y_i - \beta_0)^2\\ &= \frac{2}{n}(\sum y_i - n\beta_0) \stackrel{set}{=}0\\ \implies \frac{2}{n}\sum y_i &= \frac{2}{n}n\beta_0 \implies \bar y = \beta_0 \end{align*}\]

This is a minimum since this is a polynomial in \(\beta_0\) with a positive coefficient for \(\beta_0\).

Bias/Variance

For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).

Given the model \(y_i = \beta_0 + \epsilon_i\), show that \(E(\hat\beta_0) = \beta_0\) and \(V(\hat\beta_0) = \sigma^2/n\).

\[\begin{align*} E(\hat\beta_0) &= E\left(\frac{1}{n}\sum y_i\right)= \frac{1}{n}\sum E(y_i)\\ &= \frac{1}{n}\sum E(\beta_0 + \epsilon_i)= \frac{1}{n}n\beta_0 = \beta_0 \end{align*}\]

\[\begin{align*} V(\hat\beta_0) &= V\left(\frac{1}{n}\sum y_i\right)= \frac{1}{n^2}\sum V(y_i)\\ &= \frac{1}{n^2}\sum V(\beta_0 + \epsilon_i)= \frac{1}{n^2}n\sigma^2 = \frac{\sigma^2}{n} \end{align*}\]

Bias/Variance

For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).

Now consider the multiple linear regression model \(Y = X\beta + \underline\epsilon\), where we know that \(\hat{\underline{\beta}} = (X^TX)^{-1}X^TY\). Show that \(E(\hat{\underline{\beta}}) = \underline{\beta}\).

\[ E(\hat{\underline{\beta}}) = E((X^TX)^{-1}X^TY) = (X^TX)^{-1}X^TE(Y) = (X^TX)^{-1}X^TX\underline\beta = \underline\beta \]