22 Final Exam Review
22.1 Summaries
Generally important things
- The bias and variance of \(\hat{\underline\beta}\).
- Interpreting coefficients and inference.
- Variance, rather than point estimates.
- Interpreting residual plots.
- Choosing a reasonable model given the context of the problem.
“Extra Topics” Lecture
- Standardizing
- Effect on parameter estimates (especially correlations)
Getting the Wrong Model
- Bias due to missing predictors.
- What does it even mean to have the “right” model?
- Proxy measures and their effect on the other parameter estimates
Transforming the Predictors
- Polynomial models
- When to use them
- Lower order terms
- Extrapolation
- Other transformations (e.g. log, combining predictors, etc.)
Transforming the Response
- Explain “Stabilizing the variance”
- Diagnosing the need for a transformation
- Choosing transformations
- The effect on the model
- E.g. multiplicative errors, changes to the parameter estimates
Dummy Variables
- Definition and interpretation
factor(cyl)8in the coefficients table
- If there are three categories, we need two dummies
- “Reference” category is absorbed into the intercept.
- Interaction terms: different intercept, different slope.
- Significance of a dummy variable (or interaction with one)
- Extra sum-of-squares to test whether categories are statistically different
- What kind of test is this? ANOVA? ANCOVA?
Multicollinearity!
- Why it increases variance (many different parameter combinations are equivalent)
- Detecting via the variance inflation factor
- Approx 10 is bad, but this is just a rule-of-thumb.
- What to do about it, and what it means for interpreting coefficients.
Best Subset Selection
- Goal: Inference or prediction?
- How does Subset Selection fit into this?
- General idea of the algorithms.
- Don’t need to know Mallow’s Cp etc.
- Why the p-values can’t really be trusted.
- Useful as a preliminary step (sometimes).
Degrees of Freedom
- General modelling strategies.
- Choosing transformations based on domain knowledge.
- Being explicit about the decisions made while modelling.
Regularization
- It’s just linear regression with “smaller” slope estimates!
- Intercept isn’t constrained.
- Some slopes can be bigger than the slopes for linear regression, but sum of abs/squares is smaller.
- Regularization prevents overfitting.
- Choose the penalty parameter \(\lambda\) by minimizing out-of-sample prediction error, measured via cross-validation.
- Within 1 SE of minimum MSE leads to a simpler (more regularized) model.
- Adds bias, which is a good thing?
- Requires standardization of the predictors.
- LASSO sets parameters to 0, Ridge does not.
Classification
- Response values are 0 or 1
- Expected value of \(y\) is the proportion of 1s given a value of \(x\).
- Predictions can be converted to 0s and 1s, with two types of errors (confusion matrix).
- Logistic function looks like an “S”
- Exact shape determined by linear predictor \(\eta(X) = X\underline\beta\)
- Other than transformations of response, all
lmtopics apply.- Including regularization!
- “Residuals” are weird
- Not “observed minus expected” anymore!
22.2 Self-Study
In addition to a quiz that will be added to MyLS, below is the quizizz that we went though in class!
22.3 Midterm Solutions
ANOVA
Explain how a hypotheses test based on the ratio \(MS_{Reg}/MS_E\) in the ANOVA table is a test for whether any of the slope parameters are 0.
- A line with all slopes equal to 0 is a horizontal line, which will have 0 variance around \(\bar y\), i.e. \(MS_{Reg} = 0\).
- Due to random chance, we will never actually get \(MS_{Reg} = 0\). Dividing by MSE gives us a way to evaluate the size of \(MS_{Reg}\).
Bias/Variance
For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).
- Given the model \(y_i = \beta_0 + \epsilon_i\), show that \(\hat\beta_0 = \bar y\) minimizes the sum of squared error.
\[\begin{align*} \frac{1}{n}\sum(y_i - \hat y)^2 &= \frac{1}{n}\sum(y_i - \beta_0)^2\\ \frac{d}{d\beta}\frac{1}{n} &= \frac{1}{n}\sum(y_i - \beta_0)^2\\ &= \frac{2}{n}(\sum y_i - n\beta_0) \stackrel{set}{=}0\\ \implies \frac{2}{n}\sum y_i &= \frac{2}{n}n\beta_0 \implies \bar y = \beta_0 \end{align*}\]
This is a minimum since this is a polynomial in \(\beta_0\) with a positive coefficient for \(\beta_0\).
Bias/Variance
For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).
- Given the model \(y_i = \beta_0 + \epsilon_i\), show that \(E(\hat\beta_0) = \beta_0\) and \(V(\hat\beta_0) = \sigma^2/n\).
\[\begin{align*} E(\hat\beta_0) &= E\left(\frac{1}{n}\sum y_i\right)= \frac{1}{n}\sum E(y_i)\\ &= \frac{1}{n}\sum E(\beta_0 + \epsilon_i)= \frac{1}{n}n\beta_0 = \beta_0 \end{align*}\]
\[\begin{align*} V(\hat\beta_0) &= V\left(\frac{1}{n}\sum y_i\right)= \frac{1}{n^2}\sum V(y_i)\\ &= \frac{1}{n^2}\sum V(\beta_0 + \epsilon_i)= \frac{1}{n^2}n\sigma^2 = \frac{\sigma^2}{n} \end{align*}\]
Bias/Variance
For this question, assume that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).
- Now consider the multiple linear regression model \(Y = X\beta + \underline\epsilon\), where we know that \(\hat{\underline{\beta}} = (X^TX)^{-1}X^TY\). Show that \(E(\hat{\underline{\beta}}) = \underline{\beta}\).
\[ E(\hat{\underline{\beta}}) = E((X^TX)^{-1}X^TY) = (X^TX)^{-1}X^TE(Y) = (X^TX)^{-1}X^TX\underline\beta = \underline\beta \]