5  The General Regression Situation

Author
Affiliation

Dr. Devan Becker

Jam: I Predict a Riot by The Kaiser Chefs

Published

2024-05-21

5.1 Chapter Summary

The Normal Equations

\[\begin{align*} \underline\epsilon^T\underline\epsilon &= (Y - X\underline\beta)^T(Y - X\underline\beta)\\ &= ... = Y^TY - 2\underline\beta^TX^TY + \underline\beta^TX^TX\underline\beta \end{align*}\] We then take the derivative with respect to \(\underline\beta\). Note that \(X^TX\) is symmetric and \(Y^TX\underline\beta\) is a scalar.. \[\begin{align*} \frac{\partial}{\partial\underline\beta}\underline\epsilon^T\underline\epsilon &= 0 - 2X^TY + 2X^TX\underline\beta \end{align*}\]

  • For the 1 predictor case, verify that the equations look the same!

Setting to 0, rearranging, and plugging in our data gets us the Normal equations: \[\begin{align*} X^TX\underline{\hat\beta} &= X^T\underline y \end{align*}\]

Facts

The solution to the normal equations satisfies: \[X^TX\underline{\hat\beta} = X^T\underline y\]

  1. No distributional assumptions.
  2. If \(X^TX\) is invertible, \(\hat{\underline\beta} = (X^TX)^{-1}X^T\underline y\).
    • \(\hat{\underline\beta}\) is a linear transformation of \(\underline y\)!
    • This is the same as the MLE.
  3. \(E(\hat{\underline\beta}) = \underline\beta\) and \(V(\hat{\underline\beta}) = \sigma^2(X^TX)^{-1}\).
    • This is the smallest variance amongst all unbiased estimators of \(\underline\beta\).

Example Proof Problems

  1. Prove that \(\sum_{i=1}^n\hat\epsilon_i\hat y_i = 0\).
  2. Prove that \(\sum_{i=1}^n\hat\epsilon_i = 0\).
  3. Prove that \(X^TX\) is symmetric. Is \(A^TA\) symmetric in general?
Show the code
## Demonstration that they're true (up to a rounding error)
mylm <- lm(mpg ~ disp, data = mtcars)
sum(resid(mylm) * predict(mylm))
[1] -2.664535e-14
Show the code
mean(resid(mylm))
[1] -1.12757e-16
Show the code
X <- model.matrix(mpg ~ disp, data = mtcars)
all.equal(t(X) %*% X, t(t(X) %*% X))
[1] TRUE

Analysis of Variance (Corrected)

Source \(df\) \(SS\)
Regression (corrected) \(p - 1\) \(\underline{\hat{\beta}}^TX^T\underline y - n\bar{y}^2\)
Error \(n - p\) \(\underline y^t\underline y- \hat{\underline\beta}^TX^T\underline y\)
Total (corrected) \(n - 1\) \(\underline y^t\underline y - n\bar{y}^2\)
  • Note that \(p\) is the number of parameters, not the index of the largest param.
    • \(\underline\beta = (\beta_0, \beta_1, ..., \beta_{p-1})\)
  • We’ll always be using corrected sum-of-squares.
    • Especially next chapter!

\(F\)-test for overall significance

If SSReg is significantly larger than SSE, then fitting the model was worth it!

  • This is a test for \(\beta_1 = \beta_2 = ... = \beta_{p-1} = 0\), versus any \(\beta_j\ne 0\).

As before, we find a quantity with a known distribution, then use it for hypothesis tests.

\[ F = \frac{MS(Reg|\hat\beta_0)}{MSE} = \frac{SS(Reg|\hat\beta_0)/(p-1)}{SSE/(n-p)} \sim F_{p-1, n-p} \]

Again, note that a regression with no predictors always has \(\hat\beta_0 = \bar y\).

Example: Significance of disp

Show the code
anova(lm(mpg ~ disp, data = mtcars)) |> 
    knitr::kable()
Df Sum Sq Mean Sq F value Pr(>F)
disp 1 808.8885 808.88850 76.51266 0
Residuals 30 317.1587 10.57196 NA NA
  • \(p = 2\), \(n = 32\).
    • \(df_R' + df_E' = df_T'\), where \(df'\) is the df for corrected SS.
  • Verified these numbers in the last lecture
Show the code
plot(mpg ~ disp, data = mtcars)
abline(lm(mpg ~ disp, data = mtcars))

Example: \(F_{1,p-1} = t^2_{p-1}\)

Show the code
anova(lm(mpg ~ 1, data = mtcars), lm(mpg ~ qsec, data = mtcars)) |> 
    knitr::kable()
Res.Df RSS Df Sum of Sq F Pr(>F)
31 1126.0472 NA NA NA NA
30 928.6553 1 197.3919 6.376702 0.017082
Show the code
summary(lm(mpg ~ qsec, data = mtcars))$coef |>
    knitr::kable()
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.114038 10.0295433 -0.5098974 0.6138544
qsec 1.412125 0.5592101 2.5252133 0.0170820

What do you notice about these two tables?

Example: Significance of Regression (\(F_{2,p-1} \ne t^2_{p-1}\))

Show the code
anova(lm(mpg ~ 1, data = mtcars), lm(mpg ~ qsec + disp, data = mtcars)) |> 
    knitr::kable()
Res.Df RSS Df Sum of Sq F Pr(>F)
31 1126.0472 NA NA NA NA
29 313.5368 2 812.5104 37.57582 0
Show the code
summary(lm(mpg ~ qsec + disp, data = mtcars))$coef |>
    knitr::kable()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.5045079 7.1840940 3.5501356 0.0013359
qsec 0.2122880 0.3667758 0.5787951 0.5671961
disp -0.0398877 0.0052882 -7.5428272 0.0000000

We’ll learn more about the ANOVA table next lecture.

\(R^2\) again

\[ R^2 = \frac{SS(Reg|\hat\beta_0)}{Y^TY - SS(\beta_0)} = \frac{\sum(\hat y_i - \bar y)^2}{\sum(y_i - \bar y)^2} \]

Works for multiple dimensions!… kinda.

\(R^2\) is bad?

Show the code
nx <- 10 # Number of uncorrelated predictores
uncorr <- matrix(rnorm(50*(nx + 1)), 
    nrow = 50, ncol = nx + 1)
## First column is y, rest are x
colnames(uncorr) <- c("y", paste0("x", 1:nx))
uncorr <- as.data.frame(uncorr)

rsquares <- NA
for (i in 2:(nx + 1)) {
    rsquares <- c(rsquares,
        summary(lm(y ~ ., 
            data = uncorr[,1:i]))$r.squared)
}
plot(1:10, rsquares[-1], type = "b",
    xlab = "Number of Uncorrelated Predictors",
    ylab = "R^2 Value",
    main = "R^2 ALWAYS increases")

Adjusted (Multiple) \(R^2\)

\[ R^2_a = 1 - (1 - R^2)\left(\frac{n-1}{n-p}\right) \]

  • Penalizes added predictors - won’t always increase!
    • Still might increase by chance alone!
      • F-test
    • \(R^2_a = R^2\) when \(p=1\) (intercept model)
  • Still not perfect!
    • Works for comparing different models on same data
    • Works (poorly) for comparing different models on different data.
  • In general you should use \(R^2_a\), but always be careful.

5.2 Prediction and Confidence Intervals (Again)

\(R^2\) and \(F\)

Recall that \[ F = \frac{MS(Reg|\hat\beta_0)}{MSE} = \frac{SS(Reg|\hat\beta_0)/(p-1)}{SSE/(n-p)} \sim F_{p-1, n-p} \]

From the definition of \(R^2\), \[\begin{align*} R^2 &= \frac{SS(Reg|\hat\beta_0)}{SST}\\ &= \frac{SS(Reg|\hat\beta_0)}{SS(Reg|\hat\beta_0) + SSE}\\ &= \frac{(p-1)F}{(p-1)F + (n-p)} \end{align*}\] Conclusion: Hypothesis tests/CIs for \(R^2\) aren’t useful. Just use \(F\)!

Correlation of \(\hat\beta_0\), \(\hat\beta_1\), \(\hat\beta_2\), etc.

With a different sample, we would have gotten slightly different numbers!

  • If the slope changed, the intercept must change to fit the data
    • (and )
    • The parameter estimates are correlated!
  • Similar things happen with multiple predictors!
  • This correlation can be a problem for confidence regions

Uncorrelated \(\hat\underline{\beta}\)

\[ V(\hat{\underline\beta}) = \sigma^2(X^TX)^{-1} \]

In simple linear regression, \[ (X^TX)^{-1} = \frac{1}{nS_{XX}}\begin{bmatrix}\sum x_i^2 & -n\bar x\\-n \bar x & n\end{bmatrix} \]

so the correlation is 0 when \(\bar x = 0\)!

  • It’s common practice to mean-center the predictors for this very reason!
  • For 2 or more predictors, \((X^TX)^{-1}\) has a more complicated expression.
    • The correlation of the \(\beta_j\)s is a function of the correlations in the data.

Prediction and Confidence Intervals for \(Y\)

\(\hat Y = X\hat\beta\).

  • A confidence interval around \(\hat Y\) is based on the variance of \(\hat\beta\).
  • \(\hat Y \pm t * se(X\hat\beta)\)

\(Y_{n+1} = X\beta + \epsilon_{n+1}\)

  • A prediction interval around \(Y_{n+1}\) is based on the variance of \(\hat\beta\) and \(\epsilon\)!
  • \(\hat Y_{n+1} \pm t * se(X\hat\beta + \epsilon_{n+1})\)

5.3 Exercises

  1. Prove that \(\sum_{i=1}^n\hat\epsilon_i\hat y_i = 0\).
  2. Prove that \(\sum_{i=1}^n\hat\epsilon_i = 0\).
  3. Prove that \(X^TX\) is symmetric. Is \(A^TA\) symmetric in general?
  4. Simulate data with \(n = 50\), \(p = 11\), and \(\beta_1 = \beta_2 = ... = \beta_10 = 0\). Check the individual t-tests for whether the slopes are 0, and the overall F-test. Do this multiple times. Do you find a significant result from one of the t-tests but not the overall F-test? What about vice-versa?
Solution
n <- 50
p <- 11
X <- matrix(
    data = c(rep(1, n),
        runif(n * (p - 1), -5, 5)),
    ncol = p)
beta <- c(p - 1, rep(0, p - 1)) # beta_0 = 10, will not affect results
y <- X %*% beta + rnorm(n, 0, 3)

mydf <- data.frame(y, X[, -1])
mylm <- lm(y ~ ., data = mydf)
summary(mylm)

Call:
lm(formula = y ~ ., data = mydf)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7871 -1.9480  0.7211  1.5483  4.3626 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.44474    0.44502  23.470  < 2e-16 ***
X1          -0.24881    0.15095  -1.648  0.10733    
X2          -0.04107    0.16778  -0.245  0.80789    
X3           0.23421    0.18441   1.270  0.21159    
X4          -0.07099    0.15537  -0.457  0.65026    
X5           0.02299    0.17153   0.134  0.89407    
X6          -0.05605    0.16334  -0.343  0.73333    
X7          -0.12174    0.15332  -0.794  0.43198    
X8           0.20164    0.15557   1.296  0.20254    
X9          -0.48608    0.15305  -3.176  0.00292 ** 
X10         -0.28487    0.18491  -1.541  0.13149    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.859 on 39 degrees of freedom
Multiple R-squared:   0.24, Adjusted R-squared:  0.04509 
F-statistic: 1.231 on 10 and 39 DF,  p-value: 0.3023

In my run (this will be different each time you run it), one of the slopes was found to be significant! This is a Type 1 error - we found a significant result where we shouldn’t have.

However, the F-test successfully found that the overall fit is not significantly better than a model without any predictors. F-tests exists to prevent Type 1 errors!

As a side note, check out the difference between the Multiple R-squared and Adjusted R-squared. Which one would you believe in this situation?


  1. Plot the confidence intervals and the prediction intervals for a simple linear regression of body mass versus flipper length for penguins. Do this for a lot of points, and interpret the difference between the two.
library(palmerpenguins)
penguins <- penguins[complete.cases(penguins),]
peng_lm <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
Solution
library(palmerpenguins)
penguins <- penguins[complete.cases(penguins),]
peng_lm <- lm(body_mass_g ~ flipper_length_mm, data = penguins)

# confidence intervals along all flipper lengths
flippers <- seq(from = min(penguins$flipper_length_mm),
    to = max(penguins$flipper_length_mm), length.out = 100)

confi <- predict(peng_lm, newdata = data.frame(flipper_length_mm = flippers), interval = "confidence")
predi <- predict(peng_lm, newdata = data.frame(flipper_length_mm = flippers), interval = "prediction")

plot(body_mass_g ~ flipper_length_mm, data = penguins)
lines(x = flippers, y = confi[, "upr"], col = 3, lwd = 2, lty = 2)
lines(x = flippers, y = confi[, "lwr"], col = 3, lwd = 2, lty = 2)
lines(x = flippers, y = predi[, "upr"], col = 2, lwd = 2, lty = 2)
lines(x = flippers, y = predi[, "lwr"], col = 2, lwd = 2, lty = 2)
legend("topleft", legend = c("Confidence", "Prediction"), col = 3:2, lwd = 2, lty = 2)

The confidence “band” is much thinner than the prediction band! With this many points, the model is pretty sure about the slope and intercept (i.e. \(\hat y_i\)); a new sample would be expected to produce a similar line.

However, even though the model knows about the slope and intercept of the line, there’s still lots of uncertainty around an unobserved point \(\hat y_{n + 1}\).


  1. Use a simulation to demonstrate a confidence interval for \(\hat y_i\) (an observed data point). For a model \(y_i = 1 + 4x_i + \epsilon_i\), \(\epsilon_i \sim N(0, 16)\), \(n = 30\), simulate 100 data sets, calculate the line, and add it to a plot. What do you notice about the shape? Why is this?
Solution
n <- 30
x <- runif(n, -5, 5)
plot(NA, xlim = c(-5, 5), ylim = c(-25, 25), xlab = "x", ylab = "y")
for (i in 1:1000) {
    y <- 1 + 4 * x + rnorm(n, 0, 8)
    abline(lm(y ~ x))
}

It looks kind of like a bowtie!

As mentioned in lecture, the model can be seen as fitting a point at \((\bar x, \bar y)\), then finding a slope that fits the data well. If we got the exact same value for \(\bar y\) every time, every single line would converge at the point \((\bar x, \bar y)\)!

We also saw that the variance of \(\hat y_i\) is minimized at \(\bar x\), and this plot demonstrates this.