Appendix D — ANOVA

Author
Affiliation

Dr. Devan Becker

Wilfrid Laurier University

Published

2024-04-22

D.1 Basic ANOVA

Show the code
set.seed(2221)
plot(mpg ~ disp, data = mtcars)
abline(lm(mpg ~ disp, data = mtcars))

Show the code
anova(lm(mpg ~ qsec, data = mtcars))
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value  Pr(>F)  
qsec       1 197.39 197.392  6.3767 0.01708 *
Residuals 30 928.66  30.955                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
summary(lm(mpg ~ qsec, data = mtcars))$coef
             Estimate Std. Error    t value   Pr(>|t|)
(Intercept) -5.114038 10.0295433 -0.5098974 0.61385436
qsec         1.412125  0.5592101  2.5252133 0.01708199

Notice the p-values! Also notice that the \(F\)-value is the square of the \(t\)-value! It’s like magic! Math is cool.

D.2 \(R^2\) always increases with new predictors

Show the code
nx <- 10 # Number of uncorrelated predictors
uncorr <- matrix(rnorm(50*(nx + 1)), 
    nrow = 50, ncol = nx + 1)
## First column is y, rest are x
colnames(uncorr) <- c("y", paste0("x", 1:nx))
uncorr <- as.data.frame(uncorr)

rsquares <- NA
for (i in 2:(nx + 1)) {
    rsquares <- c(rsquares,
        summary(lm(y ~ ., data = uncorr[,1:i]))$r.squared)
}
plot(1:10, rsquares[-1], type = "b",
    xlab = "Number of Uncorrelated Predictors",
    ylab = "R^2 Value",
    main = "R^2 ALWAYS increases")

The one exception is when one of the predictors is a linear combination of the previous predictors. In this case, \(R^2\) will not change!

Show the code
uncorr[, nx + 2] <- uncorr[,2] + 3*uncorr[,3]
rsquares <- c(rsquares, summary(lm(y ~ ., data = uncorr))$r.squared)
rsquares
 [1]          NA 0.003898013 0.056398466 0.056407225 0.069142075 0.073155890
 [7] 0.105824965 0.122449599 0.145307322 0.168746574 0.172758068 0.172758068
Show the code
plot(rsquares, type = "b")

Show the code
x <- runif(20, 0, 10)
b0 <- 2; b1 <- 5; sigma <- 10
b0s <- b1s <- double(1000)
plot(NA, pch = 0, 
    xlim = c(-2, 12), ylim = c(-10, 65),
    xlab = "x", ylab = "y",
    main = "Variance of the line")
abline(h = b0 + b1*mean(x))
abline(v = mean(x))
for (i in 1:1000) {
    y <- b0 + b1*x + rnorm(20, 0, sigma)
    abline(lm(y ~ x), col = rgb(0,0,0,0.1))
}

Let’s do that again, but record the values and only show the 89% quantiles!

Show the code
x <- runif(20, 0, 10)
b0 <- 2; b1 <- 5; sigma <- 10
all_lines <- replicate(1000, {
    y <- b0 + b1*x + rnorm(20, 0, sigma)
    predict(lm(y ~ x), newdata = list(x = seq(0, 10, 0.1)))
})

eightnine <- apply(all_lines, 1, quantile, probs = c(0.045, 0.945))
Show the code
plot(NA, pch = 0, 
    xlim = c(0, 10), ylim = c(-10, 65),
    xlab = "x", ylab = "y",
    main = "Variance of the line")
lines(seq(0, 10, 0.1), eightnine[1, ])
lines(seq(0, 10, 0.1), eightnine[2, ])

Note that the theoretical calculation of these bounds is built into R:

Show the code
plot(NA, pch = 0, 
    xlim = c(0, 10), ylim = c(-10, 65),
    xlab = "x", ylab = "y",
    main = "Variance of the line")
lines(seq(0, 10, 0.1), eightnine[1, ])
lines(seq(0, 10, 0.1), eightnine[2, ])

## Add the TRUE relationship
xseq <- seq(0, 10, 0.1)
lines(xseq, b0 + b1*xseq, col = 3)

## New sample from the data generating process
x <- runif(20, 0, 10)
y <- b0 + b1*x + rnorm(20, 0, sigma)

## Extract the CI
mylm <- lm(y ~ x)
xbeta <- predict(mylm, interval = "confidence",
    newdata = list(x = xseq))
#lines(xseq, xbeta[,"fit"], col = 4)
lines(xseq, xbeta[,"upr"], col = 4)
lines(xseq, xbeta[,"lwr"], col = 4)

Note that the intervals won’t exactly align - the samples are going to be different each time! In 95% of the samples we collect from this data generating process, the CI we construct from the sample will contain the true (green) line. This is a basic definition for confidence intervals, but it’s neat to see it around a line.

Notice how the CI is curved. This is completely, 100% expected. Recall that \((\bar x, \bar y)\) is always a point on the line. If \(x\) is the same for all samples, then the variance in the height at \(\bar y\) is just the variance in \(y\). However, we can rotate the line around this point and still fit most of the data “pretty well”, which is where the curved nature of the line comes from!

Aside

Why did I use the same \(x\) values for all of the simulations? Because that’s part of the assumptions (this isn’t an important point to make). Again, notice how the point \((\bar x, \bar y)\) is always on the line, and how the variance at the point \(\bar x\) is minimized. If \(\bar x\) is randomly moved, then there’s extra variance in the line.

Show the code
b0 <- 2; b1 <- 5; sigma <- 10
all_lines2 <- replicate(1000, {
    x <- runif(20, 0, 10)
    y <- b0 + b1*x + rnorm(20, 0, sigma)
    predict(lm(y ~ x), newdata = list(x = seq(0, 10, 0.1)))
})

eightnine2 <- apply(all_lines2, 1, quantile, probs = c(0.045, 0.945))
Show the code
plot(NA, pch = 0, 
    xlim = c(0, 10), ylim = c(-10, 65),
    xlab = "x", ylab = "y",
    main = "Variance of the line")
lines(seq(0, 10, 0.1), eightnine[1, ])
lines(seq(0, 10, 0.1), eightnine[2, ])
lines(seq(0, 10, 0.1), eightnine2[1, ], col = 2)
lines(seq(0, 10, 0.1), eightnine2[2, ], col = 2)
legend("topleft", legend = c("non-random x", "random x"), col = 1:2, lty = 1)

The textbook for this course also covers models that incorporate randomness in \(X\), but this is not covered in this course.

D.3 Covariance of the parameters

Show the code
b0 <- 2; b1 <- 5; sigma <- 10
all_params <- replicate(1000, {
    x <- runif(20, 0, 10)
    y <- b0 + b1*x + rnorm(20, 0, sigma)
    coef(lm(y ~ x))
})
Show the code
plot(t(all_params))

Intuition check: were you expecting a negative slope? Does this make sense? If you increase \(\beta_0\), why would \(\beta_1\) decrease?

For homework, try a negative intercept and see what happens! What about a negative slope?

D.4 Joint Normality of the \(\underline{\hat\beta}\)s

Joint normality leads to marginal normality! This means we can create a confidence interval based on the marginal. However, if the joint distribution has a strong correlation, the marginal confidence intervals might contain unlikely points!

Show the code
par(mfrow = c(1, 3))

## Marginal distribution of beta_0
plot(density(all_params[1,]),
    main = "Distribution of b_0",
    xlab = "b_0")
abline(v = quantile(all_params[1,], c(0.055, 0.945)), col = 2)
abline(v = 8, col = 3) # Hypothesized beta_0

## Marginal distribution of beta_1
plot(density(all_params[2,]),
    main = "Distribution of b_1",
    xlab = "b_1")
abline(v = quantile(all_params[2,], c(0.055, 0.945)), col = 2)
abline(v = 6, col = 3) # Hypothesized beta_1

## Joint distribution
plot(t(all_params), main = "Joint Distribution",
    xlab = "b_0", ylab = "b_1")
abline(v = quantile(all_params[1,], c(0.055, 0.945)), col = 2)
abline(h = quantile(all_params[2,], c(0.055, 0.945)), col = 2)
points(x = 8, y = 6, col = 3, pch = 16, cex = 1.5)

Notice how the rectangular confidence region in the joint distribution contains regions where there are no points! For example, a hypothesis test for whether \(\beta_0=8\) and \(\beta_1 = 6\) (the green lines/points) would not be rejected if we checked the two confidence intervals separately, but likely should be rejected given the joint distribution! This is exactly what happens when the F-test is significant but none of the t-tests for individual predictors is significant.

In general, the CIs for each individual \(\hat\beta\) are missing something - especially if there’s correlation in the predictors!

In these examples, we repeatedly sampled from the true relationship to obtain simulation-based confidence intervals. The normality assumption allows us to make inferences about the distribution of the parameters - including the joint distribution - from a single sample! Inference is powerful!