We’re going to do this a little differently than other days. Let’s look at the penguins data again:
Show the code
set.seed(2121)library(palmerpenguins)## Remove NAs and take continuous variablespenguins <-subset(penguins, species =="Chinstrap")peng <- penguins[complete.cases(penguins), c(3, 4, 5, 6)]## Standardize the x values#peng[, 1:3] <- apply(peng[, 1:3], 2, scale)head(peng)
We’ll use these values as if they are population values and forget that they were calculated from a sample.
The x-values will stay the same, we’ll simulate new \(y\) values according to this model.
The advantage of this approach is that the predictors retain any correlation that they had!
By simulating from real data, we have correlations but we match the linear modelling assumptions perfectly!
G.2 Simulating from the “Right” model
Let’s forget the actual values of body_mass_g, and pretend that this is the true relationship: \[
bodymass = 4207 + 18*billlength + 35*billdepth + 711.5*flipperlength + \epsilon
\] where \(\epsilon_i \sim N(0, 393)\).
We can simulate from this as follows:
Show the code
X <-cbind(1, as.matrix(peng[, 1:3]))n <-nrow(X)body_mass_g <- X %*% beta +rnorm(n, 0, sigma)unname(beta)
[1] -3157.53005 16.03916 91.51275 22.57975
Show the code
unname(coef(lm(body_mass_g ~-1+ X)))
[1] -3311.31867 16.27829 120.41292 20.77560
Show the code
unname(coef(lm(body_mass_g ~ X[, -1])))
[1] -3311.31867 16.27829 120.41292 20.77560
Now let’s do this 1000s of times!
Show the code
res <-matrix(ncol =4, nrow =0)for (i in1:10000) { body_mass_g <- X %*% beta +rnorm(n, 0, sigma) right_lm <-lm(body_mass_g ~-1+ X) res <-rbind(res, unname(coef(right_lm)))}dim(res)
This looks good - we simulated according to the values in beta, and we were able to recover them. We’ve also shown that the linear model is unbiased!
G.3 Estimating the Wrong Model - Too few predictors
In the following code, I remove the “flipper_length_mm” (the third predictor) by only taking the first three columns of X, which includes the column of 1s.
I then fit the model without flipper length, which we’ve seen before is an important predictor!
Show the code
res_reduced <-matrix(ncol =3, nrow =0)X_reduced <- X[, 1:3] # Still includes column of 1sbeta_reduced <- beta[1:3]for (i in1:10000) {# Simulate from the correct model body_mass_g <- X %*% beta +rnorm(n, 0, sigma)# Only estimate beta 0-3 (not beta4) wrong_lm <-lm(body_mass_g ~-1+ X_reduced) res_reduced <-rbind(res_reduced, unname(coef(wrong_lm)))}dim(res_reduced)
Everything is biased! Since flipper_length_mm was an important predictor, the estimates from the other predictors are biased!
Here’s how I like to think of this: the machine is trying to learn a pattern using the predictors we give it. These other predictors are trying to pick up on as much pattern as possible. Without the true pattern, they have to adjust.
A big part of this comes from the fact that there’s correlation in the predictors. Since they’re correlated, if one is missing then the others can find the pattern through their correlation. Instead of flipper_length_mm causing a change in body mass, flipper_length_mm is correlated with bill_length_mm and bill_depth_mm, which then affect body_mass_g in place of flipper_length_m’s affect. In other words, they’re trying to make up for missing patterns through the correlation, like a game of telephone where information has been lost along the way.
G.4 Too Many Predictors
What happens if we include predictors that aren’t correlated with the response?