Appendix J — Multicollinearity

Author
Affiliation

Dr. Devan Becker

Wilfrid Laurier University

Published

2024-04-22

J.1 The problem with multicollinearity

Consider the model \[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_i \] with the added assumption that \(x_{i1} = a + bx_{i2} + z_i\), where \(z_i\) is some variation.

(As a technical note, none of this is assumed to be random; just uncertain. We’ll use the language of probability in this section, but it’s just to quantify uncertainty rather than make modelling assumptions.)

Show the code
library(plot3D)
n <- 100
x1 <- runif(n, 0, 10)
x2 <- 2 + x1 + runif(n, -1, 1)
y <- 0 + 4*x1 + 3*x2 + rnorm(n, 0, 5)

scatter3D(x1, x2, y, phi = 30, theta = 15, bty = "g",
    xlab = "x1", ylab = "x2", zlab = "y")

The 3D plot looks like a tube! By “tube”, we can think of the 3D plot as being a 2D plot that’s in the wrong coordinate system. In other words, the data define a single axis, which is different from the x1, x2, and y axes. When fitting multiple linear regression, we’re fitting a hyperplane. If the relationship is multicollinear, then we might imagine rotating a hyperplane around the axis defined by the “tube” in the plot above. Since any rotation of the plane around fits the line of data pretty well, the coefficients that define that plane are not well-estimated.

It’s hard to imagine 3D sets of points, so let’s do a 2D analogy. In the 3D example, we actually had a 2D relationship, so let’s consider a 2D example that’s almost 1D.

In the plot below, I’ve made \(x\) almost one-dimensional. That is, I’ve given it a vary small range. I intentionally changed the x limits of the plot to emphasize this.

Show the code
x <- c(rep(-0.025, 10), rep(0.025, 10))
y <- x + rnorm(20, 0, 3)
plot(y ~ x, xlim = c(-2, 2))

If we were to fit a regression line to this, there are many many slopes that would make this work!

Show the code
plot(y ~ x, xlim = c(-1, 1))
for(i in 1:20) abline(a = mean(y), b = runif(1, -20, 20))

All of the lines I added will fit the data pretty well, even though they all have completely different slopes! Yes, there’s one with a “best” slope, but slightly different data would have given us a very different slope:

Show the code
# Set up empty plot
plot(NA, xlab = "x", ylab = "y",
    xlim = c(-2, 2), ylim = c(-6,6))
# Generate data from same dgp as before
for(i in 1:100) {
    y <- x + rnorm(20, 0, 3)
    abline(lm(y ~ x))
}

All of these lines would have worked for our data!

This can easily be seen the variance of the parameters:

Show the code
summary(lm(y ~ x))$coef
              Estimate Std. Error    t value  Pr(>|t|)
(Intercept) -0.6205096  0.6779798 -0.9152331 0.3721680
x            6.0310840 27.1191916  0.2223917 0.8265128

J.2 Exact Multicollinearity

In the example above, the line was very sensitive to a slight change in the data because we essentially had one dimension. If we actually had one dimension, then any line that cgoes through \(\bar y\) has the exact same error, regardless of the slope. We can see this in the design matrix: \(X\) has a column of 1s for the intercept (which is good), but also has a column for \(x_1\) that has zero variance. This means that it’s a linear combination of the first column, and thus is rank-deficient.

Show the code
X <- cbind(rep(1, 20), rep(0, 20))
try(solve(t(X) %*% X))
Error in solve.default(t(X) %*% X) : 
  Lapack routine dgesv: system is exactly singular: U[2,2] = 0

Since one column is a linear combination of the other, \(X^TX\) cannot be inverted. In this case, the variance fo \(\hat\beta_1\) is infinite!

This can also happen if one column is “close” to being a linear combination of the others, such as:

Show the code
# Change the first element in the second column to 10^-23
X[1, 2] <- 1e-23
try(solve(t(X) %*% X))
Error in solve.default(t(X) %*% X) : 
  system is computationally singular: reciprocal condition number = 4.75e-48

The rows are mathematically different, but the difference is so small that computers cannot tell.

Show the code
x <- runif(20, 0, 1) / 1e7
X <- cbind(1, x)
try(solve(t(X) %*% X))
                            x
   1.697904e-01 -2.406244e+06
x -2.406244e+06  4.833450e+13

In the next code chunk, try playing around with the power of 10 (i.e., try 10^50, 10^-50, etc., for both positive and negative powers). At some point, the matrix is not invertible and the coefficient table stops reporting the slope! At the other end, the line is a near perfect fit (why?).

Show the code
x <- runif(20, 0, 1) / (10^350)
y <- x + rnorm(20, 0, 3)
summary(lm(y ~ x))$coef
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.1092976  0.6173655 0.1770387 0.8613516

J.3 Back to Regression

We can kinda detect multicollinearity mainly using the standard error of the estimates:

Show the code
library(palmerpenguins)
peng_lm <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
    data = penguins)
summary(peng_lm)

Call:
lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm + 
    bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1054.94  -290.33   -21.91   239.04  1276.64 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6424.765    561.469 -11.443   <2e-16 ***
flipper_length_mm    50.269      2.477  20.293   <2e-16 ***
bill_length_mm        4.162      5.329   0.781    0.435    
bill_depth_mm        20.050     13.694   1.464    0.144    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.4 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7615,    Adjusted R-squared:  0.7594 
F-statistic: 359.7 on 3 and 338 DF,  p-value: < 2.2e-16

It looks like bill_depth_mm has a large standard error! Of course, this might be because:

  • The variance of bill_depth_mm is high to begin with.
  • The other predictors have explained most of the variance and any estimate for bill_depth_mm will do.
  • Multicollinearity

Multico. is just one of the possible reasons why the SE might be high, we need to look into it more to be sure.

Show the code
library(car)
vif(peng_lm)
flipper_length_mm    bill_length_mm     bill_depth_mm 
         2.673338          1.865090          1.611292 

It actually has quite a small VIF!

J.4 Centering and Scaling

Show the code
X <- model.matrix(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
    data = penguins)

Z <- cbind(1, apply(X[, -1], 2, scale))
print("correlation of X")
[1] "correlation of X"
Show the code
round(cor(X), 4)
                  (Intercept) flipper_length_mm bill_length_mm bill_depth_mm
(Intercept)                 1                NA             NA            NA
flipper_length_mm          NA            1.0000         0.6562       -0.5839
bill_length_mm             NA            0.6562         1.0000       -0.2351
bill_depth_mm              NA           -0.5839        -0.2351        1.0000
Show the code
print("Also correlation of X")
[1] "Also correlation of X"
Show the code
round(t(Z) %*% Z/ (nrow(Z) - 1), 4)
                         flipper_length_mm bill_length_mm bill_depth_mm
                  1.0029            0.0000         0.0000        0.0000
flipper_length_mm 0.0000            1.0000         0.6562       -0.5839
bill_length_mm    0.0000            0.6562         1.0000       -0.2351
bill_depth_mm     0.0000           -0.5839        -0.2351        1.0000

By simulation:

Show the code
n <- 100
x1 <- runif(n, 0, 10)
x2 <- 2 + x1 + runif(n, -3, 3)

reps_1 <- replicate(1000, {
    y <- 0 + 4*x1 + 3*x2 + rnorm(n, 0, 5)
    coef(lm(y ~ x1 + x2))
}) |> t()
cor(reps_1)
            (Intercept)         x1         x2
(Intercept)   1.0000000  0.1764180 -0.6112093
x1            0.1764180  1.0000000 -0.8609998
x2           -0.6112093 -0.8609998  1.0000000
Show the code
x1 <- scale(x1)
x2 <- scale(x2)

reps_2 <- replicate(1000, {
    y <- 0 + 4*x1 + 3*x2 + rnorm(n, 0, 5)
    coef(lm(y ~ x1 + x2))
}) |> t()
cor(reps_2)
            (Intercept)          x1          x2
(Intercept)  1.00000000 -0.01364783  0.03646404
x1          -0.01364783  1.00000000 -0.85526575
x2           0.03646404 -0.85526575  1.00000000

The correlation is… slightly lower? It’s much lower for the intercept, but it doesn’t make much of a difference for the correlation between the slopes. (It will generally be lower, and should be with a large enough number of simulations.)

Just for fun, here’s the VIF for these data. I’ve added a parameter x1_around_x2 to allow you to play around with the correlation of x1 and x2.

Show the code
x2_around_x1 <- 3
x1 <- runif(n, 0, 10)
x2 <- 2 + x1 + runif(n, -x2_around_x1, x2_around_x1)
y <- 0 + 4*x1 + 3*x2 + rnorm(n, 0, 5)

vif(lm(y ~ x1 + x2))
      x1       x2 
4.494537 4.494537