Appendix I — Regression with Categorical Predictors

Author
Affiliation

Dr. Devan Becker

Wilfrid Laurier University

Published

2024-04-22

I.1 Regression as a t-test

When we do regression against a dummy variable, we are actually just doing a t-test.

  • A regression models the mean of \(y\) for each value of \(x\).
  • A one-unit increase in \(X\) is the difference between 0 and 1.
    • The slope is the difference in means.
  • We assume constant variance
    • Same variance at \(x=0\) and \(x=1\).
  • A t-test has a df of \(n-1\), so does a regression with just one predictor.
Show the code
# Note that I need the var.equal=TRUE to match the assumptions of regression.
t.test(mpg ~ am, data = mtcars, var.equal = TRUE)

    Two Sample t-test

data:  mpg by am
t = -4.1061, df = 30, p-value = 0.000285
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -10.84837  -3.64151
sample estimates:
mean in group 0 mean in group 1 
       17.14737        24.39231 
Show the code
lm(mpg ~ am, data = mtcars) |> summary()

Call:
lm(formula = mpg ~ am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   17.147      1.125  15.247 1.13e-15 ***
am             7.245      1.764   4.106 0.000285 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.3598,    Adjusted R-squared:  0.3385 
F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The two outputs have the exact same test statistic and p-value!

I.2 Categorical Predictors as ANOVA

In order for R to recognize a variable as a dummy variable, we need to tell it that the values should be interpreted as a factor. This is a very particular data type:

  • Every observation is one level of the categorical variable.

  • Every observation can only be one level of that categorical variable

    • E.g., a car can’t have 4 cylinders and 6 cylinders
  • There are a finite number of possible categories.

    • Unless we specify, R assumes that the unique values that it sees constitutes all the possible categories. In other words, it won’t let us ask about cars with 5 cylinders because we’re making a factor variable with 4, 6, and 8 as the observed values.
Show the code
anova(aov(mpg ~ factor(cyl), data = mtcars))
Analysis of Variance Table

Response: mpg
            Df Sum Sq Mean Sq F value    Pr(>F)    
factor(cyl)  2 824.78  412.39  39.697 4.979e-09 ***
Residuals   29 301.26   10.39                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
summary(lm(mpg ~ factor(cyl), data = mtcars))

Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   26.6636     0.9718  27.437  < 2e-16 ***
factor(cyl)6  -6.9208     1.5583  -4.441 0.000119 ***
factor(cyl)8 -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,    Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

The overall test of significance has an F-value of 39.7 and a p-value of 4.979x10\(^{-9}\)

But what is up with the summary.lm() output? Why do we have a row labelled factor(cyl)6??

I.3 Coding Dummy Variables (what factor is actually doing)

When we have multiple variables, we set one aside as the “reference” category. For a factor variable with \(k\) categories, we set up \(k-1\) new variables to denote category membership. For example:

  • 4 is the reference category for cyl. If all the new dummy variables that we create are all 0, then the car must be 4 cylinders.

  • We set up a dummy variable for 6 cylinders, which is a column with a 1 if the car has 6 cylinders and a 0 otherwise. This is what’s labelled as factor(cyl)6 in the output.

    • This can be denoted \(I(cyl == 6)\).
  • Similarly, we have factor(cyl)8 = \(I(cyl == 8)\).

Consider the model \(y = \beta_0 + \beta_1I(cyl == 6) + \beta_2I(cyl == 8)\).

With this setup, the intercept represents the estimate for 4 cylinder cars, \(\beta_1\) represents the difference in 4 and 6 cylinder cars, while \(\beta_2\) represents the difference in 4 and 8 cylinder cars (you can find the difference between 6 and 8 by doing the right math).

From model.matrix(), we can see this in action:

Show the code
cbind(model.matrix(mpg ~ factor(cyl), data = mtcars), cyl = cbind(mtcars$cyl)) |> head(12)
                  (Intercept) factor(cyl)6 factor(cyl)8  
Mazda RX4                   1            1            0 6
Mazda RX4 Wag               1            1            0 6
Datsun 710                  1            0            0 4
Hornet 4 Drive              1            1            0 6
Hornet Sportabout           1            0            1 8
Valiant                     1            1            0 6
Duster 360                  1            0            1 8
Merc 240D                   1            0            0 4
Merc 230                    1            0            0 4
Merc 280                    1            1            0 6
Merc 280C                   1            1            0 6
Merc 450SE                  1            0            1 8

The model can be written as:

\[ y_i = \begin{cases}\beta_0 & \text{if }\; cyl == 4\\ \beta_0 + \beta_1 & \text{if }\; cyl == 6\\\beta_0 + \beta_2 & \text{if }\; cyl == 8\end{cases} \]

And indeed we can show that this is equivalent to fitting three separate models:

Show the code
cyl4 <- coef(lm(mpg ~ 1, data = subset(mtcars, cyl == 4)))
cyl6 <- coef(lm(mpg ~ 1, data = subset(mtcars, cyl == 6)))
cyl8 <- coef(lm(mpg ~ 1, data = subset(mtcars, cyl == 8)))
allcyl <- coef(lm(mpg ~ factor(cyl), data = mtcars))
print(c(cyl4 = unname(cyl4), beta0=unname(allcyl[1])))
    cyl4    beta0 
26.66364 26.66364 
Show the code
print(c(cyl6 = unname(cyl6), beta0_plus_beta1 = unname(allcyl[1] + allcyl[2])))
            cyl6 beta0_plus_beta1 
        19.74286         19.74286 
Show the code
print(c(cyl8 = unname(cyl8), beta0_plus_beta2 = unname(allcyl[1] + allcyl[3])))
            cyl8 beta0_plus_beta2 
            15.1             15.1 

The advantage of having all three in one is that we can test for significance easily!

One way (a bad way) to visualize this is to treat \(I(cyl ==6)\) and \(I(cyl==8)\) as separate variables:

Show the code
par(mfrow = c(1,2))
plot(mpg ~ I(as.numeric(factor(cyl)) - 1), data = subset(mtcars, cyl %in% c(4,6)),
    xlab = "Cyl (0 = 4, 1 = 6)", main = "I(cyl == 6)")
abline(lm(mpg ~ factor(cyl), data = subset(mtcars, cyl %in% c(4,6))))
plot(mpg ~ I(as.numeric(factor(cyl)) - 1), data = subset(mtcars, cyl %in% c(4,8)),
    xlab = "Cyl (0 = 4, 1 = 8)", main = "I(cyl == 8)")
abline(lm(mpg ~ factor(cyl), data = subset(mtcars, cyl %in% c(4,8))))

From this, we can see that the slope of the line is indeed looking at the difference in means.

Categorical and Continuous Variables

If we have cyl and disp in the model, we get the following:

\[ y = \beta_0 + \beta_1I(cyl == 6) + \beta_2I(cyl == 8) + \beta_3 disp \] which is equivalent to: \[ y_i = \begin{cases}\beta_0 + \beta_3 disp& \text{if }\; cyl == 4\\ \beta_0 + \beta_1 + \beta_3 disp& \text{if }\; cyl == 6\\\beta_0 + \beta_2 + \beta_3 disp& \text{if }\; cyl == 8\end{cases} \] This is three different models of mpg versus disp, but with a different intercept depending on the value of cyl.

Show the code
cyldisp <- coef(lm(mpg ~ factor(cyl) + disp, data = mtcars))
cyldisp
 (Intercept) factor(cyl)6 factor(cyl)8         disp 
 29.53476781  -4.78584624  -4.79208587  -0.02730864 
Show the code
plot(mpg ~ disp, col = factor(cyl), data = mtcars)
abline(a = cyldisp[1], b = cyldisp[4], col = 1, lty = 1)
abline(a = cyldisp[1] + cyldisp[2], b = cyldisp[4], col = 2, lty = 1)
abline(a = cyldisp[1] + cyldisp[3], b = cyldisp[4], col = 3, lty = 2)

The plot looks like it only has 2 lines, but that’s because \(\beta_1=\beta_2\), so one line is plotted on top of the other! I changed the linetypes so you can see this.

It looks like the red and green lines are fitting the red and green data, but the black line doesn’t look quite right.

Three Different Models

If we have an interaction between cyl and disp, then we essentially get 3 models. \[ y = \beta_0 + \beta_1I(6) + \beta_2I(8) + \beta_3 disp + \beta_4I(6)disp + \beta_5I(8)disp \] where \(I(6)\) is just shorthand for \(I(cyl == 6)\).

This is the same as: \[ y_i = \begin{cases}\beta_0 + \beta_3 disp& \text{if }\; cyl == 4\\ (\beta_0 + \beta_1) + (\beta_3 + \beta_4) disp& \text{if }\; cyl == 6\\(\beta_0 + \beta_2) + (\beta_3 + \beta_5) disp& \text{if }\; cyl == 8\end{cases} \]

In R, we cans use the fanciness of the formula notation. R interprets * as interaction as well as lower order terms.

Show the code
coef(lm(mpg ~ disp * factor(cyl), data = mtcars))
      (Intercept)              disp      factor(cyl)6      factor(cyl)8 
       40.8719553        -0.1351418       -21.7899679       -18.8391564 
disp:factor(cyl)6 disp:factor(cyl)8 
        0.1387469         0.1155077 
Show the code
coef(lm(mpg ~ disp, data = subset(mtcars, cyl == 4))) # Others will be similar
(Intercept)        disp 
 40.8719553  -0.1351418 

ggplot2 makes it super easy to plot this.

Show the code
library(ggplot2); theme_set(theme_bw())
ggplot(mtcars) +
    aes(x = disp, y = mpg, colour = factor(cyl)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, formula = "y ~ x")

Recall that the model without interaction terms (different intercepts, same slopes) had practically the same intercept for 6 and 8. The slopes look different here, but there isn’t a lot of data in the 6 category and I suspect that we could force it to have the same intercept and slope as the 8 category.

I.4 Comparison to Other Models

In previous lectures, we’ve looked at this same relationship using polynomial model for displacement and a transformation for mpg. Let’s see how these stack up!

How do we compare such models? There isn’t a statistical test for which one fits best, but, as always, we want to know about the residuals!

Show the code
poly_lm <- lm(mpg ~ poly(disp, 2), data = mtcars)
log_lm <- lm(log(mpg) ~ disp, data = mtcars)
interact_lm <- lm(mpg ~ factor(cyl)*disp, data = mtcars)
Show the code
par(mfrow = c(1,3))

resid_plot <- 1 # Re-run with different values to check other plots
# Use similar colours to the ggplot.
mycolours <- c("red", "green", "blue")[as.numeric(factor(mtcars$cyl))]

plot(poly_lm, which = resid_plot, main = "Polynomial", col = mycolours)
plot(log_lm, which = resid_plot, main = "Log Transform", col = mycolours)
plot(interact_lm, which = resid_plot, main = "Interaction", col = mycolours)

  • Residuals versus fitted looks quite a bit better for the interaction model.

  • Normal Q-Q also looks better for interaction model.

  • Scale-Location is slightly better for interaction model, although still not perfect.

  • Residuals versus leverage indicates the Hornet 4 Drive car has somewhat high leverage. I’m guessing this is the largest green dot in the plot up above - the green line would be very different without it.

    • This highlights the importance of carefully interpreting Cook’s Distance. In this case, the interaction model is a combination of three different models.

The plots are quite similar, but for the most part the interaction model seems to work best.

ANOVA can be used to compare the residual variance across non-nested models, but is not appropriate for one of the three models we just saw. See if you can guess which !

Show the code
anova(poly_lm, log_lm, interact_lm)
Analysis of Variance Table

Model 1: mpg ~ poly(disp, 2)
Model 2: mpg ~ factor(cyl) * disp
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     29 233.39                                
2     26 146.23  3    87.159 5.1655 0.006199 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The models have a significantly different fit, so the one with the lowest residual variance probably fits better.

Show the code
c(summary(poly_lm)$sigma, summary(interact_lm)$sigma)
[1] 2.836907 2.371581

There are two important points here:

  1. The interaction model fits the context of the problem. It absolutely makes sense that an engine with 4 cylinders is different from an engine with 6 cylinders, and part of that difference is the relationship between mpg and displacement. There are no two engines which are equivalent except someone added two cylinders to it; the cylinder values represent fundamentally different groups.
    • In other words, the theory supports an interaction model. There’s no theory that I know of that states there should be a quadratic relationship between mpg and displacement. If we want to say something about cars in general (inference), it’s best to go with the context of the problem.
  2. The interaction model should fit better since it contains more information - it has the displacement and the number of cylinders, the other models only had displacement.
    • Also note that the interaction model takes up more degrees of freedom. This can be a negative, especially with small samples.

This is an ANCOVA Model

Show the code
anova(interact_lm)
Analysis of Variance Table

Response: mpg
                 Df Sum Sq Mean Sq F value   Pr(>F)    
factor(cyl)       2 824.78  412.39 73.3221 2.05e-11 ***
disp              1  57.64   57.64 10.2487 0.003591 ** 
factor(cyl):disp  2  97.39   48.69  8.6574 0.001313 ** 
Residuals        26 146.23    5.62                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the ANCOVA test is 0.001313, indicating that there is a significantly different covariance between mpg and displacement depending on the number of cylinders.

  • The lower-order terms factor(cyl) and disp must be present in order for this test to make sense.

    • It is technically possible to fit a model with just the interaction term, but it’s slightly better to have extra predictors with a 0 coefficient than be missing predictors with a non-0 coefficient.

      • For the curious, the syntax for this model would be lm(mpg ~ factor(cyl):disp, data = mtcars), where the : indicates multiplication. This isn’t used often, since it’s almost always incorrect to include interaction terms without the individual effects.

      • It’s not clearly better in all cases! There may be some contextual reason why it makes sense to only have an interaction.

  • Recall that the anova() function reports the sequential sum-of-squares. In this situation, we do not care about the p-values for factor(cyl) and disp, so we do not care which order they enter the model in. We only care about the p-value for inclusion of the interaction term factor(cyl):disp.

    • R’s formula notation is clunky, but leads to a lot of great situations like this. By using factor(cyl)*disp), it added the lower order terms first and then the interaction, thus making the sequential sum-of-squares useful!