17 Analysis of MTCars
17.1 Exploratory Data Analysis
Understanding Data
What do the column names mean?
The help file gives a (very) brief description. I spent a few minutes just looking at the descriptions and trying to guess what relationships I might find.
Overall, most of the predictors are trying to answer the question “Is this a powerful car?”
In the code below, change head() to str().
Plotting the Data
Write down any conclusions about the relationships you see in the pairs plot below:
- Only 1 car has
carb= 6, 1 hascarb= 8 wtanddratare (-ively) correlateddispandhpdispanddrat(-ive)dispandwthpandwthpandqsec
wt and disp are clearly multicollinear,; they’re measuring the same thing so I might want to include just one of them.
Patterns in the Predictors
In the following code, try setting the term on the right of the ~ as am, cyl, gear, and carb. Try the terms on the left side of the as wt, disp, drat, and qsec. Essentially, tried every combination of these and wrote down the most interesting patterns.
wtis different across categories ofam,cyl,carb,gear(all positive)disphas same relationshipshphas same relationships, except 4 gear cars have lower hp than 3 and 5 gear carsdrathas opposite relationships
Do something similar with the following code, checking every combination of all relevant predictors and writing down anything that sticks out.
- Clear separation between
dispandwtwhen coloured by am orcyl.- In other words, there are distinct groups. This probably means that one of the continuous predictors has all of the information necessary, and it won’t be necessary to include an interaction between continuous predictors (it rarely is).
- Otherwise, there are not many relationships that might be present.
The following plot can also be used with all combinations of categorical predictors.
- Some kind of “correlation” between
amandcyl.- Measuring something similar, but from different perspectives.
- Very little relation between
amandvs- they’re measuring different things.- Might be worth checking models where
amis switched withvs.
- Might be worth checking models where
Conclusions
Most things are measuring “how powerful is this car” or “how heavy is this car”, so we should just choose the ones that make sense to us and check a few categorical predictors.
wt and disp make the most sense as measures for mpg, and am and cyl also make some sense. I’ll try switching out some of the other predictors, but I expect that the final model will either be wt*am or disp*cyl.
17.2 More EDA: Relationships with the Response / Interactions
Now we’re finally looking at mpg!
From looking at many many plots, propose 3 (and only 3) candidate models.
mpgversusdisp*cylmpgversuswt*am(cyl?)mpgversuswt*vs(maybe not an interaction)mpgversuswt*gear?
I had also considered including qsec, but a plot of mpg versus qsec with colours from cyl revealed that cyl explains the relationship; if we include cyl, then the slope for mpg versus qsec is 0. The same thing happens with drat, so cyl is probably enough to include in the model rather than either qsec or drat.
17.3 Modelling
Let’s test out some models! The following code is already set up with a potentially reasonable model (but it isn’t the final model I would have chosen). Change it to test out your top 3 models, writing down any conclusions about the residuals.
When you’ve chosen your top candidate write it down!
My final model is as follows:
Now, investigate a couple changes to the model, justifying each change based on your plots above. The following code tests removing the interaction term, but you should completely change it according to what you’ve done. You are encouraged to go back and try new plots before you test them in a model.
Once you have your final model, write some interpretations!
- Residuals versus fitted looks good
- QQ norm looks great! For this small of a data set, we don’t expect much from the qq-plot, so this is actually very nice.
- Scale-Location has a slight U shape, which isn’t ideal. There may still be a predictor that’s worth including.
- There’s a high influence point. This is likely due to the interaction between cyl and disp.
- When we have this kind of interaction, there are essentially three lines, each with fewer observations. It is much easier for a point to be influential with interaction present.
My second model is as follows:
- First plot looks good!
- QQplot has some heavy tails - not bad, but not ideal.
dispmodelwas better. - Scale-location is great!
- No high leverage points.
Both models are good in different ways. Let’s check their summaries.
(Because of the way webr works, each code chunk is a separate R process. This means I have to re-define the model again in every code chunk.)
The \(R^2\) for dispmodel is a fair bit higher (although there’s no standard for how much an \(R^2\) should change, so this might not be a meaningful difference). As we saw in class, the \(R^2\) is based on the same quantities as the F-test for different models.
The models fit significantly differently. Which one fits better?
dispmodel has a higher \(R^2\) and a lower MSE, so it seems to be the winner.
From the pairs plot, I saw that disp has a slight relationship with other continuous predictors, and the scale-location plot wasn’t perfect. Perhaps another predictor will help?
I can do this with the magical update() function. The ~ . + hp notation means the response versus (~) everything ., then add hp. The ~ means “versus” (with the response on the left, which isn’t allowed to change in this case, and the predictors on the right), and the . means “everything”, which in this case refers to everything that was already in the model. The form lm(mpg ~ ., data = mtcars) will fit mpg against everything else it sees in the mtcars dataset.
I checked qsec, drat, and hp, and none seemed worth including in the model. I’ll just leave it as is.
To interpret the model we must be careful about the interaction term!
\[ mpg = \begin{cases} \beta_0 + \beta_1 disp & \text{if }cyl == 4\\ (\beta_0 + \beta_2) + (\beta_1 + \beta_4) disp & \text{if }cyl == 6\\ (\beta_0 + \beta_3) + (\beta_1 + \beta_5) disp & \text{if }cyl == 8\\ \end{cases} \]
- For 4 cylinder cars, the baseline mpg is 40 and decreases by 0.135 for each one unit increase in disp.
- For 6 cylinder cars, the baseline mpg is about 21.5 and isn’t really related to the displacement.
- For 8 cylinder cars, the baseline mpg is about 24.5 and decreases by about 0.02 for each one-unit increase in displacement.
- Note that displacement has really large units, so 0.02 over hundreds of one-unit increases is still a lot!
17.4 Conclusions
We chose the final model purely based on balancing a good fit without overfitting. This lead us to modelling the fuel efficiency of the car as a linear function of the engine displacement, with a different relationship depending on the number of cylinders. Generally, cars with more cylinders tend to have a lower fuel efficiency, and a larger engine is also associated with lower efficiency.
The engine displacement is a measure of both the weight of the car as well as the power of the car. The number of cylinders can also be seen as a measure of these two things, so our final model seems reasonable.
In this study, we are limited by both the size of the data and the purpose of data collection. These are observational data, with no effort made to control any of the predictors. For instance, we did not simply take a Mazda RX4 and change the number of cylinders to see what might happen. Instead, we observed a correlation in the data. A car company that wants to build a more fuel efficient car might want to use these associations as a starting point, but would need to do some controlled experiments to see what causal relationships might exist.
The data do not include all of the predictors that I would have liked. For instance, there is no measure of the aerodynamics of the cars. Any study focused on fuel efficiency would benefit from such a study. Furthermore, the fuel efficiency is the one reported by the company; a controlled study would measure the actual fuel efficiency in identical conditions for all cars.
In addition, the data are old. Like, very old. Cars are very different now, and these data are very outdated. There are absolutely no conclusions from this model that would actually be useful for modern cars.
17.5 Comment: The purpose of a study
In this study, we focused on the model fit. However, if a company was interested specifically in the speed of a car rather than it’s weight, we might have instead focused on the
qsecpredictor.In many studies, researchers may have a set of predictors that they are particularly interested in. For example, in a study on whether a new treatment affects the patients’ heart rate, while “controlling” for other factors (biosex, race, weight, age, etc.). Such a regression task could be written as: \[ heartrate_i = \beta_0 + \beta_1I(treatment_i) + \beta_2age_i + \beta_3I(biosexMale_i) + \beta_4weight_i \beta_5I(raceBlack_i) + \beta_6I(raceHispanic) + ... + \epsilon_i \] and we would never investigate a model that does not have the predictor that defines the treatment. Instead, we might find the best model according to all other predictors (possibly with transformations), then look at whether the treatment variable is significant.