Appendix L — Permutation Tests

Author

Affiliation

Dr. Devan Becker

Wilfrid Laurier University

Published

2024-07-15

L.1 Introduction

A permutation is a re-ordering of a set of things. In this case, we’re randomly re-ordering the data. Note that we are not randomly sampling rows, we are just changing the order. In particular, we’re randomly re-ordering the measurements independently of each other!

Why would we do this? Because it destroys the correlation between our variables while preserving their statistical properties. This gives us a “null” distribution that makes no assumptions about the distribution of the data!

L.2 Intuition: the facial impact test

The best statistical test is the “facial impact test”. In this test, we reject the null hypothesis if the results are so obvious that they jump off the page and slap you in the face (facial impact).

For this example I’m going to plot 20 different plots. One of them is going to be the original data, and the rest will be random permutations. If you can tell which is the tru data, then there’s probably something important going on! In other words, the relationship between the response and the predictor matters.

I’ve ran this several times and my cheek hurts from how much the results have slapped me in the face.

Here’s a less obvious example, but I’m sure you can still pick out the real data:

L.3 The Permutation Test for Significance of Regression

A test for significance usually finds the probability of getting results at least as significant as the results we got, assuming that the null is true.

For regression, the null is that there is no relationship. Permutations destroy the relationship, so permutations create a null distribution without making any normality assumptions! To get the p-value, we can do a bunch of permutations and look at our data compared to the distribution under the null hypothesis.

The code below demonstrates this for the cars data. Notice how similar this is to the simulation code we’ve used, but we’re not generating any new data. Try it again with the women data!

Under the null, the sample slopes can range from -2 to 2. However, our actual slope was 4! Notice that the p-value calculation is the probability of a value at least as extreme; we first find the distance from the mean, then count the number of values this far or further from the mean. The expression betas < below | betas > above returns a bunch of true/false values, with “TRUE” meaning that the point is either below below or above above. The mean() function, when applied to a bunch of true/false values, is a proportion. Thus, we are calculating the proportion of points that were at least as far away as the data we got, which is the definition of a p-value!

In the end, we got a p-value of exactly 0. The p-value from normality assumptions is below:

That’s a p-value of 0.0000000000015. To get a p-value this low in a permutation test, we’d need somewhere near 10000000000 permutations, and even then we might not get one outside of our critical region. It’s not a surprise that our permutation test p-value was 0!

L.4 Permute the response or the predictor(s)?

For the simple linear regression examples above, it doesn’t matter. Once one is permuted, any existing relationship is destroyed.

For multiple linear regression, it suddenly matters a lot!

If we permute the response, it’s severing the relationship between the response and all predictors.
- This is like an F-test for overall significance!!!
If we permute one of the predictors, it severs the relationship between the response and that particular predictor, while also severing any multicollinearity!
- This is often a good way to do things since it tests whether that predictor contributes to the regression, not just the response.
- Watch how the other predictors’ coefficients change!
If we permute all of the predictors together then the relationship between y and x is removed, but all multicolinearity remains.
If we permute the predictors individually, then there should be no relationships anywhere among the data.

For the last two, we often just care about the response given our predictors; it doesn’t often make sense to permute the predictors individually.

Note that there’s a secret fifth option: permute some of the predictors together. This is like an Extra Sum-of-Squares test for a group of predictors!

Run the following code many times with kind = "y" and watch what changes. Change kind to "x" to see what changes, then also try "xx".

(Note: you can click on the last line and hit “Shit+Enter” to run the code cell without scrolling up and clicking the “Run Code” button.)

kind = "y" affects the response, without affecting multicollinearity.
kind = "x" affects all relationships involving x
kind = "xx" affects relationships involving x1 and x2, but there’s an important exception…

L.5 Example: Permutation Test for Bill Depth

Is bill_depth_mm an important predictor in the Palmer Penguins data?

The method for getting the answer depends on the model that we choose! In the following, we’re modelling the body mass against the flipper length, bill length, and bill depth (all with an interaction term with species), and the the different islands are given different intercepts. This is by no means a “good” model - flipper_length_mm * species is a good enough model. However, this model has an interaction term that will persist if we take bill depth out of the equation, and so it’s a good example of doing an ESS test for all parameters involving bill_depth. We also know that there’s some multicollinearity in these data, which has implications for the permutation test.

That looks pretty definitively like bill_depth_mm affects the F statistic!

Compare this to the ESS test:

If you entered the formula correctly, you should get a p-value of 1.352e-11, which is also a pretty definitive difference.

If we wanted to do an ANOVA test for overall significance, we would have permuted body_mass_g. Try it out using the two code blocks above!