1 - pt(3, df = 16 - 1)
[1] 0.004486369
We’ve been over this in confidence intervals, and the same thing applies to hypothesis tests! If the population is normal (or the sample size is large enough) and we have an SRS, then \[ \frac{\bar X - \mu}{s/\sqrt{n}}\sim t_{n-1} \]
Again, the \(t\) distribution is used to account for the extra variability from the estimated standard deviation.1
This means our test statistic is \[ t_{obs} = \frac{\bar x - \mu}{s/\sqrt{n}} \]
Since this is a \(t\) distibution, we use pt(t_obs, df = n -1)
, possibly one minus and/or double, depending on the alternate hypothesis.2
That’s it. That’s the big difference. When we estimate the standard deviation, we use the t-distribution.
Given a sample mean \(\bar x\) and a sample standard deviation \(s\), our test statistic is: \[ t_{obs} = \frac{\bar x - \mu}{s/\sqrt{n}} \] Our hypotheses and calculations of the p-value work the same as they did for the z-test.
In the pilot fatigue example from the Understanding p-values lecture, we assumed that we had the population sd. I lied - it was actually a sample statistic! We should have used a t-test, not a z test.
Recall:
Using the \(t\) distribution, our p-value is:
1 - pt(3, df = 16 - 1)
[1] 0.004486369
This is larger than our previous p-value of 0.0013. This will always be the case: if the \(z_{obs}\) test statistic is the same as the \(t_{obs}\) test statistic, then the p-value for \(t_{obs}\) will be wider.
We almost never know the population standard deviation, so we have extra uncertainty. With extra uncertainty, we require more evidence! Recall that a p-value is a measure of evidence against a null.
A matched pairs design allows us to use a one-sample t-test when it looks like we have two samples3. Since the pairs are matched, we can calculate the differences between pairs and treat this like a single vector of observations. It is honkey tonk ridonkulous to say that we know the true population standard deviation for the difference in observations, so a \(z\) test could never be appropriate.
Consider the following example of a matched pairs experiment. Given a sample of brave volounteers, we create a small cut on both hands and put ointment on one of the two cuts4. This study design eliminates the variation in healing times for different people since both cuts are on the same person! For each individual, we observe a difference. That is, one observation per person!
Subject 1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | |
---|---|---|---|---|---|---|---|---|
With Ointment | 6.44 | 6.06 | 4.22 | 3.3 | 6.5 | 3.49 | 7.01 | 4.22 |
Without | 7.22 | 6.05 | 4.55 | 4 | 6.7 | 2.88 | 7.88 | 6.32 |
Difference | -0.78 | 0.01 | -0.33 | -0.7 | -0.2 | 0.61 | -0.87 | -2.1 |
Note: Differences were calculated as “With minus Without”! This will be important for setting up the alternative hypothesis later.
The important thing here is that last row of this table now represents our data - we can forget that the other two rows exist! In other words, we have one observation per person, rather than two sets of observations.
This is where the assumption that we know the population standard deviation is especially preposterous: we’re looking just at the differences! Even if there’s a true value of the sd for healing time for all people, the standard deviation of the difference between healing times isn’t a reasonable quantity to speak of.
Since we’re looking at the difference, we no longer have a hypothesized value of \(\mu_0\). Instead, we hypothesize that the average pairwise difference is 0, i.e. \(\mu_{with\; minus\; without} = \mu_{diff} = 0\)5. The alternative is “with” < “without”, i.e. \(\mu_{diff} < 0\).6
<- c(-0.78, 0.01, -0.33, -0.7, -0.2, 0.61, -0.87, -2.1)
x <- mean(x)
xbar <- sd(x)
s <- length(x)
n
<- (xbar - 0)/(s/sqrt(n)) # xbar is with - w/out
t_obs # Notice that we use pt() instead of pnorm()
pt(t_obs, df = n - 1) # Alternative is <
[1] 0.04662624
So our p-value is approximately 0.04. At the 5% level, the null hypothesis would be rejected and we would conclude that the ointment works7. At the 1% level, we would conclude that it doesn’t have a significant effect. This is why it’s important to know the significance level before calculating the p-value - we shouldn’t get to choose whether our results are statistically significant!
Do you think that researchers in the field are typing test statistics into their calculator? Of course not! We’re finally at the point in this class where the methods are so commonly used that the built-in functions in R can calculate them.
<- c(6.44, 6.06, 4.22, 3.3, 6.5, 3.49, 7.0, 4.22)
with_oint <- c(7.22, 6.05, 4.55, 4 , 6.7, 2.88, 7.8, 6.32)
without <- with_oint - without
difference t.test(difference, alternative = "less")
One Sample t-test
data: difference
t = -1.9199, df = 7, p-value = 0.04817
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf -0.007063183
sample estimates:
mean of x
-0.53625
Notice that the output shows a one-sided confidence interval. This isn’t a big leap from what you know: a confidence interval consists of all of the values that would not be rejected by a hypothesis test, and this works for one-sided as well as two-sided alternate hypotheses!
To get a two-sided confidence interval, we can either leave alternative
at it’s default value or set it to "two.sided"
. We can also change the significance level with the conf.level
argument. For an 89%CI:
t.test(difference, alternative = "two.sided", conf.level = 0.89)
One Sample t-test
data: difference
t = -1.9199, df = 7, p-value = 0.09635
alternative hypothesis: true mean is not equal to 0
89 percent confidence interval:
-1.04730428 -0.02519572
sample estimates:
mean of x
-0.53625
Notice that this calculated a two-sided p-value, which is twice what we saw before (and no longer significant at the 5% level!).
pnorm(z_obs)
; if \(>\), then 1 - pnorm(z_obs)
; if two sided, double the correct one.New York is sometimes called “the city that never sleeps”. At the 5% level, do the following data provide evidence that the average New Yorker gets less than 8 hours of sleep per night?
\(\bar x\) | \(s\) | \(n\) |
---|---|---|
7.73 | 0.77 | 25 |
pt(-1.75, 24)
= 0.0464Construct a 95% CI for the New York sleep example.
\(\bar x\) | \(s\) | \(n\) |
---|---|---|
7.73 | 0.77 | 25 |
qt(0.025, 24)
= -2.0639.What is the standard error?
What is the standard deviation of the sampling distribution?
Why does the sampling distribution have a lower variance than the population?
After conducting a study, we found a p-value of 0.04. Did we find a statistically significant result?
After conducting a study, we found a 95% confidence interval for \(\mu\) from -0.1 to 1.9. What can we conclude?
Under which condition does the CLT not apply?
The following questions are added from the Winter 2024 section of ST231 at Wilfrid Laurier University. The students submitted questions for bonus marks, and I have included them here (with permission, and possibly minor modifications).
Set up the hypotheses:
Calculate the test statistic using the sample mean, population mean, standard deviation, and sample size: $t_{obs} = = = -2.946
The p-value can be found using the t-distribution on \(n-1\) degrees of freedom:
pt(-2.946, df = 50 - 1)
[1] 0.002457447
Since this is less than 0.05, we reject the null hypothesis. There is evidence at the 5% level that the true mean sleep duration is less than 8 hours.
Note that this is a matched pairs t-test!
Set up the hypotheses:
Calculate the test statistic using the formula for a t-test: \[ t_{obs} = \frac{\bar x - \mu_0}{s/\sqrt{n}} = \frac{5.8 - 5}{0.9/\sqrt{40}} = 5.62 \]
Using the t-distribution on \(n - 1\) degrees of freedom:
1 - pt(5.62, df = 40 - 1)
[1] 8.727633e-07
We get a value much smaller than our significance level of 0.01! We reject the null, and conclude that the average weight loss is more than 5 pounds.
Our hypotheses are \(H_0:\mu = 80\) versus \(H_0:\mu \ne 80\).
\[ t_{obs} = \frac{\bar x - \mu_0}{s/\sqrt{n}} = \frac{82 - 80}{6 / \sqrt{36}} = 2 \]
The p-value can be found as:
2 * (1 - pt(2, df = 6 - 1))
[1] 0.1019395
Since our p-value is larger than alpha, we do not reject the null. We have not gathered evidence that the claim of 80mg is incorrect.
The correct answer is (a) Reject the null hypothesis at the 5% significance level because the confidence interval does not include 0, indicating a significant difference in mean reductions. The 95% confidence interval represents the range of values within which we are 95% confident the true mean difference lies. If the null hypothesis were true (no difference in mean reductions), we would expect the interval to include 0. However, since the interval (2, 8) does not include 0, we have evidence that the mean reduction from the new medication is significantly different from the standard treatment at the 5% significance level. This conclusion is based on the principle that if a 95% confidence interval for a difference does not include 0, the difference is statistically significant at the 5% level. Hence, the confidence interval directly informs the hypothesis test outcome.
This is a matched pairs test, since it’s looking at the same patient before and after treatment. Instead of a sample of people before treatment and a sample of people after treatment, the researchers can look at a single sample of the differences.
Differences 3, 5, -1, 4, 6, 2, 7, 4, 3, 5, 2, 4, 6, 8, 3
Assuming the differences in anxiety levels are normally distributed, conduct a one-sample t-test to determine if the therapy leads to a significant reduction in anxiety levels at a 5% significance level.
A.) The researchers have a sample of differences, meaning that each value comes from two observations on a natural pairing (e.g. the same individual). This can be done as a one-sample t-test.
B.) Null hypothesis (\(H_0\)): The therapy has no effect on anxiety levels, so the mean difference in anxiety levels is 0 (\(\mu_d=0\)).
Alternative hypothesis (\(H_A\)): The therapy leads to a reduction in anxiety levels, so the mean difference in anxiety levels is greater than 0 (\(\mu_d >0\)).
C.) The calculated test statistic (t-statistic) for the differences in anxiety levels is approximately 7.00.
D.) The critical t-value for a one-tailed test at a 5% significance level with 14 degrees of freedom is approximately 1.76.
E.) Since the calculated t-statistic (7.00) is greater than the critical t-value (1.76), we reject the null hypothesis. There is sufficient evidence to support that the therapy leads to a significant reduction in anxiety levels. This conclusion is based on the assumption that if the therapy had no effect, the likelihood of observing a sample mean difference as extreme as this, or more extreme, is very low under the null hypothesis. Therefore, the therapy appears to be effective in reducing anxiety levels among the patients in this study.
In R:
t.test(c(3, 5, -1, 4, 6, 2, 7, 4, 3, 5, 2, 4, 6, 8, 3), alternative = "greater")
One Sample t-test
data: c(3, 5, -1, 4, 6, 2, 7, 4, 3, 5, 2, 4, 6, 8, 3)
t = 6.9972, df = 14, p-value = 3.138e-06
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
3.043017 Inf
sample estimates:
mean of x
4.066667
Hours: 12,15,9,11,14,8,10,13,12,7
Assuming the number of hours follows a normal distribution, calculate a 90% confidence interval for the average number of hours all employees at the startup spend on professional development activities per month.
B.) The critical t-value for constructing a 90% confidence interval with 9 degrees of freedom is approximately 1.83.
C.)The 90% confidence interval for the average number of hours all employees at the startup spend on professional development activities per month is approximately (9.59, 12.61) hours.
D.) Based on the sample data, we can be 90% confident that the true average number of hours spent on professional development activities by all employees at the startup falls between 9.59 and 12.61 hours per month
In R:
<- c(12, 15, 9, 11, 14, 8, 10, 13, 12, 7)
hours t.test(hours, conf.level = 0.9)
One Sample t-test
data: hours
t = 13.494, df = 9, p-value = 2.818e-07
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
9.592086 12.607914
sample estimates:
mean of x
11.1
Which is used in the caclulation of the Estimated Standard Error.↩︎
Like pnorm()
, it always calculates the probability below the test statistic.↩︎
We’ll learn about two-sample t-tests in the next lecture.↩︎
And most likely a bandage on both.↩︎
In other words, the healing times are the same for each subject↩︎
This is where it’s important to know that we did “with minus without”; we could have done without minus with, but then our alternate hypotheses would need to be “>”.↩︎
A p-value says nothing about the effect size, so we can’t say whether it’s practically significant↩︎