qnorm((1 - 0.9)/2)
[1] -1.644854
This lesson is going to be a little different from the rest. I’m not going to give you the answers, I’m going to give you the tools.
As we’ve seen many times, this is \[ SE(\bar X) = \frac{\sigma}{\sqrt{n}} = \sqrt{\frac{\sigma^2}{n}}, \] but we often use \[ \hat{SE}(\bar X) = \frac{S}{\sqrt{n}} = \sqrt{\frac{S^2}{n}}, \] which is the estimated standard error (the “hat” on top of the letters “SE” indicates that it’s estimated). For the rest of this lecture, we’ll always use the estimated standard error.
Even though we were subtracting means, we added their variances and then take their square root. \[ \hat{SE}(\bar X_1 - \bar X_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \] It is an important fact that we add the variances and then take the square root.
From this lesson, also note that we had to make the assumptions:
This is nothing new, I’m just repeating it here: \[ \hat{SE}(\hat p) = \sqrt{\frac{\hat p(1 - \hat p)}{n}} \]
This is up to you to find! Keed these in mind:
This question is from OpenIntro Introductory Statistics for the Life and Biomedical Sciences, First Edition.
The way a question is phrased can influence a person’s response. For example, Pew Research Center conducted a survey with the following question:
As you may know, by 2014 nearly all Americans will be required to have health insurance. [People who do not buy insurance will pay a penalty] while [People who cannot afford it will receive financial help from the government]. Do you approve or disapprove of this policy?
For each randomly sampled respondent, the statements in brackets were randomized: either they were kept in the order given above, or the order of the two statements was reversed. The table below shows the results of this experiment. Calculate and interpret a 90% confidence interval of the difference in the probability of approval of the policy.
Sample size \(n_i\) | Approve % | |
---|---|---|
Original Ordering | 771 | 47 |
Reversed Ordering | 732 | 34 |
Solution
Let \(p_1\) be the proportion who approve when given the original ordering with sample size \(n_1\), and \(p_2\) be the proportion who approve when given the reversed ordering with sample size \(n_2\). This question is asking us to calculate a confidence interval for \(p_1 - p_2\).
We first check the conditions required to use the normal approximation.
Now that that’s covered, we can make a confidence interval. The general form is: \[ \text{Point Estimate}\pm\text{Critical Value}*\text{Standard Error} \] From your homework above, verify that you can calculate the standard error as 0.025.3
Our point estimate is \(\hat p_1 - \hat p_2 = 0.47 - 0.34 = 0.13\). Since we doing a difference in proportions, our critical value comes from the normal distribution:
qnorm((1 - 0.9)/2)
[1] -1.644854
So our confidence interval is: \[ 0.13 \pm 1.65*0.025 = (0.09, 0.17) \]
We are 90% confident that the true mean difference is between 0.09 and 0.17. This provides evidence that the two proportions are indeed different.
Here’s where things are a little less obvious - I’m not going to get you to find the standard error yourself!
We are generally looking at a hypothesis test for whether two proportions are equal, that is, \[ H_0: p_1 = p_2\implies p_1 - p_2 = 0 \] with an alternative that they are not equal, or that one is bigger than the other. In other words, we’re looking at the hypotheses:4 \[\begin{align*} H_0: &p_{1-2} = 0\\ H_A: &p_{1-2} \ne 0\text{ or }p_{1-2} > 0\text{ or }p_{1-2} > 0\\ \end{align*}\]
In the lesson on proportions, we saw that the standard error depended on the null hypothesis being true, since we calculate p-values under the assumption that the null hypothesis is true. How do we do that here?
Under the null hypothesis, \(p_1 = p_2\). That’s like saying that we observed a bunch of successes and failures from a single group, instead of two. Let \(x_1\) be the number of successes in the first group, and \(x_2\) the number for the second. Then \[ \hat p = \frac{x_1 + x_2}{n_1 + n_2} \] That is, we observed \(x_1 + x_2\) successes out of \(n_1 + n_2\) trials.
For example, if we assume that two coins have the same probability of heads, the getting 5 heads in 9 flips for one coin and 3 heads out of 6 flips for the other. The two coins are assumed to be identical, so it’s like we flipped one coin 15 times and got 8 heads.
As before, the assumption that the null hypothesis is true is used everywhere. This means it’s true for testing whether the normal approximation is appropriate. We must test \(n_1\hat p\), \(n_1(1 - \hat p)\), \(n_2\hat p\), and \(n_2(1 - \hat p)\).
From this, we might assume that our standard error is something like: \[ \hat{SE}(\hat p_1 - \hat p_2) = \sqrt{\frac{\hat p(1 - \hat p)}{???}} \] The ??? might seem like it should be \(n_1 + n_2\), but some advanced math shows that this doesn’t quite work. Again, this is from the problem of adding variances, but working with standard deviations. Instead, the standard error is: \[ \hat{SE}(\hat p_1 - \hat p_2) = \sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]
This standard error is based on the null hypothesis, specifically the assumption that the groups are identical, so it’s as if we took two samples from the same population.
As before, the test statistic is \[ \frac{\text{sample statistic} - \text{hypothesized value}}{\text{standard error}} = \frac{(\hat p_1 - \hat p_2) - 0}{\sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \] and this is compared to the normal distribution.
Using the same example as before, we can set up our null hypothesis as \(p_1 = p_2\), and we’ll choose the alternate hypothesis \(p_1 \ne p_2\).5 We’ll use the 5% level.
The “pooled” estimate is based on \(x_1\) and \(x_2\), which we can find based on \(\hat p_1\) and \(n_1\). Since \(\hat p_1 = x_1/n_1\), we can find \(x_1 = \hat p_1 n_1 = 771 * 0.47 = 362.37\), which we’ll round to 362. Similarly, we’ll use \(x_2\) as 249. \[ \hat p = \frac{362 + 249}{771 + 732} = 0.4065 \]
The test statistic is calculated as \[ \frac{(\hat p_1 - \hat p_2) - 0}{\sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} = \frac{(0.47 - 0.34) - 0}{\sqrt{0.4065(1-0.4065)(1/771 + 1/732)}} = 5.12 \]
We all remember the all-important value of 1.96, right? The total area under the normal curve above 1.96 plus the area below -1.96 adds to 5%. If we get a z-score above 1.96 or below -1.96, we know that the p-value is smaller than 5%. Intuitively, 5.12 is a massive z-score, and thus will have a miniscule p-value.
2 * (1 - pnorm(5.12))
[1] 3.055357e-07
That’s a p-value of approximately 0.0000003. We can safely reject the null hypothesis.
This isn’t surprising, the original proportions were 0.47 and 0.35, with sample sizes of 771 and 732. Given the sample size, we expect a pretty small standard error and thus we shouldn’t be surprised that a difference of 0.13 counts as a “big” difference!
The following example comes from OpenIntro Statistics for Health and Life Sciences.
The use of screening mammograms for breast cancer has been controversial for decades because the overall benefit on breast cancer mortality is uncertain. Several large randomized studies have been conducted in an attempt to estimate the effect of mammogram screening. A 30-year study to investigate the effectiveness of mammograms versus a standard non-mammogram breast cancer exam was conducted in Canada with 89,835 female participants. During a 5-year screening period, each woman was randomized to either receive annual mammograms or standard physical exams for breast cancer. During the 25 years following the screening period, each woman was screened for breast cancer according to the standard of care at her health care center.
At the end of the 25 year follow-up period, 1,005 women died from breast cancer. The results by intervention are summarized below.
Died | Survived | |
---|---|---|
Mammogram | 500 | 44,425 |
Control | 505 | 44,405 |
Assess whether the normal model can be used to analyze the study results.
Since the participants were randomly assigned to each group, the groups can be treated as independent, and it is reasonable to assume independence of patients within each group. Participants in randomized studies are rarely random samples from a population, but the investigators in the Canadian trial recruited participants using a general publicity campaign, by sending personal invitation letters to women identified from general population lists, and through contacting family doctors. In this study, the participants can reasonably be thought of as a random sample.
The pooled proportion \(\hat{p}\) is
\[ \hat{p} = \dfrac{x_{1} + x_{2}}{n_{1} + n_{2}} = \dfrac{500 + 505}{500 + 44425 + 505 + 44405} = 0.0112 \]
Checking the success-failure condition for each group: \[\begin{align*} \hat{p} \times n_{mgm} &= 0.0112 \times \text{44,925} = 503\\ (1 - \hat{p}) \times n_{mgm} &= 0.9888 \times \text{44,925} = \text{44,422} \\ \hat{p} \times n_{ctrl} &= 0.0112 \times \text{44,910} = 503\\ (1 - \hat{p}) \times n_{ctrl} &= 0.9888 \times \text{44,910} = \text{44,407} \end{align*}\] All values are at least 10.6
The normal model can be used to analyze the study results.
We can use this information to do a hypothesis test for the equality of proportions.
The standard error is still: \[ \sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} = 0.000702 \] That’s quite small, but this is to be expected with such a large sample size.
The test statistic is \(\hat p_1 - \hat p_2 / 0.000706 = -0.17\). Again, using our intuition, this is way lower than our 1.96 value, so this is very much not a significant result.
We conclude that there is insufficient evidence to reject the null hypothesis; the observed difference in breast cancer death rates is reasonably explained by sampling error when the two proportions are equal.
Evaluating medical treatments typically requires accounting for additional evidence that cannot be evaluated from a statistical test. For example, if mammograms are much more expensive than a standard screening and do not offer clear benefits, there is reason to recommend standard screenings over mammograms. This study also found that a higher proportion of diagnosed breast cancer cases in the mammogram screening arm (3250 in the mammogram group vs 3133 in the physical exam group), despite the nearly equal number of breast cancer deaths. The investigators inferred that mammograms may cause over-diagnosis of breast cancer, a phenomenon in which a breast cancer diagnosed with mammogram and subsequent biopsy may never become symptomatic. The possibility of over-diagnosis is one of the reasons mammogram screening remains controversial.
Recall that independence means that knowledge of one outcome gives you a better guess at other outcomes.↩︎
In this example, we could find the difference in heights between spouses, then use this collection of differences in a one-sample t-test, which gives us different information, but it’s also interesting.↩︎
This is to ensure you have the correct caclulation, you won’t need to do this on a test.↩︎
There may be a time in your life where you test whether \(p_1 - p_2 = 0.25\) or something like that, and you’ll need to modify the methods a little bit.↩︎
You might have also chosen \(p_1 > p_2\) if you thought, before seeing the results of the study, that the original order would lead to more agreement.↩︎
It is worth noting that these values are very close to the original values we were given. If we were doing a confidence interval where we don’t use the pooled proportion, we could have just checked the values in the given table!↩︎