17  Inference for the Difference in Proportions

Author

Dr. Devan Becker

Published

2023-07-14

17.1 DIY Confidence Intervals

This lesson is going to be a little different from the rest. I’m not going to give you the answers, I’m going to give you the tools.

Standard Error for a Single Mean

As we’ve seen many times, this is \[ SE(\bar X) = \frac{\sigma}{\sqrt{n}} = \sqrt{\frac{\sigma^2}{n}}, \] but we often use \[ \hat{SE}(\bar X) = \frac{S}{\sqrt{n}} = \sqrt{\frac{S^2}{n}}, \] which is the estimated standard error (the “hat” on top of the letters “SE” indicates that it’s estimated). For the rest of this lecture, we’ll always use the estimated standard error.

Standard Error for the Difference in Means

Even though we were subtracting means, we added their variances and then take their square root. \[ \hat{SE}(\bar X_1 - \bar X_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \] It is an important fact that we add the variances and then take the square root.

From this lesson, also note that we had to make the assumptions:

  • The individuals within a group are independent of other individuals in that group.
    • For example, if we sample people in our own family then the samples are not independent. People in the same family tend to have similar characteristics, so knowledge of the characteristics of one family member are informative about the others.1
  • The groups are independent.
    • For example, if we’re looking at the difference in mean heights between men and women, but we have spousal pairs. Spousal pairs have a smaller difference in height than the average difference in height.2

Standard Error for a Single Proportion

This is nothing new, I’m just repeating it here: \[ \hat{SE}(\hat p) = \sqrt{\frac{\hat p(1 - \hat p)}{n}} \]

Standard Error for the Difference in Proportions

This is up to you to find! Keed these in mind:

  • The standard error cannot be negative, so you probably can’t subtract things.
  • Variances can be added if we assume things are independent.
    • Make these assumptions explicit!

Confidence Interval Example

This question is from OpenIntro Introductory Statistics for the Life and Biomedical Sciences, First Edition.

The way a question is phrased can influence a person’s response. For example, Pew Research Center conducted a survey with the following question:

As you may know, by 2014 nearly all Americans will be required to have health insurance. [People who do not buy insurance will pay a penalty] while [People who cannot afford it will receive financial help from the government]. Do you approve or disapprove of this policy?

For each randomly sampled respondent, the statements in brackets were randomized: either they were kept in the order given above, or the order of the two statements was reversed. The table below shows the results of this experiment. Calculate and interpret a 90% confidence interval of the difference in the probability of approval of the policy.

Sample size \(n_i\) Approve %
Original Ordering 771 47
Reversed Ordering 732 34

Solution

Let \(p_1\) be the proportion who approve when given the original ordering with sample size \(n_1\), and \(p_2\) be the proportion who approve when given the reversed ordering with sample size \(n_2\). This question is asking us to calculate a confidence interval for \(p_1 - p_2\).

We first check the conditions required to use the normal approximation.

  1. Pew Research Center is basically the world expert on opinion polling, so the samples are probably good.
  2. We can safely assume that the samples are independent.
  3. The two statements were randomly assigned, so it’s safe to say that the two groups are independent.
  4. \(n_1 * 0.47 = 771 * 0.47 = 362.37\)
    • There are three other calculations to check. Check them!

Now that that’s covered, we can make a confidence interval. The general form is: \[ \text{Point Estimate}\pm\text{Critical Value}*\text{Standard Error} \] From your homework above, verify that you can calculate the standard error as 0.025.3

Our point estimate is \(\hat p_1 - \hat p_2 = 0.47 - 0.34 = 0.13\). Since we doing a difference in proportions, our critical value comes from the normal distribution:

qnorm((1 - 0.9)/2)
[1] -1.644854

So our confidence interval is: \[ 0.13 \pm 1.65*0.025 = (0.09, 0.17) \]

We are 90% confident that the true mean difference is between 0.09 and 0.17. This provides evidence that the two proportions are indeed different.

17.2 Hypothesis Testing

Here’s where things are a little less obvious - I’m not going to get you to find the standard error yourself!

We are generally looking at a hypothesis test for whether two proportions are equal, that is, \[ H_0: p_1 = p_2\implies p_1 - p_2 = 0 \] with an alternative that they are not equal, or that one is bigger than the other. In other words, we’re looking at the hypotheses:4 \[\begin{align*} H_0: &p_{1-2} = 0\\ H_A: &p_{1-2} \ne 0\text{ or }p_{1-2} > 0\text{ or }p_{1-2} > 0\\ \end{align*}\]

In the lesson on proportions, we saw that the standard error depended on the null hypothesis being true, since we calculate p-values under the assumption that the null hypothesis is true. How do we do that here?

The Pooled Proportion

Under the null hypothesis, \(p_1 = p_2\). That’s like saying that we observed a bunch of successes and failures from a single group, instead of two. Let \(x_1\) be the number of successes in the first group, and \(x_2\) the number for the second. Then \[ \hat p = \frac{x_1 + x_2}{n_1 + n_2} \] That is, we observed \(x_1 + x_2\) successes out of \(n_1 + n_2\) trials.

For example, if we assume that two coins have the same probability of heads, the getting 5 heads in 9 flips for one coin and 3 heads out of 6 flips for the other. The two coins are assumed to be identical, so it’s like we flipped one coin 15 times and got 8 heads.

As before, the assumption that the null hypothesis is true is used everywhere. This means it’s true for testing whether the normal approximation is appropriate. We must test \(n_1\hat p\), \(n_1(1 - \hat p)\), \(n_2\hat p\), and \(n_2(1 - \hat p)\).

From this, we might assume that our standard error is something like: \[ \hat{SE}(\hat p_1 - \hat p_2) = \sqrt{\frac{\hat p(1 - \hat p)}{???}} \] The ??? might seem like it should be \(n_1 + n_2\), but some advanced math shows that this doesn’t quite work. Again, this is from the problem of adding variances, but working with standard deviations. Instead, the standard error is: \[ \hat{SE}(\hat p_1 - \hat p_2) = \sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

This standard error is based on the null hypothesis, specifically the assumption that the groups are identical, so it’s as if we took two samples from the same population.

As before, the test statistic is \[ \frac{\text{sample statistic} - \text{hypothesized value}}{\text{standard error}} = \frac{(\hat p_1 - \hat p_2) - 0}{\sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \] and this is compared to the normal distribution.

Hypothesis Test Example

Using the same example as before, we can set up our null hypothesis as \(p_1 = p_2\), and we’ll choose the alternate hypothesis \(p_1 \ne p_2\).5 We’ll use the 5% level.

The “pooled” estimate is based on \(x_1\) and \(x_2\), which we can find based on \(\hat p_1\) and \(n_1\). Since \(\hat p_1 = x_1/n_1\), we can find \(x_1 = \hat p_1 n_1 = 771 * 0.47 = 362.37\), which we’ll round to 362. Similarly, we’ll use \(x_2\) as 249. \[ \hat p = \frac{362 + 249}{771 + 732} = 0.4065 \]

The test statistic is calculated as \[ \frac{(\hat p_1 - \hat p_2) - 0}{\sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} = \frac{(0.47 - 0.34) - 0}{\sqrt{0.4065(1-0.4065)(1/771 + 1/732)}} = 5.12 \]

We all remember the all-important value of 1.96, right? The total area under the normal curve above 1.96 plus the area below -1.96 adds to 5%. If we get a z-score above 1.96 or below -1.96, we know that the p-value is smaller than 5%. Intuitively, 5.12 is a massive z-score, and thus will have a miniscule p-value.

2 * (1 - pnorm(5.12))
[1] 3.055357e-07

That’s a p-value of approximately 0.0000003. We can safely reject the null hypothesis.

This isn’t surprising, the original proportions were 0.47 and 0.35, with sample sizes of 771 and 732. Given the sample size, we expect a pretty small standard error and thus we shouldn’t be surprised that a difference of 0.13 counts as a “big” difference!

17.3 Example

The following example comes from OpenIntro Statistics for Health and Life Sciences.

The use of screening mammograms for breast cancer has been controversial for decades because the overall benefit on breast cancer mortality is uncertain. Several large randomized studies have been conducted in an attempt to estimate the effect of mammogram screening. A 30-year study to investigate the effectiveness of mammograms versus a standard non-mammogram breast cancer exam was conducted in Canada with 89,835 female participants. During a 5-year screening period, each woman was randomized to either receive annual mammograms or standard physical exams for breast cancer. During the 25 years following the screening period, each woman was screened for breast cancer according to the standard of care at her health care center.

At the end of the 25 year follow-up period, 1,005 women died from breast cancer. The results by intervention are summarized below.

Died Survived
Mammogram 500 44,425
Control 505 44,405

Assess whether the normal model can be used to analyze the study results.

Since the participants were randomly assigned to each group, the groups can be treated as independent, and it is reasonable to assume independence of patients within each group. Participants in randomized studies are rarely random samples from a population, but the investigators in the Canadian trial recruited participants using a general publicity campaign, by sending personal invitation letters to women identified from general population lists, and through contacting family doctors. In this study, the participants can reasonably be thought of as a random sample.

The pooled proportion \(\hat{p}\) is

\[ \hat{p} = \dfrac{x_{1} + x_{2}}{n_{1} + n_{2}} = \dfrac{500 + 505}{500 + 44425 + 505 + 44405} = 0.0112 \]

Checking the success-failure condition for each group: \[\begin{align*} \hat{p} \times n_{mgm} &= 0.0112 \times \text{44,925} = 503\\ (1 - \hat{p}) \times n_{mgm} &= 0.9888 \times \text{44,925} = \text{44,422} \\ \hat{p} \times n_{ctrl} &= 0.0112 \times \text{44,910} = 503\\ (1 - \hat{p}) \times n_{ctrl} &= 0.9888 \times \text{44,910} = \text{44,407} \end{align*}\] All values are at least 10.6

The normal model can be used to analyze the study results.

We can use this information to do a hypothesis test for the equality of proportions.

The standard error is still: \[ \sqrt{\hat p(1 - \hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} = 0.000702 \] That’s quite small, but this is to be expected with such a large sample size.

The test statistic is \(\hat p_1 - \hat p_2 / 0.000706 = -0.17\). Again, using our intuition, this is way lower than our 1.96 value, so this is very much not a significant result.

We conclude that there is insufficient evidence to reject the null hypothesis; the observed difference in breast cancer death rates is reasonably explained by sampling error when the two proportions are equal.

Evaluating medical treatments typically requires accounting for additional evidence that cannot be evaluated from a statistical test. For example, if mammograms are much more expensive than a standard screening and do not offer clear benefits, there is reason to recommend standard screenings over mammograms. This study also found that a higher proportion of diagnosed breast cancer cases in the mammogram screening arm (3250 in the mammogram group vs 3133 in the physical exam group), despite the nearly equal number of breast cancer deaths. The investigators inferred that mammograms may cause over-diagnosis of breast cancer, a phenomenon in which a breast cancer diagnosed with mammogram and subsequent biopsy may never become symptomatic. The possibility of over-diagnosis is one of the reasons mammogram screening remains controversial.


  1. Recall that independence means that knowledge of one outcome gives you a better guess at other outcomes.↩︎

  2. In this example, we could find the difference in heights between spouses, then use this collection of differences in a one-sample t-test, which gives us different information, but it’s also interesting.↩︎

  3. This is to ensure you have the correct caclulation, you won’t need to do this on a test.↩︎

  4. There may be a time in your life where you test whether \(p_1 - p_2 = 0.25\) or something like that, and you’ll need to modify the methods a little bit.↩︎

  5. You might have also chosen \(p_1 > p_2\) if you thought, before seeing the results of the study, that the original order would lead to more agreement.↩︎

  6. It is worth noting that these values are very close to the original values we were given. If we were doing a confidence interval where we don’t use the pooled proportion, we could have just checked the values in the given table!↩︎