2 Describing Distributions with Numbers

Author

Dr. Devan Becker

Published

2024-04-10

2.1 Quantitative Variables

Parameters Versus Statistics

Numerical summaries are called parameters when they describe an entire population, and statistics when they describe a sample.
Since we typically will not collect data on the entire population, we use (sample) statistics to make statements (with some uncertainty) about what the (population) parameter is.
Categorical variables are summarized by the count or proportion (percent) of the data in each category.
Quantitative variables are summarized by the centre and spread of the data

Parameters Versus Statistics: “““Truth”“”

We assume there’s some “true” population mean.

This will change if, say, a new baby is born.

We cannot know the population parameter!

If we were able to measure the height of every Canadian at this very instant, we would get one value. We can’t do this though, so we collect a sample instead. We try to collect the sample in such a way that we get value close to the population parameter.

In this course, we make a big distinction between population parameter and sample statistic. A sample statistic is something that we calculate from a sample, where is the population? Parameter is the value that we would get if we had the whole population.

Most of this course is centred around using the sample statistic to get an idea of what the population parameter might be.

Measures of Centre

The Centre

The “centre” is a strange concept.

Where are “most” of the data?
If we took an individual at random, what’s the “best guess” of their height?

We want to summarize the data using a few words/numbers.

I prefer the second description here - if I were to make a prediction, I want my predicted value to be close to the actual value. For example, I might want to design something that fits the average person’s height. I must make a prediction about the height of the people that will use it. My prediction is chosen to minimize how far off I would be, which means that I want to be near the “centre”. ### Centre 1: The Median

The median (\(M\)) is the midpoint of a distribution: half the observations are smaller and the other half are larger than it. To find the median:

Median: The middle number after you’ve put all the values in order.

Arrange all observations in order of size, from smallest to largest.
If the number of observations (\(n\)) is odd, the median \(M\) is the centre observation in the ordered list. Find the location of the median by counting \((n + 1)/2\) observations up from the smallest observation.
If n is even, the median \(M\) halfway between the two centre observations in the ordered list. The location of the median is again \((n + 1)/2\), counting from the smallest observation in the list.

Basically, the median is the middle value. With odd \(n\), the middle value is one of the observations, but when \(n\) is even we have to go in between two values.

1
- The median is 1.
1 2
- The median is 1.5, which occurs at the \((n+1)/2\) position.
1 2 3
- The median is 2.
1 2 3 4
- The median is 2.5
Continue the pattern on your own!

Example 1: Odd \(n\)

The distribution of needle lengths is how species of pine trees are characterized. The following data are the lengths (in cm) of a sample of 15 needles taken at random from different parts of several Aleppo pine trees (Southern California). What is the median length?

7.20 7.60  8.50  8.50  8.70  9.00  9.00 9.30
9.40 9.40 10.20 10.90 11.30 12.10 12.80

Example 2: Even \(n\)

We also have the lengths (in cm) of 18 needles from trees of the noticeably different Torrey pine species. What is the median length for these 18 pine needles? The ordered data are:

21.20 21.60 21.70 23.10 23.70 24.20 24.20 25.50 26.60
26.80 28.90 29.00 29.70 29.70 30.20 32.50 33.70 33.70

Example 3: Robustness

Set 1: 1 2 3 4 5 6
Set 2: 1 1 1 6 6 6
Set 3: -2000 -1000 3 4 5000 60000000

All three have the same median!

Do the “centres” make sense? Do they provide a good summary?

These three data sets demonstrate that the median only depends on the middle two numbers when \(n\) is odd¹. For the first two data sets, the median does seem to describe centre of the data well.

The last data set is a little bit different from the others and the median might not be enough information. This will come up again and again: if your data set is “nice” (unimodal and no clear outliers), then summary statistics work well; a simple number can describe a simple data set. A complex data set needs more complicated numbers.²

The Mean

The mean is defined as:

Mean: Add all the numbers, divide by how many numbers you added.

\[ \bar x = \frac{1}{n}(x_1 + x_2 + ... + x_n)= \frac{1}{n}\sum_{i=1}^nx_i \]

For example, the mean of (1, 2, 3, 4) is \[ \frac{1}{4}(1 + 2 + 3 + 4) = \frac{10}{4} = 2.5 \] (this also happens to be the median!)

The mean has a few interesting interpretations:

If we had a metre stick and put a 1kg weight on each of the values, the mean is where the metre stick would balance.
If someone pulled one of the observed values out of a hat and we were punished based on how wrong our guess was, the mean would be least overall punishment. This formula for the mean may look a little bit scary, but all it means is that we add up all of the values that we have and divide by the number of values.

Create a couple examples yourself and find the mean! If you create a list of values, you can use R to check your work as follows:

my_values <- c(1, 2, 2, 3, 4, 5, 6.3212, 3)
mean(my_values)

[1] 3.29015

The name “my_values” could have been anything that is just letters, numbers, and underscores (no spaces, can’t start with a number). It’s just the name of an object in R.
- An object is just a thing. You can put two things together, do a function of a thing, etc.
my_values is a vector. That is, it’s a collection of values.
- We create it using the c() function, which means “concatenate”, or “put together”.
The mean function takes a vector and calculates the mean for you.

To run this code, go to and create a new R notebook. Insert an “R cell”, and copy/paste the code above into that cell. Either hit the “play” button, or Alt+Enter (Option+Enter on a Mac). Alternatively, you can open up an R script in RStudio if you have it installed on your computer.

A word of caution: R sometimes calculates the median in a different way than we do in the course. Check your work with R, but do it by hand when asked!

The mean is not robust

Consider the following sets of data:

Set 1: 1, 2, 3, 4, 5, 6
Set 2: 1, 2, 3, 4, 5, 60

Both have the same median.
Mean of 3.5 and 12.5, respectively.

Which do you think is the correct measure of the centre?

Unlike the median, which only depends on the middle value, the mean uses information from all of the values. This means that if there’s an outlier or a misrecorded point, the mean will reflect this. The median will not change, though, which is why we call it robust. Sometimes this is what we want and sometimes it is not.

Robust: A statistic is robust if a small change to the data, such as adding/removing an outlier, does not affect that statistic.

A common example of when the mean is not what we want is income. The lowest possible income is zero, but there is no maximum income, so incomes tend to be right skewed. The right skew affects the mean a lot more than that affects the median, and so, in this case the median is a better measure of the most common levels of income.

Mean and Median vs. Skew: Mean less means left

shiny::runGitHub(repo = "DBecker7/DB7_TeachingApps", 
    subdir = "Apps/MeanLessMeansLeft")

The app above will be used in class, so do not worry if you’re unable to run it. It demonstrates that when they mean is less than the median it means it’s left skewed. The reason for this is that extra weight in the skewed direction affect the mean more than they affect the median.

2.2 Measures of Spread

Which has more variance?

Set 1: 1 1 1 5 5 5
Set 2: 1 2 3 3 4 5
Set 3: 1 3 3 3 3 5

All have the same range (max - min).

Range: The difference between the maximum and minimum.

These three data sets all have the same mean and median, but just looking at them shows that they are different collections of numbers. The first set only has two unique values, but those values are relatively far away from each other compared to the other sets. The second set is a more even spread from one to five. The third set has four valued equal to the mean and two values that may be outliers.

To me, the first set looks like it’s the most variable because all of the values are very far away from either the mean or the median. The second set has a smaller variance, because there are values closer to the mean. And the last set I expect to have the lowest variance because most values are actually equal to the mean.

The formula for the variance, which will be introduced next, matches this intuition. However, many students have different intuitions about which has the most variance - those are valid but harder to quantify.

Measure of Spread: The Variance

Consider set 1, which has a mean of 3:

1 1 1 5 5 5

The distances to the mean are all -2 or 2
- If we found the mean of these distances, we’d get 0! We need to make sure they’re all positive.
Possibility 1: Absolute value. The average absolute distance to the mean is 2.
Possibility 2: Squared value.
- The average squared distance to the mean is 4
  - Important: This is not the actual variance calculation!

The variance is the average squared distance to the mean[Variance: The average squared distance to the mean.]

We are basically (but not quite) looking at the average deviation from the mean. We want that deviation to be positive and there are several ways to do this. We have settled on squaring the numbers for the same reason we drive on the right side of the road in Canada: it’s just convention. There are benefits to using the absolute distance from the mean, but there are many mathematical advantages to squaring the values first.

Variance Formula

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2 \]

We use \(n-1\) because of math reasons.

The easiest way to calculate this is to put it in a table:

\(i\)	\(x_i\)	\(x_i - \bar x\)	\((x_i - \bar x)^2\)
1	1	-2	4
2	1	-2	4
3	1	-2	4
4	5	2	4
5	5	2	4
6	5	2	4
\(\sum\)	18	0	24

The mean is 3, and the variance is 24/5 = 4.8.

In the table above, as before, the subscript \(i\) is just used to denote different observations. For example \(x_1\) is the first observation, \(x_2\) is the second observation in our data, and so on (this ordering is arbitrary).

In order to calculate the variance, we must first know the mean, and so it’s convenient to put this at the bottom of the table. We then square the deviations from the mean and divide by \(n-1\). There are very good technical reasons why we divide by (n-1) that we won’t get into here. Come to my office and chat if you’d like to know more, or just ask ChatGPT!

\(n-1\) in the denominator

As a quick explanation for \(n-1\), consider the variance of a single observation. It doesn’t vary! There’s not enough information to see how much variance there is. There isn’t enough information in our data. The \(n-1\) in the denominator enforces this - we can’t calculate the variance of one observation.

Note that the variance can be calculated in R as follows:

my_values <- c(1, 2, 2, 3, 4, 5, 6.3212, 3)
var(my_values)

[1] 3.050982

The Variance and the Standard Deviation

\[ s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x - \bar x)^2} \]

The standard deviation (or sd) is just the square root of the variance.

This makes it have the same units as the original data.

In addition, if we have two data sets and the variance of one is larger than the other, then the standard deviation is also larger. They’re the same thing, just in different units!

I will often refer to one when I mean the other. When I’m comparing standard deviations, I may call them variances because the same patterns will be there.

Here’s the R code:

sd(my_values)

[1] 1.746706

var(my_values)

[1] 3.050982

sd(my_values)^2 # the sd is the square root of the variance

[1] 3.050982

Exercise: Variation of the three sets

Set 1: 1 1 1 5 5 5
Set 2: 1 2 3 3 4 5
Set 3: 1 3 3 3 3 5

Draw out bar plots.
Set up the table and find the variance.
Compare the standard deviations; make a conclusion.

\(i\)	\(x_i\)	\(x_i - \bar x\)	\((x_i - \bar x)^2\)
1
2
3
4
5
6
\(\sum\)

Fill out the table yourself, then try with R.

Measure of Spread 2: The IQR

The IQR is very closely related to the median. But first, we must learn what quartiles are.

Consider the data:

1 2 3 4 5 6 7 8

The median of these data is 5; 50% of the data are to the left of this point. This is half the data. If, instead, we wanted a quarter of the data, we could find half of the half.

“Quartile”: Split the data into four.[Quartile: One quarter of the data. The first quartile is the first 25%. The second quartile is the median.]
- Q1: 25% of the data are to the left.
- Q2: 50% of the data are to the left (the median).
- Q3: 75% of the data are to the left.

Finding Quartiles

1 2 3 4 5 6 7 8

Find the median
- It’s 4.5. Cool.
Q1 is just half of 50% - we’re finding a median again!
- Q1 is the median of everything to the left of the median (don’t include the median when doing this calculation).
- In this case, 1 2 3 4 are the numbers to the left of the median, and so Q1 is 2.5.
By a similar argument, Q3 is 6.5. :::notes Q0 is where 0% of the data are to the left. In other words, it’s the minimum value in the data! Similarly, Q4 is the maximum value in the data.

Warning

The algorithm we just used for computing the quartiles is not the only one! In R, there are NINE different ways to calculate the quartiles. You should stick to doing this by hand if you want to get the WeBWork answers right.

The five number summary

Let’s use the folowing example:

1, 3, 3, 4, 5, 5, 5, 6, 7, 7, 8, 8, 9, 10, 10, 11, 12

The quartiles give an excellent way to summarise data:

Q0 (min)	Q1	Q2 (median)	Q3	Q4 (max)
1	4.5	7	8.5	12

The five number summary just shows all five of the quartiles. Note that there are five quartiles, because zero is also one of them.

For practice, make sure you can calculate the median, and then the median of all the values to the left of it!

Visualizing the five number summary: the boxplot

The plot on the right shows the body masses for the Palmer Penguins.

The lowest point is Q0
The left of the box is Q1
The thick line in the box is Q2 (the median)
The right of the box is Q3
The highest point is Q4

The boxplot and the histogram both demonstrate the right skew of the data, but the boxplot is much more compact!

Take a moment to compare the two plots and make sure you can explain the skewness. Remember that 25% of the data are in each interval shown in the box plot!

Measure 2: The Inter-Quartile Range (IQR)

The IQR is defined as: Q3 - Q1.

IQR: The distance between the first quartile and the third quartile.

Same units as the original data
Robust to outliers (unlike the sd)!

This is the second measure of spread that we will learn. The IQR is commonly used when we have highly skewed data or data with outliers. The sd measures the average squared deviation from the mean, whereas the IQR measures the middle 50% of the data.

Notice how this is not centered on the median. Consider the following data:

my_values <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 6, 7, 8, 10)
length(my_values) # number of observations

[1] 20

hist(my_values)

The median of these should be at position \((n+1)/2 = (20 + 1)/2 = 10.5\) (I used the length() function in R to count the number of observations for me). This value is halfway between 3 and 3, meaning it’s 3. Q1 is the median of the first 10 data points (we don’t include the median, but this doesn’t matter here), which is at position 5.5, giving us a value of 2. Q3 is 5.5 positions from the end, which is 4.5. Thus the IQR is \(4.5 - 2 = 2.5\).

First, does this make sense to you? Does 2.5 sound like a reasonable width for the middle 50%?

Now consider that the distribution is clearly skewed to the right. This affects the variance a lot, but the IQR would have been the same no matter what the first 4 or last 4 values were.

IQR for Outliers

In this class, we use a rule of thumb for calculating outliers. Anything that is…

Below Q1 minus 1.5*IQR, or
above Q3 plus 1.5*IQR

is considered an outlier.

This rule of thumb is not based on any mathematical derivations, it just seems to work in most situations.

The idea is that the IQR gives a measure of spread, and the median gives the measure of the centre, so anything too far from the centre is an outlier. We use the spread to figure out how far away from the centre we are willing to accept. This will show up several times in this course. We’ve seen it in this example for the IQR and median because this is simple and easy to interpret.

Most of the rest of this course will be spent looking at something similar for the mean. We will still use this idea of the centre plus or minus some measure of the spread, but will incorporate information about the sample and assumptions about the population that allow us to make much stronger conclusions beyond simply checking if something is an outlier.

Summary

The “centre” is trying to measure the most common value.
- Often, this is our best prediction.
The “spread” is trying to measure the scale, or variation.
- Gives context to the centre.
The mean and variance
- Interpretations and formulas are important.
The median and IQR
- Calculations, interpretations, five-number-summary, outliers, and boxplots are all important.

We saw the same thing a couple of times throughout this lesson. We saw measures of centre that try to describe the middle of a distribution and centres of spread that tell us how spread out the data are. The mean and the sd are intrinsically linked, and the median and IQR are intrinsically linked.

We also saw the rule-of-thumb to use IQR for finding outliers by using the median plus-an-minus some number times the spread. You better believe that this idea will show up again later in this course!

Boxplots are a visual representation of the five number summary. These can be very small while still showing the shape of our data. However, these only work for unimodal data - there isn’t a good way to show a bimodal distribution on a boxplot. Also, it is very easy to plot two boxplots for two different data sets in order to compare the distributions.

For assignments and exams, be ready to calculate any of these values and compare the mean/median and sd/IQR. Also be ready to compare the five number summary to a boxplot.

Exercises

Spider Silk. Spider silk is the strongest known material, natural or man-made, on a weight basis. A study examined the mechanical properties of spider silk using 21 female golden orb weavers, Nephila clavipes. Here are data on silk yield stress, which represents the amount of force per unit area needed to reach permanent deformation of the silk strand. The data are expressed in megapascals (MPa):

164.00 173.00 176.10 236.10 251.30 270.50 270.50
272.40 282.20 288.80 290.70 300.60 327.20 329.00
332.10 351.70 358.20 362.00 448.90 478.70 740.20

Describe the shape, centre, and spread of the data using a histogram (code below).
Find the mean and median yield stress. Compare these two values. Referring to the histogram produced by the code below, what general fact does your comparison illustrate?
Re-run the code using different values of breaks. What do you see? (Note that this example uses base R rather than ggplot2 because it has simpler code - ggplot2 has more flexibility, but that flexibility isn’t necessary here.)
Use the boxplot() function to create a boxplot (you do not need the breaks=10 part of the code). Compare this to the histogram. Also comment on any points that stand out (when there are outliers, R shows \(Q1 - 1.5IQR\) rather than Q0).
Use the \(Q1 - 1.5IQR\) and \(Q3 + 1.5 IQR\) formulas by hand to find the outliers, and verify your calculations with the R plot.

silk_stress <- c(164.00, 173.00, 176.10, 236.10, 251.30, 270.50, 270.50,
    272.40, 282.20, 288.80, 290.70, 300.60, 327.20, 329.00,
    332.10, 351.70, 358.20, 362.00, 448.90, 478.70, 740.20)
hist(silk_stress, breaks = 10)

Deep-sea sediments. Phytopigments are markers of the amount of organic matter that settles in sediments on the ocean floor. Phytopigment concentrations in deep-sea sediments collected worldwide showed a very strong right-skew. Of two summary statistics, 0.015 and 0.009 gram per square meter of bottom surface, which one is the mean and which one is the median? Explain your reasoning.
Glucose levels. People with diabetes must monitor and control their blood glucose level. The goal is to maintain a “fasting plasma glucose” between approximately 90 and 130 milligrams per deciliter (mg/dl). The data tables below give the fasting plasma glucose levels for two groups of diabetics five months after they received either group instruction or individual instruction on glucose control.

I provide the data as vectors in R, but you don’t need R for this question (it’s good practice to do it both ways).

group <- c(78.00, 95.00, 96.00, 103.00, 112.00, 134.00, 141.00, 145.00, 147.00,
    148.00, 153.00, 158.00, 172.00, 172.00, 200.00, 255.00, 271.00, 359.00)

individual <- c(128.00, 128.00, 158.00, 159.00, 160.00, 163.00, 164.00, 188.00, 195.00,
    198.00, 220.00, 221.00, 223.00, 226.00, 227.00, 283.00)

Calculate the five-number summary for each of the two data sets.
Make side-by-side boxplots comparing the two groups. What can you say from this graph about the differences between the two diabetes control instruction methods? (Hint, you can create side-by-side boxplot using the code boxplot(variable_1, variable_2).)
Obtain the mean and standard deviation for each sample. Does this information give any clue about the shape of the two distributions?
Add to the historgrams a symbol representing the mean of each group and error bars representing one standard deviation above and below the mean. (You can do this by hand.) Compare this graphical summary with the boxplot display you also created.
Use the 1.5 × IQR rule to identify any suspected outliers. Then look at the raw data to determine if unusually high or low values in either data set actually are outliers.

The Ecological Fallacy. In the data below there are 40 data points, but the data are included in different groups. Consider the following possibilities:
- Find the sd of all of the data.
- Find the mean of each group, then find the sd of the means. Before working on this question, what do you expect? Will the sd of the data be larger than, smaller than, or the same as the sd of the means?

Once you’re very certain that you know the answer (you don’t learn anything unless you put effort into, so do the hard thing), uncomment the commented code (i.e., remove the # at the start of the last two lines) and see if you were right!

g1 <- c(1, 2, 3, 4, 5, 6)
g2 <- c(4, 3, 5, 2, 3, 4)
g3 <- c(6, 7, 4, 7, 8, 9, 5, 6)
g4 <- c(7, 4, 5, 7, 3, 6, 7)
g5 <- c(7, 7, 7, 7, 5, 12)
g6 <- c(2, 3, 4)
g7 <- c(4, 3, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0)
all_g <- c(g1, g2, g3, g4, g5, g6, g7)
all_means <- c(mean(g1), mean(g2), mean(3), mean(4), mean(5), mean(6), mean(7))
# sd(all_g)
# sd(all_means)

Solution

The Ecological Fallacy is that the data looks less variable if we only look at the average values in different groups.

For example, Statistics Canada releases information of the average household income for people who are 16-24 years old, 25-34, etc. The variance of these averages is NOT related to the actual variance of the values. When we learn more about statistical tests, we learn just how important the variance is - getting it wrong by not acknowledging that we’re working with means is a HUGE problem!

2.3 Categorical Variables

These variables are much harder to quantify with a single summary statistic. Instead, we usually give a table of their counts or just draw a plot. There is no concept of “mean” that applies to all categorical variables, and there is especially no concept of “variance”. The one exception is the mode (the most commonly observed category), but on it’s own this gives us very little information about the “centre” or “spread” of the data.

Instead, we simply show as few numbers as we can while still displaying enough information:

Frequency Tables

One-Way Table


female   male 
   165    168


   Adelie Chinstrap    Gentoo 
      152        68       124

The number of penguins in each category.
From what’s given,m we can’t tell how many Chinstrap penguins were female.

Two-Way Table

	Adelie	Chinstrap	Gentoo
female	73	34	58
male	73	34	61

The number of penguins in each combination of categories.
We can find that there were 165 female penguins by adding the entries in the row labelled female.

2.4 Crowdsourced Questions

The following questions are added from the Winter 2024 section of ST231 at Wilfrid Laurier University. The students submitted questions for bonus marks, and I have included them here (with permission, and possibly minor modifications).

Given this set of numbers: 23, 19, 27, 36, 21, 29, 38, 39, 42, 25, determine the Q0, Q1, Q2, Q3, and Q4. Measure the spread by determining the IQR of this set of numbers.

Solution

First we will re-order the numbers from smallest value to largest value. 19, 21, 23, 25, 27, 29, 36, 38, 39, 42.

Q0 is the lowest value so that will be 19.
Q4 is the highest value in the set so that will be 42.
Q2 is the median. Since the number of values (n) is even, the median will be between the middle two numbers, 27 and 29. Making Q2 = 28.
Q1 is the median between the values between Q0 and Q2 (19, 21, 23, 25, 27). The median in this set is 23, making Q1 = 23.
Q3 is the median of the second half of the data set, between Q2 and Q4 (29, 36, 38, 39, 42). The median in this set is 38, making Q3 = 38.

Next to determine the IQR, we subtract Q1 from Q3.

IQR = Q3-Q1 = 38 - 23 = 15

Therefore, the Interquartile range for this set of data is 15.

The test scores for the ST231 midterm are as follows: 64, 77, 79, 88, 99, 81, 72, 77, 70 and 89. Given this information, are there any outliers?
1. Yes, 64 is an outlier
2. Yes, 99 is an outlier
3. Yes, both 64 and 99 are outliers
4. No, this data set does not contain any outliers

Solution

Answer: d

Firstly, put data in numerical order: 64, 70, 72, 77, 77, 79, 81, 88, 89, 99

Secondly, find the five-number summary: Q0 (min) = 64, Q1 (25%) = 72, Q2 (median) = 78 ((77+79)/2) , Q3 (75%) = 88, Q4 (max) = 99

Thirdly, determine the interquartile range: IQR = Q3-Q1 = 88-72 = 16

Fourthly, use the outliers formula, below Q1-1.5(IQR) OR above Q3+1.5(IQR) : lower bound = 72-1.5(16) = 48 ; upper bound = 88+1.5(16) = 112

Fifthly, make a conclusion: since no values fall below 48 or above 112, there are no outliers for this data set.

The median only depends on the middle number (singular) when \(n\) is even.↩︎
If your data requires complex numbers (\(\sqrt{-1}\)), then you have a very interesting data set!↩︎