Outliers (Edexcel GCSE Statistics)

Revision Note

Roger

Author

Roger

Expertise

Maths

Outlier Basics

What are outliers?

  • Outliers are extreme data values that do not fit with the general pattern of the data

  • Outliers in a data set can be due to

    • genuine extreme events

      • these are valid data, even if unusual

    • mistakes in the data collection

      • these should be identified and removed if possible

  • Outliers will affect some statistics that are calculated from the data

    • They can have a big effect on the mean,

      • but not on the median

      • and usually not on the mode

    • The range will be completely changed by a single outlier

      • but the interquartile range will not be affected

  • When calculating the mean or the range it is important to decide whether any outlier(s) should be included in the calculations

    • An exam question will tell you whether to include outliers or not

      • But you may have to decide which value(s) are outliers

      • Look for values that are much bigger or smaller than the rest of the data set

  • In general outliers are

    • included if they are a valid piece of data

    • excluded if it is likely that they are erroneous

Worked Example

The following data was collected about the ages of a number of students at the time that they sat their GCSE Maths exam

3       13       15       15       15       15       16       16       16       16       16       57


(a) Suggest possible outliers in the data set.

Most students sit their GCSEs when they are 15 or 16
Some students sit them a bit younger, so the '13' is not very unusual
However the '3' and the '57' are definitely extreme data values compared to the rest of the set!

3 and 57 should probably be considered to be outliers


(b) For each outlier identified in part (a), suggest with a reason whether the data value should be kept in or excluded from the data set.

It is essentially impossible that a 3 year old would be sitting a GCSE exam, so that data value is surely a mistake

On the other hand older people do sometimes sit GCSE exams, so the '57' shouldn't be excluded from the data set without further information

The '3' should be excluded. There is no way a 3 year old would be sitting a GCSE exam, so that is almost certainly an error in the data collection.

The '57' should be kept. It is unusual for older people to sit GCSEs, but it is not impossible. So that may be a valid data value.

Calculating Outlier Boundaries

How do I calculate outlier boundaries?

  • It is sometimes possible to find outliers by inspection

    • i.e. look for unusually large or small data values

  • But on your Higher tier paper you will usually be expected to use outlier boundaries

    • An upper boundary

      • Any data value greater than this is considered an outlier

    • A lower boundary

      • Any data value less than this is considered an outlier

  • There are two ways of calculating outlier boundaries that you need to know

    • The formulas for these are not on the exam formula sheet, so you need to remember them

Using quartiles and interquartile range

  • This method uses the lower quartile (LQ), upper quartile (UQ) and interquartile range (IQR)

  • Use this if you already know the quartiles (or can easily calculate them)

    • This includes data presented on a box plot

  • The lower boundary is the LQ subtract one and a half times the IQR

    • bold Small bold space bold outlier space is space less than space LQ space minus space 1.5 cross times IQR

  • The upper boundary is the UQ plus one and a half times the IQR

    • bold Large bold space bold outlier space is space greater than space UQ space plus space 1.5 cross times IQR

Using mean and standard deviation

  • This method uses the mean (mu) and standard deviation (sigma)

  • Use this if you already know mu and sigma (or can easily calculate them)

    • Or for data presented in a form that doesn't allow quartiles to be calculated

  • The lower boundary is three times the standard deviation less than the mean

    • bold Small bold space bold outlier space is space less than space mu minus 3 sigma

  • The upper boundary is three times the standard deviation greater than the mean

    • bold Large bold space bold outlier space is space greater than space mu plus 3 sigma

Worked Example

The ages, in years, of a number of children attending a birthday party are given below:

 2,   7,   5,   4,   8,   4,   6,   5,   5,   29,   2,   5,   13

The following statistics have been calculated for that data set:

lower quartile: 4
median: 5
upper quartile: 7.5

Identify any outliers within the data set.

Start by calculating the interquartile range

IQR = UQ - LQ = 7.5 - 4 = 3.5

Now use LQ space minus space 1.5 cross times IQR to calculate the lower outlier boundary

lower boundaryspace equals space 4 space minus space 1.5 cross times 3.5 space equals space minus 1.25

There are no values less than -1.25, so there are no 'small outliers'

And use UQ space plus space 1.5 cross times IQR to calculate the upper outlier boundary

upper boundaryspace equals space 7.5 space plus space 1.5 cross times 3.5 space equals space 12.75

There are two values greater than 12.75 (13 and 29), so those are the 'large outliers'

The outliers are 13 and 29

You would not receive full marks for that answer if you did not show in your working that you had calculated the lower and upper outlier boundaries!

Worked Example

Data was collected for the number of eggs, x, found in each of 25 American alligator nests. The data is summarised in the following way:

straight capital sigma x equals 1108 space space space space space space space space space space space straight capital sigma x to the power of italic 2 equals 51632

(a) Calculate appropriate upper and lower boundaries for defining outliers in the data set. Give your answers correct to 2 decimal places.

There is no way to calculate quartiles from that data summary, so this is definitely a 'mean and standard deviation' question!

Start by calculating the mean
Divide straight capital sigma x by the total number of data values (25)

mu equals 1108 over 25 equals 44.32


Now use square root of fraction numerator sum x squared over denominator n end fraction minus open parentheses fraction numerator sum x over denominator n end fraction close parentheses squared end root to calculate the standard deviation

sigma equals square root of 51632 over 25 minus open parentheses 1108 over 25 close parentheses squared end root equals 10.050751... equals 10.05 space open parentheses 2 space straight d. straight p. close parentheses

Now use mu plus-or-minus 3 sigma to find the outlier boundaries

44.32 minus 3 cross times 10.05 equals 14.17

44.32 plus 3 cross times 10.05 equals 74.47


The lower boundary is 14.17 (2 d.p.)
The upper boundary is 74.47 (2 d.p.)


(b) Write down the smallest and largest data values that would not be outliers.

Remember that the data is numbers of eggs in a nest
So the data values have to be whole numbers

The smallest value that would not be an outlier is 15
The largest value that would not be an outlier is 74

You've read 0 of your 0 free revision notes

Get unlimited access

to absolutely everything:

  • Downloadable PDFs
  • Unlimited Revision Notes
  • Topic Questions
  • Past Papers
  • Model Answers
  • Videos (Maths and Science)

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Did this page help you?

Roger

Author: Roger

Roger's teaching experience stretches all the way back to 1992, and in that time he has taught students at all levels between Year 7 and university undergraduate. Having conducted and published postgraduate research into the mathematical theory behind quantum computing, he is more than confident in dealing with mathematics at any level the exam boards might throw at you.