6.1 Large Data Set

Using a Large Data Set

What is a large data set?

As part of your course there is a large data set that you can use
It contains lots of information
You are not expected to memorise any results from the data
You will have an advantage if you are familiar with the large data set
- Understand what the variables are
- Understand the terminology used
- Understand the context
You will not get a copy of the large data set in your exam
- if you are required to calculate anything using the large data set you will be given an extract within the question

What skills can I practice with a large data set?

Cleaning data
- There might be missing data
- You could identify outliers and question their validity
Sampling and hypothesis testing
- You can practice different methods of sampling using the data
- You could use a sample to test a hypothesis
Statistical measures and diagram
- You could calculate summary statistics for different variables
- You could create different diagrams
- You can interpret the summary statistics and diagrams (as it is real data you could explore the context behind the results)
- You could compare summary statistics and diagrams

Do I have to use spreadsheets and other technology?

You will not be assessed on using spreadsheets
- However, it is a useful skill for your future career
You could use technology to calculate the summary statistics and create the statistical diagrams
- This will help you to practice these skills whilst using real data
- Spreadsheets can calculate summary statistics
- In the exam you could use the statistics mode on your calculator

Summary of the AQA Large Data Set

What is the data about?

The large data set for AQA comes from the UK Department Stock Vehicle Database (loosely referred to as “Cars” or “Vehicle data”)
- The full database is too large to use in full so AQA have extracted some of the data into a spreadsheet and this is what should be used to study parts of the statistics course
Some of the data in the spreadsheet is coded so keep a close eye on the information contained under “Definition of fields” and “Field Values”
- Beware! As the codes are numbers this may look like you can find statistics with them like the mean, but this would not make sense

e.g. “The mean of the propulsion type data is 2 so the mean propulsion type is diesel” does not make sense but it may be okay to say “diesel is the modal (most frequent) propulsion type of vehicles in the sample”

You are likely to be asked to “use your knowledge of the large data set” – this is where the familiarity of its key features can be an advantage
- e.g. knowing that the mass of a vehicle includes an average 75 kg driver
Only mention things that can be justified from the dataset
- e.g. knowing there is only one electric vehicle in the whole data set so don’t use or assume things you may have heard about electric cars on the news recently

What variables are included in the large data set?

Reference
- A unique number given to each individual vehicle by AQA to index the data
- Could be used to easily identify a vehicle and all its information
The first few pieces of data about the vehicles are qualitative
Make
- Only the five most frequently registered makes are included
- BMW, Ford, Toyota, Vauxhall and Volkswagen
PropulsionTypeid
- A data value of 1, 2, 3, 7 or 8 indicates the type of fuel powering the vehicle
  (4, 5 and 6 are not used in the AQA extracted dataset)
- 1 is petrol-powered, 2 is a diesel-powered vehicle
- The full codes are listed under “Field Values”
BodyTypeid
- Also given by coded values defined in “Field Values” these represent the style of vehicle including (amongst others) convertibles and MPVs (multi-purpose vehicles)
GovRegion
- The database only includes cars registered in England (rather than the UK)
- The region of a vehicle is determined by the postcode of the current registered keeper
- The regions included are London, North West and South West
KeeperTitleid
- The last of the coded values defined in “Field Values” represents whether the current registered keeper is male, female, a company or unknown
The remaining data values are all quantitative
Engine size
- Size (capacity) of the engine measured in cubic centimetres (cc)
Year registered
- Vehicles included in the extract were either first registered in 2002 or 2016
- The introduction says the precise dates are
  - 3 June 2002 – 9 June 2002
  - 6 June 2016 – 12 June 2016
- Knowing that only a few days from each year are included gives an idea of the enormity of the full database
Mass
- Measured in kilograms (kg)
  - the mass of an average driver (75 kg) is included in the figures quoted
Emissions
- The remaining data values centre around the emissions from the vehicles
  - CO2 – Carbon dioxide emissions, measured in g/km
  - CO – Carbon monoxide emissions measured in g/km
  - NOX – Oxides of nitrogen emissions measured in g/km
  - part – Particulate emissions measured in g/km
    (this measure only applies to diesel cars)
  - hc – hydrocarbon emissions measured in g/km

Random number
- A random number is generated by the spreadsheet for each vehicle so is not part of the data set but can be used to randomly select vehicles in sampling
- Be aware that the random number refreshes each time the spreadsheet is refreshed

Is the data complete?

Various data values are blank within the spreadsheet; others are 0 where this makes no sense (such as the mass of the car)
- There is no information as to why these occur but be aware they exist
- Under the “Definition of fields” tab there is some extra information about the emissions data
  - CO2 emissions are known for 83% of vehicles in the whole database
  - CO emissions are known for 82% of vehicles in the whole database
  - NOX emissions are known for 81% of vehicles in the whole database
  - Part – only for diesel vehicles (24% of the whole database)
  - Hc emissions are known for 51% of vehicles in the whole database
The above means that the data should be cleaned before samples are taken

What are the key features I need to know about the data set?

These have been mentioned in the lists above but here is a summary of those we have seen used in exam and practice papers
- There are only five makes, and Ford was the most frequently registered
- There is only one electric vehicle in the database
- Data is from a few days in summer and only in two years – 2002 and 2016
- The mass of a vehicle includes an average 75 kg driver
- Emissions data (CO2, CO and NOX) is only known for around 80% of the whole database
- Particulate emissions are only applicable to diesel cars

Worked example

Jay collects data on the masses of vehicles first registered in 2002 taking a random sample of size 30.

(a)

Use your knowledge of the large data set to explain why Jay should clean the data before taking a sample

(b)

Jay’s calculations show the mean mass of a vehicle in his sample is 1340 kg.
Using your knowledge of the large data set write down an estimate for the mean mass of an empty vehicle in the whole database, justifying your answer.

(a)

Use your knowledge of the large data set to explain why Jay should clean the data before taking a sample

6-1-1-aqa-we-solution-part-1

(b)

6-1-1-aqa-we-solution-part-2

Exam Tip

As vehicle emissions are frequently mentioned in news articles be wary of confusing popular opinion with what can be justified using the information contained within the large data set.

AQA AS Maths: Statistics

Revision Notes