### Using a Large Data Set

#### What is a large data set?

- As part of your course there is a large data set that you can use
- It contains lots of information
- You are not expected to memorise any results from the data
- You will have an advantage if you are familiar with the large data set
- Understand what the variables are
- Understand the terminology used
- Understand the context

- You will
**not**get a copy of the large data set in your exam- if you are required to calculate anything using the large data set you will be given an extract within the question

#### What skills can I practice with a large data set?

- Cleaning data
- There might be missing data
- You could identify outliers and question their validity

- Sampling and hypothesis testing
- You can practice different methods of sampling using the data
- You could use a sample to test a hypothesis

- Statistical measures and diagram
- You could calculate summary statistics for different variables
- You could create different diagrams
- You can interpret the summary statistics and diagrams (as it is real data you could explore the context behind the results)
- You could compare summary statistics and diagrams

#### Do I have to use spreadsheets and other technology?

- You will not be assessed on using spreadsheets
- However, it is a useful skill for your future career

- You could use technology to calculate the summary statistics and create the statistical diagrams
- This will help you to practice these skills whilst using real data
- Spreadsheets can calculate summary statistics
- In the exam you could use the statistics mode on your calculator

### Summary of the Large Data Set

#### What is the data about?

- The data consists of samples of data on the weather for eight locations over two different time periods
- The five UK locations are:
- Leuchars: town in Scotland
- Leeming: village in North Yorkshire
- Heathrow: hamlet in Greater London
- Hurn: village in Dorest (South West England)
- Camborne: town in Cornwall (South West England)

- The three international locations are:
- Beijing: capital city of China
- Perth: capital city of Western Australia (state of Australia)
- Jacksonville: city in Florida (state of USA)

- The two time periods are:
- May to October 1987
- May to October 2015

#### What variables are included in the large data set?

- Daily mean (air) temperature
- Measured in degrees Celsius (°C) given to 1dp
- Average of hourly temperature readings between 0900 - 0900 GMT

- Daily total rainfall
- Measured in millimetres (mm) given to 1dp
- Measured for the 24 hours starting at 0900 GMT
- A trace of rain 'tr' is an amount less than 0.05mm

- Daily total sunshine
- Measured in hours (hr) given to 1dp

- Daily maximum relative humidity
- Given as a percentage given to the nearest integer
- A reading above 95% is associated with mist and fog

- Daily mean windspeed and direction
- Mean measured in knots (1 kn = 1.15 mph) given to nearest integer and is described using the Beaufort conversion (calm, light, etc)
- Direction measured in degrees rounded to the nearest 10 and is given as a cardinal direction (north, south, etc)
- Averaged for 24 hours starting at 0000 GMT

- Daily maximum gust and direction
- Measured using the same units as windspeed
- The maximum instantaneous speed over the 24 hours

- Cloud cover
- Measured in Oktas (eighths of the sky covered by cloud)

- Daily mean visibility
- Measured in decametres (1 Dm = 10 m) horizontally

- Daily mean pressure
- Measured in hectopascals (1 hPa = 100 Pa = 1 millibar)

#### Is the data complete?

- There are missing or unknown pieces of data
- These are listed as 'n/a' or '-'
- The total daily total sunshine, mean windspeed and maximum gust is unknown for the first half of May 1987 for the UK cities
- The data should be cleaned before samples are taken

- The three international cities only contain date for:
- Daily mean temperature, daily total rainfall, daily mean pressure and daily mean windspeed

#### What are some of the important features?

- Consider which locations are closer to the equator
- Consider which locations are near a coast
- Jacksonville, Perth, Camborne, Hurn, Leuchars are near the coast

- Consider which locations are in each hemisphere
- Perth is in the southern hemisphere so have winter when UK has summer

- Consider which variables are discrete and which are continuous
- Cloud cover is discrete

- You can use 0 or 0.025 for rainfall that is listed as 'tr'
- The great storm of 1987 happened 15-16 October in UK
- The wind speeds were high at this time
- The south of England was affected
- This will skew some variables (wind/gust/rainfall)
- This won't have much impact some variables (sunshine/cloud cover)
- October in the UK is normally cloudy and have less sunshine

- Don't worry about remembering the exact dates of this but it is something to be aware of

- Consider the number of days in each month
- 30 days in June and September
- 31 days in May, July, August and October
- In total the LDS covers 184 days

#### Worked Example

Using the large data set, Dylan collects data on the daily total sunshine in Leuchars from May to October 1987 by taking a random sample of 30 days.

(a)

Using your knowledge of the large data set, explain why Dylan will have to first clean the data before taking a sample.

(b)

Dylan calculates the mean value from his sample to be 25.3 hours. Using your knowledge of the large data set, explain how you know Dylan has made a mistake.

(a)

Using your knowledge of the large data set, explain why Dylan will have to first clean the data before taking a sample.

(b)

Dylan calculates the mean value from his sample to be 25.3 hours. Using your knowledge of the large data set, explain how you know Dylan has made a mistake.