###### Scatterplots

Maths Applications (Year 12) - Bivariate Data Analysis

Zara Johnson

The primary use of scatterplots is to convey the relationship (or association) of two numerical variables. Numerical variables are defined by data points that can be counted or given a numbered value. An example of numerical data is the number of sales made annually at a retail store.

When representing relationships between two variables, the value being measured or observed will be represented on the y-axis. This is called the dependent variable. The variable that is changed is represented on the x-axis. This variable is called the independent variable.

For example, consider a study conducted to investigate the number of ice creams sold from Charlieâ€™s Ice Cream Van depending on the weather. In this study, the number of ice creams is being measured. Therefore, it is the dependent variable and so will be represented on the y-axis. The outdoor temperature (in degrees Celsius) is the variable that is being changed. Therefore, it is the independent variable and is represented on the x-axis.

In a scatterplot, data points are plotted as dots. This is why they are given the name â€˜scatterplotsâ€™ as the dots will look scattered. Once all data points are plotted, the relationship between the two variables can be seen, and you can comment on trends in the graph. In this case, a trend could be described by saying ice cream sales increase as the temperature outside increases. In this case, the graph has a positive correlation (or association).

A **line of best fit** can then be drawn to represent this correlation. A line of best fit is a straight line drawn in a way in which best represents the general trend of the data. This line can then be used to make predictions about the trend, even if we don't have a data point for the particular point we wish to examine.

For example, if we want to predict how many ice creams have been sold on a day where the temperature was 31Â°C, we would draw a straight vertical line from 31Â°C on the x-axis to the line of best fit. Then, where the drawn line hits the line of best fit, draw a horizontal line to the y-axis. Read where the horizontal line hits the y-axis to find the predicted number of ice creams sold. An example is shown with a pink line. In this example, we can predict that 223 ice creams wills be sold on a 31Â°C day.

It is important to understand the reliability of our prediction. The data used could have lots of data points scattered on the graph, with no apparent trend. Making accurate predictions with this data would be difficult. Predictions are the most reliable when most of our data points fall close to the line of best fit.

You also need to consider the number of known data points you have. If you have lots of data points, you are more likely to have an accurate prediction. If there is very few data points, then there is more room for error.

Predictions that fall within the range of the known data points are also more likely to be accurate and reliable. This sort of prediction is called interpolation. We are less likely to be reliable when making predictions outside the range of known data points. This is called extrapolation. In the example below, we can trust the prediction given by the pink line more than the extrapolations shown by the yellow lines.