Maths Applications (Year 12) - Bivariate Data Analysis
Linear models are a fundamental and versatile tool in statistics and data analysis. They provide a systematic way to investigate and quantify the relationships between variables, making them a cornerstone in statistical analysis. In this section, we will delve into what linear models are, how they are applied, and provide a detailed worked example to illustrate their practical use.
Linear models are a class of statistical models that assume a linear relationship between an explanatory variable (often denoted as x) and a response variable (typically denoted as y). This means that the change in the response variable is proportional to changes in the explanatory variable. In essence, linear models provide a way to describe and predict how one variable changes as another variable changes.
Line of Best-fit
One of the key concepts in linear modelling is the least-squares method. This method aims to find the best-fitting straight line through a scatterplot of data points. You do not need to know exact mathematics of how the least squares method is implemented but it is important to know how to interpret the results. The line is typically represented in the form of a linear equation:
y is the response variable
c is the intercept, or the predicted value of y when x is zero
m is the slope or gradient of the line
Let us explore the example of Australian Median House Prices during the period 1997 to 2006.
The above table should remind you of the two-way frequency tables you have learnt about in previous chapters. Below we have plotted the scatterplot of the results and draw the line of best fit using the least-squares method:
Using the least squares method, the equation for the line of best fit is given by:
We can make useful predictions using this model. If we assume for the same linear trend to continue, then the predicted median house price for the following year 2007 is given by:
Coefficient of Determination
Another crucial element in linear modelling is the coefficient of determination This statistic measures the proportion of the variation in the response variable that is explained by the linear model. It ranges from 0 to 1, with higher values indicating a stronger linear relationship. An value of 0.75, for example, suggests that 75% of the variation in the response variable can be explained by changes in the explanatory variable within the observed data range.
Interpolation and Extrapolation
When using linear models for predictions, it's essential to distinguish between interpolation and extrapolation. Interpolation involves predicting values within the range of observed data, which is typically reliable as it relies on the established linear relationship. However, extrapolation entails predicting values outside the observed data range and can be risky. Returning to our Australian Median House Price example, if we were to predict the Australian median house price for the year 1994 which is outside our data range, we would get a value of:
Clearly, we could never have a negative median house price, but by extrapolating our results we get a value that does not make any sense. Linear models assume that the same linear relationship holds, which may not always be valid when moving far beyond the observed data.