Linear Regression
OVERVIEW
What is Linear Regression?
Linear regression is a type of supervised machine learning algorithm that is used to model the relationship between a dependent variable and one or more independent variables. The dependent variable, also called the target or response variable, is a continuous value, while the independent variables, also called the features or predictors, can be either continuous or categorical. It is called linear regression because it models the relationship between the input and output variables as a linear function.
The goal of linear regression is to find the best-fitting line or hyperplane that describes the relationship between the independent variables and the dependent variable. This line or hyperplane is used to make predictions for new input values. In linear regression, the line is represented by a linear equation: Y = b0 + b1*X, where Y is the dependent variable (output), X is the independent variable (input), b0 is the intercept, and b1 is the slope of the line and the coefficients of this equation are estimated from the data during the training phase. Linear regression can be divided into two main types: simple linear regression and multiple linear regression. Simple linear regression involves only one independent variable, while multiple linear regression involves two or more independent variables.
To find the best-fitting line, the linear regression algorithm uses a cost function that measures the difference between the predicted values and the actual values. The goal is to minimize this cost function, which is achieved through an optimization technique called gradient descent. The performance of a linear regression model is evaluated using various metrics, such as the mean squared error (MSE), R-squared, and root mean squared error (RMSE). These metrics measure the difference between the predicted values and the actual values of the dependent variable.
Once the model is trained on a set of data, it can be used to predict the value of the output variable for new input values. Linear regression is commonly used to analyze and predict relationships between variables in various fields, such as finance, economics, social sciences, and engineering.
In layman's terms, linear regression is a method of finding a straight line or plane that best describes the relationship between two or more variables. It is like drawing a line through a scatterplot of data points so that the line is as close as possible to all the points. For example, imagine you want to predict someone's weight based on their height. You collect data on the height and weight of a group of people and plot the data on a graph. By using linear regression, you can draw a line that best describes the relationship between height and weight. This line can then be used to predict the weight of a person based on their height.
Key terms of significance in Linear Regression
Below are some of the key terms that are frequently used in the context of Linear Regression.
Dependent variable: The dependent variable, also known as the response variable or target variable, is the variable being predicted by the model.
Independent variable: The independent variable, also known as the predictor or feature variable, is the variable used to predict the value of the dependent variable.
Coefficient: A coefficient is a value that represents the strength and direction of the relationship between the independent variable and the dependent variable. In linear regression, the coefficients are estimated during the training phase and are used to create the linear equation that models the relationship between the variables.
Intercept: The intercept, also known as the bias term, is the value of the dependent variable when all the independent variables are equal to zero.
Residual: A residual is the difference between the predicted value and the actual value of the dependent variable. In other words, it is the error of the model.
Mean squared error (MSE): The mean squared error is a metric that measures the average of the squared differences between the predicted and actual values of the dependent variable. It is used to evaluate the performance of the model.
R-squared: R-squared is a metric that measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, with higher values indicating a better fit of the model.
Root mean squared error (RMSE): The root mean squared error is the square root of the mean squared error. It is a commonly used metric to evaluate the performance of regression models.
How does the algorithm work?
Below is a step-by-step approach to understanding how the linear regression algorithm works:
Collect and preprocess data: Gather data on the dependent and independent variables, and clean and preprocess the data to remove any errors or missing values.
Split data into training and test sets: Split the data into two parts, the training set, and the test set. The training set is used to estimate the coefficients of the linear equation, and the test set is used to evaluate the performance of the model.
Define the model: Choose the appropriate linear regression model based on the number of independent variables and the relationship between them and the dependent variable. For example, choose simple linear regression for one independent variable, and multiple linear regression for two or more independent variables.
Estimate coefficients: During the training phase, the algorithm estimates the coefficients of the linear equation that minimize the difference between the predicted values and the actual values of the dependent variable. The most common method used is least squares regression.
Evaluate the model: Use the test set to evaluate the performance of the model. Calculate the mean squared error (MSE), root mean squared error (RMSE), and R-squared value to assess how well the model fits the data.
Make predictions: Once the coefficients are estimated, the linear equation can be used to make predictions for new input values. Simply plug in the values for the independent variables and solve for the dependent variable using the linear equation.
Refine the model: If the model's performance is not satisfactory, refine it by adjusting the hyperparameters or using a different type of regression model.
Overall, the linear regression algorithm works by estimating the coefficients of a linear equation that models the relationship between the independent variables and the dependent variable. It is an iterative process that involves preprocessing data, defining the model, estimating coefficients, evaluating the model, making predictions, and refining the model as needed.
Observed Values vs Fitted Values
Observed values are the actual values of the dependent variable (y) in the dataset that you have collected. They are the values that you use to estimate the coefficients of the linear equation during the training phase.
Fitted values, on the other hand, are the predicted values of the dependent variable (y) based on the estimated coefficients and the independent variables (x) in the dataset. Fitted values are obtained by plugging the independent variables into the linear equation and solving for the dependent variable.
Estimating Model Parameters
The line of best fit to the data is the line that minimizes the sum of the squared vertical distances between the line and the observed points:
Anatomy of Regression Errors
In linear regression, the difference between the predicted values and the actual values of the dependent variable (y) is known as the error. Regression errors are used to evaluate the performance of a linear regression model.
Residual Sum of Squares (RSS): RSS is the sum of the squared residuals, which is the difference between the predicted values and the actual values of the dependent variable. It represents the unexplained variation in the dependent variable that is not accounted for by the independent variables.
Explained Sum of Squares (ESS): ESS is the sum of the squared differences between the predicted values of the dependent variable and the mean of the dependent variable. It represents the variation in the dependent variable that is explained by the independent variables.
Total Sum of Squares (TSS): TSS is the sum of the squared differences between the actual values of the dependent variable and the mean of the dependent variable. It represents the total variation in the dependent variable.
Mean Squared Error (MSE): This measures the average squared difference between the predicted values and the actual values. It is calculated as the RSS divided by the number of observations. The formula for MSE for a model with n number of observations is
MSE = RSS/n
R-squared (R²): This measures the proportion of variation in the dependent variable that is explained by the regression model. It is calculated as the ratio of ESS to TSS. The formula for R-squared is:
R² = ESS/TSS
R-squared takes values between 0 and 1, where 0 indicates that the regression model explains none of the variation in the dependent variable, and 1 indicates that the regression model explains all of the variation.
Root Mean Squared Error (RMSE): This is the square root of the MSE and is a measure of the average difference between the predicted values and the actual values. The formula for RMSE is:
RMSE = √MSE
These regression errors provide a way to evaluate the performance of a linear regression model and assess how well it explains the variation in the dependent variable.
Key Assumptions of a Linear Regression Model
The four main assumptions of a linear regression model are:
Linearity: The relationship between the dependent variable and the independent variables is linear. This means that the change in the dependent variable is proportional to the change in the independent variable.
Independence: The observations in the dataset are independent of each other. This means that the value of one observation does not depend on the value of another observation.
Homoscedasticity: The variance of the errors is constant across all levels of the independent variables. This means that the spread of the residuals is the same for all values of the independent variables.
Normality: The errors are normally distributed, with a mean of zero. This means that the residuals follow a normal distribution and the mean of the residuals is zero.
These assumptions are important because if any of them are violated, it can affect the accuracy of the model and the validity of the statistical inference. Therefore, it is important to check these assumptions before using linear regression and take appropriate measures to address any violations if they are present.
The Problem of Multicollinearity
Multicollinearity is a challenge that can occur in linear regression when the independent variables have a strong correlation with one another. This can create problems in accurately estimating the relationship between the independent variables and the dependent variable. The presence of multicollinearity can make it challenging to interpret the coefficients of the independent variables, as they can become unstable and less reliable. It can also make it harder to determine which independent variable is driving the relationship with the dependent variable, potentially leading to misleading interpretations. To tackle this issue, one can either remove highly correlated independent variables from the model or use techniques such as principal component analysis or ridge regression. Addressing multicollinearity is crucial in ensuring the accuracy and validity of linear regression models.
Limitations of Linear Regression
A linear regression model has a number of drawbacks, some of which are described below:
Linearity assumption: Linear regression assumes that the relationship between the independent and dependent variables is linear. However, in real-world scenarios, this may not always be the case, and other nonlinear relationships may exist.
Limited dependent variable range: Linear regression assumes that the dependent variable has a continuous and unlimited range. However, in some cases, the dependent variable may be limited in range, such as binary (0/1) or count data.
Sensitivity to outliers: Linear regression is sensitive to outliers, which are data points that are significantly different from other data points in the dataset. Outliers can have a significant impact on the model, leading to inaccurate results.
Multicollinearity: Linear regression assumes that the independent variables are not highly correlated with each other. When multicollinearity is present, it can lead to unstable and unreliable coefficient estimates.
Overfitting: Linear regression can be prone to overfitting, where the model fits too closely to the training data and does not generalize well to new data.
Underfitting: Linear regression can underfit the data if the model is too simple or the sample size is too small, which can lead to poor performance and an inability to capture the underlying patterns in the data.
Non-normality of errors: Linear regression assumes that the errors are normally distributed. If the errors are not normally distributed, it can affect the performance of the model.
Limited applicability: Linear regression is limited to modeling linear relationships between variables, and may not be suitable for modeling complex relationships or non-linear patterns in the data.
It is important to consider these limitations when using linear regression and to choose an appropriate model based on the nature of the data and the research question.
In what capacity can Linear Regression be leveraged with regard to traffic incidents and fatalities?
Linear regression can be used to analyze the relationship between traffic incidents or fatalities and various factors such as weather, road conditions, driver behavior, or vehicle characteristics. By collecting data on these variables and using linear regression to model their relationship with traffic incidents or fatalities, we can gain insights into which factors are most strongly associated with these outcomes and potentially develop interventions to reduce them. For example, if the analysis reveals that a particular road design feature is associated with a higher rate of traffic incidents or fatalities, this information could be used to improve the design of future roads and highways.
Linear regression can also be potentially used to determine certain aspects of a potential crash using variables pertaining to historical driver parameters such as the claim history and driving records. However, it is important to note that the accuracy of the predictions will depend on the quality and completeness of the data and the appropriateness of the model assumptions, such as the linearity assumption and the absence of multicollinearity.
AREA OF INTEREST
Using linear regression to model the Impact score for a crash using information such as the Average Lane Deviation Score, Driver Claim History Rating, number of years without a claim, and Gender can provide valuable insights into the factors that contribute to traffic incidents and fatalities. By analyzing the relationships between these variables and the total impact of a crash, we can identify which factors are most strongly associated with the severity of crashes and inform strategies for reducing the overall impact of traffic incidents. Additionally, this analysis can help inform decisions about how to allocate resources and prioritize interventions to improve road safety. However, it is important to note that linear regression has its limitations, and it is crucial to carefully consider the assumptions and potential sources of bias in the analysis.