Linear Regression

OVERVIEW

What is Linear Regression? 

Linear regression is a type of supervised machine learning algorithm that is used to model the relationship between a dependent variable and one or more independent variables. The dependent variable, also called the target or response variable, is a continuous value, while the independent variables, also called the features or predictors, can be either continuous or categorical. It is called linear regression because it models the relationship between the input and output variables as a linear function. 

The goal of linear regression is to find the best-fitting line or hyperplane that describes the relationship between the independent variables and the dependent variable. This line or hyperplane is used to make predictions for new input values. In linear regression, the line is represented by a linear equation: Y = b0 + b1*X, where Y is the dependent variable (output), X is the independent variable (input), b0 is the intercept, and b1 is the slope of the line and the coefficients of this equation are estimated from the data during the training phase. Linear regression can be divided into two main types: simple linear regression and multiple linear regression. Simple linear regression involves only one independent variable, while multiple linear regression involves two or more independent variables.

To find the best-fitting line, the linear regression algorithm uses a cost function that measures the difference between the predicted values and the actual values. The goal is to minimize this cost function, which is achieved through an optimization technique called gradient descent. The performance of a linear regression model is evaluated using various metrics, such as the mean squared error (MSE), R-squared, and root mean squared error (RMSE). These metrics measure the difference between the predicted values and the actual values of the dependent variable. 

Once the model is trained on a set of data, it can be used to predict the value of the output variable for new input values. Linear regression is commonly used to analyze and predict relationships between variables in various fields, such as finance, economics, social sciences, and engineering.

In layman's terms, linear regression is a method of finding a straight line or plane that best describes the relationship between two or more variables. It is like drawing a line through a scatterplot of data points so that the line is as close as possible to all the points. For example, imagine you want to predict someone's weight based on their height. You collect data on the height and weight of a group of people and plot the data on a graph. By using linear regression, you can draw a line that best describes the relationship between height and weight. This line can then be used to predict the weight of a person based on their height.

Key terms of significance in Linear Regression

Below are some of the key terms that are frequently used in the context of Linear Regression.


How does the algorithm work? 

Below is a step-by-step approach to understanding how the linear regression algorithm works:

Overall, the linear regression algorithm works by estimating the coefficients of a linear equation that models the relationship between the independent variables and the dependent variable. It is an iterative process that involves preprocessing data, defining the model, estimating coefficients, evaluating the model, making predictions, and refining the model as needed.


Observed Values vs Fitted Values

Observed values are the actual values of the dependent variable (y) in the dataset that you have collected. They are the values that you use to estimate the coefficients of the linear equation during the training phase. 

Fitted values, on the other hand, are the predicted values of the dependent variable (y) based on the estimated coefficients and the independent variables (x) in the dataset. Fitted values are obtained by plugging the independent variables into the linear equation and solving for the dependent variable.

Estimating Model Parameters 

The line of best fit to the data is the line that minimizes the sum of the squared vertical distances between the line and the observed points:

Anatomy of Regression Errors

In linear regression, the difference between the predicted values and the actual values of the dependent variable (y) is known as the error. Regression errors are used to evaluate the performance of a linear regression model. 

MSE = RSS/n   

R² = ESS/TSS

R-squared takes values between 0 and 1, where 0 indicates that the regression model explains none of the variation in the            dependent variable, and 1 indicates that the regression model explains all of the variation. 

RMSE = √MSE


These regression errors provide a way to evaluate the performance of a linear regression model and assess how well it explains the variation in the dependent variable.

Key Assumptions of a Linear Regression Model

The four main assumptions of a linear regression model are:

These assumptions are important because if any of them are violated, it can affect the accuracy of the model and the validity of the statistical inference. Therefore, it is important to check these assumptions before using linear regression and take appropriate measures to address any violations if they are present.

The Problem of Multicollinearity

Multicollinearity is a challenge that can occur in linear regression when the independent variables have a strong correlation with one another. This can create problems in accurately estimating the relationship between the independent variables and the dependent variable. The presence of multicollinearity can make it challenging to interpret the coefficients of the independent variables, as they can become unstable and less reliable. It can also make it harder to determine which independent variable is driving the relationship with the dependent variable, potentially leading to misleading interpretations. To tackle this issue, one can either remove highly correlated independent variables from the model or use techniques such as principal component analysis or ridge regression. Addressing multicollinearity is crucial in ensuring the accuracy and validity of linear regression models.

Limitations of Linear Regression

A linear regression model has a number of drawbacks, some of which are described below:

It is important to consider these limitations when using linear regression and to choose an appropriate model based on the nature of the data and the research question.

In what capacity can Linear Regression be leveraged with regard to traffic incidents and fatalities? 

Linear regression can be used to analyze the relationship between traffic incidents or fatalities and various factors such as weather, road conditions, driver behavior, or vehicle characteristics. By collecting data on these variables and using linear regression to model their relationship with traffic incidents or fatalities, we can gain insights into which factors are most strongly associated with these outcomes and potentially develop interventions to reduce them. For example, if the analysis reveals that a particular road design feature is associated with a higher rate of traffic incidents or fatalities, this information could be used to improve the design of future roads and highways.


Linear regression can also be potentially used to determine certain aspects of a potential crash using variables pertaining to historical driver parameters such as the claim history and driving records. However, it is important to note that the accuracy of the predictions will depend on the quality and completeness of the data and the appropriateness of the model assumptions, such as the linearity assumption and the absence of multicollinearity. 


AREA OF INTEREST

Using linear regression to model the Impact score for a crash using information such as the  Average Lane Deviation Score, Driver Claim History Rating, number of years without a claim, and Gender can provide valuable insights into the factors that contribute to traffic incidents and fatalities. By analyzing the relationships between these variables and the total impact of a crash, we can identify which factors are most strongly associated with the severity of crashes and inform strategies for reducing the overall impact of traffic incidents. Additionally, this analysis can help inform decisions about how to allocate resources and prioritize interventions to improve road safety. However, it is important to note that linear regression has its limitations, and it is crucial to carefully consider the assumptions and potential sources of bias in the analysis.