Results and Conclusions

Multiple Linear Regression Model

The multiple linear regression model is a statistical method used to model the relationship between multiple independent variables and a single dependent variable. It is used to predict the value of a dependent variable based on the values of two or more independent variables. In this model, the dependent variable is Total_Impact and the independent variables are Average EuroNCAP Lane Deviation Score, Driver Claim History Rating, No Year no claims, and Gender

Summary of this LR Model

The summary of the model provides important information on the coefficients of the independent variables and the overall performance of the model. The coefficients of the model represent the change in the dependent variable for a one-unit change in the independent variable, holding all other independent variables constant. The coefficients are estimated using the method of least squares, and the standard error of the coefficients measures the amount of variation in the estimate. The t-value and p-value of each coefficient are used to test the null hypothesis that the coefficient is equal to zero. If the p-value is less than the significance level (usually 0.05), the coefficient is considered statistically significant. 

The summary also provides information on the overall fit of the model. The multiple R-squared value measures the proportion of variation in the dependent variable that can be explained by the independent variables. The adjusted R-squared value is similar to the R-squared value, but it adjusts for the number of independent variables in the model. The F-statistic tests the null hypothesis that all of the coefficients in the model are equal to zero. If the p-value of the F-statistic is less than the significance level, the model is considered a good fit. The residual standard error measures the average amount that the actual values deviate from the predicted values, and it is used to assess the accuracy of the model's predictions.


Interpreting the summary statistics

Based on the summary output of the multiple linear regression model, we can infer the following:


The model explains the variation in the dependent variable (Total_Impact) using the independent variables with a coefficient of determination (R-squared) value of 0.213. This means that 21.3% of the variability in the dependent variable can be explained by the independent variables in the model. The coefficients for the independent variables represent the change in the dependent variable for a unit increase in the respective independent variables, holding all other variables constant. For instance, a unit increase in Average EuroNCAP Lane Deviation Score is associated with an increase of 20.134 in Total_Impact, holding all other variables constant. The coefficient for Driver Claim History Rating is negative, indicating that an increase in the rating is associated with a decrease in Total_Impact. The p-values associated with the coefficients indicate the statistical significance of the corresponding independent variable in the model. Average EuroNCAP Lane Deviation Score and No Year no claims are statistically significant as their p-values are less than 0.05. However, Driver Claim History Rating and Gender are not statistically significant since their p-values are greater than 0.05. 

The residual standard error indicates the average amount of error or deviation of the observed values from the predicted values. In this model, the residual standard error is 30.09. The F-statistic and its associated p-value (p-value < 2.2e-16) test the overall significance of the model. The F-statistic is 32.13, indicating that the model is significant as a whole. Overall, this model suggests that Average EuroNCAP Lane Deviation Score and No Year no claims are important predictors of Total_Impact while Driver Claim History Rating and Gender are not significant predictors.


Simple Linear Regression Model

The second model is a simple linear regression model that predicts the total impact based on the Average EuroNCAP Lane Deviation Score feature.

Summary of this LR Model

Interpreting the summary statistics

The linear regression model indicates that there is a significant positive relationship between the "Total_Impact" variable and the "Average EuroNCAP Lane Deviation Score" variable. The intercept of 205.357 means that when the "Average EuroNCAP Lane Deviation Score" is zero, the predicted value of "Total_Impact" is 205.357. The coefficient of 21.323 means that for every one unit increase in the "Average EuroNCAP Lane Deviation Score", the predicted value of "Total_Impact" increases by 21.323. 

The p-value for the coefficient of the "Average EuroNCAP Lane Deviation Score" is less than 0.05, which indicates that the coefficient is statistically significant. The R-squared value of 0.1849 indicates that only 18.49% of the variation in the "Total_Impact" variable is explained by the "Average EuroNCAP Lane Deviation Score" variable. The residual standard error of 30.52 means that the model's predictions are expected to be off by an average of 30.52 units. The F-statistic of 108.5 with a p-value of <2.2e-16 indicates that the model is significant and that the predictor variable is useful in predicting the target variable. Overall, the linear regression model suggests that the "Average EuroNCAP Lane Deviation Score" is a statistically significant predictor of the "Total_Impact" variable, but there may be other factors that are also influencing the "Total_Impact" variable that are not accounted for in the model.

A few useful visuals to understand the model

Residuals vs Fitted Values Plot

Normal Q-Q Plot

The 'Residuals vs Fitted values plot' is a scatterplot of the residuals (i.e., the differences between the predicted values and the actual values) versus the fitted values (i.e., the predicted values) of the dependent variable. This plot helps us to assess the linearity, constant variance (i.e., homoscedasticity), and independence assumptions of the model. A good model will have residuals scattered randomly around the zero line, with no clear patterns or trends. If there is a clear pattern, such as a curve or funnel shape, it suggests that the model may not fit the data well or may not meet the assumptions of linear regression. 

Based on the plot comparing the observed values with the fitted values in this case, there is a positive trend between the two variables and there is no indication of a clear curvature of the line. Therefore, it can be concluded that the assumption of linearity is not violated. However, the plot also shows that the assumption of constant variance may be violated as the points are narrowly spread at the beginning, but as we move towards the middle sector, the points become more widely spread. Towards the end of the plot, the points converge a little, suggesting that the variance may not be constant. 

The 'normal Q-Q plot' is a graphical tool for checking the normality assumption of the residuals. It plots the quantiles of the standardized residuals against the expected quantiles of a standard normal distribution. A good model will have residuals that follow a straight line on the plot, indicating that they are normally distributed. Departures from a straight line indicate that the residuals are not normally distributed, which may indicate that the model is not the best fit for the data or that there are outliers or influential observations in the data. 

When evaluating the normal Q-Q plot with a margin of 2 standard deviations, it can be observed that the majority of the points fall along the line, indicating that the normality assumption may hold. However, there are a few points towards the beginning and the end of the plot that deviate from the line, which requires further investigation. Thus, it cannot be conclusively stated that the normality assumption has been violated, and additional analysis is necessary to confirm the validity of the model with respect to the normality assumption.

Visualizing the regression line on Testing Dataset

The regression line visualization on the testing set suggests that the model may not perform well in accurately predicting the output variable. The plot shows that the points are scattered far from the regression line, which indicates that the model may not be the most appropriate one for this data. The model's predictions may not match the actual output values, and the level of error might be substantial.

This type of plot, where the predicted values are plotted against the actual values, is a common technique used to evaluate the performance of regression models. The plot allows for the assessment of the goodness of fit of the model, and in this case, it suggests that the model's fit is not very good. It is important to note that the plot only shows the performance of the model on the testing set, and it is possible that the model may have performed well on the training set. However, if the model's performance on the testing set is poor, it is likely that the model needs further refinement or that a different model may be more appropriate for the data.

Comparing both the LR Models

ANOVA (Analysis of Variance) is a statistical method used to compare the variance between two or more regression models. It helps in determining whether adding new predictor variables to the model improves the accuracy of the model. 

In the given scenario, ANOVA is used to compare two linear regression models: lm_model1 and lm_model. lm_model1 is a simple linear regression model with only one predictor variable, and lm_model is a multiple linear regression model with three predictor variables. The null hypothesis is that there is no significant difference in the variance explained by the two models. If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis and conclude that the model with additional predictors is statistically significant. 

The ANOVA results show that the p-value is less than the significance level, indicating that the model with additional predictor variables is statistically significant. It means that adding the predictor variables: Driver Claim History Rating, No Year no claims, and Gender, to the model has improved its accuracy in predicting the Total_Impact. The F-statistic value is 5.6355, which is greater than 1, indicating that the model is better than the simple linear regression model. The sum of squares (SS) is 15306, indicating that the new variables in the model explain an additional 15306 of the variance in the response variable. The results also show that the effect of the Gender predictor variable is statistically significant in predicting the Total_Impact.

Insights and Takeaways

The use of linear regression to model the Impact score for a crash is an appropriate approach for identifying factors that contribute to traffic incidents and fatalities. The multiple linear regression model that was developed takes into account four independent variables, namely the Average EuroNCAP Lane Deviation Score, Driver Claim History Rating, No Year no claims, and Gender, to predict the dependent variable Total_Impact. Additionally, a simple linear regression model was developed using only the Average EuroNCAP Lane Deviation Score as the predictor.

The summary of the multiple linear regression model shows that the independent variables, Average EuroNCAP Lane Deviation Score, No Year no claims, and Gender are significant predictors of Total_Impact. The p-values for these variables are less than 0.05, which indicates that they are statistically significant at the 95% confidence level. The Driver Claim History Rating, however, is not a significant predictor of Total_Impact as its p-value is greater than 0.05. The R-squared value of the model is 0.213, which means that the model explains 21.3% of the variability in Total_Impact.

The summary of the simple linear regression model shows that the Average EuroNCAP Lane Deviation Score is a significant predictor of Total_Impact. The p-value for this variable is less than 0.05, which indicates that it is statistically significant at the 95% confidence level. The R-squared value of the model is 0.1849, which means that the model explains 18.5% of the variability in Total_Impact.

The ANOVA results indicate that the multiple linear regression model is a better fit than the simple linear regression model for predicting Total_Impact. The F-statistic is 5.6355 with a p-value of 0.000842, which means that there is a significant difference between the two models, and the multiple linear regression model provides a better fit. 

Some insights and takeaways from this approach and the model developed are:

Conclusions

In conclusion, the approach of using linear regression to model the impact score for a crash based on features such as Average EuroNCAP Lane Deviation Score, Driver Claim History Rating, No Year no claims, and Gender can provide valuable insights into the factors that contribute to traffic incidents and fatalities. The multiple linear regression model developed shows that the Average EuroNCAP Lane Deviation Score and No Year no claims have a significant positive association with the total impact of a crash, while Driver Claim History Rating and Gender have a non-significant negative and positive association, respectively.

The simple linear regression model also shows that the Average EuroNCAP Lane Deviation Score is a significant predictor of the total impact of a crash. However, the adjusted R-squared values for both models are relatively low, indicating that the models explain only a small proportion of the variation in the dependent variable. This may be due to the exclusion of other relevant features or the presence of unmeasured confounding factors. Therefore, caution should be exercised when interpreting the results, and further research is needed to investigate additional factors that may influence the severity of crashes. 

In a nutshell, this analysis highlights the potential of linear regression as a tool for identifying the most important factors contributing to traffic incidents and informing road safety strategies. However, it also emphasizes the importance of considering the limitations of the model and carefully evaluating its assumptions and potential sources of bias to ensure reliable and valid results.

Source Code