Data Preparation

Data in Scope: 

US CrashData

This dataset contains information on crash frequencies and risk variables. The crash frequencies are represented by the number of crashes per 100,000 vehicles for four types of impacts: frontal, rear, front offset, and side. The risk variables include the average EuroNCAP lane deviation score, driver claim history rating, number of years with no claims, and gender. This dataset can be used to analyze the relationship between these risk variables and crash frequencies, and to build a linear regression model to predict the total impact of a crash using these variables.

Data preparation step-by-step

Requirement of Labeled and Numeric data

Linear regression requires labeled and numeric data because it is a supervised learning algorithm that aims to find the relationship between the independent variables and the dependent variable (which is the labeled data) through a linear equation. 

Labeled data refers to the target variable (also known as the dependent variable) that the model is trying to predict. In linear regression, the target variable needs to be numeric because the model is attempting to find a linear relationship between the independent variables and the numeric target variable. Without a numeric target variable, linear regression cannot be used. Additionally, the independent variables, which are used to predict the target variable, also need to be numeric. This is because the model calculates the relationship between the independent variables and the target variable using mathematical operations that require numeric inputs.

Factorization refers to the process of converting categorical data into numeric form, which can be used as input for machine learning algorithms. This is typically done by assigning a unique numeric value to each level of the categorical variable. To factorize the Gender attribute means to convert it from a categorical variable (male/female) to a numerical variable that can be used in statistical analysis. This can be done by assigning a numerical value to each category, for example, 0 for male and 1 for female. Factorizing the Gender attribute allows us to include it in our regression analysis, as regression requires numeric data.

The next step involved dropping the redundant attributes. 

Dropping redundant attributes means removing columns from the dataset that are not useful for the analysis or have a high correlation with other attributes. Redundant attributes can cause multicollinearity, which can lead to inaccurate regression results. Removing redundant attributes can improve the accuracy of the model and reduce overfitting. In the given context, the redundant attributes include the individual crash frequencies such as 'Frontal Impact (per 100,000 vehicles)', 'Rear Impact (per 100,000 vehicles)', 'Front Offset Impact (per 100,000 vehicles)', 'Side Impact (per 100,000 vehicles)', since these variables have been added to create the 'Total_Impact' variable. Therefore, they can be dropped from the dataset to avoid redundancy and simplify the analysis. 

Snippets of the data before preprocessing

Snippets of the data after preprocessing

Splitting the data

In machine learning, we split the data into training and testing sets to evaluate the performance of a model on new, unseen data. The purpose of training a model is to learn from the available data and make accurate predictions on new data. If we evaluate the model on the same data used for training, it can lead to overfitting, where the model learns the noise in the data instead of the underlying patterns, resulting in poor performance on new data. Thus, having disassociated training and testing datasets is essential. 

By splitting the data into training and testing sets, we can train the model on the training set and evaluate its performance on the testing set. This helps us to estimate the model's performance on new data and ensure that the model is able to generalize well to new, unseen data. 

Snippets of training dataset

Training set  - Link

Snippets of testing dataset

Testing set  - Link

Data Cleaning Code