Data Preparation
Data in Scope:
US CrashData
This dataset contains information on crash frequencies and risk variables. The crash frequencies are represented by the number of crashes per 100,000 vehicles for four types of impacts: frontal, rear, front offset, and side. The risk variables include the average EuroNCAP lane deviation score, driver claim history rating, number of years with no claims, and gender. This dataset can be used to analyze the relationship between these risk variables and crash frequencies, and to build a linear regression model to predict the total impact of a crash using these variables.
Data preparation step-by-step
The initial data preparation step involved checking for null values, and fortunately, no missing or null values were found in the dataset.
The next step in data preparation was feature engineering. The crash frequencies in the dataset were added together to get a better understanding of the overall impact. The resulting variable Total_Impact represents the sum of all four crash frequencies, providing a more comprehensive measure of the overall impact.
The data was further processed by checking the correlation between features using a correlation plot. The highest correlation coefficient was observed between the 'Frontal Impact' and 'Side Impact' variables, which was 0.69. In contrast, the correlation between 'Driver Claim History Rating' and 'Frontal Impact' was relatively low, at 0.06. This correlation plot was crucial to identify redundant or highly correlated variables that can affect the accuracy of the model.
Requirement of Labeled and Numeric data
Linear regression requires labeled and numeric data because it is a supervised learning algorithm that aims to find the relationship between the independent variables and the dependent variable (which is the labeled data) through a linear equation.
Labeled data refers to the target variable (also known as the dependent variable) that the model is trying to predict. In linear regression, the target variable needs to be numeric because the model is attempting to find a linear relationship between the independent variables and the numeric target variable. Without a numeric target variable, linear regression cannot be used. Additionally, the independent variables, which are used to predict the target variable, also need to be numeric. This is because the model calculates the relationship between the independent variables and the target variable using mathematical operations that require numeric inputs.
The subsequent step was to factorize the Gender attribute
Factorization refers to the process of converting categorical data into numeric form, which can be used as input for machine learning algorithms. This is typically done by assigning a unique numeric value to each level of the categorical variable. To factorize the Gender attribute means to convert it from a categorical variable (male/female) to a numerical variable that can be used in statistical analysis. This can be done by assigning a numerical value to each category, for example, 0 for male and 1 for female. Factorizing the Gender attribute allows us to include it in our regression analysis, as regression requires numeric data.
The next step involved dropping the redundant attributes.
Dropping redundant attributes means removing columns from the dataset that are not useful for the analysis or have a high correlation with other attributes. Redundant attributes can cause multicollinearity, which can lead to inaccurate regression results. Removing redundant attributes can improve the accuracy of the model and reduce overfitting. In the given context, the redundant attributes include the individual crash frequencies such as 'Frontal Impact (per 100,000 vehicles)', 'Rear Impact (per 100,000 vehicles)', 'Front Offset Impact (per 100,000 vehicles)', 'Side Impact (per 100,000 vehicles)', since these variables have been added to create the 'Total_Impact' variable. Therefore, they can be dropped from the dataset to avoid redundancy and simplify the analysis.
Snippets of the data before preprocessing
Snippets of the data after preprocessing
Splitting the data
The subsequent step involved splitting the data into the training set and the testing set, with the training set comprising 80% of the data and the testing set comprising 20% of the data. It is very crucial to have a disjoint split of the data into training and testing sets.
In machine learning, we split the data into training and testing sets to evaluate the performance of a model on new, unseen data. The purpose of training a model is to learn from the available data and make accurate predictions on new data. If we evaluate the model on the same data used for training, it can lead to overfitting, where the model learns the noise in the data instead of the underlying patterns, resulting in poor performance on new data. Thus, having disassociated training and testing datasets is essential.
By splitting the data into training and testing sets, we can train the model on the training set and evaluate its performance on the testing set. This helps us to estimate the model's performance on new data and ensure that the model is able to generalize well to new, unseen data.
Snippets of training dataset
Snippets of testing dataset