Data Preparation
Data in Scope:
US Accidents
This is a publicly available dataset that contains information about the traffic accidents that occurred in the United States. The dataset contains over 3 million records, and each record includes a wide range of attributes such as location, time, weather conditions, the severity of the accident, and much more. It is often used by researchers and data scientists to study the patterns and causes of traffic accidents and to develop predictive models to prevent accidents in the future.
Data preparation step-by-step
During the exploratory data analysis phase of the project, the initial steps to clean and prepare the dataset were already taken. These steps include-
Dropping irrelevant attributes
Handling missing values
Performing preliminary feature engineering, particularly for weather conditions, wind direction, start date and time, and road type.
A more comprehensive description of these steps can be found in the Data Cleaning section of the DataPrep_EDA tab.
To proceed, the correlation between features was checked by plotting a heatmap. This was done to observe whether certain features were highly correlated and determine whether any redundant features needed to be removed. The correlation matrix reveals a strong correlation between the start and end GPS coordinates of the accidents. Additionally, the earlier analysis on the medium distance of the crash indicated that the end location of the accident is often in close proximity to the start location. Therefore, it may be feasible to consider only one of the GPS coordinates (either start or end) for machine learning models instead of using both variables. For the model, the starting coordinates are considered. There was no significant correlation observed between the features of interest in the dataset apart from the GPS coordinates.
The problem of Class Imbalance
The subsequent step involved checking for class imbalance for the target attribute which is the severity of a crash. By creating a bar plot of the number of accidents by severity, it was observed that the severity attribute is significantly unbalanced, with a low count of accidents categorized as 'Severe', and a relatively higher count of accidents categorized as 'Not Severe'.
Class imbalance is a common problem in machine learning where the number of samples in one class is significantly smaller than the number of samples in another class. This can lead to biased models that are accurate in predicting the majority class but perform poorly in predicting the minority class. In other words, the model may focus on maximizing overall accuracy at the expense of correctly predicting the minority class. This can have serious consequences in applications where correctly identifying the minority class is important, such as in fraud detection or medical diagnosis. Therefore, addressing class imbalance is important in order to build a reliable and effective machine learning model.
To rectify the class imbalance, the decision was made to undersample the 'Not Severe' category until the number of records in the minority category (Severe) was attained. This approach was chosen as it resulted in a substantial number of records for each category, approximately 2,70,000 records. The count of accidents by severity before and after undersampling is presented in the following plots.
Before undersampling
After undersampling
Requirement of Labeled and Numeric data
SVMs (Support Vector Machines) are designed to work with numeric data because they rely on mathematical operations and computations to learn and optimize the decision boundary for classification. SVMs require labeled data, which means that each data point must be associated with a class label indicating its category or class membership. The underlying mathematical formulations and optimization algorithms used in SVMs are designed to work with numerical values. For example, SVMs use numerical features to represent data points in a multi-dimensional feature space, and numerical labels to represent the target classes or categories. The optimization process in SVMs involves finding the optimal hyperplane or decision boundary that can best separate the data points based on their numeric feature values.
Additionally, the kernel functions used in SVMs to handle non-linear data require numeric inputs for computation. Kernel functions are used to map the data points from the original feature space to a higher-dimensional space, where a linear decision boundary can be found. These kernel functions operate on numeric values and cannot be directly applied to non-numeric or categorical data.
The subsequent step was to drop the redundant variables.
After dropping unnecessary attributes, the next step involved converting boolean values to numerical form.
Following this, feature encoding for categorical attributes was performed. One-hot encoding was used to encode categorical features using the get_dummies() method.
One hot encoding is a technique used to transform categorical data into a numerical format that can be used for machine learning algorithms. It involves creating a binary column for each category of a categorical variable, where a value of 1 indicates that the observation belongs to that category and 0 indicates that it does not. This creates a sparse matrix where each row has only one 1 value, and the rest are 0s. One hot encoding is useful for machine learning algorithms that require numerical inputs, as it preserves the ordinality of the data without introducing any artificial hierarchy among the categories.
Snippets of the data before preprocessing
Snippets of the data after preprocessing
Splitting the data
The subsequent step involved splitting the data into the training set and the testing set, with the training set comprising 75% of the data and the testing set comprising 25% of the data. It is very crucial to have a disjoint split of the data into training and testing sets.
In machine learning, we split the data into training and testing sets to evaluate the performance of a model on new, unseen data. The purpose of training a model is to learn from the available data and make accurate predictions on new data. If we evaluate the model on the same data used for training, it can lead to overfitting, where the model learns the noise in the data instead of the underlying patterns, resulting in poor performance on new data. Thus, having disassociated training and testing datasets is essential.
By splitting the data into training and testing sets, we can train the model on the training set and evaluate its performance on the testing set. This helps us to estimate the model's performance on new data and ensure that the model is able to generalize well to new, unseen data.
Snippets of training dataset
Snippets of testing dataset