Data Preparation

Data in Scope: 

US Accidents

This is a publicly available dataset that contains information about the traffic accidents that occurred in the United States. The dataset contains over 3 million records, and each record includes a wide range of attributes such as location, time, weather conditions, the severity of the accident, and much more. It is often used by researchers and data scientists to study the patterns and causes of traffic accidents and to develop predictive models to prevent accidents in the future.

Data preparation step-by-step

A more comprehensive description of these steps can be found in the Data Cleaning section of the DataPrep_EDA tab.

The problem of Class Imbalance

Class imbalance is a common problem in machine learning where the number of samples in one class is significantly smaller than the number of samples in another class. This can lead to biased models that are accurate in predicting the majority class but perform poorly in predicting the minority class. In other words, the model may focus on maximizing overall accuracy at the expense of correctly predicting the minority class. This can have serious consequences in applications where correctly identifying the minority class is important, such as in fraud detection or medical diagnosis. Therefore, addressing class imbalance is important in order to build a reliable and effective machine learning model. 

To rectify the class imbalance, the decision was made to undersample the 'Not Severe' category until the number of records in the minority category (Severe) was attained. This approach was chosen as it resulted in a substantial number of records for each category, approximately 2,70,000 records. The count of accidents by severity before and after undersampling is presented in the following plots.

Before undersampling

After undersampling

Requirement of Labeled and Numeric data

SVMs (Support Vector Machines) are designed to work with numeric data because they rely on mathematical operations and computations to learn and optimize the decision boundary for classification. SVMs require labeled data, which means that each data point must be associated with a class label indicating its category or class membership. The underlying mathematical formulations and optimization algorithms used in SVMs are designed to work with numerical values. For example, SVMs use numerical features to represent data points in a multi-dimensional feature space, and numerical labels to represent the target classes or categories. The optimization process in SVMs involves finding the optimal hyperplane or decision boundary that can best separate the data points based on their numeric feature values. 

Additionally, the kernel functions used in SVMs to handle non-linear data require numeric inputs for computation. Kernel functions are used to map the data points from the original feature space to a higher-dimensional space, where a linear decision boundary can be found. These kernel functions operate on numeric values and cannot be directly applied to non-numeric or categorical data.

One hot encoding is a technique used to transform categorical data into a numerical format that can be used for machine learning algorithms. It involves creating a binary column for each category of a categorical variable, where a value of 1 indicates that the observation belongs to that category and 0 indicates that it does not. This creates a sparse matrix where each row has only one 1 value, and the rest are 0s. One hot encoding is useful for machine learning algorithms that require numerical inputs, as it preserves the ordinality of the data without introducing any artificial hierarchy among the categories.

Snippets of the data before preprocessing

Snippets of the data after preprocessing

Splitting the data

The subsequent step involved splitting the data into the training set and the testing set, with the training set comprising 75% of the data and the testing set comprising 25% of the data. It is very crucial to have a disjoint split of the data into training and testing sets.

In machine learning, we split the data into training and testing sets to evaluate the performance of a model on new, unseen data. The purpose of training a model is to learn from the available data and make accurate predictions on new data. If we evaluate the model on the same data used for training, it can lead to overfitting, where the model learns the noise in the data instead of the underlying patterns, resulting in poor performance on new data. Thus, having disassociated training and testing datasets is essential. 

By splitting the data into training and testing sets, we can train the model on the training set and evaluate its performance on the testing set. This helps us to estimate the model's performance on new data and ensure that the model is able to generalize well to new, unseen data. 

Snippets of training dataset

Sample training features data - Link

Sample training label data - Link

Snippets of testing dataset

Sample testing features data - Link

Sample testing label data - Link

Data Cleaning Code