Data Cleaning
Once the necessary data is, the next step is to organize and clean it. Data cleaning and preprocessing are crucial steps in the data science pipeline. The goal of these steps is to prepare the data for further analysis by removing any inconsistencies, errors, or missing values. This includes tasks such as handling missing values, converting data into a standardized format, normalizing data and removing any outliers. By completing these steps, the data is transformed into a usable format that can be easily analyzed, modeled, and visualized. This helps to ensure the reliability and accuracy of the results and improves the overall quality of the analysis. Data cleaning, preprocessing, and formatting are key components of data science.
Given the use of data from various sources, I thought it was appropriate to partially clean and maintain version control of the data. For now, I have conducted a first round of cleaning specifically for visual exploration. As the project evolves, I will document any changes made and keep the website updated with subsequent phases of data cleansing.
Looking at the data at hand, below are some of the areas that I focussed on for the first round of cleaning:
Handling missing vales
Checking and reformatting data types
Feature selection
Feature engineering
Sampling
US Accidents Data:
This is a static dataset having over 2.8 million recorded traffic accidents.
Snippets of the data before cleaning
Matrix plot visualizing the missing values before cleaning the data
The data has missing values in multiple attributes as evident in the data snippet and the matrix plot. The data cleaning process involved a thorough understanding of the data and dropping certain unwanted attributes before handling these missing values. For numerical attributes, the missing values were replaced using the mean value of the attribute. In cases of categorical values, since we had a large amount of data and the missing data only contributed to a very small percentage of the data, the rows were dropped.
The data also has a large variety of weather conditions making the analysis difficult, so I consolidated the unique conditions into a more general description. Similarly, I also reduced the number of unique conditions for the Wind Direction. I also broke down the Start Time attribute into the year, month, day, weekday, hour, and minute in order to help with certain visualizations and later feed them to the models.
While investigating the data, it was interesting to see that most of the accidents are recorded on interstate highways. So, I did some more feature engineering to differentiate the interstate highways for further analysis.
Snippets of the data post cleaning
Matrix plot visualizing the missing values after cleaning the data
US Traffic Fatality Records Data:
This is a dynamic dataset handled by the United States Department of Transportation and hosted in Google's BigQuery platform. The data is accessed using the BigQuery API.
Snippets of the data before cleaning
SQL queries were utilized to retrieve data from the API based on specific use cases. Although the retrieved data was devoid of missing values, it was in the form of query data that required processing to be transformed into a dataframe.
Snippets of the data after cleaning