Data Preparation

Data in Scope:

Chicago Traffic Crashes

This dataset has the records of all the reportable traffic crash incidents within Chicago. This is a public dataset maintained by the Chicago Police Department and is updated daily. The data includes a wide range of information about each crash, such as the date, time, location, severity, and number of injuries and fatalities. It also includes information about the vehicles and people involved in the crashes, including driver age, gender, and whether they were impaired at the time of the crash. The dataset can be used for a variety of purposes, such as analyzing traffic patterns, identifying dangerous intersections, and evaluating the effectiveness of traffic safety.

In order to successfully conduct association rule mining on a dataset, it is often necessary to perform some preliminary data cleaning to ensure that the data is suitable for analysis. Looking at the data at hand, there are several focus areas for cleaning the data effectively. One important step is handling missing values, as these can cause issues with the analysis and lead to inaccurate results. This might involve dropping the records with missing values or imputing missing values using methods such as mean imputation, median imputation, or regression imputation, depending on the nature of the missing data.

Another important focus area for data cleaning is feature engineering. This process involves transforming the raw data into a format that is more suitable for analysis. This may include creating new variables based on existing ones or normalizing and standardizing the data to make it more consistent. Feature engineering helps to reduce noise and increase the accuracy of the analysis.

Dimensionality reduction is another crucial step in the data-cleaning process. When working with large datasets, there may be many variables that can make the analysis complex and difficult to understand. Dimensionality reduction is the process of reducing the number of variables in the dataset, and it can help to simplify the analysis and make it more efficient.

Finally, data transformation is another focus area for cleaning. This process involves transforming the data to make it more suitable for analysis. This might include changing the data type of variables, converting categorical variables into numerical ones, normalizing the data to ensure that it falls within a specific range, or changing the structuring of the data. By following these steps to clean and prepare the data for association rule mining, accuracy, reliability, and useful insights can be ensured, which in turn can inform decision-making.

Snippets of the data before cleaning

The data has missing and unknown values in multiple attributes as evident in the data snippet. The initial data cleaning process was performed in Python which involved a thorough understanding of the data and dropping certain unwanted attributes before handling these missing values. Due to the abundance of data in cases where the key attributes were categorical values, the rows with missing values were omitted.

Later stages of cleaning were carried out in R involved filtering out the irrelevant columns and keeping only the date and the primary contributory cause for a crash. This was then followed by grouping the data by date and creating a list of all the unique crash contributory cause values for each date. Finally, the resulting data was converted into a transaction dataset format, where each row represents a unique date and the columns represent the unique primary contributory cause values.

Snippets of the data post cleaning

Data Cleaning Code

Python