Results and Conclusions

Choosing the Best Parameters

GridSearchCV is a powerful technique used for hyperparameter tuning, which is an essential step in machine learning model development. The process involves selecting the optimal set of hyperparameters that can improve the model's performance. GridSearchCV performs an exhaustive search over a specified grid of hyperparameters and evaluates the performance of the model using cross-validation. The best hyperparameters are selected based on the performance metric specified, such as accuracy or F1 score.

In this case, GridSearchCV has been used to determine the best arguments for a DecisionTreeClassifier. The results are as below-

criterion = 'entropy'
max_depth = 15
min_samples_leaf = 500
min_samples_split = 5000

The criterion parameter specifies the quality measure to be used when splitting the nodes of the tree, and 'entropy' is a measure that maximizes information gain. The max_depth parameter specifies the maximum depth of the tree, which limits the number of splits that can be made.

Additionally, the code sets the values of min_samples_leaf and min_samples_split to 500 and 5000, respectively. These parameters specify the minimum number of samples required to be at a leaf node and the minimum number of samples required to split an internal node, respectively.

It's worth noting that the optimal hyperparameters found by GridSearchCV may depend on the specific dataset and problem being tackled, and may not necessarily generalize to other datasets. Therefore, it's important to perform hyperparameter tuning for each specific problem to obtain the best performance.

Feature Importance

The feature importance are calculated based on the impurity reduction of each feature in the tree, which is a measure of how much the feature contributes to reducing the entropy or Gini impurity of the nodes in the tree.

The feature importances are returned as an array of values, where each value represents the relative importance of the corresponding feature in the input data. The sum of all feature importances is equal to 1.0, and higher values indicate that the feature has a stronger influence on the target variable in the model. Feature importances can be used to gain insights into the most important predictors in the model, which can help in feature selection, data preprocessing, and model optimization. For example, if some features have low importance scores, they can be removed from the input data to simplify the model and reduce the risk of overfitting. On the other hand, if some features have high importance scores, they can be used to explain the model's predictions and identify important factors that contribute to the target variable.

Plot representing the feature importances for the decision tree classifier model

Model Performance Metrics

Training Accuracy ~ 88% ; Testing Accuracy ~86%

Confusion Matrix

The confusion matrix of a model is a matrix used to evaluate the performance of the classification model. It summarizes the actual and predicted class labels for a set of data, and shows the number of correct and incorrect predictions made by the model.

In the context of binary classification, the confusion matrix is typically represented as a 2x2 matrix having four entries where each row represents the actual class labels and each column represents the predicted class labels.:

True Positive (TP): The model correctly predicted the positive class.

False Positive (FP): The model predicted the positive class, but the true class is negative.

False Negative (FN): The model predicted the negative class, but the true class is positive.

True Negative (TN): The model correctly predicted the negative class.

Interpreting the confusion matrix

The rows of the confusion matrix of the model in this case represent the actual values of the target variable (Severity of an accident), and the columns represent the predicted values by the model. The confusion matrix shows the number of true positives, true negatives, false positives, and false negatives.

From the confusion matrix, we can see that the model correctly predicted 57,279 not severe accidents and 61,399 severe accidents. However, it also misclassified 7,823 severe accidents as not severe and 11,641 not severe accidents as severe. These false positives and false negatives can have serious consequences in the context of traffic accidents, so it's important to keep them as low as possible.

While the model seems to perform reasonably well, there is room for improvement in reducing the number of false positives and false negatives to increase the accuracy of the model.

Classification Report

The classification report of a model provides precision, recall, F1 score, and support for each class, as well as the accuracy, macro average, and weighted average across all classes.

From the classification report, we can see that the model has an accuracy of 86%, which is calculated as the proportion of correct predictions over the total number of predictions.

The precision and recall values for the Not Severe class are 0.88 and 0.83, respectively, which means that the model has a high precision for predicting Not Severe accidents and moderate recall for this class. Similarly, the precision and recall values for the Severe class are 0.84 and 0.89, respectively, which means that the model has a high precision for predicting Severe accidents and high recall for this class. The F1 score, which is a harmonic mean of precision and recall, is 0.85 for the Not Severe class and 0.86 for the Severe class. These values indicate that the model performs reasonably well in predicting both classes.

In summary, the classification report suggests that the model has a good performance in predicting the severity of traffic accidents, with high accuracy and reasonable precision, recall, and F1 score for both classes. However, it's important to note that the evaluation metrics may vary depending on the specific problem and dataset, and it's always recommended to perform a thorough analysis of the model's performance before making any decisions based on its predictions.

Visualizing the Tree

Pruned Visually until Depth 4

Source.gv.pdf

Pruned Visually until Depth 5

The image shows a decision tree that is being used to identify potential severe accidents. The tree is grown by recursively partitioning the input data based on the feature that provides the most information gain at each split. The tree is being branched out to determine the severity of the accidents, and every path leading to the blue-colored nodes is an indication that the accident is severe.

Based on the description of the tree, it appears that the blue-colored nodes represent the final decision or prediction of the model that an accident is severe. The tree is being branched out to determine the severity of the accidents, and every path leading to the blue-colored nodes is an indication that the accident is severe. These paths represent the sequence of decisions that the model makes to arrive at this conclusion.

It's important to note that the accuracy of the decision tree model may depend on the quality and relevance of the input features, as well as the size and representativeness of the training data used to grow the tree. Regular evaluation and refinement of the model are necessary to ensure optimal performance in identifying potential severe accidents.

Below are some other possible decision trees:

Decision Tree 2

Changing the max_depth to '5', min_samples_leaf to '10000' and min_samples_split to '10000'

Train Accuracy ~ 83% ; Test Accuracy ~83%

dtc2_f.pdf

Decision Tree 3

Without taking into account the 'Year' attribute and changing the criterion to 'gini', max_depth to '5' and min_samples_leaf to '500' and min_samples_split to '10000'

Train Accuracy ~ 75% ; Test Accuracy ~75%

Pruned Visually until Depth 4

dtc3.pdf

Pruned Visually until Depth 4

Conclusions

The decision tree classifier with optimized hyperparameters appears to be a strong model for predicting the severity of accidents in the US accidents dataset. The model achieved an accuracy of 0.86 and performed well on both not-severe and severe accidents with a high recall rate. The use of GridSearchCV to fine-tune the hyperparameters also contributed to the model's accuracy and robustness. However, the visualization of the decision tree was complex and difficult to interpret, which can pose a challenge to understand the key factors contributing to severe accidents.

Additionally, further refinement of the model and validation against new datasets is necessary to ensure its effectiveness in different geographic regions.

Insights and Takeaways

The use of decision trees to predict the severity of a crash provides valuable insights and interpretability into the key factors contributing to severe accidents. The model's ability to handle both categorical and continuous variables makes it a versatile tool for analyzing accident data. However, one of the challenges with decision tree models is the complexity of the visualization, which can be difficult for policymakers and law enforcement agencies to interpret the results.

To optimize the accuracy and robustness of the decision tree model, techniques such as GridSearchCV can be used to fine-tune the hyperparameters. However, if the tree depth is not appropriately optimized, or if the data is high-dimensional and noisy, the model may overfit the data. Therefore, it is essential to carefully select the relevant features to ensure the model's accuracy and generalizability.

Despite these challenges, decision tree models are powerful tools for predicting the severity of a crash and identifying the critical factors contributing to severe accidents. The insights gained from the model can be used to develop targeted interventions to reduce their occurrence, making them valuable tools for stakeholders in the transportation industry. Additionally, decision trees can be combined with other machine learning algorithms such as random forests and gradient boosting to improve the accuracy of the model.

Source Code

Python

Google Sites

Report abuse