Results and Conclusions
Partitioning Clustering
The first approach to cluster states based on fatalities involved performing partition clustering analysis using the K-Means algorithm in python.
K-means is an unsupervised learning technique that clusters similar data points together using a similarity metric. The ultimate objective of K-means is to group data points into a predetermined number of clusters (K) based on their characteristics.
To achieve this objective, K-means involves four main steps:
Step 1: Initialization - Select a value for K and choose K random data points as centroids.
Step 2: Assignment - Assign each data point to the nearest centroid based on a distance metric.
Step 3: Recalculation - Recalculate the centroid of each cluster as the mean of the data points assigned to that cluster.
Step 4: Convergence - Repeat steps 2 and 3 until the centroids no longer change, or a maximum number of iterations is reached.
K-means starts by randomly selecting K initial centroids and assigning each data point to its nearest centroid. Then, the centroids are updated by calculating the mean of each cluster. This process repeats until the centroids no longer change, indicating convergence.
To evaluate the quality of the clustering, K-means uses the within-cluster sum of squares (WCSS) metric. However, K-means has some limitations, such as sensitivity to initial centroid selection and a tendency to converge to local optima. To address these issues, variations of the algorithm, such as k-means++, have been proposed.
The approach involved generating 3-D clustering plots for 2, 3, 4, and 5 clusters. The purpose of these plots is to visualize the clustering results for different numbers of clusters and compare them. These plots will help us identify the optimal number of clusters that best represents the underlying structure of the data. By looking at these plots, we can observe how the data points are distributed across different clusters for each configuration. This will provide us with insights into how the algorithm groups similar states together based on their attributes. Principal Component Analysis (PCA) was also performed in order to reduce the dimensionality for the sake of visualization and then the 2-D clustering plots were also generated.
3 Dimensional Clustering Plots
K = 3
K = 4 K = 5
2 Dimensional Clustering Plots
K = 2
K = 3
K = 4
K = 5
From the above clustering plots, it can be said that the partitioning of the states is effective for 2, 3, and 4 clusters. For 5 clusters, we can see that the 5th cluster only consists of one data point possibly due to its vast number of fatalities(Mississippi). Also, for 2 clusters though the partitioning looks effective, for our use case, the clustering is not highly effective as we would prefer looking at dedicated clusters with less number of states as that would be advantageous in terms of taking dedicated actions to improve teh overall road safety.
Silhouette Analysis
To further analyze or determine the efficient number of clusters, a silhouette plot was generated as below. By interpreting the silhouette score from the above plots it seemed reasonable to have the number of clusters as 4.
Visualizing the clusters with K as 4
Visualizing our clusters’ size/spread
Interactive plot
Understanding the clusters:
Below is the partitioned list of states by their respective clusters:
Cluster 0
Cluster 1
Cluster 3
Cluster 2
It appears that the K-means clustering algorithm was able to group the states based on similarities in their "Fatalities per Drivers", "Fatalities per Registered Vehicles", and "Fatalities per Population" attributes. The states in each cluster seem to have similar patterns in terms of their fatality rates. Cluster 0 includes states such as Connecticut, Massachusetts, and New York, while cluster 1 includes states such as Florida, Georgia, and Texas. Cluster 2 has states like California, Maryland, and Michigan, and cluster 3 includes states such as Alabama, Mississippi, and Oklahoma.
OBSERVATIONS:
Based on the clusters that were formed from K-means clustering when K is 4, we can interpret the clustering as follows:
Cluster 0: This cluster consists of states with relatively low fatalities rates across all three categories. These states may have implemented strict traffic laws or have good infrastructure for road safety.
Cluster 1: This cluster consists of states with high fatalities rates across all three categories. These states may have more dangerous roads or higher rates of reckless driving.
Cluster 2: This cluster consists of states with moderate fatalities rates across all three categories. These states may have a mix of good and bad road safety practices.
Cluster 3: This cluster consists of states with very high fatalities rates for drivers and registered vehicles, but relatively low fatalities rates for the population as a whole. This may suggest that these states have higher rates of accidents involving drivers and vehicles, but lower rates of accidents involving pedestrians and non-motorists.
These clusters may offer insights into common factors that contribute to fatalities in the states, and may help inform policies and initiatives aimed at reducing fatalities on the roads. However, it's important to note that further analysis and domain knowledge will be necessary to fully understand the factors that contribute to fatality rates in each of these clusters.
Hierarchical Clustering
For further understanding of clustering analysis and also to verify if there are any major differences between various other clustering techniques, I proceeded to cluster the data using an hierarchical approach.
Hierarchical clustering is a type of clustering algorithm used in unsupervised machine learning that groups similar data points together in a hierarchy or tree-like structure based on their distance or similarity. The resulting hierarchy can be represented using a dendrogram, which displays the branching structure of the hierarchy.
The steps involved in hierarchical clustering are:
Initialization: At the beginning of the algorithm, each data point is considered as a cluster of its own.
Calculate distance/similarity: A distance or similarity measure is calculated between each pair of data points. This measure is used to determine how similar or dissimilar each pair of data points is.
Merge closest clusters: The two closest clusters are merged to form a new cluster. The distance/similarity measure used in step 2 is used to determine which clusters are the closest.
Recalculate distance/similarity: The distance/similarity matrix is updated to reflect the new cluster that was formed in step 3.
Repeat: Steps 3 and 4 are repeated until all the data points have been merged into a single cluster.
There are two types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering starts with each data point as a separate cluster and successively merges the closest pairs of clusters, while divisive clustering starts with all data points in a single cluster and recursively splits it into smaller clusters.
Agglomerative hierarchical clustering is the most commonly used type of hierarchical clustering. In agglomerative clustering, the distance or similarity between clusters is calculated using one of several methods, such as single linkage, complete linkage, or average linkage. These methods differ in how they calculate the distance or similarity between clusters.
Once the clusters have been formed, they can be visualized using a dendrogram. A dendrogram is a tree-like diagram that shows the branching structure of the hierarchy. The height of each branch in the dendrogram represents the distance or dissimilarity between the clusters that are being merged at that point. The dendrogram can be used to identify the optimal number of clusters to use in subsequent analyses.
The approach involved performing hierarchical clustering analysis using the hclust function in R by using Cosine Similarity as the distance measure. By generating dendrograms I intended to discover the relationship established between our cluster variables or the states in our case and also to trace back the results to that observed in the case of partitioning clustering.
Fig 1- Tree Dendrogram
Fig 2- Tree Dendrogram with colored clusters
Fig 3- Phylogenic Dendrogram
Fig 4- Circular Dendrogram
OBSERVATIONS:
From the above figures 1 and 2 obtained by clustering the states using hierarchical method for 4 clusters, our states were grouped as below-
Mississippi is grouped as one cluster. (Cluster A)
South Dakota, Iowa, North Dakota, Idaho, Nebraska, Alaska, Delaware, Michigan, Colorado, Oregon, Indiana, Nevada, Maine are grouped into one cluster. (Cluster B)
Connecticut, Dist of Columbia, Massachusetts, Minnesota, New Jersey, New York, Rhode Island, Washington, Maryland, Virginia, Illinois, Hawaii, Pennsylvania, Vermont, New Hampshire, Wisconsin, Ohio, Utah, California are grouped into one cluster. (Cluster C)
Alabama, Kentucky, New Mexico, Oklahoma, South Carolina, Montana, Wyoming, Arkansas, Tennessee, Louisiana, Florida, Georgia, Arizona, North Carolina, Texas, West Virginia, Kansas, Missouri are grouped into one cluster. (Cluster D)
Based on the clusters obtained from the hierarchical clustering, we can categorize them as follows:
Cluster A: This cluster consists of only one state, Mississippi. This suggests that Mississippi has a distinct pattern of fatalities compared to the other states, which may warrant further investigation.
Cluster B: This cluster consists of mainly Midwestern and Western states, along with Alaska and Delaware. These states have relatively low fatalities per 100,000 drivers, registered vehicles, and population compared to the other clusters. This may indicate that these states have implemented effective road safety measures.
Cluster C: This cluster consists of mainly Northeastern states, along with California and Illinois. These states have moderate fatalities per 100,000 drivers, registered vehicles, and population compared to the other clusters.
Cluster D: This cluster consists of mainly Southern and Southwestern states, along with Montana and Wyoming. These states have relatively high fatalities per 100,000 drivers, registered vehicles, and population compared to the other clusters. This may indicate that these states have a higher incidence of road accidents and may require more attention to road safety measures.
Partitioning Clustering vs Hierarchical Clustering
The differences in the clustering results obtained through partitioning clustering (K-means) and hierarchical clustering can be attributed to the different algorithms used to form the clusters.
In partitioning clustering, the algorithm starts by assigning each observation to a random cluster and then iteratively updates the cluster centroids and re-assigns the observations to their closest centroid. This process continues until the cluster assignments no longer change or a specified number of iterations is reached. The main advantage of partitioning clustering is its scalability and ability to handle large datasets.
On the other hand, hierarchical clustering works by recursively merging or splitting clusters based on their similarity, resulting in a tree-like structure (dendrogram). The algorithm starts by considering each observation as a separate cluster and then iteratively merges the most similar clusters until all the observations belong to a single cluster. The main advantage of hierarchical clustering is its ability to visualize the hierarchical relationships among the clusters and observations.
In our case, it seems that the partitioning clustering algorithm (K-means) has formed clusters based on the proximity of the observations to their respective cluster centroids, while the hierarchical clustering algorithm (hclust) has formed clusters based on the similarity between the observations.
Moreover, the K-means algorithm specifies the number of clusters beforehand, while the number of clusters generated by hierarchical clustering relies on the dendrogram and the selection of a cutoff point to obtain the desired number of clusters.
Furthermore, upon close examination of the Phylogenic Dendrogram presented in Figure 3, it is evident that the states with varying cluster assignments in both scenarios are situated close to each other in terms of proximity. For example, Mississippi is in close proximity to the branch containing Alabama, Kentucky, New Mexico, Oklahoma, and South Carolina. This proximity can be observed in the below dendrogram where the states with differing cluster assignments in both clustering scenarios are situated very closely.
Conclusions
In this analysis, we explored the clustering of US states based on fatality trends and rates. We used two methods, hierarchical clustering and partitioning clustering with K means algorithm, to group the states into clusters.
In a nut shell, both methods showed that clustering US states based on fatality trends and rates is possible, and the results were consistent to a certain extent. The clusters obtained from both methods can be useful for certain institutes, policymakers and stakeholders to identify areas where interventions are needed to improve road safety and reduce fatalities. It is important to note that other factors such as road infrastructure, vehicle safety standards, and driver behavior may also influence fatality rates and should be considered in any comprehensive road safety analysis. Thus we could say that this analysis is beneficial to a certain extent however, there are a few other factors that are needed to be analyzed before we can arrive at strong conclusions.