Association Rule Mining

OVERVIEW

What is Association Rule Mining?

Association Rule Mining (ARM) is a technique used in data mining to discover relationships between variables in a dataset. It involves finding patterns or associations among items or events that occur together in large datasets, such as market baskets, customer transaction logs, or web usage logs. It is an unsupervised learning method, meaning that it does not require labeled data to work.

The basic idea behind ARM is to identify sets of items that frequently occur together in a dataset. These sets of items are known as itemsets, and the relationships between them are expressed as association rules. An association rule is an "if-then" statement that describes the relationship between two or more items. For example, if we have a dataset of customer transactions at a grocery store, an association rule might be "if a customer buys bread and milk, then they are likely to also buy eggs". This rule tells us that there is a strong relationship between the items bread, milk, and eggs in the dataset.

There are several algorithms that can be used to perform association rule mining, including Apriori, Eclat, and FP-Growth. These algorithms use different approaches to identify frequent itemsets and generate association rules from them.

Concepts Behind the Measures

Support: The support of an itemset in a dataset is calculated by determining how many times the itemset appears in the dataset, based on the number of transactions that contain the itemset. To illustrate, suppose a dataset has 10 transactions, and out of those, 4 transactions include both items A and B. Then, in this case, the support for the itemset "A, B" is 0.4 or 40%.

Confidence: Confidence is the probability of item B being purchased given the purchase of item A. This probability can be calculated by dividing the number of transactions that include both items A and B by the number of transactions that include only item A. For instance, if the support of the set "A, B" is 0.4 and the support of the set "A" is 0.5, then the confidence of the rule "A -> B" is 0.8 or 80%.

The three important measures in an association rule learning

Lift: Lift measures the likelihood of item B being purchased given the purchase of item A, while considering the popularity of B. A lift value greater than 1 indicates that the purchase of item A is associated with a higher likelihood of purchasing item B. Conversely, a lift value less than 1 indicates that the purchase of item A is associated with a lower likelihood of purchasing item B.

Apriori Algorithm Illustration
All possible itemsets formed from 5 items (A, B, C, D, E)

What is the Apriori algorithm?

The Apriori algorithm is a widely used algorithm for mining frequent itemsets and generating association rules in a dataset. It is based on the concept of Apriori property, which states that a subset of a frequent itemset must also be frequent. The algorithm works by scanning the dataset to identify frequent itemsets that meet a minimum support threshold, which is specified by the user.

The Apriori algorithm uses a bottom-up approach to find frequent itemsets, starting with the individual items and then progressively extending the itemset until no more frequent extensions can be found. This approach makes the algorithm efficient for mining frequent itemsets in large datasets.

Once the frequent itemsets are identified, the algorithm generates association rules that have a minimum confidence level, which is also specified by the user. An association rule is a statement of the form "If X, then Y", where X and Y are sets of items. The confidence level of an association rule is the proportion of transactions in which the rule holds true among those transactions that contain X.

The Apriori algorithm's steps are referred to as "iterations." The first iteration searches the dataset for each individual item that appears at least a certain number of times. This category is known as "frequent one-item sets." The algorithm creates candidate itemsets of length k+1 in the following iteration using the frequent itemsets from the previous iteration. In order to determine which potential itemsets have the highest support, the algorithm then runs over the dataset once more. A candidate itemset is discarded if it does not receive the minimum number of votes. If not, it is added to the list of frequently used items.

The algorithm then combines all of the frequent item sets to create candidate rules. The next step examines the support, confidence, and lift of each prospective rule. The collection of intriguing rules that pass the minimum requirements for each of these measures is the algorithm's ultimate output.

In what capacity can ARM be leveraged with regard to traffic incidents and fatalities?

Association rule mining (ARM) can be leveraged to analyze data related to traffic incidents and fatalities in order to identify patterns and relationships between different variables. This information can be used to develop strategies and interventions aimed at reducing the incidence of traffic incidents and fatalities.

For instance, ARM could be used to identify factors that are frequently associated with traffic incidents, such as driver age, road conditions, time of day, and weather conditions. By analyzing these factors and their relationships, it may be possible to develop targeted interventions that address specific risk factors. For example, if the analysis suggests that driver fatigue is a significant risk factor, measures could be taken to promote better sleep habits or rest stops for long-haul truckers.

In addition, ARM can be used to identify groups of factors that tend to occur together, which can provide insight into complex relations and contribute to a more comprehensive understanding of the causes of traffic incidents. For example, ARM could be used to identify a set of factors that commonly co-occur in accidents involving young drivers, such as high speed, distracted driving, and lack of experience. This information could be used to develop targeted education and outreach campaigns aimed at improving driver safety among this demographic.

Another potential application of ARM is to analyze data related to near-miss incidents, or incidents in which an accident was narrowly avoided. By analyzing the factors that contributed to these incidents, it may be possible to identify risk factors that are not evident in actual accidents.

Overall, ARM can be a valuable tool for analyzing traffic incident and fatality data, and it has the potential to contribute to the development of effective interventions aimed at reducing the incidence of accidents and improving road safety.

AREA OF INTEREST

What are the common causes of traffic crashes in a city, and how are they associated with each other?

One way to use association rule mining to answer this question is by following these steps:

Preprocess the traffic crashes dataset for a city and extract the causes for a crash.
Apply data mining techniques, such as the Apriori algorithm, to discover frequent itemsets and association rules.
Interpret the association rules to identify the common causes of traffic crashes in Chicago and their relationships. For example, it may be discovered that a notable number of accidents are attributed to speeding, which is often linked to other factors like reckless driving, distracted driving, or road conditions.

Based on the insights from the association rules, strategies can be developed to tackle the common causes of traffic crashes and minimize their incidence. For instance, measures like enhancing public awareness campaigns to curb speeding and distracted driving or improving road infrastructure to boost safety may be prioritized.

Data Preparation

Results and Conclusions