Decision Trees
OVERVIEW
What is a Decision Tree?
A decision tree is a predictive modeling tool used in machine learning and data mining. It is a tree-like structure that represents a set of decisions and their possible consequences, including chance events and resource costs. Decision trees are commonly used in classification and regression analysis to help identify a strategy for solving a problem or making a decision. It is a Supervised learning technique.
The decision tree starts with a single node, called the root node, which represents the entire dataset. The root node is then split into two or more child nodes, each representing a subset of the data that share a common characteristic or attribute. The splitting process continues recursively, with each node being split into child nodes until a stopping criterion is met. The stopping criterion can be a maximum tree depth, a minimum number of samples per leaf node, or a minimum reduction in the impurity measure of the data. At each node, a decision is made based on the values of one or more input features or attributes. For classification tasks, the decision typically involves assigning a class label to the data point represented by the leaf node. For regression tasks, the decision involves predicting a continuous output value for the leaf node.
The construction of a decision tree involves several steps:
Feature selection: The input features or attributes that best separate the data into the different classes or groups are selected.
Splitting: The feature or attribute with the greatest information gain or reduction in impurity is selected to split the data into two or more child nodes.
Stopping criterion: A stopping criterion is defined to determine when to stop splitting the nodes into child nodes.
Pruning: The decision tree is pruned to prevent overfitting, which is when the tree is too complex and fits the training data too closely, resulting in poor generalization to new data.
Once the decision tree is constructed, it can be used to predict the class or output value for new data points by following the path from the root node to the appropriate leaf node based on the values of the input features. Decision trees are often used in conjunction with other machine learning techniques, such as ensemble methods like random forests, to improve their accuracy and robustness.
A sample representation of a decision tree
Decision Tree Terminologies
Root Node: The root node initiates the decision tree by representing the entire dataset, which is further divided into two or more homogeneous sets.
Leaf Node: The leaf nodes signify the final output of the decision tree, and it cannot be segregated further after reaching a leaf node.
Splitting: The process of dividing the decision node/root node into sub-nodes based on the given conditions is known as splitting.
Branch/Sub Tree: The tree formed by splitting the decision tree is referred to as a branch or sub-tree.
Pruning: The process of removing the unwanted branches from the decision tree is called pruning.
Parent/Child node: The root node of the decision tree is called the parent node, and the other nodes are referred to as child nodes.
In layman's terms, a decision tree is a graphical representation of a set of rules used to make decisions. Think of it as a flowchart that helps you make decisions based on a set of conditions or rules. Each branch of the tree represents a decision that is made based on a condition or attribute, leading to additional branches or a final decision. Decision trees are often used in machine learning as a predictive model, where the goal is to classify new data based on a set of pre-defined criteria or attributes.
How does the algorithm work?
Below is a step-by-step explanation of the decision tree algorithm, including the concepts of Gini impurity, entropy, and information gain:
Step 1: Data Preparation
The first step in building a decision tree is to prepare the data. This includes selecting the relevant features and pre-processing the data by handling missing values, outliers, and other data quality issues.
Step 2: Node Selection
The next step is to select a node for splitting the data. This is done by using Attribute Selection Measure (ASM).
Step 3: Splitting the Data
Once a node has been selected for splitting, the data is split into subsets based on the values of the selected feature. For each subset, a child node is created.
Step 4: Recursive Splitting
The splitting process is repeated recursively for each child node until a stopping criterion is met. This criterion could be a maximum depth limit, a minimum number of data points in a node, or a minimum reduction in impurity.
Step 5: Prediction
Finally, the decision tree is used to make predictions by traversing the tree from the root node to a leaf node based on the values of the input features. The predicted output is the class label associated with the leaf node.
Attribute Selection Measures
When constructing a Decision tree, a crucial concern is determining the optimal attribute for the root node and sub-nodes. An effective solution to this challenge is using an Attribute Selection Measure (ASM) technique, which aids in selecting the most appropriate attribute for tree nodes. Information Gain and Gini Index are two widely used ASM techniques.
1. Information Gain:
The measurement of changes in entropy after segmenting a dataset based on an attribute is known as information gain. It helps to determine the amount of information a feature provides about a class. Based on the information gain value, nodes are split, and a decision tree is constructed. The decision tree algorithm aims to maximize the value of information gain, and the attribute with the highest information gain is split first. The formula to calculate information gain is given by subtracting the weighted average of each feature's entropy from the entropy of the original dataset.
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)]
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
2. Gini Index:
The Gini Index is a measure of impurity or purity that is utilized during the creation of a decision tree in the CART (Classification and Regression Tree) algorithm. Attributes with lower Gini indexes are preferred over those with higher ones. The Gini index creates only binary splits, and the CART algorithm employs it to generate these splits. The calculation of the Gini index can be performed using the following formula:
Gini Index= 1- ∑jPj2
In a nutshell, the decision tree algorithm works by recursively partitioning the feature space based on the most informative features until a stopping criterion is met. The resulting tree can be used for classification or regression tasks by traversing the tree and making predictions based on the values of the input features.
Example:
Buying a car: The scenario involves evaluating an individual's car purchasing preference. If the desired color is blue, then additional criteria such as the car's year of manufacture and mileage are taken into account. If the color preference is not blue, then the brand of the car is prioritized. If none of these conditions are met, the car is not purchased. However, if the car is blue and manufactured after 2020, or if it is a blue car with good mileage, or if it is a red Ferrari, then it would be purchased. A sample decision tree for this scenario can be created as below-
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and problem is the main point to remember while creating a machine learning model. Below are the two primary reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Advantages of the Decision Tree
Decision trees are easy to understand since they follow a human decision-making process in real-life.
They can be effective in solving decision-based problems.
They enable the exploration of all possible outcomes for a given problem.
Decision trees require less data preprocessing in comparison to other algorithms.
Disadvantages of the Decision Tree
Decision trees can be complex since they consist of multiple layers.
Overfitting can be an issue, but this can be addressed using the Random Forest algorithm.
For problems with more class labels, the computational complexity of the decision tree may increase.
In what capacity can Decision Trees be leveraged with regard to traffic incidents and fatalities?
Decision trees can be a valuable tool in predicting and preventing traffic incidents and fatalities. By analyzing data such as weather conditions, time of day, road type, and other relevant factors, decision trees can help identify high-risk situations and inform proactive measures to mitigate those risks. For example, if a decision tree analysis shows that a particular type of road or intersection is associated with a high frequency of accidents, city planners can use that information to make improvements to the infrastructure and reduce the likelihood of accidents in the future. Similarly, if the analysis shows that certain weather conditions or times of day are associated with a higher risk of accidents, authorities can take steps such as issuing warnings or adjusting traffic flow to reduce the risk. Additionally, decision trees can be used to develop models for predicting the severity of accidents, which can help emergency responders better allocate resources and respond more effectively to accidents.
Moreover, decision trees can also be used to predict the likelihood of fatalities in the event of an accident. By analyzing data such as the type of vehicle involved, speed of impact, and use of safety features such as seatbelts and airbags, decision trees can provide insight into the factors that are most strongly associated with fatal accidents. This information can inform public education campaigns, targeted enforcement efforts, and vehicle safety regulations to help reduce the number of fatalities on the roads. Overall, decision trees can play an important role in promoting road safety and saving lives.
AREA OF INTEREST
To leverage a decision tree classifier in the context of traffic accidents and fatalities, I plan to utilize the US Accidents dataset to predict the severity of a crash. Using decision trees to predict the severity of a crash is a promising area of application. By analyzing the various attributes related to an accident, such as road conditions, weather, vehicle type, and driver behavior, a decision tree can help predict the severity of a crash. This can assist emergency responders in providing appropriate medical care and directing resources to the scene of the accident. Furthermore, by identifying the factors that contribute to the severity of a crash, decision trees can aid in the development of preventative measures, such as better road design or driver education programs, to reduce the frequency and severity of accidents in the future.
Decision trees are relatively simple to understand and interpret, making them accessible to non-technical stakeholders, such as policymakers or law enforcement personnel. However, care must be taken to avoid overfitting the model, as decision trees can become overly complex and difficult to interpret when trained on large datasets with many features. To mitigate this, techniques such as pruning or ensemble methods like random forests can be employed to increase the accuracy and interpretability of the model.