SUPPORT VECTOR MACHINE

OVERVIEW

What is a Support Vector Machine (SVM)? 

A support vector machine (SVM) is a type of machine learning algorithm used for binary and multi-class classification tasks. SVMs are supervised learning models that analyze data and classify it into different categories or predict continuous values based on labeled training data. SVMs can also be used for regression and outlier detection

The basic idea behind SVM is to find a hyperplane that best separates the data into different classes by maximizing the margin between the classes. The margin is defined as the perpendicular distance between the hyperplane and the nearest data points from each class, called support vectors. SVMs aim to find the hyperplane that has the largest margin, which can result in better generalization performance and improved robustness to noisy data. 

In simple terms, SVM is like a virtual line that tries to separate different groups of data points in the best possible way. Imagine you have a bunch of colored dots on a piece of paper, and you want to draw a straight line to separate them into different groups. The goal of SVM is to find the best possible line that can accurately divide the dots into their respective groups. This line is called the "decision boundary" or "hyperplane" in SVM.

Types of Support Vector Machine

Support Vector Machines (SVM) can be classified into two types namely 'Linear SVM' and 'Non-Linear SVM'. 

Linear SVM comes into play when the data points form distinct groups that can be effortlessly separated by a single straight line in a two-dimensional space. It's like drawing a clear-cut line between two sets of dots on a piece of paper. 

Non-Linear SVM steps in when the data points are more tangled and cannot be accurately separated by a straight line in a 2D space. This happens when the data is not perfectly linearly separable, and requires more advanced techniques like kernel tricks to effectively classify the data points. Kernel tricks are like a magic wand that transforms the data into a higher-dimensional space where the points might arrange themselves more neatly, making it easier for SVM to accurately separate them. In real-world scenarios, it's common to encounter data that doesn't follow a linear pattern, which is why Non-Linear SVM is often used to handle complex patterns and make precise predictions. It's like unlocking the hidden potential of the data to reveal meaningful insights.

A simple intuition of Linearly separable and Non Linearly separable data.

Key terms of significance in SVM

Below are the three key terms that are frequently used in the context of Support Vector Machines.



How does Support Vector Machine algorithm work?

SVM is a powerful algorithm that works by finding an optimal hyperplane that best separates the data points into different classes. Below is a step-by-step overview of how SVM works:

Let's understand the functioning of SVM with an example-

Consider a dataset consisting of two classes A and B, represented by the colors green and blue. Our objective is to use SVM to classify new data points given by (x,y) coordinates as either green or blue. The first step is to plot our already labeled training data on a plane. 

A support vector machine (SVM) uses the given data points to determine the optimal hyperplane (which is a straight line in two dimensions) that can best separate the different tags or classes. This hyperplane serves as the decision boundary, where data points falling on one side are classified as blue and those on the other side are classified as green. While there are multiple potential decision boundaries that could be used to classify the data points, finding the optimal hyperplane can be a challenging task.


The best hyperplane in SVM is the one that maximizes the distance between the classes. This is the primary objective of SVM. To achieve this, SVM considers multiple hyperplanes and selects the one that is farthest from the data points, or the one with the maximum margin, in order to effectively classify the labels or classes.

Mathematical Intuition behind Support Vector Machine

Dot-Product

In mathematics, the dot product, also known as the scalar product or inner product, is a mathematical operation that combines two vectors to produce a scalar value.  The dot product can be defined as the projection of one vector along with another, multiply by the product of another vector.

the dot product between two vectors, a and b, is calculated by first finding their magnitudes using the Pythagorean theorem or the distance formula. The dot product is then obtained by multiplying the magnitudes with the cosine angle between the two vectors. Mathematically, it can be expressed as:

A . B = |A| cosθ * |B|

where |A| cosθ represents the projection of vector A onto vector B, and |B| represents the magnitude of vector B."


Use of Dot Product in SVM

Suppose we have a point X and we want to determine if it lies on the right side of the plane or the left side of the plane (positive or negative).

To do this, we treat X as a vector and create another vector (w) that is perpendicular to the hyperplane. Let's assume the distance from the origin to the decision boundary along vector w is 'c'. Next, we calculate the projection of vector X onto vector w."

As we know, the projection of one vector onto another vector is represented by the dot product. So, we calculate the dot product of vectors X and w. If the dot product is greater than 'c', we can conclude that the point lies on the right side of the hyperplane. If the dot product is less than 'c', the point is on the left side, and if the dot product is equal to 'c', then the point lies on the decision boundary.

The reason for taking the perpendicular vector w to the hyperplane is to determine the distance of vector X from the decision boundary. Since there can be an infinite number of points on the boundary, it becomes impractical to measure the distance from each one. By using the perpendicular vector as a reference, we can take projections of all other data points on this vector and then compare the distances. This allows us to standardize the measurement and determine the relative distances of the data points from the decision boundary. 

The equation of a hyperplane in SVM is typically represented as w.x + b = 0, where:


Margin in SVM

We can use the equation of hyperplane w.x+b=0 to define a decision rule to classify a given point as negative or positive. The rule can be as follows:

The SVM algorithm aims to find the optimal values of w and b such that the margin between the classes is maximized. In the SVM equation of y = wx + b, if the value of wx + b is greater than zero, then the input data point is classified as a positive point, otherwise it is classified as a negative point. The goal is to find the values of w and b that result in a decision boundary or hyperplane with the maximum distance or margin between the classes, denoted as 'd'. Maximizing the margin ensures a larger separation between the classes and potentially better generalization performance of the SVM model on unseen data points.

To calculate the margin 'd', we need to establish the equations for parallel lines L1 and L2, which are used to define the margin. For convenience, we make assumptions that the equation of L1 is w.x + b = 1 and the equation of L2 is w.x + b = -1

Now, the question arises as to why we chose the magnitudes of 1 and -1 for L1 and L2, respectively, and why we didn't choose other values like 1 and -2, or 24 and -100. The reason is that we want the hyperplane to have an equal distance from both classes, which means it should pass through the center of L1 and L2. Therefore, we choose the magnitudes of 1 and -1 to ensure this symmetry. 

The position of these parallel lines, which are used to define the margin, depends on the values of (w, b) of our hyperplane. If we multiply the equation of the hyperplane by a factor greater than 1, the parallel lines will shrink, and if we multiply it by a factor less than 1, they will expand. 

From this observation, we can conclude that the position of these lines will change as we make changes in (w, b), and this is how the optimization of SVM occurs. The objective of SVM is to maximize the margin, which represents the distance 'd'. But there are constraints on this distance. 

Optimization function and its constraints

To address the problem of optimization of SVM, we can make use of the optimization function. To derive our optimization function, we need to take into account certain constraints. The main constraint is that we will calculate the distance/margin (m) in such a way that no positive or negative point can cross the margin line. Let's express these constraints mathematically:

Instead of dealing with two separate constraints, we can simplify them into one. We assume that negative classes are represented by y = -1 and positive classes by y = 1. In order for every point to be correctly classified, the following condition should always hold true:

If a green point is successfully identified, it will follow the equation w.x+b>=1, which, when multiplied by y=1, yields the equation described above. Likewise, if we repeat the process using a red spot where y=-1, we will once more obtain this equation. Therefore, in order to maximize the distance or the margin (m) between the two support vectors, we need to ensure that this constraint holds true.

Now let us select two support vectors, one from the negative class and the other from the positive class. The distance between these two vectors, denoted as x1 and x2 respectively, can be calculated as the vector (x2 - x1). To find the shortest distance between these points, we use a trick involving the dot product. We introduce a vector 'w' that is perpendicular to the hyperplane, and then find the projection of the vector (x2 - x1) onto 'w'. It's important to note that 'w' should be a unit vector for this to work. To make 'w' a unit vector, we divide it by the norm of 'w'.

We can now take the dot product of these two vectors in order to find the projection of the vector  (x2-x1) onto w.

As x2 and x1 are both support vectors and they lie on the hyperplane, they will satisfy the condition yi * (w.x + b) = 1. This can be expressed as follows:

Substituting equations (2) and (3) in equation (1) we get:

So we get the length of the margin (m) to be 2/||w||

Now, the equation that we need to maximize is:

From the above equation, we can conclude that maximizing the margin m is the same as maximizing 2/||w||. 

Now, maximizing 2/||w|| is same as maximizing 1/||w|| which is same as minimizing ||w|| which is same as minimizing  ½||w||^2.

So, the goal of the optimization now is to get w and b such that the value of ½||w||^2 is minimum for all yi(w.xi+b)≥1. 

Lagrangian Multipliers

The Lagrangian is a mathematical concept used in constrained optimization problems, to find the optimal solution by introducing Lagrange multipliers. The use of Lagrange multipliers in SVM optimization provides a mathematical framework for formulating and solving the optimization problem with constraints, identifying support vectors, and incorporating non-linear kernels, making it a powerful and widely used technique in machine learning. The general formulation of the Lagrangian is as follows: 

In general, the Lagrangian is formulated as follows: 

L(x, a) = f(x) + Σ a_i * g_i(x)

where:

The Lagrangian is used to convert the constrained optimization problem into an unconstrained optimization problem, which can be easier to solve. The Lagrange multipliers, represented by "a", are introduced to enforce the constraints of the problem by penalizing violations of the constraints in the objective function. 

The Lagrangian allows us to incorporate the constraints into the objective function by adding a weighted sum of the constraint functions, multiplied by the Lagrange multipliers, to the original objective function. The Lagrange multipliers act as weights that determine the importance of each constraint in the optimization problem. By varying the values of the Lagrange multipliers, we can find the optimal values of the variables x that satisfy the constraints and optimize the objective function.


Lagrangian Formulation

The Lagrangian Dual Problem: Instead of minimizing over w, b, subject to constraints involving a’s, we can maximize over a (the dual variable) subject to the relations obtained previously for w and b

Why are SVMs often referred to as linear separators?

Support Vector Machines (SVMs) are often referred to as linear classifiers or linear separators, even though they are capable of handling non-linear data distributions. This is because SVMs find a linear decision boundary or hyperplane in the transformed feature space, which is equivalent to a non-linear decision boundary in the original feature space. 

The key idea behind SVMs is to find the hyperplane that maximizes the margin between the data points of different classes, while minimizing the classification error. The margin is defined as the perpendicular distance between the hyperplane and the nearest data points from each class, which are called support vectors. SVMs aim to find the hyperplane that has the largest margin, which results in better generalization performance and improved robustness to noisy data. 

In cases where the data is linearly separable, meaning that it is possible to draw a straight line or hyperplane that can perfectly separate the data points of different classes, SVMs find the optimal linear hyperplane that achieves maximum margin between the classes. This results in a linear decision boundary in the original feature space, which makes SVMs linear separators. 

However, SVMs are not limited to linearly separable data. They can also effectively handle non-linear data distributions by using a technique called the kernel trick. The kernel trick allows SVMs to implicitly transform the input data into a higher-dimensional feature space, where the data points may become linearly separable. This is achieved by using a kernel function, which is a mathematical function that computes the similarity or distance between pairs of data points in the original feature space by computing the dot product between pairs of data points, as if they were already transformed into the higher-dimensional space. The kernel trick is employed to overcome the computational burden of explicitly transforming the data into a higher-dimensional space, which can be computationally expensive or even infeasible for high-dimensional data. This makes SVMs versatile and able to handle complex data distributions.

Kernels in SVM

One of the most intriguing capabilities of Support Vector Machines (SVMs) is their ability to handle non-linear datasets. This is achieved through the use of a technique called the "Kernel Trick," which simplifies the classification of data points. Consider a dataset with a non-linear distribution as an example:

In this scenario, it is not feasible to draw a single line or hyperplane that can accurately classify the data points. To overcome this challenge, we employ a technique that involves transforming the lower-dimensional space into a higher-dimensional space using quadratic functions. This transformation enables us to identify a decision boundary that effectively separates the data points. These quadratic functions, known as Kernels, allow us to perform this transformation using dot product between pairs of data points.

Different Kernel functions: 

Below are some of the Kernel functions that can be employed in SVM

SVM classifier with linear kernel

2. Polynomial kernel: This kernel function introduces non-linearity by computing the polynomial of a certain degree of the dot product between the input data points. It has a hyperparameter, the degree of the polynomial, which determines the complexity of the decision boundary. Following is the formula for the polynomial kernel:


where d represents the degree of the polynomial, which needs to be specified manually.

Suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel we can write it as:

So, herein we need to compute X1^2, X2^2, and X1 * X2, and as a result, we can observe that the original 2-dimensional data has been converted into 5 dimensions.

SVM classifier with polynomial kernel

Understanding the math behind the operation of a polynomial kernel with a simple example-

Consider a simple example of SVM classification using a polynomial kernel with r=1 and d=2. We have a set of 2D points with their corresponding labels as follows:

Point A: (-1, 2), Label A: +1

Point B: (1, -1), Label B: -1

Point C: (3, 4), Label C: +1

Point D: (-2, -3), Label D: -1

We first apply the polynomial kernel with r=1 and d=2 to cast these 2D points into a higher-dimensional space.

Step 1: Compute the polynomial kernel function for each point

The polynomial kernel function is defined as: K(x, y) = (r + x*y)^d

For point A (-1, 2):

K(A, A) = (1 + (-1* -1))^2 = 4

K(A, B) = (1 + (-1* 1))^2 = 0

K(A, C) = (1 + (-1* 3))^2 = 16

K(A, D) = (1 + (-1* -2))^2 = 1

For point B (1, -1):

K(B, A) = (1 + (1* -1))^2 = 4

K(B, B) = (1 + (1* 1))^2 = 4

K(B, C) = (1 + (1* 3))^2 = 16

K(B, D) = (1 + (1* -2))^2 = 1

For point C (3, 4):

K(C, A) = (1 + (3* -1))^2 = 16

K(C, B) = (1 + (3* 1))^2 = 16

K(C, C) = (1 + (3* 3))^2 = 256

K(C, D) = (1 + (3* -2))^2 = 1

For point D (-2, -3):

K(D, A) = (1 + (-2* -1))^2 = 1

K(D, B) = (1 + (-2* 1))^2 = 16

K(D, C) = (1 + (-2* 3))^2 = 1

K(D, D) = (1 + (-2* -2))^2 = 16

Step 2: Cast points into higher-dimensional space

We now represent each point in the higher-dimensional space using the calculated kernel values.

Point A: (4, 0, 16, 1)

Point B: (4, 4, 16, 1)

Point C: (16, 16, 256, 1)

Point D: (1, 16, 1, 16)

Step 3: Plot points in the higher-dimensional space

We now plot the points in the higher-dimensional space, where each point is represented by its kernel values as coordinates.

Point A: (4, 0, 16, 1)

Point B: (4, 4, 16, 1)

Point C: (16, 16, 256, 1)

Point D: (1, 16, 1, 16)

Now, we can use an SVM classifier to find the optimal decision boundary that separates the points with different labels in the higher-dimensional space.


3. Radial basis function (RBF) kernel: This is a popular kernel function that is commonly used in SVMs. It uses a Gaussian function to compute the similarity between data points in a higher-dimensional space. The RBF kernel is particularly useful when the decision boundary is expected to be complex or when the data points are not clearly separable. The following formula explains it mathematically:

where,

'σ' represents the variance and serves as a hyperparameter

||X₁ – X₂|| denotes the Euclidean Distance between two points, X₁ and X₂

4. Sigmoid kernel: This kernel function uses a sigmoid activation function to map the input data points into a higher-dimensional space. It is less commonly used compared to the linear, polynomial, and RBF kernels. The mathematical equation is as below:

It involves taking the input values, mapping them to binary values of 0 and 1, so that they can be separated by a simple linear boundary.

The selection of an appropriate kernel function depends on the specific characteristics of the data and the problem at hand, and it may require experimentation and model evaluation to determine the optimal kernel for a particular application.

Explore Yourself

Below is an illustrative demonstration for examining various kernels. This demonstration was created using Python and is hosted via Streamlit.

In what capacity can SVM classifiers be leveraged with regard to traffic incidents and fatalities? 

Support Vector Machine (SVM) classifiers can play a significant role in addressing traffic incidents and fatalities. One of the primary applications of SVM in this context is predicting the likelihood of future accidents. By training SVM classifiers on historical data that includes information on road conditions, weather, time of day, and vehicle characteristics, these classifiers can generate predictions on the probability of accidents occurring. This information can be used to implement proactive measures to prevent accidents, such as optimizing traffic management strategies, enhancing road safety measures, and improving traffic flow in high-risk areas. Another important application of SVM in traffic incidents and fatality analysis is predicting the severity of accidents. SVM classifiers can be trained on features such as impact force, vehicle speed, and collision type to predict the severity of accidents in real-time. This information can help emergency responders and medical personnel assess the severity of accidents and allocate appropriate resources for effective response and treatment. Additionally, SVM classifiers can be used to predict the probability of fatalities in traffic accidents by considering factors such as the age, gender, and physical condition of involved individuals, as well as collision characteristics. This information can aid in implementing targeted safety campaigns, enforcing stricter regulations, and improving road infrastructure to reduce the risk of fatalities in traffic incidents. 

Additionally, SVM classifiers can be utilized to analyze driver behavior data, such as acceleration, braking, and lane-changing patterns, to detect risky driving behaviors that may lead to traffic incidents and fatalities. This information can be used to identify high-risk drivers, implement targeted interventions, and improve driver training programs. Furthermore, SVM classifiers can be applied to analyze data from various sensors, such as cameras and sensors embedded in roadways, to detect hazards like debris, potholes, and other obstacles that may pose risks to traffic safety. This information can be used to trigger timely alerts, implement hazard mitigation measures, and improve road maintenance and infrastructure planning.

SVM classifiers have diverse applications in addressing traffic incidents and fatalities, ranging from accident prediction and severity assessment to driver behavior analysis and roadway hazard detection. Their ability to analyze labeled numeric data makes them a valuable tool in traffic safety analysis, allowing for proactive measures, targeted interventions, and improved road safety strategies to mitigate risks and prevent accidents on roadways.


AREA OF INTEREST

The US Accidents dataset is a rich source of data that contains information on traffic accidents and fatalities across the United States. By leveraging a Support Vector Machine classifier, it is possible to predict the severity of a crash based on various features such as the location, time, and weather conditions. Using an SVM classifier to predict the severity of accidents in the US accidents dataset can have significant implications for traffic safety. The SVM classifier can leverage the labeled numeric data to learn patterns and relationships between different features and the severity of accidents. This information can be used to implement proactive measures such as optimizing traffic management strategies, improving road safety measures, and allocating appropriate resources for emergency response. Additionally, the SVM classifier can provide insights into risky driving behaviors and aid in targeted interventions and driver training programs to prevent accidents or minimize their severity. With regular validation, updates, and retraining, the SVM classifier can be a valuable tool for data-driven decision-making in traffic safety management and accident prevention efforts in my specific area of interest.