Math's behind Support Vector Machine

September 17, 2023

Math's behind Support Vector Machine

Maths behind Support Vector Machine

Introduction

A Support Vector Machine (SVM) is a powerful machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane (a decision boundary in high-dimensional space) that maximizes the margin, the distance between the hyperplane and the nearest data points of different classes. SVMs are particularly effective in scenarios where the data is not linearly separable by transforming it into a higher-dimensional space using a "kernel trick," allowing them to handle complex, non-linear relationships. SVMs have found applications in diverse fields such as image classification, spam email detection, and medical diagnosis, due to their ability to make robust predictions and generalize well to new data.

Datasets

Let's assume the following images are representing the dataset with 3 different sets, each containing 2 classes. The first dataset is easy to classify however the second and third offers some challenges.

Understanding SVM for Classification

For a n-dimensional feature, SVM creates an n-1 dimensional hyperplane to seperate the classes. Any hyperplane can be represented as:

w^TX + b = 0

For example, if we have a dataset which can be represented by a line, hyperplane would be a point. In our case we have 2 independent and 1 dependent feature. Thus, our hyperplane is a line with the quation:

y = w₁X₁ + w₂X₂ + b

where, w = [ w₁, w₂ ]

Hard Margin

If the dataset can be linearly sperated like our first dataset and partially in the second, we can create two parallel hyperplane (lines here) for each classes and contain each data points under these hyperplanes and our goal would be to maximize the distance between these two parallel hyperplane. The region between these two plane is called margin. The equations for both hyperplane is:

w^TX + b = -1 and, w^TX + b = 1

The distance between them is

2/||w||

and to maximize the distance, ||w|| should be minimum, which is nothing but w ^Tw

Objective

Increase the margin by minimizing
1/2 w ^Tw
To prevent any data point falling inside margin we add the restriction (or the above equation is subjectd to) ,
y_i(w^TXⁱ + b) >= 1
where y_i = ith row in the target and X_i = ith row in the X

Soft Margin

If the dataset in non-linearly seprable (dataset 2 and 3, note: if in dataset 1, one or more points are in wrong classes, then it is also non linear), we use soft margin. We are using a slack variable ζ, which determines how much margin violation is allowed.

Objective

Now we have 2 conflicting objectives:
a. reduce the slack variable, to decrease margin violation
b. reduce the 1/2 w ^Tw, to increase the margin
Thus we are using a hyperparamter C, which defines the tradeoff betweeen the two.
We are thus minimising:
1/2 w ^Tw + C Σⁱ_m ζ⁽ⁱ⁾
The above equation is subjectd to
y_i(w^TXⁱ + b) >= 1 - ζ⁽ⁱ⁾ and
ζ⁽ⁱ⁾ >= 0
for i=1,2...m training instances

Kernals

In Support Vector Machines (SVM), kernels are a crucial component that allows SVMs to handle non-linearly separable data. Kernels are functions that transform the original input data from a lower-dimensional space into a higher-dimensional space, making it easier to find a hyperplane (decision boundary) that separates the data. For example, the third dataset is circular, easy to add margin in 3d.
Here are some common kernels used in SVMs:

Linear Kernel (Linear SVM):
The linear kernel is the simplest and is used for linearly separable data. It represents a dot product between the input features in the original space. Mathematically, it can be represented as:
K(x, x') = x . x'
Linear SVMs work well when the data can be effectively separated by a straight line (in 2D) or a hyperplane (in higher dimensions).
Polynomial Kernel:
The polynomial kernel allows SVMs to capture complex, non-linear relationships in the data. It involves raising the dot product of the input data to a certain power. The degree of the polynomial determines the complexity of the decision boundary. Mathematically:
K(x, x') = (x . x' + c)^d
- d is the degree of the polynomial.
- c is a constant.
Degree d in SVM package, is only used for polynomial kernal
Polynomial kernels are useful for data that doesn't have a clear linear separation.
Radial Basis Function (RBF) Kernel:
The RBF kernel, also known as the Gaussian kernel, is highly flexible and suitable for data with complex non-linear patterns. It transforms the data into an infinite-dimensional space using a Gaussian function. Mathematically:
K(x, x') = e ^{-γ||x - x||²}
- gamma(γ) is a positive constant that determines the spread of the Gaussian.
RBF kernels are popular because they can handle various types of data distributions effectively. However, they come with more hyperparameters to tune.
Sigmoid Kernel:
The sigmoid kernel is inspired by the logistic function and is used for binary classification problems. It can capture S-shaped decision boundaries. Mathematically:
K(x, x') = tanh(αx . x' + β)
- α and β are constants.
Sigmoid kernels are less commonly used than the others but can be effective in specific scenarios.

SVM for Regression

SVM for Regression works in the same way as classification. However, the aim changes. Instead of trying to minimze the number of data points in the margin, we have to maximize the data points within the margin. Thus objective are reversed and so are the ourliers. Anything outside the margin will be considered as outliers.

References:

Note: Parts of the article are developed by using ChatGPT

Search This Blog

prabhat kumar singh

Math's behind Support Vector Machine

Introduction

Datasets

Understanding SVM for Classification

Hard Margin

Soft Margin

Kernals

SVM for Regression

References:

Comments

Post a Comment

Popular Posts

Math behind Simple and Multiple Linear Regression

Math's behind Correlation