Math's behind Support Vector Machine
Introduction
A Support Vector Machine (SVM) is a powerful machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane (a decision boundary in high-dimensional space) that maximizes the margin, the distance between the hyperplane and the nearest data points of different classes. SVMs are particularly effective in scenarios where the data is not linearly separable by transforming it into a higher-dimensional space using a "kernel trick," allowing them to handle complex, non-linear relationships. SVMs have found applications in diverse fields such as image classification, spam email detection, and medical diagnosis, due to their ability to make robust predictions and generalize well to new data.
Datasets
Let's assume the following images are representing the dataset with 3 different sets, each containing 2 classes. The first dataset is easy to classify however the second and third offers some challenges.Understanding SVM for Classification
For a n-dimensional feature, SVM creates an n-1 dimensional hyperplane to seperate the classes. Any hyperplane can be represented as:
wTX + b = 0
For example, if we have a dataset which can be represented by a line, hyperplane would be a point. In our case we have 2 independent and 1 dependent feature. Thus, our hyperplane is a line with the quation:y = w1X1 + w2X2 + b
where, w = [ w1, w2 ]Hard Margin
If the dataset can be linearly sperated like our first dataset and partially in the second, we can create two parallel hyperplane (lines here) for each classes and contain each data points under these hyperplanes and our goal would be to maximize the distance between these two parallel hyperplane. The region between these two plane is called margin. The equations for both hyperplane is:
wTX + b = -1 and, wTX + b = 1
The distance between them is2/||w||
and to maximize the distance, ||w|| should be minimum, which is nothing but w TwObjective
-
Increase the margin by minimizing
1/2 w Tw
-
To prevent any data point falling inside margin we add the restriction (or the above equation is subjectd to) ,
yi(wTXi + b) >= 1
where yi = ith row in the target and Xi = ith row in the X
Soft Margin
If the dataset in non-linearly seprable (dataset 2 and 3, note: if in dataset 1, one or more points are in wrong classes, then it is also non linear), we use soft margin. We are using a slack variable ζ, which determines how much margin violation is allowed.
Objective
-
Now we have 2 conflicting objectives:
a. reduce the slack variable, to decrease margin violation
b. reduce the 1/2 w Tw, to increase the margin
Thus we are using a hyperparamter C, which defines the tradeoff betweeen the two.
We are thus minimising:1/2 w Tw + C Σim ζ(i)
-
The above equation is subjectd to
yi(wTXi + b) >= 1 - ζ(i) and
for i=1,2...m training instances
ζ(i) >= 0
Kernals
In Support Vector Machines (SVM), kernels are a crucial component that allows SVMs to handle non-linearly separable data. Kernels are functions that transform the original input data from a lower-dimensional space into a higher-dimensional space, making it easier to find a hyperplane (decision boundary) that separates the data. For example, the third dataset is circular, easy to add margin in 3d.Here are some common kernels used in SVMs:
- Linear Kernel (Linear SVM):
The linear kernel is the simplest and is used for linearly separable data. It represents a dot product between the input features in the original space. Mathematically, it can be represented as:K(x, x') = x . x'
Linear SVMs work well when the data can be effectively separated by a straight line (in 2D) or a hyperplane (in higher dimensions). - Polynomial Kernel:
The polynomial kernel allows SVMs to capture complex, non-linear relationships in the data. It involves raising the dot product of the input data to a certain power. The degree of the polynomial determines the complexity of the decision boundary. Mathematically:K(x, x') = (x . x' + c)d
- d is the degree of the polynomial.
- c is a constant.
Degree d in SVM package, is only used for polynomial kernal
Polynomial kernels are useful for data that doesn't have a clear linear separation. - Radial Basis Function (RBF) Kernel:
The RBF kernel, also known as the Gaussian kernel, is highly flexible and suitable for data with complex non-linear patterns. It transforms the data into an infinite-dimensional space using a Gaussian function. Mathematically:K(x, x') = e -γ||x - x||2
- gamma(γ) is a positive constant that determines the spread of the Gaussian.
RBF kernels are popular because they can handle various types of data distributions effectively. However, they come with more hyperparameters to tune. - Sigmoid Kernel:
The sigmoid kernel is inspired by the logistic function and is used for binary classification problems. It can capture S-shaped decision boundaries. Mathematically:K(x, x') = tanh(αx . x' + β)
- α and β are constants.
Sigmoid kernels are less commonly used than the others but can be effective in specific scenarios.
Comments
Post a Comment