SVM โ from classification to optimization ๐
solving the classification starts with understanding the intuition ๐
Support vector machine- We will see the math concepts behind SVM and how solving this would lead to solving optimization problems.
Goal: Let's consider the problem of classifying people who get admitted to high-ranked colleges (predict whether or not they get admitted to high-ranked colleges)
We have n students, and for each student, there are vectors of predictors (features that attribute to classification) โ X1 โฆ.. Xn
An important point to note X1, X2.. Xn is the vector and not a single point
We are going to come up with a supervised learning algorithm that uses the predictors and generates a response.
Let's plot the predictors in two dimensions, considering GPA and TOEFL score alone into consideration.
Here is the plot:
Considering the equation of line y=mx+b โ here m would be the weight vector (w) and x is the training data vector for each student.
What's the spirit of SVM? We want to create a maximum margin classifier. We have to choose the decision boundary in the middle, such that it maximized the space between the classes.
This is the equation of hyperplane, where b is the intercept and W1 to Wp are the coefficients. Each vector X is p-dimensional (p features into consideration)
W and b are the only coefficients we care about in order to find the best margin between the hyperplanes. (Good values of W and b that maximize the margin) โ W โ is a vector of weights we are choosing.
We wanted to know the size of the margin, the decision boundary that separates two lines.
First observation: Normal vector is W, which is perpendicular to the hyperplane
We could derive this from the concept of โdot product is zeroโ, then the vectors are orthogonal.
Lets us consider we have a point X, thatโs on the decision boundary, the middle line in the above diagram, then it must follow the equation, we need to know how much distance (W) we need to walk to go to the hyperplane. We can come up with a unit vector (W/ ||W||) โ which is in the same direction as W. We need to find the units to walk in the W direction so we can end up in the hyperplane (above the decision boundary)
we are finding the vector X plus the K distance we have to walk in the W direction, towards the above decision boundary hyperplane.
Second observation: The distance between the decision boundary to one hyperplane is 1/ ||W||
Now we know the margin size, the goal of SVM is to maximize the margin size, hence maximum margin lies in minimizing the weights vector ||W|| (magnitude of W)
1st part of SVM: From the above, we have to come up with W and b such that we minimize the magnitude of W.
The above should be picked based on two constraints: Everything on one side of the margin should be classified as +1 and everything on another side of the margin is classified as -1.
Mathematically, the vectors obeying the condition such that points on the hyperplane and above or below it should follow the below equation,
Two conclusions till now: ๐ฅ
- We need to maximize the margin between the hyperplanes, and for that, we need to minimize the magnitude of ||W||
- We need to do that following the above constraint on yi is obeyed.
So we come up with โHard Margin SVMโ where the data is linearly separable and we don't want to allow any misclassification.
Now we know what support vectors, think about the โequalityโ in the above expression who lies exactly on the solid hyperplanes above and below the decision boundary, which means,
Anything that's not exactly 1 is not the support vector, either they are above the margin or below the margin.
๐ตโ๐ซ But the world is different. We are not going to have that linear separability in the case of real-world data, so we should not be too rigid with misclassifications.
There are cases where students get high GPAs, and high TOEFL but they do not get into the high-ranked college or the other way around.
Hard margin SVM doesn't know what to do when we can't achieve a clean linear separability.
So the margin should not be hard, able to accept a certain level of misclassification.
You guessed it, there is a penalty factor โ based on how big the misclassification is.
Soft margin is based on Hinge Loss โ a cost function (determining how well the model performs specifically for SVM).
The quantity thatโs a driving force behind soft margin SVM is โHinge loss"
The score is nothing but the equation of hyperplane where the score for the data point(new observation) above that plane would be > 1 and similarly, the score for the data point below another margin is < -1. The observation between the two margins would get a score {-1,+1} which is 0 to +1 and 0 to -1 respective to the decision boundary and the margin.
Loss is proportional to how confidently the observations are getting predicted.
If an observation is far away from the positive margin, the score would be > 1 and if the truth is also a positive class, then the loss is less here.
Let's break down the hinge loss of above observations 1, 2, 3, and 4 given in the above figure.
Observation 1: The green data point that's correctly classified, let's see the loss,
Observation 2: The red data point, is on the correct side of the margin and classification.
No loss till now. We are happy about how the observations are getting classified.
Observation 3: A student who doesn't get into a high-ranked college but is classified as a student who got into on the wrong side of the margin. The score is > 1 based on the margin.
Hence, for this observation, am assigning a hinge loss > 1 (penalty to the model)
Observation 4: Here, the data point is between the margins. But according to the decision boundary, this is correct โ the score would be < -1. But why a penalty? Because it's too close to the โcomfort-decision boundaryโ, our goal is to maximize the margin for the classifier, but it's inside the margin, so we wanted to give as little penalty.
This is what we want because the classifier is a little less confident in its correct prediction, so we give some less hinge loss.
๐ Give zero loss for correct classifications with good confidence from the margin. Give a higher loss > 1 for those incorrectly classified and give a loss of 0 to 1 for that outside of the decision boundary but inside the margin ๐
Let's come up with the soft margin SVM.
When compared to hard margin SVM, here in the soft margin โ we are trying to accomplish two things simultaneously with real-world data:
- maximizing the margin (separability between the classes)
- Since we have allowed misclassifications, we want to keep the loss as low as possible.
Now that we know the equations of hard and soft margin SVM, we need to solve this problem. Now comes the concept of duality โ problems that are primal and dual.
๐ง This comes from the optimization theory, duality principle โ optimization problems can be viewed from either of the two perspectives โ primal problem or dual problem. If the primal problem is minimization, then the dual is maximization and vice versa. Solving one of the problems leads to solving another problem.
Now coming onto SVM, we are going to express the SVM in different forms and we will come to know why this alternate form solving is better.
Primal problem: (Original formulation)
Minimization: (Hard Margin SVM)
The goal is to maximize the size of the margin between two classes, so the classes are clearly separable by the best possible decision boundary.
The condition is for all the data points from 1 to N, they should live on the margin(=1) [Support vectors] or outside the margin (>1)
In order to express the above primal problem in dual form, we need to come up with Lagrange Multipliers (Alpha i) for every data point vector.
Let's look at the Lagrangian Formulation of the above problem:
This is obtained from the above Hard margin SVM objective function along with subtracting [adding up all the N constraints, each constraint on data point is multiplied with Lagrange multiplier]
Why this formulation is important? We have W,b which are used to maximize the size of the margin and Lagrange multiplier.
The general idea behind the Lagrange multiplier is that we can turn the constrained optimization problem into an unconstrained one. Here, Hence minimization on a constrained optimization problem is turned into another constrained optimization problem or Lagrangian function. This Lagrangian formulation also satisfies KKT conditions(Karush-Kuhn-Tucker).
If we know which constraints are active and which or not, we can directly use Lagrangian methods, the idea of constrain bind and constrain slack. KKT gives a way of checking conditions on active and slack constraints(complementary slackness). Multipliers in KKT conditions are like penalties for violating the constraints. e.g. โ We have a constraint g(x) โค c in a maximization problem, then the Lagrangian function is ๐(๐ฅ)โ๐โ (๐(๐ฅ)โc) where f(x) is the original objective function and g(x) is the constraint and ๐ is the Lagrangian multiplier.
Now we come to the dual part of the problem. The Hard margin SVM formulation is equivalent to below one, which means solving one solves the other.
Now we convert the above Lagrangian formulation into an equivalent dual problem,
Let's break down this dual problem:
Let's say we have a constraint on all possible Lagrange multipliers that is greater than 0, now we go into the sub-problem โ where it is based on a certain set of alpha that's fixed โ we solve the minimization problem, which is solving the Lagrangian formulation above going over different values of W and b. We get a minimum value of Lagrangian. Now let's backtrack and pick a different set of Lagrangian multipliers and solve that corresponding minimization problem. What values of alpha we are going to settle with? The one that's maximum across all possible sets of Lagrangian multipliers has undergone subproblem computation.
No matter what values of alpha we choose in the above problem, we need to solve the inner minimization problem such that it needs to meet the stationarity constraint (something that doesn't have changes over time) โ a kind of flat-looking trend, without variance.
Stationarity constraint: The partial derivative must be equal to 0. Here, for Lagrange formulation, we take partial derivates with respect to W and b.
An important point to note โ๐ป Only data samples or vectors that contribute to the margin are the support vectors. The weight vector w in the first equation is part of the margin itself, the only vectors that are allowed towards the contribution of w can be support vectors themselves.
So If Xi is not a support vector, then their (alpha i) must be equal to 0, because only vectors that contribute towards the definition of w are support vectors.
Now we substitute the above values in Lagrangian Formulation to get in terms of alpha alone for easy manipulation. We can use alpha to get the best wโs and bโs according to the stationarity condition equation on w.
Since we know for non-support vectors alpha i is equal to 0, for real word data there is not going to be a lot of data vectors that lie on the margin, hence the terms mostly cancel out.
๐ First simplification: Here in the above equation, we are iterating pairs of support vectors and not all data points, since alpha is 0 for non-support vectors and the entire term gets canceled.
Hence, after the first simplification, the sum could be written as,
Dual Formulation of the SVM problem
We flipped the terms from the final equation(considering the negative of the inside term), so we could consider this as a minimization problem so we have some symmetry between primal and dual.
Big takeaway: Solving the primal problem is equivalent to finding the best alphaโs to solve the dual problem.
Why do we care about solving the problem in duality optimization?
- Consider high-dimensional data, text data embedding, where it has more columns or dimensions.
The number of dimensions p >> N, p is much bigger than the number of samples, multiplying both by N, Np >> N square.
If it's an original formulation, we have to work with the dataset of Np of Xi, but the keynote in the above dual formulation is, Xi shows only in terms of the dot product, which is the dot product of one vector with another vector.
What's the maximum dot products we have to compute? We have N vectors and we have N square dot products only. The dual formulation doesn't need the original data but just needs the inner products between them, hence it's computationally less expensive. It's also Kernel-Friendly โ Since I don't know about kernels for now โ Am keeping it for another day!