# SVM - Understanding the math - the optimal hyperplane

This is the Part 3 of my series of tutorials about the math behind Support Vector Machine.

If you did not read the previous articles, you might want to start the serie at the beginning by reading this article: an overview of Support Vector Machine.

Here is a quick summary of what we will see:

• How can we find the optimal hyperplane ?
• How do we calculate the distance between two hyperplanes ?
• What is the SVM optimization problem ?

## How to find the optimal hyperplane ?

At the end of Part 2 we computed the distance  $\|p\|$ between a point  $A$ and a hyperplane. We then computed the margin which was equal to $2 \|p\|$.

However, even if it did quite a good job at separating the data it was not the optimal hyperplane.

Figure 1: The margin we calculated in Part 2 is shown as M1

As we saw in Part 1, the optimal hyperplane  is the one which maximizes the margin of the training data.

In Figure 1, we can see that the margin $M_1$, delimited by the two blue lines, is not the biggest margin separating perfectly the data. The biggest margin is the margin $M_2$ shown in Figure 2 below.

Figure 2: The optimal hyperplane is slightly on the left of the one we used in Part 2.

You can also see the optimal hyperplane on Figure 2. It is slightly on the left of our initial hyperplane. How did I find it ? I simply traced a line crossing $M_2$ in its middle.

Right now you should have the feeling that hyperplanes and margins are closely related. And you would be right!

If I have an hyperplane I can compute its margin with respect to some data point. If I have a margin delimited by two hyperplanes (the dark blue lines in Figure 2), I can find a third hyperplane passing right in the middle of the margin.

Finding the biggest margin, is the same thing as finding the optimal hyperplane.

## How can we find the biggest margin  ?

It is rather simple:

1. You have a dataset
2. select two hyperplanes which separate the data with no points between them
3. maximize their distance (the margin)

The region bounded by the two hyperplanes will be the biggest possible margin.

If it is so simple why does everybody have so much pain understanding SVM ?
It is because as always the simplicity requires some abstraction and mathematical terminology to be well understood.

So we will now go through this recipe step by step:

### Step 1: You have a dataset $\mathcal{D}$ and you want to classify it

Most of the time your data will be composed of $n$ vectors $\mathbf{x}_i$.

Each $\mathbf{x}_i$ will also be associated with a value $y_i$ indicating if the element belongs to the class (+1) or not (-1).

Note that $y_i$  can only have two possible values -1 or +1.

Moreover, most of the time, for instance when you do text classification, your vector $\mathbf{x}_i$ ends up having a lot of dimensions. We can say that $\mathbf{x}_i$ is a $p$-dimensional vector if it has $p$ dimensions.

So your dataset $\mathcal{D}$ is the set of $n$ couples of element $(\mathbf{x}_i, y_i)$

The more formal definition of an initial dataset in set theory is :

### Step 2: You need to select two hyperplanes separating the data with no points between them

Finding two hyperplanes separating some data is easy when you have a pencil and a paper. But with some $p$-dimensional data it becomes more difficult because you can't draw it.

Moreover, even if your data is only 2-dimensional it might not be possible to find a separating hyperplane !

You can only do that if your data is linearly separable

Figure 3: Data on the left can be separated by an hyperplane, while data on the right can't

So let's assume that our dataset $\mathcal{D}$ IS linearly separable. We now want to find two hyperplanes with no points between them, but we don't have a way to visualize them.

What do we know about hyperplanes that could help us ?

#### Taking another look at the hyperplane equation

We saw previously, that the equation of a hyperplane  can be written

However, in the Wikipedia article about Support Vector Machine it is said that :

Any hyperplane can be written as the set of points $\mathbf{x}$ satisfying $\mathbf{w}\cdot\mathbf{x}+b=0\$.

First, we recognize another notation for the dot product, the article uses $\mathbf{w}\cdot\mathbf{x}$ instead of  $\mathbf{w}^T\mathbf{x}$.

You might wonder... Where does the $+b$ comes from ? Is our previous definition incorrect ?

Not quite. Once again it is a question of notation. In our definition the vectors $\mathbf{w}$ and $\mathbf{x}$ have three dimensions, while in the Wikipedia definition they have two dimensions:

Given two 3-dimensional vectors $\mathbf{w}(b,-a,1)$ and $\mathbf{x}(1,x,y)$

Given two 2-dimensional vectors $\mathbf{w^\prime}(-a,1)$ and $\mathbf{x^\prime}(x,y)$

Now if we add $b$ on both side of the equation $(2)$ we got :

For the rest of this article we will use 2-dimensional vectors (as in equation (2)).

Given a hyperplane $H_0$ separating the dataset and satisfying:

We can select two others hyperplanes $H_1$ and $H_2$ which also separate the data and have the following equations :

and

so that $H_0$ is equidistant from $H_1$ and $H_2$.

However, here the variable $\delta$ is not necessary. So we can set $\delta=1$ to simplify the problem.

and

Now we want to be sure that they have no points between them.

We won't select any hyperplane, we will only select those who meet the two following constraints:

For each vector $\mathbf{x_i}$ either :

or

#### Understanding the constraints

On the following figures, all red points have the class $1$ and all blue points have the class $-1$.

So let's look at Figure 4 below and consider the point $A$. It is red so it has the class $1$ and we need to verify it does not violate the constraint $\mathbf{w}\cdot\mathbf{x_i} + b \geq1\$

When $\mathbf{x_i} = A$  we see that the point is on the hyperplane so $\mathbf{w}\cdot\mathbf{x_i} + b =1\$ and the constraint is respected. The same applies for $B$.

When $\mathbf{x_i} = C$  we see that the point is above the hyperplane so $\mathbf{w}\cdot\mathbf{x_i} + b >1\$ and the constraint is respected. The same applies for $D$, $E$, $F$ and $G$.

With an analogous reasoning you should find that the second constraint is respected for the class $-1$.

Figure 4: Two hyperplanes satisfying the constraints

On Figure 5, we see another couple of hyperplanes respecting the constraints:

Figure 5: Two hyperplanes also satisfying the constraints

And now we will examine cases where the constraints are not respected:

Figure 6: The right hyperplane does not satisfy the first constraint

Figure 7: The left hyperplane does not satisfy the second constraint

Figure 8: Both constraint are not satisfied

What does it means when a constraint is not respected ? It means that we cannot select these two hyperplanes. You can see that every time the constraints are not satisfied (Figure 6, 7 and 8) there are points between the two hyperplanes.

By defining these constraints, we found a way to reach our initial goal of selecting two hyperplanes without points between them.  And it works not only in our examples but also in $p$-dimensions !

#### Combining both constraints

In mathematics, people like things to be expressed concisely.

Equations (4) and (5) can be combined into a single constraint:

And multiply both sides by $y_i$ (which is always -1 in this equation)

Which means equation (5) can also be written:

In equation (4), as $y_i =1$ it doesn't change the sign of the inequation.

We combine equations (6) and (7) :

We now have a unique constraint (equation 8) instead of two (equations 4 and 5) , but they are mathematically equivalent. So their effect is the same (there will be no points between the two hyperplanes).

### Step 3: Maximize the distance between the two hyperplanes

This is probably be the hardest part of the problem. But don't worry, I will explain everything along the way.

#### a) What is the distance between our two hyperplanes ?

Before trying to maximize the distance between the two hyperplane, we will first ask ourselves: how do we compute it ?

Let:

• $\mathcal{H}_0$ be the hyperplane having the equation $\textbf{w}\cdot\textbf{x} + b = -1$
• $\mathcal{H}_1$ be the hyperplane having the equation $\textbf{w}\cdot\textbf{x} + b = 1$
• $\textbf{x}_0$ be a point in the hyperplane $\mathcal{H}_0$.

We will call $m$ the perpendicular distance from $\textbf{x}_0$ to the hyperplane $\mathcal{H}_1$ . By definition, $m$ is what we are used to call the margin.

As $\textbf{x}_0$ is in $\mathcal{H}_0$, $m$ is the distance between hyperplanes $\mathcal{H}_0$ and $\mathcal{H}_1$ .

We will now try to find the value of $m$.

Figure 9: m is the distance between the two hyperplanes

You might be tempted to think that if we add $m$ to $\textbf{x}_0$ we will get another point, and this point will be on the other hyperplane !

But it does not work, because $m$ is a scalar, and $\textbf{x}_0$ is a vector and adding a scalar with a vector is not possible. However, we know that adding two vectors is possible, so if we transform $m$ into a vector we will be able to do an addition.

We can find the set of all points which are at a distance $m$ from  $\textbf{x}_0$. It can be represented as a circle :

Figure 10: All points on the circle are at the distance m from x0

Looking at the picture, the necessity of a vector become clear. With just the length $m$ we don't have one crucial information : the direction. (recall from Part 2 that a vector has a magnitude and a direction).

We can't add a scalar to a vector, but we know if we multiply a scalar with a vector we will get another vector.

From our initial statement, we want  this vector:

1. to have a magnitude of $m$
2. to be perpendicular to the hyperplane $\mathcal{H}_1$

Fortunately, we already know a vector perpendicular to $\mathcal{H}_1$, that is $\textbf{w}$ (because  $\mathcal{H}_1 = \textbf{w}\cdot\textbf{x} + b = 1$)

Figure 11: w is perpendicular to H1

Let's define $\textbf{u} = \frac{\textbf{w}}{\|\textbf{w}\|}$ the unit vector of $\textbf{w}$. As it is a unit vector $\|\textbf{u}\| = 1$ and it has the same direction as $\textbf{w}$ so it is also perpendicular to the hyperplane.

Figure 12: u is also is perpendicular to H1

If we multiply $\textbf{u}$ by $m$ we get the vector $\textbf{k} = m\textbf{u}$ and :

1. $\|\textbf{k}\| = m$
2. $\textbf{k}$ is perpendicular to $\mathcal{H}_1$ (because it has the same direction as $\textbf{u}$)

From these properties we can see that $\textbf{k}$ is the vector we were looking for.

Figure 13: k is a vector of length m perpendicular to H1

We did it ! We transformed our scalar $m$ into a vector $\textbf{k}$ which we can use to perform an addition with the vector $\textbf{x}_0$.

If we start from the point $\textbf{x}_0$ and add $k$ we find that the point $\textbf{z}_0 = \textbf{x}_0 + \textbf{k}$ is in the hyperplane $\mathcal{H}_1$  as shown on Figure 14.

Figure 14: z0 is a point on H1

The fact that $\textbf{z}_0$ is in $\mathcal{H}_1$ means that

We can replace $\textbf{z}_0$ by $\textbf{x}_0+\textbf{k}$ because that is how we constructed it.

We can now replace $\textbf{k}$ using equation $(9)$

We now expand equation $(12)$

The dot product of a vector with itself is the square of its norm so :

As $\textbf{x}_0$ is in $\mathcal{H}_0$ then $\textbf{w}\cdot\textbf{x}_0 +b = -1$

This is it ! We found a way to compute $m$.

#### b) How to maximize the distance between our two hyperplanes

We now have a formula to compute the margin:

The only variable we can change in this formula is the norm of  $\mathbf{w}$.

Let's try to give it different values:

When $\|\textbf{w}\|=1$ then $m=2$

When $\|\textbf{w}\|=2$ then $m=1$

When $\|\textbf{w}\|=4$ then $m=\frac{1}{2}$

One can easily see that the bigger the norm is, the smaller the margin become.

### Maximizing the margin is the same thing as minimizing the norm of $\textbf{w}$

Our goal is to maximize the margin. Among all possible hyperplanes meeting the constraints,  we will choose the hyperplane with the smallest $\|\textbf{w}\|$ because it is the one which will have the biggest margin.

This give us the following optimization problem:

Minimize in ($\textbf{w}, b$)

$\|\textbf{w}\|$

subject to $y_i(\mathbf{w}\cdot\mathbf{x_i}+b) \geq 1$

(for any $i = 1, \dots, n$)

Solving this problem is like solving and equation. Once we have solved it, we will have found the couple ($\textbf{w}, b$)  for which $\|\textbf{w}\|$ is the smallest possible and the constraints we fixed are met. Which means we will have the equation of the optimal hyperplane !

## Conclusion

We discovered that finding the optimal hyperplane requires us to solve an optimization problem.  Optimization problems are themselves somewhat tricky. And you need more background information to be able to solve them.  So we will go step by step. Let us discover unconstrained minimization problems in Part 4! Thanks for reading.

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

## 145 thoughts on “SVM - Understanding the math - the optimal hyperplane”

1. dragon518

Thank you very much! I have never read a technology blog more fundamental, detailed, clear than yours before. You are a specialist on SVM. BTW, look forward to read part-4.

2. snake2971chris

really nice report! Just one thing, if you have data points in a 2 dimensional space(x1,x2), like in your examples, you should visualize it in 3 dimensions since the hyperplane is spanned in a third dimension(y). If you don't want to mess around with 3 dimensions use a simple 1D example, so your hyperplane can be really visualized with a line.

That was a thing which confused me for quiet a bit. But once I had the right visualization everything else became kind of intuitive.

1. Alexandre KOWALCZYK Post author

Yeah sure. Here the third dimension is depicted by the color, but indeed it can be confusing. Thanks for your feedback 🙂

Why are the hyper planes set to 1 and -1. It seems arbitrary? This essentially sets them 2 units apart. What if the data points with labels 1 and labels -1 were very far away?

1. Alexandre KOWALCZYK Post author

They are set to 1 and -1 because it is the only possible values the vector y can take. As suggested by snake2971chris in a previous comment, visualizing the picture in 3 dimensions can help to understand this.

1. Alexandre KOWALCZYK Post author

This is by definition. The equation is called a normal equation for the hyperplane. I invite you to read the part about hyperplanes in this course (page 9). This is about the calculus of functions of several variables but I used it as a reference several times to write this serie of articles (you can read the full course here).

2. archkrish

The hyperplanes are defined in the way that there is vector which is perpendicular to all the vectors in the hyperplane...imagine hyperplane consists of all the vectors which are perpendicular to w.There are good video lectures available which explains hyperplane.

4. daniel

simple yet insightful explanation on the math behind the scene...cant wait to see the release of part 4...thanks for your effort in producing this tutorial, very helpful

5. Rahul

Let us consider a two dimensional dataset as shown in your tutorial. My doubt is what is the physical significance of w.I know it is a vector perpendicular to the hyperplane but what does it actually signify

1. Alexandre KOWALCZYK Post author

I am afraid I don't understand your question. What do you mean by physical significance?

1. Alexandre KOWALCZYK Post author

Hi Tejas. Thanks for your comment. I don't have a precise date because I have been very busy at work. As there is a lot of people asking me I will try to do my best to publish it soon (before 2016)
I added a small form at the end of the article so that you can put your email in it if you want me to send you a mail when it will be published.

6. Hugo

Thanks for the great tutorial. The expressions wx-b=1 and wx-b=-1 are however a little confusing. You write the following in defence of these equations.

"And we choose +1 and -1 because they are the only values the class can take."

Question: My understanding is that the result of wx-b should not evaluate to a class label. What would evaluate to a class label is an expression such as f(wx-b) where f is the classification rule that assigns the product between a given test vector and the w vector to the classes 1 or -1. Can you clarify why you are having wx-b evaluating to the value of the class labels? Some literature suggests that the numerical values of the class labels are only arbitrary -- in fact -1 and 1 are chosen in SVM algebra because they enable an elegant way to represent the SVM (e.g., we are able to write y(wx-b)>0 because y can only be 1 or -1).

Could you have chosen 1 and -1 because you wanted to have the planes wx-b=-1,1 equidistant from wx-b=0. In that case does it mean that I could have used numbers such as 2 and -2?

1. Alexandre KOWALCZYK Post author

Indeed the expression "And we choose +1 and -1 because they are the only values the class can take." was an error. I checked online and this value has no connection with the class label.
You can see equation (2) of this paper for more details.

I updated the article so that it is more explicit how this number is choosen.
I think we could use numbers such as 2 and -2 but I am not sure, I need to check the math to be 100% sure.

Thanks a lot.

2. ZC Liu

Actually, you can choose +2 and -2 instead of +1 and -1.
Then you follow the steps: 10 ~ 19, you will get m = 4 / w'.
means you find a w', which length is half of the original one(w), when you choose +1 and -1.
then the support hyper plane is: w' * x + b' = 2 => 2w * x + 2b = 2 , which is equivalent to the
original one: w * x + b = 1.

so you can choose other number either, eventually, you will get the same support hyper plane.

7. Mutaz

Really thanks a lot for this series
This is the first time I see such a simplicity of explaining the SVM
I am waiting part 4, I am a bit scared to learn how solve that optimization problem. Please try to explain it as simple as possible.

Thanks a lot

1. 강산하

I really really appriciate for your aritcle.
I am graduate school student in South Korea

I have a question,

Why do you set δ = 1.-1 ?

I think δ is related to margin and

it should smaller if there are data point and equation (4), (5) are not true.

What do you think?

1. Alexandre KOWALCZYK Post author

We could have kept $\delta$ in all the following equations. At the end we would have had $y_i(\mathbf{w}\cdot\mathbf{x_i}-b)\geq\delta$ in equation 8. Indeed, $\delta = y_i(\mathbf{w}\cdot\mathbf{x_i}-b)$ is the functional margin. You can read part 3 of this course to understand better why we can set it to 1 and -1. You will see in part 4 that he replaces the functionnal margin by 1. I hope this helps you.

1. Ganecian

what does \delta means? I still didn't understand this paragraph,
"We now note that we have over-parameterized the problem: if we scale w, b and \delta by a constant factor \alpha, the equations for x are still satisfied. To remove this ambiguity we will require that \delta = 1, this sets the scale of the problem, i.e. if we measure distance in millimeters or meters. " Can you explain this?

8. 강산하

another question

In step 2, we select 2 hyperplane, so I think w is fixed in that time

but step3 you tell we minimize w to get biggest margin

I think w nerver chage after step2...

Can you explain?

1. Alexandre KOWALCZYK Post author

In step 2 we select two hyperplanes among all possible couple of hyperplanes satisfying the constraints (4) and (5). Given one hyperplane separating the data, we can find two other hyperplanes meeting the constraint. Given another hyperplane separating the data (with a different w than the first one), we can find two other different hyperplanes meeting the constraint. So minimizing w in step 3 allow us to find the two hyperplanes having the biggest margin and the optimal hyperplane.

9. Cao Fuxiang

Hello Alexandre!

Thank you very much for your excellent series of posts. You convert a very complicated math problem into a very clear and simple one. I have never seen such a great technical blog before!

Thank you once again.

10. Reshma

Really an awesome tutorial with the unique way of explanations. Each part of SVM was soo well explained. You are an expert in SVM. Looking forward for the part 4.

1. Alexandre KOWALCZYK Post author

You are welcome. I am still very busy right now but I don't give up on part 4. I just have some more SVM related work to do before.

11. Rajive Jain

This is the clearest explanation of a complex topic that I have ever seen. Why can't the professors in classes teach the same way? I have a gut feeling that they don't do it because they don't understand it themselves! I went through your entire tutorial in a couple of hours and undestood the entire conceptual framework of SVM - something I never got from my professors in the entire semester. I am sure it took you many days of hard work to put this together. Thanks for sharing your knowledge and understanding - I really appreciate your service to the community.

12. tyvannguyen

I've to say that this one is one of the best tutorial I've read.
It can viewed as a beginer-to-intermediate lecture with all neccessary visual demonstrations.
Thank you very much.

13. Manziba

Its wonderful article.So simple explanation which I was looking for.Can you please upload more articles explaining Lgranagian to find optimal w

14. Praveen

I have a question on th dot product of the 2 vectors w.x -b = 1 and -1.
Isnt it be >0 and <0 . Since we are taking the dot product of the normal vector w with another point x which is like dot product of the two vectors
|w||x|.cos(theta) and the values will vary between 1 and -1 and when they are on the hyperplane the values of the dot product will be 0.
Please correct me if I am wrong.

1. Alexandre KOWALCZYK Post author

We don't want the point to be too close to the initial hyperplane, that is why we introduce delta. We choose two hyperplanes at a distance delta from the original hyperplane. We then arbitrarely set delta = 1. But that means that points on the other hyperplanes will have values +1 or -1

15. Vamshi Dhar

Dear Praveen,

when they are on the hyperplane the values of the dot product will not be 0, but 1 on one side of plane and it will be >1 as the point goes away from that plane.
Same way with other plane having -1 for point that is exactly on the plane and as it goes out of the plane then it will be < -1.
I guess this can be your answer. If i understood some thing from the Alexandre KOWALCZYK's 3 valuable "Understanding the math" parts.

Cheers to Alexandre KOWALCZYK

16. Ulrich Armel

Really great article, it really made have a good feeling of SVM after I stop learning machine learning when I came across it and it was very poorly explained and I could n't understand anything.

17. imen

I thank you a lot for this great job. It is amazing to find things very simple lik this. you are an angle.

18. Tonja Rand

Hey,
amazing post!
However, I am a bit confused about the definitions of "hyperplane" and "margins". The two hyperplanes you have Ho and H1 (dotted lines) are the actual support vectors. A non-dotted line is a separating hyperplane. Right? I found in the internet two different definitions of "margins". In this figure (https://onionesquereality.files.wordpress.com/2009/03/optimal-margin-classifier.jpg) the margin is defined as a distance between a support vector and a separating hyperplane. In another figure (http://blog.pengyifan.com/wp-content/uploads/2013/09/svm.png) margin is a distance between two support vectors. And as far as I understood you define it also as a distance between two support vectors.

If there are two definitions of "margin" then there should be two different mathematical representations correspondingly. But as I see it does not matter if "margin" is defined as distance between two support vectors or a separating hyperplane and one support vector, it will have the same formula, namely 2/||w||. But how is it possible?

1. Alexandre KOWALCZYK Post author

Hello. The two hyperplanes H0 and H1 are not support vectors, they are hyperplanes. As they separate the data, they could be called separating hyperplanes too. In the case were the central hyperplane is the optimal hyperplane, then vectors which lies on the two others hyperplanes will be called support vectors. In the case were the margin is the distance between a support vector and the hyperplane it will have a length of 1/||w||. It does not matter which one we choose for the margin because at the end we optimize the norm of w and the same argument works if we use 2/||w|| or 1/||w|| (minimizing the norm is like maximizing the margin). If you want to understand more about the margin you can take a look at this lecture of Andrew Ng which introduces the functional and geometric margin. I hope it helps you.

19. Ma Lu

Could you please give a hint about when part 4 could be released, no stress i'mylevels just curious 😉

1. Alexandre KOWALCZYK Post author

Hello. I honestly don't know. I will try my best to do it in the next few months.

20. optimus

Hi Alenxander, could you explain to me what is the correlaction between finding hyperplane with prediction process by svm ? I have read several articles about svm, but I still don't get it

1. Alexandre KOWALCZYK Post author

Hello optimus. This is a two step process: you find any hyperplane which separate data, and you are able to make predictions on your training set. However you have no guarantee that it will perform well in the test set (that it will "generalize" well). With SVMs you find the "optimal" hyperplane, one which separates the data, but which is also as far away as possible from it. As a result it generalize better and you can expect that the prediction you made on unseen data will be better.

21. neran

Really enjoying the article. Thanks for the clear description of something very powerful.

22. Daniel

You explained everything so well and in such a non-patronising manner. I wish you were my teacher years ago. Looking forward to part 4.

23. Sourabh

I really liked your post and it gave me a keen interest in SVm, although i do have some open points which I do not understand after reading this 3rd part. I followed till combining of the constraints in one equation(till eq 8). But I have some open points regarding the initial selection of any hyperplane like how we start initially with first hyperplane which will classify the dataset(I know we can do it through single perceptron),but the thing is here in this post when we maximizing the distance in those steps I really gone confused.

thanks for this wonderful post.
It will be great if you can take one very simple 2 dimensional example to explain what we did in part 3 which explain all steps we covered here , it will be very helpful for visualization ,like what is going on.

24. Kaustav Mukherjee

Hey Alexander,

Let me first thank you for an absolutely outstanding explanation of SVM! I have scoured the internet upside down, ranging from lectures from MIT and Caltech profs, none of the explanations I have found can match yours in terms of simplicity and effectiveness! Please upload part 4 ASAP! I am unable to have peace of mind until I see it working end-to-end!

25. sawssen

Hello, thank you very for this explanation , i need to inderstand the 4 th part , is it ready ?

1. Alexandre KOWALCZYK Post author

Hello. The 4th part is not ready yet but contains already some content, I want to put enough content in it so that people are not frustrated and we the article advance the understanding of SVM. I have some time to work on it this week because I am on holidays s but most of the time I have very little time to write during the week, hence the big delay.

26. marcelo

Hello Alexandre,
I have a question about a special case of SVM. I tried to find the support vectors for these training examples:
x1 = (0,1) y1 = +1
x2 = (1,0) y1 = +1
x3 = (-1,0) y1 = -1
x4 = (0,-1) y1 = -1
Clearly, all examples are support vectors because x1 and x2 are over the hyperplane H+ and x3 and x4 are over the hyperplane H-. However, when the LD is maximized subject to sum(yi.alphai = 0) and alphai >= 0, the obtained results are (I have used the Wolfram Alpha online optimizer):
alpha1 = alpha3
alpha2 = 1 - alpha3
alpha4 = 1 - alpha3
(notice that alphai is the Lagrage Multiplier of xi)
The solution that has sense to me is alpha1 = alpha3 = 1, but then alpha2 = alpha4 = 0, indicating that x2 and x4 are not support vectors, and this is illogical. Even more with alpha1 = alpha3 = 1 and alpha2 = alpha4 = 0, the w vector and b obtained return these results:
(w . x1) + b = +1 (x1 is over the H+, since x1 is a support vector)
(w . x2) + b = +1 (x1 is over the H+, but it is not a support vector because its Lagrange Multiplier is zero)
(w . x3) + b = -1 (x1 is over the H-, since x1 is a support vector)
(w . x4) + b = -1 (x1 is over the H-, but it is not a support vector because its Lagrange Multiplier is zero)
I have also used the svmtrain of Matlab and the support vectors obtained are the same (x1 and x3).
One last observation, if I add one more training example, [x5 = (2, 2) y5 = +1], only alpha5 is zero, this indicate that x1, x2, x3 and x4 are support vectors.
Do you have any suggestion to clarify this situation?

1. Alexandre KOWALCZYK Post author

Hello marcelo,

I performed the same experiment as you with a quadratic programming library and I have found 4 support vectors. This is because I was solving the problem without regularization (the hard-margin). However most libraries give you an implementation of soft-margin SVM with regularization where the hyperparameter C allows you to tune how much you want to regularize. By default they use a value of C which perform some regularization. I made a reproducible example in Python. If you change the value of C to be greater than one, you will have 2 support vectors. However as soon as you use a value less than 1 you have 4 support vectors. I hope it helps you.

1. abood

Hello Alexandre,
thank you for an absolutely outstanding explanation of SVM
have you Code in Java for SVM !!!¿¿¿

1. abood

Thank you very much Alexandre for the java Code.
But I want to java Code only for linearly separable SVM please

27. Chew

Very, very good explanation. Great Job!!!!

Hope you can explain us the next steeps soon! I am excited with your step-by-step, figures and numerical examples.

1. Alexandre KOWALCZYK Post author

Thanks a lot for your comment. I don't have a book yet 😉 The next part is coming up. Don't forget to drop your email and I will keep you informed.

28. G. Krishna veni

Sir really i satisfied with u r modules. Actually I did not have basic maths. When I gone through this blog i cleared doubts myself. I'm eagerly waiting for your next part

29. ntp

If you write a book about Machine Learning with these kind of awesome explanation, I will buy it. Anyone has a same thought?

30. SUMAN KUMAR CHOUDHURY

Great !!! wao !!! what an explanation... thank u very much.... I am waiting for the optimization in SVM (Part 4)

31. Jorge

Thank you very much! It´s great to read a thorough description of the SVM with a clear notation and the intuition behind the math.

32. Cor

Hi, thanks for writing this up. Could you please double-check that w is indeed perpendicular to H1? If you take any point on H1, this point is represented by a vector drawn from the origin to that point. This vector is obviously not perpendicular to w. I'd appreciate it if you could explain why I'm wrong 🙂

1. Alexandre KOWALCZYK Post author

The two things that matter about w are its direction and its magnitude. I can draw it where I want and I will be the same vector (even if it does not have the same starting coordinates). I explain this in Part 2 when talking about the difference between two vectors. I hope it helps you 🙂

Hi!
In equation 9 we have K = mu = m ( w / ||w|| );
I think in defining the vector "k" we need something like b in wx+b=0
ti indicate the location of vector in the space.
just the magnitude and the norm shows where is the starting point of k?????

1. Alexandre KOWALCZYK Post author

Hello. No, you do not need to indicate the location of the vector in the space. This is often confusing, I explain this in Part 2 when talking about the difference between two vectors. Reread this section, it might help you understand.

34. Amal Targhi

Thank's for this course but i can't understand how do you calculate the vecotr W and the value of in order to know if wx+b superior or inferior to 0 please understand me

1. Alexandre KOWALCZYK Post author

The value of w is what we are trying to find by solving the optimization problem explained in the following articles.

35. Sonu Lamba

Dear Sir,
How we calculate w and b?. are they different for every feature and at the end we need to optimize them?..how we decide the the value of w and b for optimal hyperplane equation...if for each feature we are calculating w and b then what value of w and b we will choose?. and why we take unit value for w always even the data are not always at the same distance. I am not getting the base of w and b value. could you please explain it earliest. Thank you very much for your time and article.

1. Alexandre KOWALCZYK Post author

Hello. There is only one vector w, and it will contain several values. If X is an (m,n) matrix, then a single example is a (1,n) vector and w is a (n,1) vector. So one vector w contains the values for all examples. There is a single value of b. They will be chosen by solving an optimization problem (you can read more on that on the later articles).

36. cam nguyen

Hello Alexandre, thank you so much for your great article !

I'm learning a project about human detection which using SVM algorithm to classify non-human or human. Can you tell me the relationship between the variables w, b, xi with the values from the input images? I haven't still imagined yet? TKS so much!

1. Alexandre KOWALCZYK Post author

1. cam nguyen

How to compute the weight vector w and bias b in linear SVM. We have a hyperplane equation and the positive and negative feature. In equation Wx+b= 0, what does it mean by weight vector and how to compute it??

37. Matthias

Hello. I agree with the many comments praising your post. Thank you so much for it!

There's one thing I'm still confused about: How do we find a hyperplane that seperates the set in the first place? In step 2 it says: "Given a hyperplane H0 separating the dataset and satisfying[...]"
So how do we get H0? Did I somehow miss that?

1. Alexandre KOWALCZYK Post author

Hello, Matthias. Thank you for your comment. The hyperplane H0 is introduced in order to derive the optimization problem. I could start an argument by "Given a point x on the horizontal axis", and it does not matter how you pick it, or where it is, what matters is that it is a point, and it is on the horizontal axis. The same apply in this context. But to answer your question, if you have a linearly separable dataset, you can find a separating hyperplane by using an algorithm like the Perceptron Learning Algorithm for instance, or of course the SMO algorithm for the SVM problem.

38. Kamakshi Bansal

You are such an amazing teacher!! kudos to this post. This post taught me SVM in such a simple way.

39. idk222

What is the intuition for why minimizing the norm of w is the same as maximizing the margin m. From the algebra it is obvious.

w is the normal vector to H1. The norm of w is the magnitude of that vector. Do you have any intuition why that would be the same as choosing the biggest m? Thanks. Great articles!!

1. Alexandre KOWALCZYK Post author

2. Numhacker

It's all about dot product. Normal vector w is perpendicular to the hyperplane, and so the margin vector k is. Given arbitrary vector w and k, what we want is the result of this dot product must be 1 (or any positif number, we use δ in this post). Since these two vectors is in parallel, the dot product is only based on its length.
How do you choose w and m which the dot product is 1 while we want to maximize margin? Yap, minimize the length of the normal vector w!

40. Tetsuya

Thanks a lot for the detailed explanation. That's the greatest article on SVM I have ever read before. Here I have one question. In Figure 4, 5, 6, 7, 8, there are hyperplanes written by wx - b = 1 and wx - b = -1, but we are talking about wx + b = 1 and wx + b = -1, right? So, I think that wx + b = 1 and wx + b = -1 should be also written in those figures. What do you think?

1. Alexandre KOWALCZYK Post author

Yes indeed it would be more coherent to use wx+b=1 in the figure as well. Even if the only difference between an hyperplane with wx-b=1 and wx+b=1 is the sign of b. Thanks for pointing that out. It is in fact often confusing in the literature as well as some author use the wx+b convention and others use wx-b (Platt for example).

1. Alexandre KOWALCZYK Post author

I am not sure to understand your question perfectly. But the vector w here is the vector normal to the hyperplane. In Part 2 we defined a vector w which was the direction of a vector u. But they are not the same vectors.

41. Rahul

Simply Amazing. Sorted out all the doubts regarding SVM. Can you please take any classification example along with say MATLAB code without using any direct command?
Thanks

1. Alexandre KOWALCZYK Post author

Thank you Rahul. Unfortunately my MATLAB is a little rusty 🙂

42. Nick

You wrote in a previous tutorial: "Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin"
However with the constraints above(in this tutorial) there isn't guarantee that a data point (of the training set) was on the boundary of hyperplane H0 or H1. In this case the given definition of margin seems wrong. Am I wrong?

Another question is why δ=1? It should be correlation between the margin and δ. If you leave δ you get (if I'm not wrong):
margin = (2*δ)/||w||
And the optimization problem should ivolve the parameter δ too.
Why can simplification be done (δ=1)?

1. Alexandre KOWALCZYK Post author

Hello Nick,

1) As the hyperplane will be at equal distance between the two nearest data point finding the distance from one point than multiplying it by 2 works.
However this is not very rigorous, that is why we see the "true" method to compute the margin in this article.
2) To better understand delta, you can read this paper. (Equation 2)

43. Tushar

Great work. I am a Deep Learning researcher. I personally have never seen such simple and effective explanation for SVMs in any of the books (or across the web). This reminded me of the quote by Einstein...."If you cannot explain it simply, you probably have not understood it"....Cheers.

44. Olle Welin

Hi thank's very much for your tutorial. Very informative. (As usual when I got into new intellectual topics I often miss understand basics). Regarding SVM I wounder how will I do to find the first 3 support vector. One guess I have is this (correct me if I am wrong).
1. Create a first version of test "nr1 hyperplane" by Random select an origo and a random angle for the (bolded) 'w' vector (hyperplane)
2. Go through All vectors and if some vector is constrained by this test "nr1 hyperplane" then throe this vector. Maybee se how bad this constrained point and use this data to produce a new test "nr2 hyperplane".
3. When test "nr? hyperplane" satisfy ALL vectors then Start to find the 3 support vectors.
4. Now again go through the margin calculation method you describe for ALL the vectors and when some vector have smallest margin save this until you have find for example 'A' 'B' and 'H' in your example above.

I guess there must be a way to find this 3 support vector as fast as possible.
Very best Regards Olle Welin
I feel so slow but I am curious about machine learning how the basics math "nuts and bolts" works. (play with write my own C code on Raspberry pi)

45. Mohit

Great work Alexandre!Sorry m completely new to this so might be a novice question but am not able to visualise how z0 which is addition of x0 and k .... is in hyperplane H1 ?
"If we start from the point x0 and add k we find that the point z0=x0+k is in the hyperplane H1 ?"

1. Alexandre KOWALCZYK Post author

Well, you do not need to visualize it because you can see it on Figure 14. There is a review of vector addition on part 2 if you need help understanding the concept. Best Regards,

46. Hannah Liu

I am still confused about step 2.

Yes, with all the constraints, I will be able to know whether the hyperplanes are the ones that correctly separate the data. But how do I get the hyperplanes in the first place? Do I take a random guess? If after applying the constraints I found they are not the ones I am looking for, how do I improve myself?

1. Alexandre KOWALCZYK Post author

That is the purpose of the algorithm SMO. Right now this article is building the foundation used to define the problem we want to solve, not explaining how to solve it. (I will explain all of that in my upcoming ebook)

47. Peyman Barjoueian

Hello Dear Alexandre.

Thank you for these good series.
I have a question. it make sense to minimize ||w|| value in our optimization problem. but why should also minimize the b value? how they(||w|| and b) are related?

Cheers.

48. JX Song

First of all I'd like to thank you for this awesome tutorial. There is however something I do not understand. In the part "Understanding the constraints". Why is it that for Xi = A, the point on the hyperplane is =1 and for Xi = C the point on the hyperplane is >1 ?

49. Miruna Stefan

Dear Alexandre,

I have been circling SVM for over a year now, and while i understood the underlying principle, i never truly got how it worked. Most resources provide either a general, high-level explanation, or heavy mathematics that can be difficult to understand on your own.

Your post, on the other hand, is clear, detailed and explicit. I think you have invested a great deal of attention and time into this project, and i want you to know it is truly helpful and of great quality.

Thank you very much!
Miruna

50. Romain C

Hi,

Thanks again for this great tutorial,

I got to the end of the first tutorial steps and I am trying to go further reading your ebook on SVMs

I do not understand why :

- here we minimize ‖w‖
- on your ebook we minimize 1/2 ‖w‖ ^2

1. Alexandre KOWALCZYK Post author

Hello Romain,

I explain this in the book "The factor has been added for later convenience when we will use QP solver to solve the problem and squaring the norm has the advantage of removing the square root."

Best regards,

51. Steve Chamales

I've read the first 3 parts to this tutorial (this page is part 3). In the example you're working through, you chose your first hyperplane to be x2 = -2 * x1 because this is a very basic example that you could visualize. How do you select a hyperplane that best separates multi-dimensional data that can't be visualized by humans?

1. Alexandre KOWALCZYK Post author

Basically we can say that the first hyperplane is defined by a vector w starting at 0 or with random values, then an algorithm updates it by tiny increment until it find the optimal hyperplane (you can read the full detail of this in the chapter about the SMO algorithm in my ebook).

52. Yury Parfenov

Hi,
We have two parameters w and b in w⋅x+b=−1, w⋅x+b=1 equations. w is corresponding for rotation, b is corresponding for offset. I didn't get how we can control distance between two hyperplanes if we have shared b parameter and constant values on the right hand side (-1, 1). We only can rotate and shift hyperplanes by varying w and b. But we cannot change distance between planes.

1. Alexandre KOWALCZYK Post author

Compare the distance between the lines created by:
First equation 9.0x+1.0
Second equation 9.0x+3.0

and by

First equation 0.1x+1.0
Second equation 0.1x+3.0

Clearly by only shifting the hyperplane we can make the distance smaller.

53. Tejveer Singh

Please, can you explain step 1 in a bit detail?

Please explain equation of Hyperplace in step -2 also?
It is a bit unclear in the previous post also.

54. dr. hala omar

Thank you very much
your simple and clear explanation helped me a lot to understand the subject especially as a beginner in it.