This is the **Part 3** of my series of tutorials about the math behind Support Vector Machine.

If you did not read the previous articles, you might want to start the serie at the beginning by reading this article: an overview of Support Vector Machine.

## What is this article about?

The main focus of this article is to show you the reasoning allowing us to select the optimal hyperplane.

Here is a quick summary of what we will see:

- How can we find the optimal hyperplane ?
- How do we calculate the distance between two hyperplanes ?
- What is the SVM optimization problem ?

## How to find the optimal hyperplane ?

At the end of Part 2 we computed the distance between a point and a hyperplane. We then computed the margin which was equal to .

However, even if it did quite a good job at separating the data it was not the *optimal* hyperplane.

As we saw in Part 1, the optimal hyperplane is the one which maximizes the margin of the training data.

In Figure 1, we can see that the margin , delimited by the two blue lines, is not the biggest margin separating perfectly the data. The biggest margin is the margin shown in Figure 2 below.

You can also see the optimal hyperplane on *Figure 2*. It is slightly on the left of our initial hyperplane. How did I find it ? I simply traced a line crossing in its middle.

Right now you should have the feeling that hyperplanes and margins are closely related. And you would be right!

If I have an hyperplane I can compute its margin with respect to some data point. If I have a margin delimited by two hyperplanes (the dark blue lines in *Figure 2*), I can find a third hyperplane passing right in the middle of the margin.

**Finding the biggest margin, is the same thing as finding the optimal hyperplane.**

## How can we find the biggest margin ?

It is rather simple:

- You have a dataset
- select two hyperplanes which separate the data with no points between them
- maximize their distance (the margin)

The region bounded by the two hyperplanes will be the biggest possible margin.

If it is so simple why does everybody have so much pain understanding SVM ?

It is because as always the simplicity requires some abstraction and mathematical terminology to be well understood.

So we will now go through this recipe step by step:

### Step 1: You have a dataset and you want to classify it

Most of the time your data will be composed of vectors .

Each will also be associated with a value indicating if the element belongs to the class (+1) or not (-1).

Note that can only have two possible values -1 or +1.

Moreover, most of the time, for instance when you do text classification, your vector ends up having a lot of dimensions. We can say that is a -dimensional vector if it has dimensions.

So your dataset is the set of couples of element

The more formal definition of an initial dataset in set theory is :

### Step 2: You need to select two hyperplanes separating the data with no points between them

Finding two hyperplanes separating some data is easy when you have a pencil and a paper. But with some -dimensional data it becomes more difficult because you can't draw it.

Moreover, even if your data is only 2-dimensional it might not be possible to find a separating hyperplane !

*You can only do that if your data is linearly separable*

So let's assume that our dataset IS linearly separable. We now want to find two hyperplanes with no points between them, but we don't have a way to visualize them.

What do we know about hyperplanes that could help us ?

#### Taking another look at the hyperplane equation

We saw previously, that the equation of a hyperplane can be written

However, in the Wikipedia article about Support Vector Machine it is said that :

Any hyperplane can be written as the set of points satisfying .

First, we recognize another notation for the dot product, the article uses instead of .

You might wonder... Where does the comes from ? Is our previous definition incorrect ?

Not quite. Once again it is a question of notation. In our definition the vectors and have three dimensions, while in the Wikipedia definition they have two dimensions:

Given two 3-dimensional vectors and

Given two 2-dimensional vectors and

Now if we add on both side of the equation we got :

**For the rest of this article we will use 2-dimensional vectors** (as in equation (2)).

Given a hyperplane separating the dataset and satisfying:

We can select two others hyperplanes and which also separate the data and have the following equations :

and

so that is equidistant from and .

However, here the variable is not necessary. So we can set to simplify the problem.

and

Now we want to be sure that they have no points between them.

We won't select *any *hyperplane, we will only select those who meet the two following **constraints**:

For each vector either :

or

#### Understanding the constraints

On the following figures, all red points have the class and all blue points have the class .

So let's look at *Figure 4 *below and consider the point . It is red so it has the class and we need to verify it does not violate the constraint

When we see that the point is on the hyperplane so and the constraint is respected. The same applies for .

When we see that the point is above the hyperplane so 1\" /> and the constraint is respected. The same applies for , , and .

With an analogous reasoning you should find that the second constraint is respected for the class .

On *Figure 5*, we see another couple of hyperplanes respecting the constraints:

And now we will examine cases where the constraints are not respected:

What does it means when a constraint is not respected ? It means that we cannot select these two hyperplanes. You can see that every time the constraints are not satisfied (*Figure 6, 7 and 8)* there are points between the two hyperplanes.

By defining these constraints, we found a way to reach our initial goal of selecting two hyperplanes **without points between them.** And it works not only in our examples but also in -dimensions !

#### Combining both constraints

In mathematics, people like things to be expressed concisely.

Equations **(4)** and **(5)** can be combined into a single constraint:

We start with equation **(5)**

And multiply both sides by (which is always -1 in this equation)

Which means equation **(5)** can also be written:

In equation **(4)**, as it doesn't change the sign of the inequation.

We combine equations** (6)** and **(7) **:

We now have a unique constraint (equation **8**) instead of two (equations **4** and **5**) , but they are mathematically equivalent. So their effect is the same (there will be no points between the two hyperplanes).

### Step 3: Maximize the distance between the two hyperplanes

This is probably be the hardest part of the problem. But don't worry, I will explain everything along the way.

#### a) What is the distance between our two hyperplanes ?

Before trying to maximize the distance between the two hyperplane, we will first ask ourselves: how do we compute it ?

Let:

- be the hyperplane having the equation
- be the hyperplane having the equation
- be a point in the hyperplane .

We will call the perpendicular distance from to the hyperplane . By definition, is what we are used to call **the margin**.

As is in , is the distance between hyperplanes and .

**We will now try to find the value of** .

You might be tempted to think that if we add to we will get another point, and this point will be on the other hyperplane !

But **it does not work**, because is a *scalar*, and is a *vector* and adding a scalar with a vector is not possible. However, we know that adding two vectors is possible, so if we transform into a vector we will be able to do an addition.

We can find the set of all points which are at a distance from . It can be represented as a circle :

Looking at the picture, the necessity of a vector become clear. With just the length we don't have one crucial information : **the direction**. (recall from Part 2 that a vector has a magnitude and a direction).

We can't add a scalar to a vector, but we know if we multiply a scalar with a vector we will get another vector.

From our initial statement, we want this vector:

- to have a magnitude of
- to be perpendicular to the hyperplane

Fortunately, we already know a vector perpendicular to , that is (because )

Let's define the unit vector of . As it is a unit vector and it has the same direction as so it is also perpendicular to the hyperplane.

If we multiply by we get the vector and :

- is perpendicular to (because it has the same direction as )

From these properties we can see that is the vector we were looking for.

We did it ! We transformed our scalar into a vector which we can use to perform an addition with the vector .

If we start from the point and add we find that the point is in the hyperplane as shown on *Figure 14.*

The fact that is in means that

We can replace by because that is how we constructed it.

We can now replace using equation

We now expand equation

The dot product of a vector with itself is the square of its norm so :

As is in then

This is it ! We found a way to compute .

#### b) How to maximize the distance between our two hyperplanes

We now have a formula to compute the margin:

The only variable we can change in this formula is the norm of .

Let's try to give it different values:

When then

When then

When then

One can easily see that the bigger the norm is, the smaller the margin become.

### Maximizing the margin is the same thing as minimizing the norm of

Our goal is to maximize the margin. Among all possible hyperplanes meeting the constraints, we will choose the hyperplane with the smallest because it is the one which will have the biggest margin.

This give us the following optimization problem:

Minimize in ()

subject to

(for any )

Solving this problem is like solving and equation. Once we have solved it, we will have found the couple () for which is the smallest possible and the constraints we fixed are met. Which means we will have the equation of the optimal hyperplane !

## Conclusion

We discovered that finding the optimal hyperplane requires us to solve an optimization problem. Optimization problems are themselves somewhat tricky. And you need more background information to be able to solve them. So we will go step by step. Let us discover unconstrained minimization problems in Part 4! Thanks for reading.

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

dragon518Thank you very much! I have never read a technology blog more fundamental, detailed, clear than yours before. You are a specialist on SVM. BTW, look forward to read part-4.

snake2971chrisreally nice report! Just one thing, if you have data points in a 2 dimensional space(x1,x2), like in your examples, you should visualize it in 3 dimensions since the hyperplane is spanned in a third dimension(y). If you don't want to mess around with 3 dimensions use a simple 1D example, so your hyperplane can be really visualized with a line.

That was a thing which confused me for quiet a bit. But once I had the right visualization everything else became kind of intuitive.

Alexandre KOWALCZYKPost authorYeah sure. Here the third dimension is depicted by the color, but indeed it can be confusing. Thanks for your feedback 🙂

Ezra DriekaThanks so much bro - beginner

sarath r nairVery great post ....... Please dont stop in the middle . Thumbs up.

Rokesh JankieReally enjoying the article. Thanks for the clear description of something very powerful.

korawitThanks looking forward to how to solve SVM equation

WinsonWaiting for part 4. This tutorial is very clear for beginner. 🙂 Thanks

genetixx01Very helpful. I look forward for part 4.

Abhinand (Abhi) SivaprasadWhy are the hyper planes set to 1 and -1. It seems arbitrary? This essentially sets them 2 units apart. What if the data points with labels 1 and labels -1 were very far away?

Alexandre KOWALCZYKPost authorThey are set to 1 and -1 because it is the only possible values the vector y can take. As suggested by snake2971chris in a previous comment, visualizing the picture in 3 dimensions can help to understand this.

mlsvmGreat SVM series Alexandre! I have a question for this part, specifically on "Fortunately, we already know a vector perpendicular to 1, that is w (because 1=w⋅x−b=1)"

Why is w perpendicular to the hyperplane? I've read an explaination over here: http://datascience.stackexchange.com/questions/6054/in-svm-algorithm-why-vector-w-is-orthogonal-to-the-separating-hyperplane

but I still have no clue about the logic behind.

Alexandre KOWALCZYKPost authorThis is by definition. The equation is called a

normal equationfor the hyperplane. I invite you to read the part about hyperplanes in this course (page 9). This is about the calculus of functions of several variables but I used it as a reference several times to write this serie of articles (you can read the full course here).archkrishThe hyperplanes are defined in the way that there is vector which is perpendicular to all the vectors in the hyperplane...imagine hyperplane consists of all the vectors which are perpendicular to w.There are good video lectures available which explains hyperplane.

Edemir FerreiraThanks for the explanation! I'm anxious for the next part =)

danielsimple yet insightful explanation on the math behind the scene...cant wait to see the release of part 4...thanks for your effort in producing this tutorial, very helpful

RahulHey, really great tutorial, waiting for the next part. I had a doubt...Please help!!

Let us consider a two dimensional dataset as shown in your tutorial. My doubt is what is the physical significance of w.I know it is a vector perpendicular to the hyperplane but what does it actually signify

Alexandre KOWALCZYKPost authorI am afraid I don't understand your question. What do you mean by physical significance?

NrupatungaThank you so much for such a nice and easy tutorial.

Tejas KhotHello Alexandre!

Thank you for an excellent series of posts.

By when can we expect the next part to be released ?

Alexandre KOWALCZYKPost authorHi Tejas. Thanks for your comment. I don't have a precise date because I have been very busy at work. As there is a lot of people asking me I will try to do my best to publish it soon (before 2016)

I added a small form at the end of the article so that you can put your email in it if you want me to send you a mail when it will be published.

HugoThanks for the great tutorial. The expressions wx-b=1 and wx-b=-1 are however a little confusing. You write the following in defence of these equations.

"And we choose +1 and -1 because they are the only values the class can take."

Question: My understanding is that the result of wx-b should not evaluate to a class label. What would evaluate to a class label is an expression such as f(wx-b) where f is the classification rule that assigns the product between a given test vector and the w vector to the classes 1 or -1. Can you clarify why you are having wx-b evaluating to the value of the class labels? Some literature suggests that the numerical values of the class labels are only arbitrary -- in fact -1 and 1 are chosen in SVM algebra because they enable an elegant way to represent the SVM (e.g., we are able to write y(wx-b)>0 because y can only be 1 or -1).

Could you have chosen 1 and -1 because you wanted to have the planes wx-b=-1,1 equidistant from wx-b=0. In that case does it mean that I could have used numbers such as 2 and -2?

Alexandre KOWALCZYKPost authorThanks for your comment Hugo.

Indeed the expression "And we choose +1 and -1 because they are the only values the class can take." was an error. I checked online and this value has no connection with the class label.

You can see equation (2) of this paper for more details.

I updated the article so that it is more explicit how this number is choosen.

I think we could use numbers such as 2 and -2 but I am not sure, I need to check the math to be 100% sure.

Thanks a lot.

seokhyunThanks for the grate tutorial!. To me it was hard to understand, but now i come to realize!

ZC LiuActually, you can choose +2 and -2 instead of +1 and -1.

Then you follow the steps: 10 ~ 19, you will get m = 4 / w'.

means you find a w', which length is half of the original one(w), when you choose +1 and -1.

then the support hyper plane is: w' * x + b' = 2 => 2w * x + 2b = 2 , which is equivalent to the

original one: w * x + b = 1.

so you can choose other number either, eventually, you will get the same support hyper plane.

MutazReally thanks a lot for this series

This is the first time I see such a simplicity of explaining the SVM

I am waiting part 4, I am a bit scared to learn how solve that optimization problem. Please try to explain it as simple as possible.

Thanks a lot

Alexandre KOWALCZYKPost authorThanks. I will try 😉

강산하I really really appriciate for your aritcle.

I am graduate school student in South Korea

I have a question,

Why do you set δ = 1.-1 ?

I think δ is related to margin and

it should smaller if there are data point and equation (4), (5) are not true.

What do you think?

Alexandre KOWALCZYKPost authorWe could have kept in all the following equations. At the end we would have had in equation 8. Indeed, is the functional margin. You can read part 3 of this course to understand better why we can set it to 1 and -1. You will see in part 4 that he replaces the functionnal margin by 1. I hope this helps you.

Ganecianwhat does \delta means? I still didn't understand this paragraph,

"We now note that we have over-parameterized the problem: if we scale w, b and \delta by a constant factor \alpha, the equations for x are still satisfied. To remove this ambiguity we will require that \delta = 1, this sets the scale of the problem, i.e. if we measure distance in millimeters or meters. " Can you explain this?

Gleen A. DalaoraoHi, is this forum still active?

강산하another question

In step 2, we select 2 hyperplane, so I think w is fixed in that time

but step3 you tell we minimize w to get biggest margin

I think w nerver chage after step2...

Can you explain?

Alexandre KOWALCZYKPost authorIn step 2 we select two hyperplanes among all possible couple of hyperplanes satisfying the constraints (4) and (5). Given one hyperplane separating the data, we can find two other hyperplanes meeting the constraint. Given another hyperplane separating the data (with a different w than the first one), we can find two other different hyperplanes meeting the constraint. So minimizing w in step 3 allow us to find the two hyperplanes having the biggest margin and the optimal hyperplane.

BruceThank you very much Alexandre! I was exactly looking for a text like yours.

ChuanA really good series of articles! Thank you!

knightPlease carry on with 4th article, the 3 articles are so informative and extremely useful

Cao FuxiangHello Alexandre!

Thank you very much for your excellent series of posts. You convert a very complicated math problem into a very clear and simple one. I have never seen such a great technical blog before!

I look forward to reading your part-4.

Thank you once again.

agarwalayush9Thanks, waiting for the next tut.

ReshmaReally an awesome tutorial with the unique way of explanations. Each part of SVM was soo well explained. You are an expert in SVM. Looking forward for the part 4.

arkanraThank you so much for very clear articles, I look forward for part 4

Alexandre KOWALCZYKPost authorYou are welcome. I am still very busy right now but I don't give up on part 4. I just have some more SVM related work to do before.

Rajive JainThis is the clearest explanation of a complex topic that I have ever seen. Why can't the professors in classes teach the same way? I have a gut feeling that they don't do it because they don't understand it themselves! I went through your entire tutorial in a couple of hours and undestood the entire conceptual framework of SVM - something I never got from my professors in the entire semester. I am sure it took you many days of hard work to put this together. Thanks for sharing your knowledge and understanding - I really appreciate your service to the community.

Alexandre KOWALCZYKPost authorThank you very much for such a kind comment.

jonathanbrophy47Jonathan BrophyThis is such a great and clear explanation of SVMs, easily the best I've read, can't wait for the next tutorial!

tyvannguyenI've to say that this one is one of the best tutorial I've read.

It can viewed as a beginer-to-intermediate lecture with all neccessary visual demonstrations.

Thank you very much.

Vaibhav Daganicely written!!

WaelThank You Alexandre, great tutorial made SVM look so simple.

ManzibaIts wonderful article.So simple explanation which I was looking for.Can you please upload more articles explaining Lgranagian to find optimal w

Alexandre KOWALCZYKPost authorThis will be in Part 4.

DongLook forward to part 4. This series us amazing

johnp631Great tutorial!

PeachAwesome! Look forward to chapter 4

brianExcellent tutorial but please make time for more!

PraveenI have a question on th dot product of the 2 vectors w.x -b = 1 and -1.

Isnt it be >0 and <0 . Since we are taking the dot product of the normal vector w with another point x which is like dot product of the two vectors

|w||x|.cos(theta) and the values will vary between 1 and -1 and when they are on the hyperplane the values of the dot product will be 0.

Please correct me if I am wrong.

Alexandre KOWALCZYKPost authorWe don't want the point to be too close to the initial hyperplane, that is why we introduce delta. We choose two hyperplanes at a distance delta from the original hyperplane. We then arbitrarely set delta = 1. But that means that points on the other hyperplanes will have values +1 or -1

Vamshi DharDear Praveen,

when they are on the hyperplane the values of the dot product will not be 0, but 1 on one side of plane and it will be >1 as the point goes away from that plane.

Same way with other plane having -1 for point that is exactly on the plane and as it goes out of the plane then it will be < -1.

I guess this can be your answer. If i understood some thing from the Alexandre KOWALCZYK's 3 valuable "Understanding the math" parts.

Cheers to Alexandre KOWALCZYK

Vamshi DharDear Alexandre KOWALCZYK,

Desperately waiting for your fourth part .

Ulrich ArmelReally great article, it really made have a good feeling of SVM after I stop learning machine learning when I came across it and it was very poorly explained and I could n't understand anything.

imenI thank you a lot for this great job. It is amazing to find things very simple lik this. you are an angle.

Tonja RandHey,

amazing post!

However, I am a bit confused about the definitions of "hyperplane" and "margins". The two hyperplanes you have Ho and H1 (dotted lines) are the actual support vectors. A non-dotted line is a separating hyperplane. Right? I found in the internet two different definitions of "margins". In this figure (https://onionesquereality.files.wordpress.com/2009/03/optimal-margin-classifier.jpg) the margin is defined as a distance between a support vector and a separating hyperplane. In another figure (http://blog.pengyifan.com/wp-content/uploads/2013/09/svm.png) margin is a distance between two support vectors. And as far as I understood you define it also as a distance between two support vectors.

If there are two definitions of "margin" then there should be two different mathematical representations correspondingly. But as I see it does not matter if "margin" is defined as distance between two support vectors or a separating hyperplane and one support vector, it will have the same formula, namely 2/||w||. But how is it possible?

Alexandre KOWALCZYKPost authorHello. The two hyperplanes H0 and H1 are not support vectors, they are hyperplanes. As they separate the data, they could be called separating hyperplanes too. In the case were the central hyperplane is the optimal hyperplane, then vectors which lies on the two others hyperplanes will be called support vectors. In the case were the margin is the distance between a support vector and the hyperplane it will have a length of 1/||w||. It does not matter which one we choose for the margin because at the end we optimize the norm of w and the same argument works if we use 2/||w|| or 1/||w|| (minimizing the norm is like maximizing the margin). If you want to understand more about the margin you can take a look at this lecture of Andrew Ng which introduces the functional and geometric margin. I hope it helps you.

sourenathanks man...really wonderful tutorial...look forward to part 4.

sandeepSuch a wonderful explaination any layman can understand SVM. I appreciate your work and the way you explained.

Ma LuCould you please give a hint about when part 4 could be released, no stress i'mylevels just curious 😉

Alexandre KOWALCZYKPost authorHello. I honestly don't know. I will try my best to do it in the next few months.

optimusHi Alenxander, could you explain to me what is the correlaction between finding hyperplane with prediction process by svm ? I have read several articles about svm, but I still don't get it

Alexandre KOWALCZYKPost authorHello optimus. This is a two step process: you find any hyperplane which separate data, and you are able to make predictions on your training set. However you have no guarantee that it will perform well in the test set (that it will "generalize" well). With SVMs you find the "optimal" hyperplane, one which separates the data, but which is also as far away as possible from it. As a result it generalize better and you can expect that the prediction you made on unseen data will be better.

renjithbrilliant!! eagerly waiting for part 4!!! thanx a ton! 🙂

neranReally enjoying the article. Thanks for the clear description of something very powerful.

DanielYou explained everything so well and in such a non-patronising manner. I wish you were my teacher years ago. Looking forward to part 4.

Alexandre KOWALCZYKPost authorThank you very much Daniel. Part 4 is on its way 🙂

SourabhI really liked your post and it gave me a keen interest in SVm, although i do have some open points which I do not understand after reading this 3rd part. I followed till combining of the constraints in one equation(till eq 8). But I have some open points regarding the initial selection of any hyperplane like how we start initially with first hyperplane which will classify the dataset(I know we can do it through single perceptron),but the thing is here in this post when we maximizing the distance in those steps I really gone confused.

thanks for this wonderful post.

It will be great if you can take one very simple 2 dimensional example to explain what we did in part 3 which explain all steps we covered here , it will be very helpful for visualization ,like what is going on.

rossLooking forward to part 4.

Kaustav MukherjeeHey Alexander,

Let me first thank you for an absolutely outstanding explanation of SVM! I have scoured the internet upside down, ranging from lectures from MIT and Caltech profs, none of the explanations I have found can match yours in terms of simplicity and effectiveness! Please upload part 4 ASAP! I am unable to have peace of mind until I see it working end-to-end!

Alexandre KOWALCZYKPost authorThanks for you kind words. I am working on Part 4. 😉

sawssenHello, thank you very for this explanation , i need to inderstand the 4 th part , is it ready ?

Alexandre KOWALCZYKPost authorHello. The 4th part is not ready yet but contains already some content, I want to put enough content in it so that people are not frustrated and we the article advance the understanding of SVM. I have some time to work on it this week because I am on holidays s but most of the time I have very little time to write during the week, hence the big delay.

marceloHello Alexandre,

I have a question about a special case of SVM. I tried to find the support vectors for these training examples:

x1 = (0,1) y1 = +1

x2 = (1,0) y1 = +1

x3 = (-1,0) y1 = -1

x4 = (0,-1) y1 = -1

Clearly, all examples are support vectors because x1 and x2 are over the hyperplane H+ and x3 and x4 are over the hyperplane H-. However, when the LD is maximized subject to sum(yi.alphai = 0) and alphai >= 0, the obtained results are (I have used the Wolfram Alpha online optimizer):

alpha1 = alpha3

alpha2 = 1 - alpha3

alpha4 = 1 - alpha3

(notice that alphai is the Lagrage Multiplier of xi)

The solution that has sense to me is alpha1 = alpha3 = 1, but then alpha2 = alpha4 = 0, indicating that x2 and x4 are not support vectors, and this is illogical. Even more with alpha1 = alpha3 = 1 and alpha2 = alpha4 = 0, the w vector and b obtained return these results:

(w . x1) + b = +1 (x1 is over the H+, since x1 is a support vector)

(w . x2) + b = +1 (x1 is over the H+, but it is not a support vector because its Lagrange Multiplier is zero)

(w . x3) + b = -1 (x1 is over the H-, since x1 is a support vector)

(w . x4) + b = -1 (x1 is over the H-, but it is not a support vector because its Lagrange Multiplier is zero)

I have also used the svmtrain of Matlab and the support vectors obtained are the same (x1 and x3).

One last observation, if I add one more training example, [x5 = (2, 2) y5 = +1], only alpha5 is zero, this indicate that x1, x2, x3 and x4 are support vectors.

Do you have any suggestion to clarify this situation?

Thank you in advance.

Alexandre KOWALCZYKPost authorHello marcelo,

I performed the same experiment as you with a quadratic programming library and I have found 4 support vectors. This is because I was solving the problem without regularization (the hard-margin). However most libraries give you an implementation of soft-margin SVM with regularization where the hyperparameter C allows you to tune how much you want to regularize. By default they use a value of C which perform some regularization. I made a reproducible example in Python. If you change the value of C to be greater than one, you will have 2 support vectors. However as soon as you use a value less than 1 you have 4 support vectors. I hope it helps you.

aboodHello Alexandre,

thank you for an absolutely outstanding explanation of SVM

have you Code in Java for SVM !!!¿¿¿

Alexandre KOWALCZYKPost authorThanks abood. There is a java implementation of libsvm.

aboodThank you very much Alexandre for the java Code.

But I want to java Code only for linearly separable SVM please

ChewVery, very good explanation. Great Job!!!!

You have a book about this? If yes send-us the name, I wanna buy it!

Hope you can explain us the next steeps soon! I am excited with your step-by-step, figures and numerical examples.

Alexandre KOWALCZYKPost authorThanks a lot for your comment. I don't have a book yet 😉 The next part is coming up. Don't forget to drop your email and I will keep you informed.

G. Krishna veniSir really i satisfied with u r modules. Actually I did not have basic maths. When I gone through this blog i cleared doubts myself. I'm eagerly waiting for your next part

ntpIf you write a book about Machine Learning with these kind of awesome explanation, I will buy it. Anyone has a same thought?

aboodHelle Alexandre,

I want java Code for linearly separable SVM please!!!

neoli365Thank you very much for this clear, awesome explanation. Looking for part 4 ~~

CamExcellent tutorial. I'm looking forward to part 4.

Cam

SUMAN KUMAR CHOUDHURYGreat !!! wao !!! what an explanation... thank u very much.... I am waiting for the optimization in SVM (Part 4)

JorgeThank you very much! It´s great to read a thorough description of the SVM with a clear notation and the intuition behind the math.

Arun BMThis was best explanation ever....Now I need not see any vedios books Thanks a Lot

RimpleExcellent tutorial. Best explanation of SVMs. Eagerly waiting for part 4

😀

CorHi, thanks for writing this up. Could you please double-check that w is indeed perpendicular to H1? If you take any point on H1, this point is represented by a vector drawn from the origin to that point. This vector is obviously not perpendicular to w. I'd appreciate it if you could explain why I'm wrong 🙂

Alexandre KOWALCZYKPost authorThe two things that matter about w are its direction and its magnitude. I can draw it where I want and I will be the same vector (even if it does not have the same starting coordinates). I explain this in Part 2 when talking about the difference between two vectors. I hope it helps you 🙂

Bahareh MoradiHi!

In equation 9 we have K = mu = m ( w / ||w|| );

I think in defining the vector "k" we need something like b in wx+b=0

ti indicate the location of vector in the space.

just the magnitude and the norm shows where is the starting point of k?????

Please help me

Alexandre KOWALCZYKPost authorHello. No, you do not need to indicate the location of the vector in the space. This is often confusing, I explain this in Part 2 when talking about the difference between two vectors. Reread this section, it might help you understand.

Amal TarghiThank's for this course but i can't understand how do you calculate the vecotr W and the value of in order to know if wx+b superior or inferior to 0 please understand me

Alexandre KOWALCZYKPost authorThe value of w is what we are trying to find by solving the optimization problem explained in the following articles.

Sonu LambaDear Sir,

How we calculate w and b?. are they different for every feature and at the end we need to optimize them?..how we decide the the value of w and b for optimal hyperplane equation...if for each feature we are calculating w and b then what value of w and b we will choose?. and why we take unit value for w always even the data are not always at the same distance. I am not getting the base of w and b value. could you please explain it earliest. Thank you very much for your time and article.

Alexandre KOWALCZYKPost authorHello. There is only one vector w, and it will contain several values. If X is an (m,n) matrix, then a single example is a (1,n) vector and w is a (n,1) vector. So one vector w contains the values for all examples. There is a single value of b. They will be chosen by solving an optimization problem (you can read more on that on the later articles).

cam nguyenHello Alexandre, thank you so much for your great article !

I'm learning a project about human detection which using SVM algorithm to classify non-human or human. Can you tell me the relationship between the variables w, b, xi with the values from the input images? I haven't still imagined yet? TKS so much!

Alexandre KOWALCZYKPost authorHello. This is a pretty broad question and the answer depends a lot on how you transform your image. I advise you to read this paper to learn more about image classification with SVM.

cam nguyenHow to compute the weight vector w and bias b in linear SVM. We have a hyperplane equation and the positive and negative feature. In equation Wx+b= 0, what does it mean by weight vector and how to compute it??

Can anybody explain it please

MatthiasHello. I agree with the many comments praising your post. Thank you so much for it!

There's one thing I'm still confused about: How do we find a hyperplane that seperates the set in the first place? In step 2 it says: "Given a hyperplane H0 separating the dataset and satisfying[...]"

So how do we get H0? Did I somehow miss that?

Alexandre KOWALCZYKPost authorHello, Matthias. Thank you for your comment. The hyperplane H0 is introduced in order to derive the optimization problem. I could start an argument by "Given a point x on the horizontal axis", and it does not matter how you pick it, or where it is, what matters is that it is a point, and it is on the horizontal axis. The same apply in this context. But to answer your question, if you have a linearly separable dataset, you can find a separating hyperplane by using an algorithm like the Perceptron Learning Algorithm for instance, or of course the SMO algorithm for the SVM problem.

MatthiasThank you so much for clarifying. That was very helpful!

Kamakshi BansalYou are such an amazing teacher!! kudos to this post. This post taught me SVM in such a simple way.

trungthank you so much, you saved my whole day researching how the distance is computed and now I found it. thank you again!

Vijay GuptaAmazing !!!!

idk222What is the intuition for why minimizing the norm of w is the same as maximizing the margin m. From the algebra it is obvious.

w is the normal vector to H1. The norm of w is the magnitude of that vector. Do you have any intuition why that would be the same as choosing the biggest m? Thanks. Great articles!!

Alexandre KOWALCZYKPost authorI already thought about this but I did not come with a simple intuition. I would be glad to hear one if someone has a good explanation.

NumhackerIt's all about dot product. Normal vector w is perpendicular to the hyperplane, and so the margin vector k is. Given arbitrary vector w and k, what we want is the result of this dot product must be 1 (or any positif number, we use δ in this post). Since these two vectors is in parallel, the dot product is only based on its length.

How do you choose w and m which the dot product is 1 while we want to maximize margin? Yap, minimize the length of the normal vector w!

TetsuyaThanks a lot for the detailed explanation. That's the greatest article on SVM I have ever read before. Here I have one question. In Figure 4, 5, 6, 7, 8, there are hyperplanes written by wx - b = 1 and wx - b = -1, but we are talking about wx + b = 1 and wx + b = -1, right? So, I think that wx + b = 1 and wx + b = -1 should be also written in those figures. What do you think?

Alexandre KOWALCZYKPost authorYes indeed it would be more coherent to use wx+b=1 in the figure as well. Even if the only difference between an hyperplane with wx-b=1 and wx+b=1 is the sign of b. Thanks for pointing that out. It is in fact often confusing in the literature as well as some author use the wx+b convention and others use wx-b (Platt for example).

ridarafisyedthe w used here is also direction vectors?

Alexandre KOWALCZYKPost authorI am not sure to understand your question perfectly. But the vector w here is the vector normal to the hyperplane. In Part 2 we defined a vector w which was the direction of a vector u. But they are not the same vectors.

SamCan I upvote the article? I mean, I would click on the upvote icon til my mouse blows off.

RahulSimply Amazing. Sorted out all the doubts regarding SVM. Can you please take any classification example along with say MATLAB code without using any direct command?

Thanks

Alexandre KOWALCZYKPost authorThank you Rahul. Unfortunately my MATLAB is a little rusty 🙂

NickYou wrote in a previous tutorial: "Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin"

However with the constraints above(in this tutorial) there isn't guarantee that a data point (of the training set) was on the boundary of hyperplane H0 or H1. In this case the given definition of margin seems wrong. Am I wrong?

Another question is why δ=1? It should be correlation between the margin and δ. If you leave δ you get (if I'm not wrong):

margin = (2*δ)/||w||

And the optimization problem should ivolve the parameter δ too.

Why can simplification be done (δ=1)?

Alexandre KOWALCZYKPost authorHello Nick,

1) As the hyperplane will be at equal distance between the two nearest data point finding the distance from one point than multiplying it by 2 works.

However this is not very rigorous, that is why we see the "true" method to compute the margin in this article.

2) To better understand delta, you can read this paper. (Equation 2)

TusharGreat work. I am a Deep Learning researcher. I personally have never seen such simple and effective explanation for SVMs in any of the books (or across the web). This reminded me of the quote by Einstein...."If you cannot explain it simply, you probably have not understood it"....Cheers.

MariuszHi there. I really liked this article! As a reference for the future I created this simple visualization. Maybe somebody else will find it useful:

https://www.desmos.com/calculator/4jqnpapcv9

Alexandre KOWALCZYKPost authorThanks a lot Mariusz!

Joeygreat! thank you.

Olle WelinHi thank's very much for your tutorial. Very informative. (As usual when I got into new intellectual topics I often miss understand basics). Regarding SVM I wounder how will I do to find the first 3 support vector. One guess I have is this (correct me if I am wrong).

1. Create a first version of test "nr1 hyperplane" by Random select an origo and a random angle for the (bolded) 'w' vector (hyperplane)

2. Go through All vectors and if some vector is constrained by this test "nr1 hyperplane" then throe this vector. Maybee se how bad this constrained point and use this data to produce a new test "nr2 hyperplane".

3. When test "nr? hyperplane" satisfy ALL vectors then Start to find the 3 support vectors.

4. Now again go through the margin calculation method you describe for ALL the vectors and when some vector have smallest margin save this until you have find for example 'A' 'B' and 'H' in your example above.

I guess there must be a way to find this 3 support vector as fast as possible.

Very best Regards Olle Welin

I feel so slow but I am curious about machine learning how the basics math "nuts and bolts" works. (play with write my own C code on Raspberry pi)

Alexandre KOWALCZYKPost authorHello. If you want to solve the problem in a fast way, you should read Platt's SMO algorithm paper 😉

MohitGreat work Alexandre!Sorry m completely new to this so might be a novice question but am not able to visualise how z0 which is addition of x0 and k .... is in hyperplane H1 ?

"If we start from the point x0 and add k we find that the point z0=x0+k is in the hyperplane H1 ?"

Alexandre KOWALCZYKPost authorWell, you do not need to visualize it because you can

seeit on Figure 14. There is a review of vector addition on part 2 if you need help understanding the concept. Best Regards,Hannah LiuI am still confused about step 2.

Yes, with all the constraints, I will be able to know whether the hyperplanes are the ones that correctly separate the data. But how do I get the hyperplanes in the first place? Do I take a random guess? If after applying the constraints I found they are not the ones I am looking for, how do I improve myself?

Alexandre KOWALCZYKPost authorThat is the purpose of the algorithm SMO. Right now this article is building the foundation used to define the problem we want to solve, not explaining how to solve it. (I will explain all of that in my upcoming ebook)

Peyman BarjoueianHello Dear Alexandre.

Thank you for these good series.

I have a question. it make sense to minimize ||w|| value in our optimization problem. but why should also minimize the b value? how they(||w|| and b) are related?

Cheers.

Alexandre KOWALCZYKPost authorI think this page might help you understand the importance of b.

DeweiThanks for the detailed explanation. It's super helpful!

JX SongFirst of all I'd like to thank you for this awesome tutorial. There is however something I do not understand. In the part "Understanding the constraints". Why is it that for Xi = A, the point on the hyperplane is =1 and for Xi = C the point on the hyperplane is >1 ?

Miruna StefanDear Alexandre,

I have been circling SVM for over a year now, and while i understood the underlying principle, i never truly got how it worked. Most resources provide either a general, high-level explanation, or heavy mathematics that can be difficult to understand on your own.

Your post, on the other hand, is clear, detailed and explicit. I think you have invested a great deal of attention and time into this project, and i want you to know it is truly helpful and of great quality.

Thank you very much!

Miruna

Romain CHi,

Thanks again for this great tutorial,

I got to the end of the first tutorial steps and I am trying to go further reading your ebook on SVMs

I do not understand why :

- here we minimize ‖w‖

- on your ebook we minimize 1/2 ‖w‖ ^2

Thanks in advance for your help

Alexandre KOWALCZYKPost authorHello Romain,

I explain this in the book "The factor has been added for later convenience when we will use QP solver to solve the problem and squaring the norm has the advantage of removing the square root."

Best regards,

Steve ChamalesI've read the first 3 parts to this tutorial (this page is part 3). In the example you're working through, you chose your first hyperplane to be x2 = -2 * x1 because this is a very basic example that you could visualize. How do you select a hyperplane that best separates multi-dimensional data that can't be visualized by humans?

Alexandre KOWALCZYKPost authorBasically we can say that the first hyperplane is defined by a vector w starting at 0 or with random values, then an algorithm updates it by tiny increment until it find the optimal hyperplane (you can read the full detail of this in the chapter about the SMO algorithm in my ebook).

Yury ParfenovHi,

We have two parameters w and b in w⋅x+b=−1, w⋅x+b=1 equations. w is corresponding for rotation, b is corresponding for offset. I didn't get how we can control distance between two hyperplanes if we have shared b parameter and constant values on the right hand side (-1, 1). We only can rotate and shift hyperplanes by varying w and b. But we cannot change distance between planes.

Alexandre KOWALCZYKPost authorCompare the distance between the lines created by:

First equation 9.0x+1.0

Second equation 9.0x+3.0

and by

First equation 0.1x+1.0

Second equation 0.1x+3.0

Clearly by only shifting the hyperplane we can make the distance smaller.

Tejveer SinghPlease, can you explain step 1 in a bit detail?

Please explain equation of Hyperplace in step -2 also?

It is a bit unclear in the previous post also.

dr. hala omarThank you very much

your simple and clear explanation helped me a lot to understand the subject especially as a beginner in it.

KshitijThank you Very Much for very nice Explanation.