SVM - Understanding the math : the optimal hyperplane

This is the Part 3 of my series of tutorials about the math behind Support Vector Machine.
If you did not read the previous articles, you might want to take a look before reading this one :

SVM - Understanding the math

Part 1: What is the goal of the Support Vector Machine (SVM)?
Part 2: How to compute the margin?
Part 3: How to find the optimal hyperplane?
Part 4: Unconstrained minimization
Part 5: Convex functions
Part 6: Duality and Lagrange multipliers

What is this article about?

The main focus of this article is to show you the reasoning allowing us to select the optimal hyperplane.

Here is a quick summary of what we will see:

  • How can we find the optimal hyperplane ?
  • How do we calculate the distance between two hyperplanes ?
  • What is the SVM optimization problem ?

How to find the optimal hyperplane ?

At the end of Part 2 we computed the distance  \|p\| between a point  A and a hyperplane. We then computed the margin which was equal to 2 \|p\|.

However, even if it did quite a good job at separating the data it was not the optimal hyperplane.


Figure 1: The margin we calculated in Part 2 is shown as M1

As we saw in Part 1, the optimal hyperplane  is the one which maximizes the margin of the training data.

In Figure 1, we can see that the margin M_1, delimited by the two blue lines, is not the biggest margin separating perfectly the data. The biggest margin is the margin M_2 shown in Figure 2 below.

The optimal hyperplane

Figure 2: The optimal hyperplane is slightly on the left of the one we used in Part 2.

You can also see the optimal hyperplane on Figure 2. It is slightly on the left of our initial hyperplane. How did I find it ? I simply traced a line crossing M_2 in its middle.

Right now you should have the feeling that hyperplanes and margins are closely related. And you would be right!

If I have an hyperplane I can compute its margin with respect to some data point. If I have a margin delimited by two hyperplanes (the dark blue lines in Figure 2), I can find a third hyperplane passing right in the middle of the margin.

Finding the biggest margin, is the same thing as finding the optimal hyperplane.

How can we find the biggest margin  ?

It is rather simple:

  1. You have a dataset
  2. select two hyperplanes which separate the data with no points between them
  3. maximize their distance (the margin)

The region bounded by the two hyperplanes will be the biggest possible margin.

If it is so simple why does everybody have so much pain understanding SVM ?
It is because as always the simplicity requires some abstraction and mathematical terminology to be well understood.

So we will now go through this recipe step by step:

Step 1: You have a dataset \mathcal{D} and you want to classify it

Most of the time your data will be composed of n vectors \mathbf{x}_i.

Each \mathbf{x}_i will also be associated with a value y_i indicating if the element belongs to the class (+1) or not (-1).

Note that y_i  can only have two possible values -1 or +1.

Moreover, most of the time, for instance when you do text classification, your vector \mathbf{x}_i ends up having a lot of dimensions. We can say that \mathbf{x}_i is a p-dimensional vector if it has p dimensions.

So your dataset \mathcal{D} is the set of n couples of element (\mathbf{x}_i, y_i)

The more formal definition of an initial dataset in set theory is :

\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n

Step 2: You need to select two hyperplanes separating the data with no points between them

Finding two hyperplanes separating some data is easy when you have a pencil and a paper. But with some p-dimensional data it becomes more difficult because you can't draw it.

Moreover, even if your data is only 2-dimensional it might not be possible to find a separating hyperplane !

You can only do that if your data is linearly separable

Linear kernel works well with linearly separable data

Figure 3: Data on the left can be separated by an hyperplane, while data on the right can't

So let's assume that our dataset \mathcal{D} IS linearly separable. We now want to find two hyperplanes with no points between them, but we don't have a way to visualize them.

What do we know about hyperplanes that could help us ?

Taking another look at the hyperplane equation

We saw previously, that the equation of a hyperplane  can be written

\mathbf{w}^T\mathbf{x} = 0

However, in the Wikipedia article about Support Vector Machine it is said that :

Any hyperplane can be written as the set of points \mathbf{x} satisfying \mathbf{w}\cdot\mathbf{x}+b=0\.

First, we recognize another notation for the dot product, the article uses \mathbf{w}\cdot\mathbf{x} instead of  \mathbf{w}^T\mathbf{x}.

You might wonder... Where does the +b comes from ? Is our previous definition incorrect ?

Not quite. Once again it is a question of notation. In our definition the vectors \mathbf{w} and \mathbf{x} have three dimensions, while in the Wikipedia definition they have two dimensions:

Given two 3-dimensional vectors \mathbf{w}(b,-a,1) and \mathbf{x}(1,x,y)

\mathbf{w}\cdot\mathbf{x} = b\times (1) + (-a)\times x + 1 \times y

\begin{equation}\mathbf{w}\cdot\mathbf{x} = y - ax + b\end{equation}

Given two 2-dimensional vectors \mathbf{w^\prime}(-a,1) and \mathbf{x^\prime}(x,y)

\mathbf{w^\prime}\cdot\mathbf{x^\prime} = (-a)\times x + 1 \times y

\begin{equation}\mathbf{w^\prime}\cdot\mathbf{x^\prime} = y - ax\end{equation}

Now if we add b on both side of the equation (2) we got :

\mathbf{w^\prime}\cdot\mathbf{x^\prime} +b = y - ax +b

\begin{equation}\mathbf{w^\prime}\cdot\mathbf{x^\prime}+b = \mathbf{w}\cdot\mathbf{x}\end{equation}

For the rest of this article we will use 2-dimensional vectors (as in equation (2)).

Given a hyperplane H_0 separating the dataset and satisfying:

\mathbf{w}\cdot\mathbf{x} + b=0\

We can select two others hyperplanes H_1 and H_2 which also separate the data and have the following equations :

\mathbf{w}\cdot\mathbf{x} + b=\delta\


\mathbf{w}\cdot\mathbf{x} + b=-\delta\

so that H_0 is equidistant from H_1 and H_2.

However, here the variable \delta is not necessary. So we can set \delta=1 to simplify the problem.

\mathbf{w}\cdot\mathbf{x} + b=1\


\mathbf{w}\cdot\mathbf{x} + b=-1\

Now we want to be sure that they have no points between them.

We won't select any hyperplane, we will only select those who meet the two following constraints:

For each vector \mathbf{x_i} either :

\begin{equation}\mathbf{w}\cdot\mathbf{x_i} + b \geq 1\;\text{for }\;\mathbf{x_i}\;\text{having the class}\;1\end{equation}


\begin{equation}\mathbf{w}\cdot\mathbf{x_i} + b \leq -1\;\text{for }\;\mathbf{x_i}\;\text{having the class}\;-1\end{equation}

Understanding the constraints

On the following figures, all red points have the class 1 and all blue points have the class -1.

So let's look at Figure 4 below and consider the point A. It is red so it has the class 1 and we need to verify it does not violate the constraint \mathbf{w}\cdot\mathbf{x_i} + b \geq1\

When \mathbf{x_i} = A  we see that the point is on the hyperplane so \mathbf{w}\cdot\mathbf{x_i} + b =1\ and the constraint is respected. The same applies for B.

When \mathbf{x_i} = C  we see that the point is above the hyperplane so \mathbf{w}\cdot\mathbf{x_i} + b  data-recalc-dims=1\" /> and the constraint is respected. The same applies for D, E, F and G.

With an analogous reasoning you should find that the second constraint is respected for the class -1.

Figure 4: Two hyperplanes satisfying the constraints

Figure 4: Two hyperplanes satisfying the constraints

On Figure 5, we see another couple of hyperplanes respecting the constraints:

Figure 4: Two hyperplanes also satisfying the constraints

Figure 5: Two hyperplanes also satisfying the constraints

And now we will examine cases where the constraints are not respected:

Figure 5: The right hyperplane does not satisfy the first constraint

Figure 6: The right hyperplane does not satisfy the first constraint

Figure 6: The left hyperplane does not satisfy the second constraint

Figure 7: The left hyperplane does not satisfy the second constraint

Figure 7: Both constraint are not satisfied

Figure 8: Both constraint are not satisfied

What does it means when a constraint is not respected ? It means that we cannot select these two hyperplanes. You can see that every time the constraints are not satisfied (Figure 6, 7 and 8) there are points between the two hyperplanes.

By defining these constraints, we found a way to reach our initial goal of selecting two hyperplanes without points between them.  And it works not only in our examples but also in p-dimensions !

Combining both constraints

In mathematics, people like things to be expressed concisely.

Equations (4) and (5) can be combined into a single constraint:

We start with equation (5)

\text{for }\;\mathbf{x_i}\;\text{having the class}\;-1

\mathbf{w}\cdot\mathbf{x_i}+b \leq -1

And multiply both sides by y_i (which is always -1 in this equation)

y_i(\mathbf{w}\cdot\mathbf{x_i}+b ) \geq y_i(-1)

Which means equation (5) can also be written:

\begin{equation}y_i(\mathbf{w}\cdot\mathbf{x_i} + b ) \geq 1\end{equation}\;\text{for }\;\mathbf{x_i}\;\text{having the class}\;-1

In equation (4), as y_i =1 it doesn't change the sign of the inequation.

\begin{equation}y_i(\mathbf{w}\cdot\mathbf{x_i} + b) \geq 1\;\text{for }\;\mathbf{x_i}\;\text{having the class}\;1\end{equation}

We combine equations (6) and (7) :

\begin{equation}y_i(\mathbf{w}\cdot\mathbf{x_i} + b) \geq 1\;\text{for all}\;1\leq i \leq n\end{equation}

We now have a unique constraint (equation 8) instead of two (equations 4 and 5) , but they are mathematically equivalent. So their effect is the same (there will be no points between the two hyperplanes).

Step 3: Maximize the distance between the two hyperplanes

This is probably be the hardest part of the problem. But don't worry, I will explain everything along the way.

a) What is the distance between our two hyperplanes ?

Before trying to maximize the distance between the two hyperplane, we will first ask ourselves: how do we compute it ?


  • \mathcal{H}_0 be the hyperplane having the equation \textbf{w}\cdot\textbf{x} + b = -1
  • \mathcal{H}_1 be the hyperplane having the equation \textbf{w}\cdot\textbf{x} + b = 1
  • \textbf{x}_0 be a point in the hyperplane \mathcal{H}_0.

We will call m the perpendicular distance from \textbf{x}_0 to the hyperplane \mathcal{H}_1 . By definition, m is what we are used to call the margin.

As \textbf{x}_0 is in \mathcal{H}_0, m is the distance between hyperplanes \mathcal{H}_0 and \mathcal{H}_1 .

We will now try to find the value of m.

Figure 9: m is the distance between the two hyperplanes

Figure 9: m is the distance between the two hyperplanes

You might be tempted to think that if we add m to \textbf{x}_0 we will get another point, and this point will be on the other hyperplane !

But it does not work, because m is a scalar, and \textbf{x}_0 is a vector and adding a scalar with a vector is not possible. However, we know that adding two vectors is possible, so if we transform m into a vector we will be able to do an addition.

We can find the set of all points which are at a distance m from  \textbf{x}_0. It can be represented as a circle :

Figure 10: All points on the circle are at the distance m from x0

Figure 10: All points on the circle are at the distance m from x0

Looking at the picture, the necessity of a vector become clear. With just the length m we don't have one crucial information : the direction. (recall from Part 2 that a vector has a magnitude and a direction).

We can't add a scalar to a vector, but we know if we multiply a scalar with a vector we will get another vector.

From our initial statement, we want  this vector:

  1. to have a magnitude of m
  2. to be perpendicular to the hyperplane \mathcal{H}_1

Fortunately, we already know a vector perpendicular to \mathcal{H}_1, that is \textbf{w} (because  \mathcal{H}_1 = \textbf{w}\cdot\textbf{x} + b = 1)

Figure 11: w is perpendicular to H1

Figure 11: w is perpendicular to H1

Let's define \textbf{u} = \frac{\textbf{w}}{\|\textbf{w}\|} the unit vector of \textbf{w}. As it is a unit vector \|\textbf{u}\| = 1 and it has the same direction as \textbf{w} so it is also perpendicular to the hyperplane.

Figure 12: u is also is perpendicular to H1

Figure 12: u is also is perpendicular to H1

If we multiply \textbf{u} by m we get the vector \textbf{k} = m\textbf{u} and :

  1. \|\textbf{k}\| = m
  2. \textbf{k} is perpendicular to \mathcal{H}_1 (because it has the same direction as \textbf{u})

From these properties we can see that \textbf{k} is the vector we were looking for.

Figure 13: k is a vector of length m perpendicular to H1

Figure 13: k is a vector of length m perpendicular to H1


We did it ! We transformed our scalar m into a vector \textbf{k} which we can use to perform an addition with the vector \textbf{x}_0.

If we start from the point \textbf{x}_0 and add k we find that the point \textbf{z}_0 = \textbf{x}_0 + \textbf{k} is in the hyperplane \mathcal{H}_1  as shown on Figure 14.

Figure 14: z0 is a point on H1

Figure 14: z0 is a point on H1


The fact that \textbf{z}_0 is in \mathcal{H}_1 means that

 \begin{equation}\textbf{w}\cdot\textbf{z}_0+b = 1\end{equation}

We can replace \textbf{z}_0 by \textbf{x}_0+\textbf{k} because that is how we constructed it.

 \begin{equation}\textbf{w}\cdot(\textbf{x}_0+\textbf{k})+b = 1\end{equation}

We can now replace \textbf{k} using equation (9)

 \begin{equation}\textbf{w}\cdot(\textbf{x}_0+m\frac{\textbf{w}}{\|\textbf{w}\|})+b = 1\end{equation}

We now expand equation (12)

 \begin{equation}\textbf{w}\cdot\textbf{x}_0 +m\frac{\textbf{w}\cdot\textbf{w}}{\|\textbf{w}\|}+b = 1\end{equation}

The dot product of a vector with itself is the square of its norm so :

 \begin{equation}\textbf{w}\cdot\textbf{x}_0 +m\frac{\|\textbf{w}\|^2}{\|\textbf{w}\|}+b = 1\end{equation}

 \begin{equation}\textbf{w}\cdot\textbf{x}_0 +m\|\textbf{w}\|+b = 1\end{equation}

 \begin{equation}\textbf{w}\cdot\textbf{x}_0 +b = 1 - m\|\textbf{w}\|\end{equation}

As \textbf{x}_0 is in \mathcal{H}_0 then \textbf{w}\cdot\textbf{x}_0 +b = -1

\begin{equation} -1= 1 - m\|\textbf{w}\|\end{equation}

\begin{equation} m\|\textbf{w}\|= 2\end{equation}

\begin{equation} m = \frac{2}{\|\textbf{w}\|}\end{equation}

This is it ! We found a way to compute m.

b) How to maximize the distance between our two hyperplanes

We now have a formula to compute the margin:

m= \frac{2}{\|\textbf{w}\|}

The only variable we can change in this formula is the norm of  \mathbf{w}.

Let's try to give it different values:

When \|\textbf{w}\|=1 then m=2

When \|\textbf{w}\|=2 then m=1

When \|\textbf{w}\|=4 then m=\frac{1}{2}

One can easily see that the bigger the norm is, the smaller the margin become.

Maximizing the margin is the same thing as minimizing the norm of \textbf{w}

Our goal is to maximize the margin. Among all possible hyperplanes meeting the constraints,  we will choose the hyperplane with the smallest \|\textbf{w}\| because it is the one which will have the biggest margin.

This give us the following optimization problem:

Minimize in (\textbf{w}, b)


subject to y_i(\mathbf{w}\cdot\mathbf{x_i}+b) \geq 1

(for any i = 1, \dots, n)

Solving this problem is like solving and equation. Once we have solved it, we will have found the couple (\textbf{w}, b)  for which \|\textbf{w}\| is the smallest possible and the constraints we fixed are met. Which means we will have the equation of the optimal hyperplane !


We discovered that finding the optimal hyperplane requires us to solve an optimization problem.  Optimization problems are themselves somewhat tricky. And you need more background information to be able to solve them.  So we will go step by step. Let us discover unconstrained minimization problems in Part 4! Thanks for reading.

I am passionate about machine learning and Support Vector Machine. When I am not writing this blog, you can find me on Kaggle participating in some competition.

97 thoughts on “SVM - Understanding the math : the optimal hyperplane

  1. dragon518

    Thank you very much! I have never read a technology blog more fundamental, detailed, clear than yours before. You are a specialist on SVM. BTW, look forward to read part-4.

  2. snake2971chris

    really nice report! Just one thing, if you have data points in a 2 dimensional space(x1,x2), like in your examples, you should visualize it in 3 dimensions since the hyperplane is spanned in a third dimension(y). If you don't want to mess around with 3 dimensions use a simple 1D example, so your hyperplane can be really visualized with a line.

    That was a thing which confused me for quiet a bit. But once I had the right visualization everything else became kind of intuitive.

    1. Alexandre KOWALCZYK Post author

      Yeah sure. Here the third dimension is depicted by the color, but indeed it can be confusing. Thanks for your feedback 🙂

  3. Abhinand (Abhi) Sivaprasad

    Why are the hyper planes set to 1 and -1. It seems arbitrary? This essentially sets them 2 units apart. What if the data points with labels 1 and labels -1 were very far away?

    1. Alexandre KOWALCZYK Post author

      They are set to 1 and -1 because it is the only possible values the vector y can take. As suggested by snake2971chris in a previous comment, visualizing the picture in 3 dimensions can help to understand this.

    1. Alexandre KOWALCZYK Post author

      This is by definition. The equation is called a normal equation for the hyperplane. I invite you to read the part about hyperplanes in this course (page 9). This is about the calculus of functions of several variables but I used it as a reference several times to write this serie of articles (you can read the full course here).

    2. archkrish

      The hyperplanes are defined in the way that there is vector which is perpendicular to all the vectors in the hyperplane...imagine hyperplane consists of all the vectors which are perpendicular to w.There are good video lectures available which explains hyperplane.

  4. daniel

    simple yet insightful explanation on the math behind the scene...cant wait to see the release of part 4...thanks for your effort in producing this tutorial, very helpful

  5. Rahul

    Hey, really great tutorial, waiting for the next part. I had a doubt...Please help!!
    Let us consider a two dimensional dataset as shown in your tutorial. My doubt is what is the physical significance of w.I know it is a vector perpendicular to the hyperplane but what does it actually signify

    1. Alexandre KOWALCZYK Post author

      Hi Tejas. Thanks for your comment. I don't have a precise date because I have been very busy at work. As there is a lot of people asking me I will try to do my best to publish it soon (before 2016)
      I added a small form at the end of the article so that you can put your email in it if you want me to send you a mail when it will be published.

  6. Hugo

    Thanks for the great tutorial. The expressions wx-b=1 and wx-b=-1 are however a little confusing. You write the following in defence of these equations.

    "And we choose +1 and -1 because they are the only values the class can take."

    Question: My understanding is that the result of wx-b should not evaluate to a class label. What would evaluate to a class label is an expression such as f(wx-b) where f is the classification rule that assigns the product between a given test vector and the w vector to the classes 1 or -1. Can you clarify why you are having wx-b evaluating to the value of the class labels? Some literature suggests that the numerical values of the class labels are only arbitrary -- in fact -1 and 1 are chosen in SVM algebra because they enable an elegant way to represent the SVM (e.g., we are able to write y(wx-b)>0 because y can only be 1 or -1).

    Could you have chosen 1 and -1 because you wanted to have the planes wx-b=-1,1 equidistant from wx-b=0. In that case does it mean that I could have used numbers such as 2 and -2?

    1. Alexandre KOWALCZYK Post author

      Thanks for your comment Hugo.

      Indeed the expression "And we choose +1 and -1 because they are the only values the class can take." was an error. I checked online and this value has no connection with the class label.
      You can see equation (2) of this paper for more details.

      I updated the article so that it is more explicit how this number is choosen.
      I think we could use numbers such as 2 and -2 but I am not sure, I need to check the math to be 100% sure.

      Thanks a lot.

    2. ZC Liu

      Actually, you can choose +2 and -2 instead of +1 and -1.
      Then you follow the steps: 10 ~ 19, you will get m = 4 / w'.
      means you find a w', which length is half of the original one(w), when you choose +1 and -1.
      then the support hyper plane is: w' * x + b' = 2 => 2w * x + 2b = 2 , which is equivalent to the
      original one: w * x + b = 1.

      so you can choose other number either, eventually, you will get the same support hyper plane.

  7. Mutaz

    Really thanks a lot for this series
    This is the first time I see such a simplicity of explaining the SVM
    I am waiting part 4, I am a bit scared to learn how solve that optimization problem. Please try to explain it as simple as possible.

    Thanks a lot

      1. 강산하

        I really really appriciate for your aritcle.
        I am graduate school student in South Korea

        I have a question,

        Why do you set δ = 1.-1 ?

        I think δ is related to margin and

        it should smaller if there are data point and equation (4), (5) are not true.

        What do you think?

        1. Alexandre KOWALCZYK Post author

          We could have kept \delta in all the following equations. At the end we would have had y_i(\mathbf{w}\cdot\mathbf{x_i}-b)\geq\delta in equation 8. Indeed, \delta = y_i(\mathbf{w}\cdot\mathbf{x_i}-b) is the functional margin. You can read part 3 of this course to understand better why we can set it to 1 and -1. You will see in part 4 that he replaces the functionnal margin by 1. I hope this helps you.

          1. Ganecian

            what does \delta means? I still didn't understand this paragraph,
            "We now note that we have over-parameterized the problem: if we scale w, b and \delta by a constant factor \alpha, the equations for x are still satisfied. To remove this ambiguity we will require that \delta = 1, this sets the scale of the problem, i.e. if we measure distance in millimeters or meters. " Can you explain this?

  8. 강산하

    another question

    In step 2, we select 2 hyperplane, so I think w is fixed in that time

    but step3 you tell we minimize w to get biggest margin

    I think w nerver chage after step2...

    Can you explain?

    1. Alexandre KOWALCZYK Post author

      In step 2 we select two hyperplanes among all possible couple of hyperplanes satisfying the constraints (4) and (5). Given one hyperplane separating the data, we can find two other hyperplanes meeting the constraint. Given another hyperplane separating the data (with a different w than the first one), we can find two other different hyperplanes meeting the constraint. So minimizing w in step 3 allow us to find the two hyperplanes having the biggest margin and the optimal hyperplane.

  9. Cao Fuxiang

    Hello Alexandre!

    Thank you very much for your excellent series of posts. You convert a very complicated math problem into a very clear and simple one. I have never seen such a great technical blog before!

    I look forward to reading your part-4.

    Thank you once again.

  10. Reshma

    Really an awesome tutorial with the unique way of explanations. Each part of SVM was soo well explained. You are an expert in SVM. Looking forward for the part 4.

    1. Alexandre KOWALCZYK Post author

      You are welcome. I am still very busy right now but I don't give up on part 4. I just have some more SVM related work to do before.

  11. Rajive Jain

    This is the clearest explanation of a complex topic that I have ever seen. Why can't the professors in classes teach the same way? I have a gut feeling that they don't do it because they don't understand it themselves! I went through your entire tutorial in a couple of hours and undestood the entire conceptual framework of SVM - something I never got from my professors in the entire semester. I am sure it took you many days of hard work to put this together. Thanks for sharing your knowledge and understanding - I really appreciate your service to the community.

  12. tyvannguyen

    I've to say that this one is one of the best tutorial I've read.
    It can viewed as a beginer-to-intermediate lecture with all neccessary visual demonstrations.
    Thank you very much.

  13. Manziba

    Its wonderful article.So simple explanation which I was looking for.Can you please upload more articles explaining Lgranagian to find optimal w

  14. Praveen

    I have a question on th dot product of the 2 vectors w.x -b = 1 and -1.
    Isnt it be >0 and <0 . Since we are taking the dot product of the normal vector w with another point x which is like dot product of the two vectors
    |w||x|.cos(theta) and the values will vary between 1 and -1 and when they are on the hyperplane the values of the dot product will be 0.
    Please correct me if I am wrong.

    1. Alexandre KOWALCZYK Post author

      We don't want the point to be too close to the initial hyperplane, that is why we introduce delta. We choose two hyperplanes at a distance delta from the original hyperplane. We then arbitrarely set delta = 1. But that means that points on the other hyperplanes will have values +1 or -1

  15. Vamshi Dhar

    Dear Praveen,

    when they are on the hyperplane the values of the dot product will not be 0, but 1 on one side of plane and it will be >1 as the point goes away from that plane.
    Same way with other plane having -1 for point that is exactly on the plane and as it goes out of the plane then it will be < -1.
    I guess this can be your answer. If i understood some thing from the Alexandre KOWALCZYK's 3 valuable "Understanding the math" parts.

    Cheers to Alexandre KOWALCZYK

  16. Ulrich Armel

    Really great article, it really made have a good feeling of SVM after I stop learning machine learning when I came across it and it was very poorly explained and I could n't understand anything.

  17. Tonja Rand

    amazing post!
    However, I am a bit confused about the definitions of "hyperplane" and "margins". The two hyperplanes you have Ho and H1 (dotted lines) are the actual support vectors. A non-dotted line is a separating hyperplane. Right? I found in the internet two different definitions of "margins". In this figure ( the margin is defined as a distance between a support vector and a separating hyperplane. In another figure ( margin is a distance between two support vectors. And as far as I understood you define it also as a distance between two support vectors.

    If there are two definitions of "margin" then there should be two different mathematical representations correspondingly. But as I see it does not matter if "margin" is defined as distance between two support vectors or a separating hyperplane and one support vector, it will have the same formula, namely 2/||w||. But how is it possible?

    1. Alexandre KOWALCZYK Post author

      Hello. The two hyperplanes H0 and H1 are not support vectors, they are hyperplanes. As they separate the data, they could be called separating hyperplanes too. In the case were the central hyperplane is the optimal hyperplane, then vectors which lies on the two others hyperplanes will be called support vectors. In the case were the margin is the distance between a support vector and the hyperplane it will have a length of 1/||w||. It does not matter which one we choose for the margin because at the end we optimize the norm of w and the same argument works if we use 2/||w|| or 1/||w|| (minimizing the norm is like maximizing the margin). If you want to understand more about the margin you can take a look at this lecture of Andrew Ng which introduces the functional and geometric margin. I hope it helps you.

    1. Alexandre KOWALCZYK Post author

      Hello. I honestly don't know. I will try my best to do it in the next few months.

  18. optimus

    Hi Alenxander, could you explain to me what is the correlaction between finding hyperplane with prediction process by svm ? I have read several articles about svm, but I still don't get it

    1. Alexandre KOWALCZYK Post author

      Hello optimus. This is a two step process: you find any hyperplane which separate data, and you are able to make predictions on your training set. However you have no guarantee that it will perform well in the test set (that it will "generalize" well). With SVMs you find the "optimal" hyperplane, one which separates the data, but which is also as far away as possible from it. As a result it generalize better and you can expect that the prediction you made on unseen data will be better.

  19. Daniel

    You explained everything so well and in such a non-patronising manner. I wish you were my teacher years ago. Looking forward to part 4.

  20. Sourabh

    I really liked your post and it gave me a keen interest in SVm, although i do have some open points which I do not understand after reading this 3rd part. I followed till combining of the constraints in one equation(till eq 8). But I have some open points regarding the initial selection of any hyperplane like how we start initially with first hyperplane which will classify the dataset(I know we can do it through single perceptron),but the thing is here in this post when we maximizing the distance in those steps I really gone confused.

    thanks for this wonderful post.
    It will be great if you can take one very simple 2 dimensional example to explain what we did in part 3 which explain all steps we covered here , it will be very helpful for visualization ,like what is going on.

  21. Kaustav Mukherjee

    Hey Alexander,

    Let me first thank you for an absolutely outstanding explanation of SVM! I have scoured the internet upside down, ranging from lectures from MIT and Caltech profs, none of the explanations I have found can match yours in terms of simplicity and effectiveness! Please upload part 4 ASAP! I am unable to have peace of mind until I see it working end-to-end!

    1. Alexandre KOWALCZYK Post author

      Hello. The 4th part is not ready yet but contains already some content, I want to put enough content in it so that people are not frustrated and we the article advance the understanding of SVM. I have some time to work on it this week because I am on holidays s but most of the time I have very little time to write during the week, hence the big delay.

  22. marcelo

    Hello Alexandre,
    I have a question about a special case of SVM. I tried to find the support vectors for these training examples:
    x1 = (0,1) y1 = +1
    x2 = (1,0) y1 = +1
    x3 = (-1,0) y1 = -1
    x4 = (0,-1) y1 = -1
    Clearly, all examples are support vectors because x1 and x2 are over the hyperplane H+ and x3 and x4 are over the hyperplane H-. However, when the LD is maximized subject to sum(yi.alphai = 0) and alphai >= 0, the obtained results are (I have used the Wolfram Alpha online optimizer):
    alpha1 = alpha3
    alpha2 = 1 - alpha3
    alpha4 = 1 - alpha3
    (notice that alphai is the Lagrage Multiplier of xi)
    The solution that has sense to me is alpha1 = alpha3 = 1, but then alpha2 = alpha4 = 0, indicating that x2 and x4 are not support vectors, and this is illogical. Even more with alpha1 = alpha3 = 1 and alpha2 = alpha4 = 0, the w vector and b obtained return these results:
    (w . x1) + b = +1 (x1 is over the H+, since x1 is a support vector)
    (w . x2) + b = +1 (x1 is over the H+, but it is not a support vector because its Lagrange Multiplier is zero)
    (w . x3) + b = -1 (x1 is over the H-, since x1 is a support vector)
    (w . x4) + b = -1 (x1 is over the H-, but it is not a support vector because its Lagrange Multiplier is zero)
    I have also used the svmtrain of Matlab and the support vectors obtained are the same (x1 and x3).
    One last observation, if I add one more training example, [x5 = (2, 2) y5 = +1], only alpha5 is zero, this indicate that x1, x2, x3 and x4 are support vectors.
    Do you have any suggestion to clarify this situation?
    Thank you in advance.

    1. Alexandre KOWALCZYK Post author

      Hello marcelo,

      I performed the same experiment as you with a quadratic programming library and I have found 4 support vectors. This is because I was solving the problem without regularization (the hard-margin). However most libraries give you an implementation of soft-margin SVM with regularization where the hyperparameter C allows you to tune how much you want to regularize. By default they use a value of C which perform some regularization. I made a reproducible example in Python. If you change the value of C to be greater than one, you will have 2 support vectors. However as soon as you use a value less than 1 you have 4 support vectors. I hope it helps you.

      1. abood

        Hello Alexandre,
        thank you for an absolutely outstanding explanation of SVM
        have you Code in Java for SVM !!!¿¿¿

          1. abood

            Thank you very much Alexandre for the java Code.
            But I want to java Code only for linearly separable SVM please

  23. Chew

    Very, very good explanation. Great Job!!!!

    You have a book about this? If yes send-us the name, I wanna buy it!

    Hope you can explain us the next steeps soon! I am excited with your step-by-step, figures and numerical examples.

    1. Alexandre KOWALCZYK Post author

      Thanks a lot for your comment. I don't have a book yet 😉 The next part is coming up. Don't forget to drop your email and I will keep you informed.

  24. G. Krishna veni

    Sir really i satisfied with u r modules. Actually I did not have basic maths. When I gone through this blog i cleared doubts myself. I'm eagerly waiting for your next part

  25. ntp

    If you write a book about Machine Learning with these kind of awesome explanation, I will buy it. Anyone has a same thought?


    Great !!! wao !!! what an explanation... thank u very much.... I am waiting for the optimization in SVM (Part 4)

  27. Jorge

    Thank you very much! It´s great to read a thorough description of the SVM with a clear notation and the intuition behind the math.

  28. Cor

    Hi, thanks for writing this up. Could you please double-check that w is indeed perpendicular to H1? If you take any point on H1, this point is represented by a vector drawn from the origin to that point. This vector is obviously not perpendicular to w. I'd appreciate it if you could explain why I'm wrong 🙂

    1. Alexandre KOWALCZYK Post author

      The two things that matter about w are its direction and its magnitude. I can draw it where I want and I will be the same vector (even if it does not have the same starting coordinates). I explain this in Part 2 when talking about the difference between two vectors. I hope it helps you 🙂

  29. Bahareh Moradi

    In equation 9 we have K = mu = m ( w / ||w|| );
    I think in defining the vector "k" we need something like b in wx+b=0
    ti indicate the location of vector in the space.
    just the magnitude and the norm shows where is the starting point of k?????
    Please help me

    1. Alexandre KOWALCZYK Post author

      Hello. No, you do not need to indicate the location of the vector in the space. This is often confusing, I explain this in Part 2 when talking about the difference between two vectors. Reread this section, it might help you understand.

  30. Amal Targhi

    Thank's for this course but i can't understand how do you calculate the vecotr W and the value of in order to know if wx+b superior or inferior to 0 please understand me

    1. Alexandre KOWALCZYK Post author

      The value of w is what we are trying to find by solving the optimization problem explained in the following articles.


Leave a Reply