This is Part 2 of my series of tutorial about the math behind Support Vector Machines.

If you did not read the previous article, you might want to take a look before reading this one :

### SVM - Understanding the math

Part 1: What is the goal of the Support Vector Machine (SVM)?

**Part 2: How to compute the margin?**

Part 3: How to find the optimal hyperplane?

Part 4: Unconstrained minimization

Part 5: Convex functions

Part 6: Duality and Lagrange multipliers

In the first part, we saw what is the aim of the SVM. Its goal is to find the hyperplane which maximizes the margin.

**But how do we calculate this margin?**

## SVM = Support VECTOR Machine

In Support Vector Machine, there is the word **vector.
**That means it is important to understand vector well and how to use them.

Here a short sum-up of what we will see today:

- What is a vector?
- its norm
- its direction

- How to add and subtract vectors ?
- What is the dot product ?
- How to project a vector onto another ?

Once we have all these tools in our toolbox, we will then see:

- What is the equation of the hyperplane?
- How to compute the margin?

## What is a vector?

If we define a point in we can plot it like this.

Definition: Any point , in specifies a vector in the plane, namely the vector starting at the origin and ending at x.

This definition means that there exists a vector between the origin and A.

If we say that the point at the origin is the point then the vector above is the vector . We could also give it an arbitrary name such as .

**Note**: You can notice that we write vector either with an arrow on top of them, or in bold, in the rest of this text I will use the arrow when there is two letters like and the bold notation otherwise.

Ok so now we know that there is a vector, but we still don't know what **IS** a vector.

Definition: A vector is an object that has both a magnitude and a direction.

We will now look at these two concepts.

### 1) The magnitude

**The magnitude or length of a vector is written and is called its norm.
**For our vector , is the length of the segment

From Figure 3 we can easily calculate the distance OA using Pythagoras' theorem:

### 2) The direction

The direction is the second component of a vector.

Definition : The

directionof a vector is the vector

Where does the coordinates of come from ?

#### Understanding the definition

To find the direction of a vector, we need to use its angles.

Figure 4 displays the vector with and

We could say that :

*Naive definition 1: The direction of the vector is defined by the angle with respect to the horizontal axis, and with the angle with respect to the vertical axis.*

This is tedious. Instead of that we will use the cosine of the angles.

In a right triangle, the cosine of an angle is defined by :

In Figure 4 we can see that we can form two right triangles, and in both case the adjacent side will be on one of the axis. Which means that the definition of the cosine implicitly contains the axis related to an angle. We can rephrase our naïve definition to :

*Naive definition 2: The direction of the vector is defined by the cosine of the angle and the cosine of the angle .*

Now if we look at their values :

Hence the original definition of the vector * *. That's why its coordinates are also called *direction cosine*.

#### Computing the direction vector

We will now compute the direction of the vector from Figure 4.:

and

The direction of is the vector

If we draw this vector we get Figure 5:

We can see that as indeed the same look as except it is smaller. Something interesting about direction vectors like is that their norm is equal to 1. That's why we often call them **unit vectors**.

## The sum of two vectors

Given two vectors and then :

Which means that adding two vectors gives us** a third vector** whose coordinate are the sum of the coordinates of the original vectors.

You can convince yourself with the example below:

## The difference between two vectors

The difference works the same way :

Since the subtraction is not commutative, we can also consider the other case:

The last two pictures describe the "*true*" vectors generated by the difference of and .

However, since a vector has a magnitude and a direction, we often consider that parallel translate of a given vector (vectors with the same magnitude and direction but with a different origin) are the same vector, just drawn in a different place in space.

So don't be surprised if you meet the following :

and

If you do the math, it looks wrong, because the end of the vector is not in the right point, but it is a convenient way of thinking about vectors which you'll encounter often.

## The dot product

One **very** important notion to understand SVM is the dot product.

Definition: Geometrically, it is the product of the Euclidian magnitudes of the two vectors and the cosine of the angle between them

Which means if we have two vectors and and there is an angle (theta) between them, their dot product is :

**Why ?**

To understand let's look at the problem geometrically.

In the definition, they talk about , let's see what it is.

By definition we know that in a right-angled triangle:

In our example, we don't have a right-angled triangle.

However if we take a different look Figure 12 we can find two right-angled triangles formed by each vector with the horizontal axis.

and

So now we can view our original schema like this:

We can see that

So computing is like computing

There is a special formula called the *difference identity for cosine* which says that:

(if you want you can read the demonstration here)

Let's use this formula!

So if we replace each term

If we multiply both sides by we get:

Which is the same as :

**We just found the geometric definition of the dot product ! **

Eventually from the two last equations we can see that :

**This is the algebraic definition of the dot product !**

### A few words on notation

The dot product is called like that because we write a dot between the two vectors.

Talking about the dot product is the same as talking about

- the
**inner product**(in linear algebra) **scalar product**because we take the product of two vectors and it returns a scalar (a real number)

## The orthogonal projection of a vector

Given two vectors and , we would like to find the orthogonal projection of onto .

To do this we project the vector onto

This give us the vector

By definition :

We saw in the section about the dot product that

So we replace in our equation:

If we define the vector as the **direction** of then

and

We now have a simple way to compute the norm of the vector .

Since this vector is in the same direction as it has the direction

And we can say :

The vector is the orthogonal projection of onto .

Why are we interested by the orthogonal projection ? Well in our example, it allows us to compute the distance between and the line which goes through .

We see that this distance is

## The SVM hyperplane

### Understanding the equation of the hyperplane

You probably learnt that an equation of a line is : . However when reading about hyperplane, you will often find that the equation of an hyperplane is defined by :

How does these two forms relate ?

In the hyperplane equation you can see that the name of the variables are in bold. Which means that they are vectors ! Moreover, is how we compute the inner product of two vectors, and if you recall, the inner product is just another name for the dot product !

Note that

is the same thing as

Given two vectors and

The two equations are just different ways of expressing the same thing.

It is interesting to note that is , which means that this value determines the intersection of the line with the vertical axis.

Why do we use the hyperplane equation instead of ?

For two reasons:

- it is easier to work in more than two dimensions with this notation,
- the vector will always be normal to the hyperplane

And this last property will come in handy to compute the distance from a point to the hyperplane.

### Compute the distance from a point to the hyperplane

In Figure 20 we have an hyperplane, which separates two group of data.

To simplify this example, we have set .

As you can see on the Figure 20, the equation of the hyperplane is :

which is equivalent to

with and

Note that the vector is shown on the Figure 20. (w is not a data point)

We would like to compute the distance between the point and the hyperplane.

This is the distance between and its projection onto the hyperplane

We can view the point as a vector from the origin to .

If we project it onto the normal vector

We get the vector

Our goal is to find the distance between the point and the hyperplane.

We can see in Figure 23 that this distance is the same thing as .

Let's compute this value.

We start with two vectors, which is normal to the hyperplane, and which is the vector between the origin and .

Let the vector be the direction of

is the orthogonal projection of onto so :

## Compute the margin of the hyperplane

Now that we have the distance between and the hyperplane, the margin is defined by :

We did it ! We computed the margin of the hyperplane !

## Conclusion

This ends the Part 2 of this tutorial about the math behind SVM.

There was a lot more of math, but I hope you have been able to follow the article without problem.

### What's next?

Now that we know how to compute the margin, we might want to know how to select the best hyperplane, this is described in Part 3 of the tutorial : How to find the optimal hyperplane ?

Oleg PrutzAre you planning to tell about support vectors, non-linear kernels and optimization (I mean finding the minimum of the distance from the hyperplane to the suport vectors) in this tutorial? It seems that one need to know optimization theory in depth to understand this algorithm. It would be nice to see the simple explanation of what the algorithm is doing actually.

Alexandre KOWALCZYKPost authorYes that is what I am planning to do. However optimization theory is indeed very important to understand the algorithm and I am still figuring out how to explain SVM without going too deep into details.

Franck BerthuitVery clear article, Alexandre... and enjoyable for a poor mathematician like me.

I'm eager to read then next one.

Bye

thewizardofmeMarvellious article+lecture! Thanks for making it so clear.

thewizardofmeCan you please explain non-linear SVMs and Kernel in your preceeding articles?

Alexandre KOWALCZYKPost authorThanks for your kind comment. I need to find more time to write new articles. 🙂

Shivani BhardwajI was trying to understand SVM from a very long time. your blog really helped me a lot and now I know what I am dealing with. your tutorial not only helped in understanding the mathematical jargon but also give me the clear perspective of what I am doing.

Thanks a lot!!

Alexandre KOWALCZYKPost authorI am glad to hear it helped you. Thanks 🙂

ShyamVery lucid explanation - looking forward for part 3. When's that coming out ?

Alexandre KOWALCZYKPost authorThanks for the comment Shyam. I am afraid that recently I have spent most of my time on kaggle competitions and playing with convolutionnal neural networks. I will try to write the following part in the coming weeks in order to no achieve this work.

Nasser Tamimthank you very much very your wonderful explanation

and i am waiting patiently for lectures in deep learning especially convolution neural network

korawitThis is the best tutorial ever!

KunalOf all the links I found while doing a google search on SVM this is by far the best one in terms of simplicity of language in which it is explained...Thanks Alex

ajayVery Nice explanation. Where can I get part 3

Alexandre KOWALCZYKPost authorI am currently writing it. But it is coming soon. 🙂

dragon518This is the best blog about SVM I have seen ever, help me so much, thank you very much, look forward to excellent part 3. BTW, "To simplify this example, we have set ", do you mean that setting the start point of vector at origin?

Alexandre KOWALCZYKPost authorThanks for your kind comment. No this does not mean setting the start point of the vector at the origin. We could place it somewhere else because we often consider that the parallel translate of a given vector is the same vector (this is illustrated in the section about the difference of two vectors)In the definition of the equation of a hyperplane the vector is a 3-dimensional vector . By setting to 0 we can do the remaining calculations with a 2-dimensional vector. Because the definition says that and we use instead, it removes the intercept term from the equation. As a result the hyperplane passes through the origin. In the Part 3 I wrote in more details about the hyperplane equation, things should be easier to understand.

Tam DcVery helpful ! Thank you. Hope part 3

FarzadThank you!

GouthamNice Explanation.Waiting for future posts 🙂

Alexandre KOWALCZYKPost authorThanks. The part 3 is now online. (I added the link at the end of the article)

DavidThanks a lot, Alexandre!

Gabriel B. ThébergeThat is effectively crystal clear! I have read a lot of papers on this topic but nothing was as clear and accessible as your presentation Alexandre!

FeliciaThis is the most useful blog about SVM I've seen so far, especially for people like who don't have much knowledge in linear algebra.

A dumb question: why is the direction of \mathbf{w} perpendicular to the hyper-plane?

Alexandre KOWALCZYKPost authorThanks Felicia 🙂 For your question you can see my answer to the same question in the comments of part 3.

sidrait is a best lecture article .....

i have read many lectures but now the concept is clear .....

Ankit PKAwesome Alexandre..... Thanks for this nice series...... makes SVM a lot easier

Subha MGHi Alexandre..Your blog is simply superb! The way you've explained concepts!! I saw several videos on SVMs..but I didn't get a clear picture..Your articles have made it super-clear!! Super-like!!

MIngreally helpful, fantastic

Md. Asadur RahmanNo Word to thank you, brother! I was very worried and eager to learn about SVM, You have solved my problem. Be blessed by Almighty.

Sameer PannaGreat Blog.. and excuse my ignorance but can you please explain how one arrives here w(−b,−a,1) and x(1,x,y)

Alexandre KOWALCZYKPost authorThank you. You find these two vectors by continuing the reasoning.

We want to express the equation y-ax−b=0 with a dot product between two vectors.

The dot product is the sum of several products. In our case there is two minus signs, so there is three elements being summed together. Our vectors will have three elements each. Then we transform the equation to display these products: y−ax−b=0 is equivalent to y*1−a*x−b*1=0 and then we transform the differences into sums : y*1+(−a)*x+(−b)*1=0

Sameer PannaThank you 🙂

elmustafa1This is extremely helpful. Thank you so much.

KoushikWonderful post.....it gives me clear understanding even in some of Linear Algebra concepts. Thanks....keep this good work up..... 🙂

adhi21I am understand now why the equation is w^(T) . x + b. I have another question, what does "T" mean in that equation?

Thank you

Alexandre KOWALCZYKPost authorIt is the transpose of a matrix. (It transforms a column vector into a row vector for instance)

omarthank you very much about that useful tutorial , can you write an article about dealing SVM with non_linear dataset

Alexandre KOWALCZYKPost authorThanks for the suggestion. I will try to finish this tutorial series first. 😉

sawiThanks for a very useful article, explaining every tiny detail about the calculation with simple language and figures.

chadaphoneIt is very helpful. Thank you very much. Great job!!

chansungparkThank you for very explicit explanation.

I have a question since I have no background knowledge about cosine, sine, etc.

Shouldn't Direction of vector be just angle of the triangle? I am just curious what cos(β)= adjacent/hypotenuse formula fundamentally means?

Alexandre KOWALCZYKPost authorNo because you can have another vector with the same angle between the axis and itself but with the vectors pointing in another direction. By using cosine, we use the length of the adjacent and hypotenuse and as we are using coordinates we obtain a vector pointing in the same direction.

RaviHow can we show that the vector w will always be normal to the hyperplane?

Alexandre KOWALCZYKPost authorThis is by definition of the hyperplane. You can read more about it here.

FarahCan u plz tell, why u multiplied distance between A and hyperplane by 2 to compute margin?

Alexandre KOWALCZYKPost authorThis was a simplification. The margin is explained in more details in Part 3.

BrandonThank for you blog, that is great. However, i have a question about the W(-b,-a,1) and X(1,x,y),

the transpose of W is a column vector and X is a row vector, the result of [ column * row] is a matrix that size is (3,3), can you tell me where i missing?

Alexandre KOWALCZYKPost authorGood catch. Indeed both w and x needs to be column vectors so that the transpose of w is a row and we do [row * column]. I updated the article. Thanks!

ShawnSo far the best explanation of SVM in the net for those who do not have the required math background. Fantastic job! Please post non-linear SVMs and Kernel explanation. Thanks a lot !

Ashutosh SrivastavaVery Nice and crystal clear explaination i have ever found on internet.

It will be very helpful if you give some practical demostration of how SVM and other

learning algorithms can be implemented and interpreted on various platforms like weka and orange.What is confusion matrix and ROI. How that Wt-b equation is generated etc.

Giving practical demonstration will be very helpful.

Thanks and Regards...

Everaldo AguiarPhenomenal explanation of SVMs. Thanks a lot for taking the time to write and publish this. I was wondering if you would mind if I used brief excerpts of your content. I am preparing a few slides for a course that I will be teaching and found some of your images and explanations very helpful. I'll be sure to include citations and a reference to your blog posts.

Alexandre KOWALCZYKPost authorThanks for your comment. No problem you can use some excerpts. For which course is it?

Farai LebohoSpeechless, this is downright simple to understand. This makes SVM move from very hard to simply understandably, thanks a lot mate. At least now i have an idea of what's happening behind the scenes of svm.SVC().fit(),

Great work.

rssoniThe best explanation I can think of. I made the concept clear. The writing style is lucid and understandable. Nothing else can be easier than this explanation of SVM. Awesome, thanks Alex

A Logical GeekYou are amazing. Thanks a lot for this.. Because of lack of enough maths background i have having difficulty reaching here . You helped a lot. Is it possible for you to explain the justification of langrange's multipliers as well as further explanation of SVM.

leviliardThank you very much. Your a are a big teacher.

GanecianI'm still confused in determining normal vector w. If the equation of the hyperplane is x2 = 1/3 * x1 + 1, what is the normal vector w? How to calculate it when w0 is not zero. Thanks

Alexandre KOWALCZYKPost authorIn your example: x2 = 1/3 * x1 + 1, is in the form: x2 = a * x1 + b. To get the normal vector you just get the vector w(a,-1). So in this case, we define w(1/3,-1), x(x1,x2) and b = 1. And you can see that wx+b=0 is equivalent to x2 = 1/3 * x1 + 1.

Now if we plot, the vector w(1/3,-1), we can start to draw it where we want. I could start drawing at the origin x(0,0) or I can start drawing it directly on the hyperplane. I choose to do so, and I start drawing it at x(1,1+1/3).

As you can see in the figure: it is normal to the hyperplane.

Where I chose to start drawing it, does not change the fact that it is normal to the hyperplane.

Beungeut BolohoHi, can you explain what is the bias b visually? Is it the distance of vertical axis to the origin or the distance of the hyperplane to the origin?

Alexandre KOWALCZYKPost authorGiven an hyperplane having the equation wx+b=0 with vectors w(w0,w1) and x(x0,x1). b is the distance between the vertical axis and the origin only when the value w1 of the weight vector is equal -1. Indeed, when we transform this hyperplane equation to a line equation of the form y=ax+c we get a = -w0/w1 and c = -b/w1. Some books represent b as being the distance between the origin and the hyperplane, but I think this is true only under certain conditions, at least that is what I found when trying to verify it by myself using the first formula of this article using formulas from this page.

Srinath Shiv KumarVery clearly explained. Thanks a lot!

Bahareh MoradiThank you a thousand times............You explained Lagrange multipliers in the best way in the world.....

can you introduce me some useful books which I can read and get more information about classification?

Sara Q. AbedulridhaMany thanks brother, that is great...

wishing you all good things, God bless you...

Jeetendra AhujaAwesome tutorial, TAL man!

Just a small suggestion, when you give a link like for "cumulative", "dot product" , can you change a code of your site such that after clicking on this link, it gets open in different tab instead of opening in current tab.

Alexandre KOWALCZYKPost authorHello. Thank you for your comment. I thought all my links were opening in a new tab but indeed it was not the case. I updated all the problematic links in this article. Thank for the remark !

SoniaSir,

x+.w+b= +1

x_.w+b = -1

why it is always equal to +1 and -1 for positive and negative support vector respectively.?. How to normalize this distance of hyperplane to support vectors?. Once we normalize, it always remains same for any kind of data. could you explain?. I am not clear about the distance between hyperplane and support vectors

Alexandre KOWALCZYKPost authorIt is always equal to +1 and -1 because we are free to select w for which it will be the case (we can rescale w and b and keep the same hyperplane). So we decide arbitrarily to select among the ones for which it is equal to +1 and -1 because it will make the following computation easier.

tkotreshThank you for finding time to write this article. I am back to basics and enjoying it as well.

AnasA. HadiThanks for this illustration about SVM..