# SVM - Understanding the math - Part 1 - The margin

## Introduction

This is the first article from a series of articles I will be writing about the math behind SVM.  There is a lot to talk about and a lot of mathematical backgrounds is often necessary. However, I will try to keep a slow pace and to give in-depth explanations, so that everything is crystal clear, even for beginners.

### SVM - Understanding the math

Part 1: What is the goal of the Support Vector Machine (SVM)?
Part 2: How to compute the margin?
Part 3: How to find the optimal hyperplane?
Part 4: Unconstrained minimization
Part 5: Convex functions
Part 6: Duality and Lagrange multipliers

## What is the goal of the Support Vector Machine (SVM)?

The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data.

The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning algorithm.

It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a particular class.

For instance, we can have the training data below:

Figure 1

We have plotted the size and weight of several people, and there is also a way to distinguish between men and women.

With such data, using a SVM will allow us to answer the following question:

Given a particular data point (weight and size), is the person a man or a woman ?

For instance:  if someone measures 175 cm and weights 80 kg, is it a man of a woman?

## What is a separating hyperplane?

Just by looking at the plot, we can see that it is possible to separate the data.  For instance, we could trace a line and then all the data points representing men will be above the line, and all the data points representing women will be below the line.

Such a line is called a separating hyperplane and is depicted below:

### If it is just a line, why do we call it an hyperplane ?

Even though we use a very simple example with data points laying in $R^2$ the support vector machine can work with any number of dimensions !

An hyperplane is a generalization of a plane.

• in one dimension, an hyperplane is called a point
• in two dimensions, it is a line
• in three dimensions, it is a plane
• in more dimensions you can call it an hyperplane

The point L is a separating hyperplane in one dimension

## What is the optimal separating hyperplane?

The fact that you can find a separating hyperplane,  does not mean it is the best one !  In the example below there is several separating hyperplanes. Each of them is valid as it successfully separates our data set with men on one side and women on the other side.

There can be a lot of separating hyperplanes

Suppose we select the green hyperplane and use it to classify on real life data.

This hyperplane does not generalize well

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we select an hyperplane which is close to the data points of one class, then it might not generalize well.

So we will try to select an hyperplane as far as possible from data points from each category:

This one looks better. When we use it with real life data, we can see it still make perfect classification.

The black hyperplane classifies more accurately than the green one

That's why the objective of a SVM is to find the optimal separating hyperplane:

• because it correctly classifies the training data
• and because it is the one which will generalize better with unseen data

# What is the margin and how does it help choosing the optimal hyperplane?

The margin of our optimal hyperplane

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin.

Basically the margin is a no man's land. There will never be any data point inside the margin. (Note: this can cause some problems when data is noisy, and this is why soft margin classifier will be introduced later)

For another hyperplane, the margin will look like this :

As you can see, Margin B is smaller than Margin A.

We can make the following observations:

• If an hyperplane is very close to a data point, its margin will be small.
• The further an hyperplane is from a data point, the larger its margin will be.

This means that the optimal hyperplane will be the one with the biggest margin.

That is why the objective of the SVM is to find  the optimal separating hyperplane which maximizes the margin of the training data.

This concludes this introductory post about the math behind SVM. There was not a lot of formula, but in the next article we will  put on some numbers and try to get the mathematical view of this using geometry and vectors.

SVM - Understanding the math - Part 2 : Calculate the margin

## 67 thoughts on “SVM - Understanding the math - Part 1 - The margin”

1. Alexandre KOWALCZYK Post author

Thanks. I just found out my name was not visible on the site. I am passionate about machine learning and I work as a software engineer in a financial firm. 🙂 I added an author box. You can check my profile on various social media now.

Hi Alexandre,
Thanks for writing up this piece( and the following ones) on SVM. I really like the fact that you have explained all the essentials along with the SVM. It is comprehensive and complete. I could not have wished for a better write-up on the subject.

1. Jose

This is one of the best tutorials out there. Very well explained. Are you thinking on including some illustrative code on R?

Thanks!

1. Arun

Great article. Thanks for such a nice article. I have a fundamental question. The classification here are men and women and hyperplane separates them. If my classes are age say between 1-20 , 21-50 , 51 -75 can we still use svm and how will we model hyperplane.

2. Patrick

Hi Alexandre, I would like to use SVR in my thesis to predict a response with 3 causal factors. What is your opinion, does R do it as u showed in your example. Iam pretty new in machine learning

1. Alexandre KOWALCZYK Post author

Hello Patrick. This is a pretty broad question. R is as good as another language for machine learning. Most often people use R, Python or MatLab. Anyway you can basically do SVR with any language. I don't really understand what you mean by "predicting a response with 3 causal factors". The first thing you have to consider is whether you need to do a classification or a regression. Then you can pick the appropriate algorithm (SVM for classification or SVR for regression).

3. Patrick

Thank you for the response, causal factors I mean I have 3 independent (causal) variables and one dependent variable. So I wd like to use SVR

4. Ahila

Hi,
I am new to machine learning and classification.
Do you mind giving an example of EEG classification tutorial

5. Emmanuel

I have 6 classes to classify from and Orthophoto. and Have generated the HSV from the Image including the Mean RGB for each color code.. give these information how to I find the best Optimal Hyperplane using R. Advice me.. Thank you

6. Rahul

Thank you Alexandre. This is very helpful. SVM was never so clear to me before. However, this clarity on SVM brings me to another question. How is SVM different from the Discriminant analysis. I understand that discriminant analysis as well tries to find a discriminating line which maximizes the distance between points which belongs to different categories. Does the difference lies in the way algorithms are implemented or there is something more to it? Are the application areas of SVM and Discriminant analysis different?

1. Alexandre KOWALCZYK Post author

Hi Rahul. Thank you for your comment. I don't know a lot about LDA but I found this quora answer which might help you to understand the difference between SVM and LDA

7. Abhilasha

Hi, I love your explanation. I have a silly doubt. What exactly is R^2 that you've mentioned here.
"With data points lying in R^2..."

1. Alexandre KOWALCZYK Post author

R^2 represent the euclidian plane. It comes from set theory. If R is the set of all real numbers, then R^2 = R X R is the cartesian product of R. That is, R × R is the set of all ordered pairs whose first coordinate is an element of R and whose second coordinate is an element of R.

8. Chungkwon Ryu

Hello! This tutorial was very good for me. So I want to introduce this tutorial to my co-workers by using some slides. Is it possible?

1. Alexandre KOWALCZYK Post author

Hello. No problem for me as long as you indicate it came from this site. 😉

9. Manny Grewal

Thank you. this is the best tutorial of a newbie. I saw videos on youtube, they are too advanced. Great work for doing all this effort to teach ordinary guys like us.

10. Preetham

Hello,
Good introduction into machine learning and SVM.
I had a doubt, the hyperplane which you are referring, does it need to be a straight line or can it be a curve also?

1. Alexandre KOWALCZYK Post author

In two dimensions an hyperplane is a straight line, sometimes you can see pictures with a circle for instance but it is just a projection of an hyperplane of a higher dimension into the 2 dimensions. In 3 dimensions it is a plane. In more than 3 dimensions you cannot visualize it.

11. inesda

hello your tutorial is excllent ! the title of my thesis is the contextual discovery of web services using svm but the problem I searched and I have never found this with svm I wonder if we can implement web services on svm

1. Alexandre KOWALCZYK Post author

Thank you. I don't understand what you mean by "implementing web services" on svm. Sorry.

12. Pingback: 机器学习相关网络资源 | 研究生主页

13. thesoul

Wow. Well explained. I was so confused with other sites explaining SVM when I stumbled upon your site. Thanks for the simple yet powerful explanation.

for needle trajectory detection in ultrasound , if I want to estimate the needle trajectory , by classify each pixel in the image to needle class , and background class , if I apply first the log gabor filter what will be the next classifier , can I use svm and how , I use matlab

thank you very much

15. kamal

Hello Alex,
Great article for beginners, I see the explaining starts by picking up the support vectors upfront. Basically those vectors that fall on the decision boundaries are picked up upfront. As I was new to SVM, I was wondering how would a machine do this. What is the mechanism used, is the euclidean distance between two points (vectors) the key?

-Kamal.

1. Alexandre KOWALCZYK Post author

The support vectors are the ones for which lagrangian multipliers are non zero. This will be explained in the upcoming article about optimization.

16. Ayush varshney

The best article i've found on this subject.

17. Rohit Tanwar

I read so many articles on SVM but the clarity of concepts I got from here is awesome.... hope to get your guidance in this area in future too..

18. robert d

This is the best explanation of the maths behind svms i have read and i have read quite a few. Will you include kkt conditions when dealing with optimisation with inequalities ?
Thank you very much for this explanation and i look forward to the book.

19. Sanidhya Singh

Great explanation in simple way.. Thanks Alexandre. Looking forward to more of your work. Do you have any other tutorials website or blog on other topics. Or if you can mail me also the docs. It will be of great help.

1. Alexandre KOWALCZYK Post author

Thank you very much. No, I do not have other tutorials I am fully focused on SVM 🙂 Maybe later. Which docs are you talking about?

20. Ravindra M

I wish you could do tutorials for other machine learning techniques as well. Its the best tutorial available in SVM.

21. Ganecian

Nice post... But you're working with continuous dataset in your example so we can easily map it to 2 dimensional vector space. How about discrete/categorized attribute such as skin color (white, black, brown, etc), how to map it?

22. Ming

Passionate to work as a data scientist, I've been looking for tutorials giving me ideas what exactly SVM is and this page is the one, Thank you.

Nice article
I have a question, in your example the SVM used to classify 2 classes
can we use SVM to classify 3 classes ?
my research topic is sentiment analysis (tex classification) using SVM. Im about to classify 3 classes (positive, negative and netral class).
thanks before.

1. Alexandre KOWALCZYK Post author

Yes, SVMs can be used to classify more than one class. There are several ways to do this, one-vs-one, one-vs-all, ... You can read this page on the subject if you are using scikit-learn. In my upcoming ebook about SVM there will be one chapter dedicated to multi-class classification as it is a frequent question.

24. Sonu

Dear sir,
How to calculate the weight vector w and the bias term b?. I have searched a lot but did not find the clear answer of calculation of the weight vector w and the b. Could you please explain in simple way as you done all the things

1. Alexandre KOWALCZYK Post author

Hello Sonu. Computing w and b is done once we have solved the optimization problem and found the Lagrange multipliers. Moreover, some algorithms such as SMO compute w and b separately. I explain how to do that in detail in my upcoming ebook. As it is not published yet, I can recommend you to read this paper by Andrew Ng. You will find how to compute w in equation (9) and b in equation (11).