## Introduction

This is the first article from a series of articles I will be writing about the math behind SVM. There is a lot to talk about and a lot of mathematical backgrounds is often necessary. However, I will try to keep a slow pace and to give in-depth explanations, so that everything is crystal clear, even for beginners.

### SVM - Understanding the math

**Part 1: What is the goal of the Support Vector Machine (SVM)?**

Part 2: How to compute the margin?

Part 3: How to find the optimal hyperplane?

Part 4: Unconstrained minimization

Part 5: Convex functions

Part 6: Duality and Lagrange multipliers

## What is the goal of the Support Vector Machine (SVM)?

The goal of a support vector machine is to find the optimal separating hyperplane which maximizes the margin of the training data.

The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning algorithm.

It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a particular class.

For instance, we can have the training data below:

We have plotted the size and weight of several people, and there is also a way to distinguish between men and women.

With such data, using a SVM will allow us to answer the following question:

Given a particular data point (weight and size), is the person a man or a woman ?

For instance: if someone measures 175 cm and weights 80 kg, is it a man of a woman?

## What is a separating hyperplane?

Just by looking at the plot, we can see that it is possible to separate the data. For instance, we could trace a line and then all the data points representing men will be above the line, and all the data points representing women will be below the line.

Such a line is called a **separating hyperplane **and is depicted below:

### If it is just a line, why do we call it an hyperplane ?

Even though we use a very simple example with data points laying in the support vector machine can work with any number of dimensions !

**An hyperplane is a generalization of a plane**.

- in one dimension, an hyperplane is called a point
- in two dimensions, it is a line
- in three dimensions, it is a plane
- in more dimensions you can call it an hyperplane

## What is the *optimal* separating hyperplane?

The fact that you can find a **separating hyperplane**, does not mean it is the best one ! In the example below there is several separating hyperplanes. Each of them is valid as it successfully separates our data set with men on one side and women on the other side.

Suppose we select the green hyperplane and use it to classify on real life data.

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that *if we select an hyperplane which is close to the data points of one class, then it might not generalize well.*

So we will try to select an hyperplane **as far as possible from data points from each category:**

This one looks better. When we use it with real life data, we can see it still make perfect classification.

That's why the objective of a SVM is to **find the optimal separating hyperplane**:

- because it correctly classifies the training data
- and because it is the one which will generalize better with unseen data

# What is the margin and how does it help choosing the optimal hyperplane?

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the **margin**.

**Basically the margin is a no man's land. There will never be any data point inside the margin.** (Note: this can cause some problems when data is noisy, and this is why soft margin classifier will be introduced later)

For another hyperplane, the margin will look like this :

As you can see, Margin B is smaller than Margin A.

We can make the following observations:

- If an hyperplane is very close to a data point, its margin will be small.
- The further an hyperplane is from a data point, the larger its margin will be.

This means that **the optimal hyperplane will be the one with the biggest margin.**

That is why the objective of the SVM **is to find the optimal separating hyperplane which maximizes the margin of the training data.**

This concludes this introductory post about the math behind SVM. There was not a lot of formula, but in the next article we will put on some numbers and try to get the mathematical view of this using geometry and vectors.

If you want to learn more read it now :

SVM - Understanding the math - Part 2 : Calculate the margin

David Harper CFA FRM (@bionicturtle)Great intro, who are you? Thanks!

Alexandre KOWALCZYKPost authorThanks. I just found out my name was not visible on the site. I am passionate about machine learning and I work as a software engineer in a financial firm. 🙂 I added an author box. You can check my profile on various social media now.

Chetan YadatiHi Alexandre,

Thanks for writing up this piece( and the following ones) on SVM. I really like the fact that you have explained all the essentials along with the SVM. It is comprehensive and complete. I could not have wished for a better write-up on the subject.

PeristicHi Alexandre thank you so much you're awesome!!

Renny VargheseCould you explain how SVM works for multiple classes? How would it work for 9 classes? I used a function called multisvm here: http://www.mathworks.com/matlabcentral/fileexchange/39352-multi-class-svm

but I'm not sure how it's working behind the scenes. Everything I've read online is rather confusing.

Alexandre KOWALCZYKPost authorHello Renny, I will dedicate a full chapter of my upcoming (free) ebook to explain multi-class SVM. If you wish you can register at the end of this series so you will receive an email when it is available. 😉

Anmol BiswasFor N-class classification, what are the other possibilities excluding N one class vs the rest classifiers ? In terms of SVMs

Alexandre KOWALCZYKPost authorOne-vs-One, Crammer & Singer, DAGSVM are alternative approaches for multi-class SVM.

JoseThis is one of the best tutorials out there. Very well explained. Are you thinking on including some illustrative code on R?

Thanks!

Alexandre KOWALCZYKPost authorThanks you very much Jose. Right now I am concentrating on the math, but I might add some R code along the way to illustrate some part. I intend to write SVM tutorial like the one about text classification but in R this time as it is widely used.

ArunGreat article. Thanks for such a nice article. I have a fundamental question. The classification here are men and women and hyperplane separates them. If my classes are age say between 1-20 , 21-50 , 51 -75 can we still use svm and how will we model hyperplane.

Alexandre KOWALCZYKPost authorIn your example you have what we call a multiclass classification problem. For each person you can assign one of the three classes. The common approach using SVM is the one-vs-rest. So if you have three classes A, B and C you will train three models A vs (B and C), B vs (A and C) and C vs (A and B). Here is an example in python using a linear SVM.

Nishant AggarwalAWESOME ! AWESOME ! AWESOME ! AWESOME !

Thanks,

Nishant Aggarwal

Alexandre KOWALCZYKPost authorThanks 😉

PatrickHi Alexandre, I would like to use SVR in my thesis to predict a response with 3 causal factors. What is your opinion, does R do it as u showed in your example. Iam pretty new in machine learning

Alexandre KOWALCZYKPost authorHello Patrick. This is a pretty broad question. R is as good as another language for machine learning. Most often people use R, Python or MatLab. Anyway you can basically do SVR with any language. I don't really understand what you mean by "predicting a response with 3 causal factors". The first thing you have to consider is whether you need to do a classification or a regression. Then you can pick the appropriate algorithm (SVM for classification or SVR for regression).

PatrickThank you for the response, causal factors I mean I have 3 independent (causal) variables and one dependent variable. So I wd like to use SVR

AhilaHi,

I am new to machine learning and classification.

Do you mind giving an example of EEG classification tutorial

Alexandre KOWALCZYKPost authorI never done EEG classification before. Maybe this video will help you. 😉

EmmanuelI have 6 classes to classify from and Orthophoto. and Have generated the HSV from the Image including the Mean RGB for each color code.. give these information how to I find the best Optimal Hyperplane using R. Advice me.. Thank you

Alexandre KOWALCZYKPost authorHello. I am afraid your question is too broad. You have to try for yourself. If you have problems programming, try asking on stackoverflow.

Sri SuwarnoDear Alexandre, thank you for your clear tutorial. It helps me understanding SVM.

RahulThank you Alexandre. This is very helpful. SVM was never so clear to me before. However, this clarity on SVM brings me to another question. How is SVM different from the Discriminant analysis. I understand that discriminant analysis as well tries to find a discriminating line which maximizes the distance between points which belongs to different categories. Does the difference lies in the way algorithms are implemented or there is something more to it? Are the application areas of SVM and Discriminant analysis different?

Alexandre KOWALCZYKPost authorHi Rahul. Thank you for your comment. I don't know a lot about LDA but I found this quora answer which might help you to understand the difference between SVM and LDA

RahulThank you Alexandre.

Prihananto Jokohey brother.. thank you ,,, you safe me... You make everything about SVM clear

AkshayVery nicely explained, thank you...

GurpreetThanks for clarity..............

linmaungthank you for your SVM tutorial.Your explanation is excellence to understand for me.

AbhilashaHi, I love your explanation. I have a silly doubt. What exactly is R^2 that you've mentioned here.

"With data points lying in R^2..."

Alexandre KOWALCZYKPost authorR^2 represent the euclidian plane. It comes from set theory. If R is the set of all real numbers, then R^2 = R X R is the cartesian product of R. That is, R × R is the set of all ordered pairs whose first coordinate is an element of R and whose second coordinate is an element of R.

Chungkwon RyuHello! This tutorial was very good for me. So I want to introduce this tutorial to my co-workers by using some slides. Is it possible?

Alexandre KOWALCZYKPost authorHello. No problem for me as long as you indicate it came from this site. 😉

Manny GrewalThank you. this is the best tutorial of a newbie. I saw videos on youtube, they are too advanced. Great work for doing all this effort to teach ordinary guys like us.

PreethamHello,

Good introduction into machine learning and SVM.

I had a doubt, the hyperplane which you are referring, does it need to be a straight line or can it be a curve also?

Alexandre KOWALCZYKPost authorIn two dimensions an hyperplane is a straight line, sometimes you can see pictures with a circle for instance but it is just a projection of an hyperplane of a higher dimension into the 2 dimensions. In 3 dimensions it is a plane. In more than 3 dimensions you cannot visualize it.

inesdahello your tutorial is excllent ! the title of my thesis is the contextual discovery of web services using svm but the problem I searched and I have never found this with svm I wonder if we can implement web services on svm

Alexandre KOWALCZYKPost authorThank you. I don't understand what you mean by "implementing web services" on svm. Sorry.

maniGreat post, keep it up!

Pingback: 机器学习相关网络资源 | 研究生主页

zakariaPlease how i do for classify image whit SVM ??

thesoulWow. Well explained. I was so confused with other sites explaining SVM when I stumbled upon your site. Thanks for the simple yet powerful explanation.

Kishor BhosaleReally very simple explanation very easy to understand... I like your work.

abhavery well explained

Ahmadthanks a lot for this tutorial , I need your help about my problem please

for needle trajectory detection in ultrasound , if I want to estimate the needle trajectory , by classify each pixel in the image to needle class , and background class , if I apply first the log gabor filter what will be the next classifier , can I use svm and how , I use matlab

thank you very much

Pradip NichiteGreat Explanation !!

kamalHello Alex,

Great article for beginners, I see the explaining starts by picking up the support vectors upfront. Basically those vectors that fall on the decision boundaries are picked up upfront. As I was new to SVM, I was wondering how would a machine do this. What is the mechanism used, is the euclidean distance between two points (vectors) the key?

-Kamal.

Alexandre KOWALCZYKPost authorThe support vectors are the ones for which lagrangian multipliers are non zero. This will be explained in the upcoming article about optimization.

Sonali NimkarBeautifully explained, thank you very much Alexandre.

Sidra SultanOne of the good articles. Thankyou so much

Srinivasan Aruchamyvery Nice way of explanation. Thanks Alexandre

Ayush varshneyThe best article i've found on this subject.

Thanks for sharing your expert knowledge on the subject, i am really impressed and wanted to learn more from your blogs..

Ngọc Tân ĐỗI like your writing style so much. Thank you very much!

Luiz de SouzaVery good explanation Simple and didactic. Hard to find combination.

Rohit TanwarI read so many articles on SVM but the clarity of concepts I got from here is awesome.... hope to get your guidance in this area in future too..

robert dThis is the best explanation of the maths behind svms i have read and i have read quite a few. Will you include kkt conditions when dealing with optimisation with inequalities ?

Thank you very much for this explanation and i look forward to the book.

Alexandre KOWALCZYKPost authorHello Robert,

Yes KKT will be included of course.

Thank you for the kind comment.

polosepianExcellent job

Sanidhya SinghGreat explanation in simple way.. Thanks Alexandre. Looking forward to more of your work. Do you have any other tutorials website or blog on other topics. Or if you can mail me also the docs. It will be of great help.

Alexandre KOWALCZYKPost authorThank you very much. No, I do not have other tutorials I am fully focused on SVM 🙂 Maybe later. Which docs are you talking about?

Ravindra MI wish you could do tutorials for other machine learning techniques as well. Its the best tutorial available in SVM.

Alexandre KOWALCZYKPost authorThanks a lot 🙂

GanecianNice post... But you're working with continuous dataset in your example so we can easily map it to 2 dimensional vector space. How about discrete/categorized attribute such as skin color (white, black, brown, etc), how to map it?

Alexandre KOWALCZYKPost authorUsually, you need to one-hot-encode categorical variables. You can read more about preprocessing the data in this article.

Amal Targhithank you it one of the best tutorials

MingPassionate to work as a data scientist, I've been looking for tutorials giving me ideas what exactly SVM is and this page is the one, Thank you.

Alexandre KOWALCZYKPost authorThank you for your kind words. 🙂

jayadi.kurniawanNice article

I have a question, in your example the SVM used to classify 2 classes

can we use SVM to classify 3 classes ?

my research topic is sentiment analysis (tex classification) using SVM. Im about to classify 3 classes (positive, negative and netral class).

thanks before.

Alexandre KOWALCZYKPost authorYes, SVMs can be used to classify more than one class. There are several ways to do this, one-vs-one, one-vs-all, ... You can read this page on the subject if you are using scikit-learn. In my upcoming ebook about SVM there will be one chapter dedicated to multi-class classification as it is a frequent question.

SonuDear sir,

How to calculate the weight vector w and the bias term b?. I have searched a lot but did not find the clear answer of calculation of the weight vector w and the b. Could you please explain in simple way as you done all the things

Alexandre KOWALCZYKPost authorHello Sonu. Computing w and b is done once we have solved the optimization problem and found the Lagrange multipliers. Moreover, some algorithms such as SMO compute w and b separately. I explain how to do that in detail in my upcoming ebook. As it is not published yet, I can recommend you to read this paper by Andrew Ng. You will find how to compute w in equation (9) and b in equation (11).

Ganesh S.Hi,

This is awesome post. It clears every mathematical aspect of support vector machine.

Can you suggest some sources/book for such mathematical explanation of other algorithms ?

Thank you !

Alexandre KOWALCZYKPost authorHi,

The book Pattern Recognition and Machine Learning by Bishop is very interesting. If you want a much broader view of AI I recommend Artificial Intelligence a Modern Approach by Russel and Norvig.

SrinidhiHi Alexandre,

Thanks for such a lucid presentation to beginners of Machine learning algorithms.Helped me gain confidence that ML is not an bewildering field.