SVM - Understanding the math - Part 1 - The margin


This is the first article from a series of articles I will be writing about the math behind SVM. There is a lot to talk about and a lot of mathematical backgrounds is often necessary. However, I will try to keep a slow pace and to give in-depth explanations, so that everything is crystal clear, even for beginners.

If you are new and wish to know a little bit more about SVMs before diving into the math, you can read the article: an overview of Support Vector Machine.

What is the goal of the Support Vector Machine (SVM)?

The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data.

The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning algorithm.

It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a particular class.

For instance, we can have the training data below:

Support Vector Machine dataset

Figure 1

We have plotted the size and weight of several people, and there is also a way to distinguish between men and women.

With such data, using a SVM will allow us to answer the following question:

Given a particular data point (weight and size), is the person a man or a woman ?

For instance:  if someone measures 175 cm and weights 80 kg, is it a man of a woman?

What is a separating hyperplane?

Just by looking at the plot, we can see that it is possible to separate the data.  For instance, we could trace a line and then all the data points representing men will be above the line, and all the data points representing women will be below the line.

Such a line is called a separating hyperplane and is depicted below:

An example separating hyperplane

If it is just a line, why do we call it an hyperplane ?

Even though we use a very simple example with data points laying in R^2 the support vector machine can work with any number of dimensions !

An hyperplane is a generalization of a plane.

  • in one dimension, an hyperplane is called a point
  • in two dimensions, it is a line
  • in three dimensions, it is a plane
  • in more dimensions you can call it an hyperplane
A separating hyperplane in one dimension

The point L is a separating hyperplane in one dimension

What is the optimal separating hyperplane?

The fact that you can find a separating hyperplane,  does not mean it is the best one !  In the example below there is several separating hyperplanes. Each of them is valid as it successfully separates our data set with men on one side and women on the other side.

There is several possible separating hyperplanes

There can be a lot of separating hyperplanes

Suppose we select the green hyperplane and use it to classify on real life data.


This hyperplane does not generalize well

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we select an hyperplane which is close to the data points of one class, then it might not generalize well.

So we will try to select an hyperplane as far as possible from data points from each category:

The optimal hyperplane

This one looks better. When we use it with real life data, we can see it still make perfect classification.


The black hyperplane classifies more accurately than the green one

That's why the objective of a SVM is to find the optimal separating hyperplane:

  • because it correctly classifies the training data
  • and because it is the one which will generalize better with unseen data

What is the margin and how does it help choosing the optimal hyperplane?


The margin of our optimal hyperplane

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin.

Basically the margin is a no man's land. There will never be any data point inside the margin. (Note: this can cause some problems when data is noisy, and this is why soft margin classifier will be introduced later)

For another hyperplane, the margin will look like this :


As you can see, Margin B is smaller than Margin A.

We can make the following observations:

  • If an hyperplane is very close to a data point, its margin will be small.
  • The further an hyperplane is from a data point, the larger its margin will be.

This means that the optimal hyperplane will be the one with the biggest margin.

That is why the objective of the SVM is to find  the optimal separating hyperplane which maximizes the margin of the training data.

This concludes this introductory post about the math behind SVM. There was not a lot of formula, but in the next article we will  put on some numbers and try to get the mathematical view of this using geometry and vectors.

If you want to learn more read it now :
SVM - Understanding the math - Part 2 : Calculate the margin

    • Thanks. I just found out my name was not visible on the site. I am passionate about machine learning and I work as a software engineer in a financial firm. 🙂 I added an author box. You can check my profile on various social media now.

      • Hi Alexandre,
        Thanks for writing up this piece( and the following ones) on SVM. I really like the fact that you have explained all the essentials along with the SVM. It is comprehensive and complete. I could not have wished for a better write-up on the subject.

        • Hello Renny, I will dedicate a full chapter of my upcoming (free) ebook to explain multi-class SVM. If you wish you can register at the end of this series so you will receive an email when it is available. 😉

          • For N-class classification, what are the other possibilities excluding N one class vs the rest classifiers ? In terms of SVMs

  1. This is one of the best tutorials out there. Very well explained. Are you thinking on including some illustrative code on R?


      • Great article. Thanks for such a nice article. I have a fundamental question. The classification here are men and women and hyperplane separates them. If my classes are age say between 1-20 , 21-50 , 51 -75 can we still use svm and how will we model hyperplane.

  2. Hi Alexandre, I would like to use SVR in my thesis to predict a response with 3 causal factors. What is your opinion, does R do it as u showed in your example. Iam pretty new in machine learning

    • Hello Patrick. This is a pretty broad question. R is as good as another language for machine learning. Most often people use R, Python or MatLab. Anyway you can basically do SVR with any language. I don't really understand what you mean by "predicting a response with 3 causal factors". The first thing you have to consider is whether you need to do a classification or a regression. Then you can pick the appropriate algorithm (SVM for classification or SVR for regression).

  3. Thank you for the response, causal factors I mean I have 3 independent (causal) variables and one dependent variable. So I wd like to use SVR

  4. Hi,
    I am new to machine learning and classification.
    Do you mind giving an example of EEG classification tutorial

  5. I have 6 classes to classify from and Orthophoto. and Have generated the HSV from the Image including the Mean RGB for each color code.. give these information how to I find the best Optimal Hyperplane using R. Advice me.. Thank you

  6. Thank you Alexandre. This is very helpful. SVM was never so clear to me before. However, this clarity on SVM brings me to another question. How is SVM different from the Discriminant analysis. I understand that discriminant analysis as well tries to find a discriminating line which maximizes the distance between points which belongs to different categories. Does the difference lies in the way algorithms are implemented or there is something more to it? Are the application areas of SVM and Discriminant analysis different?

  7. Hi, I love your explanation. I have a silly doubt. What exactly is R^2 that you've mentioned here.
    "With data points lying in R^2..."

    • R^2 represent the euclidian plane. It comes from set theory. If R is the set of all real numbers, then R^2 = R X R is the cartesian product of R. That is, R × R is the set of all ordered pairs whose first coordinate is an element of R and whose second coordinate is an element of R.

  8. Hello! This tutorial was very good for me. So I want to introduce this tutorial to my co-workers by using some slides. Is it possible?

  9. Thank you. this is the best tutorial of a newbie. I saw videos on youtube, they are too advanced. Great work for doing all this effort to teach ordinary guys like us.

  10. Hello,
    Good introduction into machine learning and SVM.
    I had a doubt, the hyperplane which you are referring, does it need to be a straight line or can it be a curve also?

    • In two dimensions an hyperplane is a straight line, sometimes you can see pictures with a circle for instance but it is just a projection of an hyperplane of a higher dimension into the 2 dimensions. In 3 dimensions it is a plane. In more than 3 dimensions you cannot visualize it.

  11. hello your tutorial is excllent ! the title of my thesis is the contextual discovery of web services using svm but the problem I searched and I have never found this with svm I wonder if we can implement web services on svm

  12. Pingback: 机器学习相关网络资源 | 研究生主页

  13. Wow. Well explained. I was so confused with other sites explaining SVM when I stumbled upon your site. Thanks for the simple yet powerful explanation.

  14. thanks a lot for this tutorial , I need your help about my problem please

    for needle trajectory detection in ultrasound , if I want to estimate the needle trajectory , by classify each pixel in the image to needle class , and background class , if I apply first the log gabor filter what will be the next classifier , can I use svm and how , I use matlab

    thank you very much

  15. Hello Alex,
    Great article for beginners, I see the explaining starts by picking up the support vectors upfront. Basically those vectors that fall on the decision boundaries are picked up upfront. As I was new to SVM, I was wondering how would a machine do this. What is the mechanism used, is the euclidean distance between two points (vectors) the key?


    • The support vectors are the ones for which lagrangian multipliers are non zero. This will be explained in the upcoming article about optimization.

  16. The best article i've found on this subject.

    Thanks for sharing your expert knowledge on the subject, i am really impressed and wanted to learn more from your blogs..

  17. I read so many articles on SVM but the clarity of concepts I got from here is awesome.... hope to get your guidance in this area in future too..

  18. This is the best explanation of the maths behind svms i have read and i have read quite a few. Will you include kkt conditions when dealing with optimisation with inequalities ?
    Thank you very much for this explanation and i look forward to the book.

  19. Great explanation in simple way.. Thanks Alexandre. Looking forward to more of your work. Do you have any other tutorials website or blog on other topics. Or if you can mail me also the docs. It will be of great help.

    • Thank you very much. No, I do not have other tutorials I am fully focused on SVM 🙂 Maybe later. Which docs are you talking about?

  20. I wish you could do tutorials for other machine learning techniques as well. Its the best tutorial available in SVM.

  21. Nice post... But you're working with continuous dataset in your example so we can easily map it to 2 dimensional vector space. How about discrete/categorized attribute such as skin color (white, black, brown, etc), how to map it?

  22. Passionate to work as a data scientist, I've been looking for tutorials giving me ideas what exactly SVM is and this page is the one, Thank you.

  23. Nice article
    I have a question, in your example the SVM used to classify 2 classes
    can we use SVM to classify 3 classes ?
    my research topic is sentiment analysis (tex classification) using SVM. Im about to classify 3 classes (positive, negative and netral class).
    thanks before.

    • Yes, SVMs can be used to classify more than one class. There are several ways to do this, one-vs-one, one-vs-all, ... You can read this page on the subject if you are using scikit-learn. In my upcoming ebook about SVM there will be one chapter dedicated to multi-class classification as it is a frequent question.

  24. Dear sir,
    How to calculate the weight vector w and the bias term b?. I have searched a lot but did not find the clear answer of calculation of the weight vector w and the b. Could you please explain in simple way as you done all the things

    • Hello Sonu. Computing w and b is done once we have solved the optimization problem and found the Lagrange multipliers. Moreover, some algorithms such as SMO compute w and b separately. I explain how to do that in detail in my upcoming ebook. As it is not published yet, I can recommend you to read this paper by Andrew Ng. You will find how to compute w in equation (9) and b in equation (11).

  25. Hi,

    This is awesome post. It clears every mathematical aspect of support vector machine.
    Can you suggest some sources/book for such mathematical explanation of other algorithms ?

    Thank you !

    • Hi,
      The book Pattern Recognition and Machine Learning by Bishop is very interesting. If you want a much broader view of AI I recommend Artificial Intelligence a Modern Approach by Russel and Norvig.

  26. Hi Alexandre,

    Thanks for such a lucid presentation to beginners of Machine learning algorithms.Helped me gain confidence that ML is not an bewildering field.

  27. Hi,

    it is a very simple tutorail and very clear on SVM. I very much impressed with your explanation.
    thank you

  28. hi alexandre,

    an amazing work, by you on SVM, i have been trying to understand svm for almost a month, and your site filled that void.

    Just wanna ask you, can you discuss preceptron, neural networks as well ? or can you suggest any websites / book for the above ?


    • Hello. I discuss perceptron in my upcoming ebook 🙂 Feel free to subscribe to receive a mail when it is ready!

  29. Hey! Alexandre, The way you wrote this topic help lot.
    thanks for wrighting such a goood topics.

  30. Hello Alexander, I enjoyed this article and it is making the concept of SVM very clear to me. Please how do I use SVM and fuzzy logic to develop a medical diagnostic system.

  31. How i can apply svm algorithm to implement automatic identification and data capture using RFID ??

  32. Pingback: Support Vector Machine Classification using Raw Python | James D. McCaffrey

  33. Can you add one part about how the kernel works? like how rbf or polynomial implemented mathematically

    • Hello Sir,

      I love this blog, it makes our SVM life very easy... I'm working on SVM research but I'm new to this, my research is that I want to tweak or make just a simple modification on SVM on it weakness or disadvantage. I hope you can guide me where to focus or what part of SVM will I modify, the important there in is that it is new or novel so that I gather articles on that and start reading. Hope you can share your thoughts and guide. Thanks and more power

  34. Dear Alexandre KOWALCZYK,
    I started taking a course on Machine Learning. Found it easy going until my encounter with SVM. Had to go back and forth multiple times to understand the math and gain some intuition. University courses videos helped to some extent and there too got stuck at few places. Your explanation helped to gain further insights and cleared some of the logic I could not get a hang of. I am grateful to you for your contribution.


  35. I think I can not subscribe to the list by the mail, could you provide another way to get the coming ebook? thanks!

  36. Hello, Great Introduction, couldn't have asked for a better one !!

    Can you please add a paragraph or two about the regression side of SVM ?
    because this explanation mainly focuses on the classification side of SVM... and I am left wondering how does Support Vector Regression (SVR) work? also by finding the optimal hyperplane that minimizes the margin ? if so? how?

    Thanks in advance

  37. Hi Alexandre

    Could you provide some information on how to implement a Fuzzy SVM in R?
    Or how to incorporate a fuzzy target variable?