SVM - Understanding the math - Part 1 - The margin


This is the first article from a series of articles I will be writing about the math behind SVM. There is a lot to talk about and a lot of mathematical backgrounds is often necessary. However, I will try to keep a slow pace and to give in-depth explanations, so that everything is crystal clear, even for beginners.

If you are new and wish to know a little bit more about SVMs before diving into the math, you can read the article: an overview of Support Vector Machine.

What is the goal of the Support Vector Machine (SVM)?

The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data.

The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning algorithm.

It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a particular class.

For instance, we can have the training data below:

Support Vector Machine dataset

Figure 1

We have plotted the size and weight of several people, and there is also a way to distinguish between men and women.

With such data, using a SVM will allow us to answer the following question:

Given a particular data point (weight and size), is the person a man or a woman ?

For instance:  if someone measures 175 cm and weights 80 kg, is it a man of a woman?

What is a separating hyperplane?

Just by looking at the plot, we can see that it is possible to separate the data.  For instance, we could trace a line and then all the data points representing men will be above the line, and all the data points representing women will be below the line.

Such a line is called a separating hyperplane and is depicted below:

An example separating hyperplane

If it is just a line, why do we call it an hyperplane ?

Even though we use a very simple example with data points laying in R^2 the support vector machine can work with any number of dimensions !

An hyperplane is a generalization of a plane.

  • in one dimension, an hyperplane is called a point
  • in two dimensions, it is a line
  • in three dimensions, it is a plane
  • in more dimensions you can call it an hyperplane
A separating hyperplane in one dimension

The point L is a separating hyperplane in one dimension

What is the optimal separating hyperplane?

The fact that you can find a separating hyperplane,  does not mean it is the best one !  In the example below there is several separating hyperplanes. Each of them is valid as it successfully separates our data set with men on one side and women on the other side.

There is several possible separating hyperplanes

There can be a lot of separating hyperplanes

Suppose we select the green hyperplane and use it to classify on real life data.


This hyperplane does not generalize well

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we select an hyperplane which is close to the data points of one class, then it might not generalize well.

So we will try to select an hyperplane as far as possible from data points from each category:

The optimal hyperplane

This one looks better. When we use it with real life data, we can see it still make perfect classification.


The black hyperplane classifies more accurately than the green one

That's why the objective of a SVM is to find the optimal separating hyperplane:

  • because it correctly classifies the training data
  • and because it is the one which will generalize better with unseen data

What is the margin and how does it help choosing the optimal hyperplane?


The margin of our optimal hyperplane

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin.

Basically the margin is a no man's land. There will never be any data point inside the margin. (Note: this can cause some problems when data is noisy, and this is why soft margin classifier will be introduced later)

For another hyperplane, the margin will look like this :


As you can see, Margin B is smaller than Margin A.

We can make the following observations:

  • If an hyperplane is very close to a data point, its margin will be small.
  • The further an hyperplane is from a data point, the larger its margin will be.

This means that the optimal hyperplane will be the one with the biggest margin.

That is why the objective of the SVM is to find  the optimal separating hyperplane which maximizes the margin of the training data.

This concludes this introductory post about the math behind SVM. There was not a lot of formula, but in the next article we will  put on some numbers and try to get the mathematical view of this using geometry and vectors.

If you want to learn more read it now :
SVM - Understanding the math - Part 2 : Calculate the margin

106 thoughts on “SVM - Understanding the math - Part 1 - The margin

    1. Alexandre KOWALCZYK Post author

      Thanks. I just found out my name was not visible on the site. I am passionate about machine learning and I work as a software engineer in a financial firm. 🙂 I added an author box. You can check my profile on various social media now.

      1. Chetan Yadati

        Hi Alexandre,
        Thanks for writing up this piece( and the following ones) on SVM. I really like the fact that you have explained all the essentials along with the SVM. It is comprehensive and complete. I could not have wished for a better write-up on the subject.

        1. Alexandre KOWALCZYK Post author

          Hello Renny, I will dedicate a full chapter of my upcoming (free) ebook to explain multi-class SVM. If you wish you can register at the end of this series so you will receive an email when it is available. 😉

          1. Anmol Biswas

            For N-class classification, what are the other possibilities excluding N one class vs the rest classifiers ? In terms of SVMs

          2. Alexandre KOWALCZYK Post author

            One-vs-One, Crammer & Singer, DAGSVM are alternative approaches for multi-class SVM.

  1. Jose

    This is one of the best tutorials out there. Very well explained. Are you thinking on including some illustrative code on R?


      1. Arun

        Great article. Thanks for such a nice article. I have a fundamental question. The classification here are men and women and hyperplane separates them. If my classes are age say between 1-20 , 21-50 , 51 -75 can we still use svm and how will we model hyperplane.

  2. Patrick

    Hi Alexandre, I would like to use SVR in my thesis to predict a response with 3 causal factors. What is your opinion, does R do it as u showed in your example. Iam pretty new in machine learning

    1. Alexandre KOWALCZYK Post author

      Hello Patrick. This is a pretty broad question. R is as good as another language for machine learning. Most often people use R, Python or MatLab. Anyway you can basically do SVR with any language. I don't really understand what you mean by "predicting a response with 3 causal factors". The first thing you have to consider is whether you need to do a classification or a regression. Then you can pick the appropriate algorithm (SVM for classification or SVR for regression).

  3. Patrick

    Thank you for the response, causal factors I mean I have 3 independent (causal) variables and one dependent variable. So I wd like to use SVR

  4. Ahila

    I am new to machine learning and classification.
    Do you mind giving an example of EEG classification tutorial

  5. Emmanuel

    I have 6 classes to classify from and Orthophoto. and Have generated the HSV from the Image including the Mean RGB for each color code.. give these information how to I find the best Optimal Hyperplane using R. Advice me.. Thank you

  6. Rahul

    Thank you Alexandre. This is very helpful. SVM was never so clear to me before. However, this clarity on SVM brings me to another question. How is SVM different from the Discriminant analysis. I understand that discriminant analysis as well tries to find a discriminating line which maximizes the distance between points which belongs to different categories. Does the difference lies in the way algorithms are implemented or there is something more to it? Are the application areas of SVM and Discriminant analysis different?

    1. Alexandre KOWALCZYK Post author

      Hi Rahul. Thank you for your comment. I don't know a lot about LDA but I found this quora answer which might help you to understand the difference between SVM and LDA

  7. Abhilasha

    Hi, I love your explanation. I have a silly doubt. What exactly is R^2 that you've mentioned here.
    "With data points lying in R^2..."

    1. Alexandre KOWALCZYK Post author

      R^2 represent the euclidian plane. It comes from set theory. If R is the set of all real numbers, then R^2 = R X R is the cartesian product of R. That is, R × R is the set of all ordered pairs whose first coordinate is an element of R and whose second coordinate is an element of R.

  8. Chungkwon Ryu

    Hello! This tutorial was very good for me. So I want to introduce this tutorial to my co-workers by using some slides. Is it possible?

    1. Alexandre KOWALCZYK Post author

      Hello. No problem for me as long as you indicate it came from this site. 😉

  9. Manny Grewal

    Thank you. this is the best tutorial of a newbie. I saw videos on youtube, they are too advanced. Great work for doing all this effort to teach ordinary guys like us.

  10. Preetham

    Good introduction into machine learning and SVM.
    I had a doubt, the hyperplane which you are referring, does it need to be a straight line or can it be a curve also?

    1. Alexandre KOWALCZYK Post author

      In two dimensions an hyperplane is a straight line, sometimes you can see pictures with a circle for instance but it is just a projection of an hyperplane of a higher dimension into the 2 dimensions. In 3 dimensions it is a plane. In more than 3 dimensions you cannot visualize it.

  11. inesda

    hello your tutorial is excllent ! the title of my thesis is the contextual discovery of web services using svm but the problem I searched and I have never found this with svm I wonder if we can implement web services on svm

    1. Alexandre KOWALCZYK Post author

      Thank you. I don't understand what you mean by "implementing web services" on svm. Sorry.

  12. Pingback: 机器学习相关网络资源 | 研究生主页

  13. thesoul

    Wow. Well explained. I was so confused with other sites explaining SVM when I stumbled upon your site. Thanks for the simple yet powerful explanation.

  14. Ahmad

    thanks a lot for this tutorial , I need your help about my problem please

    for needle trajectory detection in ultrasound , if I want to estimate the needle trajectory , by classify each pixel in the image to needle class , and background class , if I apply first the log gabor filter what will be the next classifier , can I use svm and how , I use matlab

    thank you very much

  15. kamal

    Hello Alex,
    Great article for beginners, I see the explaining starts by picking up the support vectors upfront. Basically those vectors that fall on the decision boundaries are picked up upfront. As I was new to SVM, I was wondering how would a machine do this. What is the mechanism used, is the euclidean distance between two points (vectors) the key?


    1. Alexandre KOWALCZYK Post author

      The support vectors are the ones for which lagrangian multipliers are non zero. This will be explained in the upcoming article about optimization.

  16. Ayush varshney

    The best article i've found on this subject.

    Thanks for sharing your expert knowledge on the subject, i am really impressed and wanted to learn more from your blogs..

  17. Rohit Tanwar

    I read so many articles on SVM but the clarity of concepts I got from here is awesome.... hope to get your guidance in this area in future too..

  18. robert d

    This is the best explanation of the maths behind svms i have read and i have read quite a few. Will you include kkt conditions when dealing with optimisation with inequalities ?
    Thank you very much for this explanation and i look forward to the book.

  19. Sanidhya Singh

    Great explanation in simple way.. Thanks Alexandre. Looking forward to more of your work. Do you have any other tutorials website or blog on other topics. Or if you can mail me also the docs. It will be of great help.

    1. Alexandre KOWALCZYK Post author

      Thank you very much. No, I do not have other tutorials I am fully focused on SVM 🙂 Maybe later. Which docs are you talking about?

  20. Ravindra M

    I wish you could do tutorials for other machine learning techniques as well. Its the best tutorial available in SVM.

  21. Ganecian

    Nice post... But you're working with continuous dataset in your example so we can easily map it to 2 dimensional vector space. How about discrete/categorized attribute such as skin color (white, black, brown, etc), how to map it?

  22. Ming

    Passionate to work as a data scientist, I've been looking for tutorials giving me ideas what exactly SVM is and this page is the one, Thank you.

  23. jayadi.kurniawan

    Nice article
    I have a question, in your example the SVM used to classify 2 classes
    can we use SVM to classify 3 classes ?
    my research topic is sentiment analysis (tex classification) using SVM. Im about to classify 3 classes (positive, negative and netral class).
    thanks before.

    1. Alexandre KOWALCZYK Post author

      Yes, SVMs can be used to classify more than one class. There are several ways to do this, one-vs-one, one-vs-all, ... You can read this page on the subject if you are using scikit-learn. In my upcoming ebook about SVM there will be one chapter dedicated to multi-class classification as it is a frequent question.

  24. Sonu

    Dear sir,
    How to calculate the weight vector w and the bias term b?. I have searched a lot but did not find the clear answer of calculation of the weight vector w and the b. Could you please explain in simple way as you done all the things

    1. Alexandre KOWALCZYK Post author

      Hello Sonu. Computing w and b is done once we have solved the optimization problem and found the Lagrange multipliers. Moreover, some algorithms such as SMO compute w and b separately. I explain how to do that in detail in my upcoming ebook. As it is not published yet, I can recommend you to read this paper by Andrew Ng. You will find how to compute w in equation (9) and b in equation (11).

  25. Ganesh S.


    This is awesome post. It clears every mathematical aspect of support vector machine.
    Can you suggest some sources/book for such mathematical explanation of other algorithms ?

    Thank you !

    1. Alexandre KOWALCZYK Post author

      The book Pattern Recognition and Machine Learning by Bishop is very interesting. If you want a much broader view of AI I recommend Artificial Intelligence a Modern Approach by Russel and Norvig.

  26. Srinidhi

    Hi Alexandre,

    Thanks for such a lucid presentation to beginners of Machine learning algorithms.Helped me gain confidence that ML is not an bewildering field.

  27. Remalli Vidyasagar


    it is a very simple tutorail and very clear on SVM. I very much impressed with your explanation.
    thank you

  28. ramamurthi kumar

    hi alexandre,

    an amazing work, by you on SVM, i have been trying to understand svm for almost a month, and your site filled that void.

    Just wanna ask you, can you discuss preceptron, neural networks as well ? or can you suggest any websites / book for the above ?


    1. Alexandre KOWALCZYK Post author

      Hello. I discuss perceptron in my upcoming ebook 🙂 Feel free to subscribe to receive a mail when it is ready!

  29. Anil Chaurasiya

    Hey! Alexandre, The way you wrote this topic help lot.
    thanks for wrighting such a goood topics.

  30. Morgan

    Hello Alexander, I enjoyed this article and it is making the concept of SVM very clear to me. Please how do I use SVM and fuzzy logic to develop a medical diagnostic system.

  31. vamshi polapally

    How i can apply svm algorithm to implement automatic identification and data capture using RFID ??

  32. Pingback: Support Vector Machine Classification using Raw Python | James D. McCaffrey

  33. Ekki rinaldi

    Can you add one part about how the kernel works? like how rbf or polynomial implemented mathematically

    1. Gleen A. Dalaorao

      Hello Sir,

      I love this blog, it makes our SVM life very easy... I'm working on SVM research but I'm new to this, my research is that I want to tweak or make just a simple modification on SVM on it weakness or disadvantage. I hope you can guide me where to focus or what part of SVM will I modify, the important there in is that it is new or novel so that I gather articles on that and start reading. Hope you can share your thoughts and guide. Thanks and more power

  34. Sudhakar Surapaneni

    Dear Alexandre KOWALCZYK,
    I started taking a course on Machine Learning. Found it easy going until my encounter with SVM. Had to go back and forth multiple times to understand the math and gain some intuition. University courses videos helped to some extent and there too got stuck at few places. Your explanation helped to gain further insights and cleared some of the logic I could not get a hang of. I am grateful to you for your contribution.


  35. oymj

    I think I can not subscribe to the list by the mail, could you provide another way to get the coming ebook? thanks!

  36. ENNAJIH Yassin

    Hello, Great Introduction, couldn't have asked for a better one !!

    Can you please add a paragraph or two about the regression side of SVM ?
    because this explanation mainly focuses on the classification side of SVM... and I am left wondering how does Support Vector Regression (SVR) work? also by finding the optimal hyperplane that minimizes the margin ? if so? how?

    Thanks in advance

  37. Allasandro

    Hi Alexandre

    Could you provide some information on how to implement a Fuzzy SVM in R?
    Or how to incorporate a fuzzy target variable?

Comments are closed.