SVM - Understanding the math - Part 1 - The margin

Introduction

This is the first article from a series of articles I will be writing about the math behind SVM.  There is a lot to talk about and a lot of mathematical backgrounds is often necessary. However, I will try to keep a slow pace and to give in-depth explanations, so that everything is crystal clear, even for beginners.


SVM - Understanding the math

Part 1: What is the goal of the Support Vector Machine (SVM)?
Part 2: How to compute the margin?
Part 3: How to find the optimal hyperplane?
Part 4: Unconstrained minimization
Part 5: Convex functions
Part 6: Duality and Lagrange multipliers


What is the goal of the Support Vector Machine (SVM)?

The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data.

The first thing we can see from this definition, is that a SVM needs training data. Which means it is a supervised learning algorithm.

It is also important to know that SVM is a classification algorithm. Which means we will use it to predict if something belongs to a particular class.

For instance, we can have the training data below:

Support Vector Machine dataset

Figure 1

We have plotted the size and weight of several people, and there is also a way to distinguish between men and women.

With such data, using a SVM will allow us to answer the following question:

Given a particular data point (weight and size), is the person a man or a woman ?

For instance:  if someone measures 175 cm and weights 80 kg, is it a man of a woman?

What is a separating hyperplane?

Just by looking at the plot, we can see that it is possible to separate the data.  For instance, we could trace a line and then all the data points representing men will be above the line, and all the data points representing women will be below the line.

Such a line is called a separating hyperplane and is depicted below:

An example separating hyperplane

If it is just a line, why do we call it an hyperplane ?

Even though we use a very simple example with data points laying in R^2 the support vector machine can work with any number of dimensions !

An hyperplane is a generalization of a plane.

  • in one dimension, an hyperplane is called a point
  • in two dimensions, it is a line
  • in three dimensions, it is a plane
  • in more dimensions you can call it an hyperplane
A separating hyperplane in one dimension

The point L is a separating hyperplane in one dimension

What is the optimal separating hyperplane?

The fact that you can find a separating hyperplane,  does not mean it is the best one !  In the example below there is several separating hyperplanes. Each of them is valid as it successfully separates our data set with men on one side and women on the other side.

There is several possible separating hyperplanes

There can be a lot of separating hyperplanes

Suppose we select the green hyperplane and use it to classify on real life data.

svm-hyperplane-bad

This hyperplane does not generalize well

This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we select an hyperplane which is close to the data points of one class, then it might not generalize well.

So we will try to select an hyperplane as far as possible from data points from each category:

The optimal hyperplane

This one looks better. When we use it with real life data, we can see it still make perfect classification.

svm-hyperplane-good

The black hyperplane classifies more accurately than the green one

That's why the objective of a SVM is to find the optimal separating hyperplane:

  • because it correctly classifies the training data
  • and because it is the one which will generalize better with unseen data

What is the margin and how does it help choosing the optimal hyperplane?

07_withMidpointsAndSeparator

The margin of our optimal hyperplane

Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin.

Basically the margin is a no man's land. There will never be any data point inside the margin. (Note: this can cause some problems when data is noisy, and this is why soft margin classifier will be introduced later)

For another hyperplane, the margin will look like this :

01_svm-dataset1-small-margin

As you can see, Margin B is smaller than Margin A.

We can make the following observations:

  • If an hyperplane is very close to a data point, its margin will be small.
  • The further an hyperplane is from a data point, the larger its margin will be.

This means that the optimal hyperplane will be the one with the biggest margin.

That is why the objective of the SVM is to find  the optimal separating hyperplane which maximizes the margin of the training data.

This concludes this introductory post about the math behind SVM. There was not a lot of formula, but in the next article we will  put on some numbers and try to get the mathematical view of this using geometry and vectors.

If you want to learn more read it now :
SVM - Understanding the math - Part 2 : Calculate the margin

I am passionate about machine learning and Support Vector Machine. When I am not writing this blog, you can find me on Kaggle participating in some competition.

67 thoughts on “SVM - Understanding the math - Part 1 - The margin

    1. Alexandre KOWALCZYK Post author

      Thanks. I just found out my name was not visible on the site. I am passionate about machine learning and I work as a software engineer in a financial firm. 🙂 I added an author box. You can check my profile on various social media now.

      Reply
      1. Chetan Yadati

        Hi Alexandre,
        Thanks for writing up this piece( and the following ones) on SVM. I really like the fact that you have explained all the essentials along with the SVM. It is comprehensive and complete. I could not have wished for a better write-up on the subject.

        Reply
  1. Jose

    This is one of the best tutorials out there. Very well explained. Are you thinking on including some illustrative code on R?

    Thanks!

    Reply
      1. Arun

        Great article. Thanks for such a nice article. I have a fundamental question. The classification here are men and women and hyperplane separates them. If my classes are age say between 1-20 , 21-50 , 51 -75 can we still use svm and how will we model hyperplane.

        Reply
  2. Patrick

    Hi Alexandre, I would like to use SVR in my thesis to predict a response with 3 causal factors. What is your opinion, does R do it as u showed in your example. Iam pretty new in machine learning

    Reply
    1. Alexandre KOWALCZYK Post author

      Hello Patrick. This is a pretty broad question. R is as good as another language for machine learning. Most often people use R, Python or MatLab. Anyway you can basically do SVR with any language. I don't really understand what you mean by "predicting a response with 3 causal factors". The first thing you have to consider is whether you need to do a classification or a regression. Then you can pick the appropriate algorithm (SVM for classification or SVR for regression).

      Reply
  3. Patrick

    Thank you for the response, causal factors I mean I have 3 independent (causal) variables and one dependent variable. So I wd like to use SVR

    Reply
  4. Emmanuel

    I have 6 classes to classify from and Orthophoto. and Have generated the HSV from the Image including the Mean RGB for each color code.. give these information how to I find the best Optimal Hyperplane using R. Advice me.. Thank you

    Reply
  5. Rahul

    Thank you Alexandre. This is very helpful. SVM was never so clear to me before. However, this clarity on SVM brings me to another question. How is SVM different from the Discriminant analysis. I understand that discriminant analysis as well tries to find a discriminating line which maximizes the distance between points which belongs to different categories. Does the difference lies in the way algorithms are implemented or there is something more to it? Are the application areas of SVM and Discriminant analysis different?

    Reply
    1. Alexandre KOWALCZYK Post author

      Hi Rahul. Thank you for your comment. I don't know a lot about LDA but I found this quora answer which might help you to understand the difference between SVM and LDA

      Reply
  6. Abhilasha

    Hi, I love your explanation. I have a silly doubt. What exactly is R^2 that you've mentioned here.
    "With data points lying in R^2..."

    Reply
    1. Alexandre KOWALCZYK Post author

      R^2 represent the euclidian plane. It comes from set theory. If R is the set of all real numbers, then R^2 = R X R is the cartesian product of R. That is, R × R is the set of all ordered pairs whose first coordinate is an element of R and whose second coordinate is an element of R.

      Reply
  7. Chungkwon Ryu

    Hello! This tutorial was very good for me. So I want to introduce this tutorial to my co-workers by using some slides. Is it possible?

    Reply
  8. Preetham

    Hello,
    Good introduction into machine learning and SVM.
    I had a doubt, the hyperplane which you are referring, does it need to be a straight line or can it be a curve also?

    Reply
    1. Alexandre KOWALCZYK Post author

      In two dimensions an hyperplane is a straight line, sometimes you can see pictures with a circle for instance but it is just a projection of an hyperplane of a higher dimension into the 2 dimensions. In 3 dimensions it is a plane. In more than 3 dimensions you cannot visualize it.

      Reply
  9. inesda

    hello your tutorial is excllent ! the title of my thesis is the contextual discovery of web services using svm but the problem I searched and I have never found this with svm I wonder if we can implement web services on svm

    Reply
  10. Pingback: 机器学习相关网络资源 | 研究生主页

  11. thesoul

    Wow. Well explained. I was so confused with other sites explaining SVM when I stumbled upon your site. Thanks for the simple yet powerful explanation.

    Reply
  12. Ahmad

    thanks a lot for this tutorial , I need your help about my problem please

    for needle trajectory detection in ultrasound , if I want to estimate the needle trajectory , by classify each pixel in the image to needle class , and background class , if I apply first the log gabor filter what will be the next classifier , can I use svm and how , I use matlab

    thank you very much

    Reply
  13. kamal

    Hello Alex,
    Great article for beginners, I see the explaining starts by picking up the support vectors upfront. Basically those vectors that fall on the decision boundaries are picked up upfront. As I was new to SVM, I was wondering how would a machine do this. What is the mechanism used, is the euclidean distance between two points (vectors) the key?

    -Kamal.

    Reply
    1. Alexandre KOWALCZYK Post author

      The support vectors are the ones for which lagrangian multipliers are non zero. This will be explained in the upcoming article about optimization.

      Reply
  14. Ayush varshney

    The best article i've found on this subject.

    Thanks for sharing your expert knowledge on the subject, i am really impressed and wanted to learn more from your blogs..

    Reply
  15. Rohit Tanwar

    I read so many articles on SVM but the clarity of concepts I got from here is awesome.... hope to get your guidance in this area in future too..

    Reply
  16. robert d

    This is the best explanation of the maths behind svms i have read and i have read quite a few. Will you include kkt conditions when dealing with optimisation with inequalities ?
    Thank you very much for this explanation and i look forward to the book.

    Reply
  17. Sanidhya Singh

    Great explanation in simple way.. Thanks Alexandre. Looking forward to more of your work. Do you have any other tutorials website or blog on other topics. Or if you can mail me also the docs. It will be of great help.

    Reply
    1. Alexandre KOWALCZYK Post author

      Thank you very much. No, I do not have other tutorials I am fully focused on SVM 🙂 Maybe later. Which docs are you talking about?

      Reply
  18. Ganecian

    Nice post... But you're working with continuous dataset in your example so we can easily map it to 2 dimensional vector space. How about discrete/categorized attribute such as skin color (white, black, brown, etc), how to map it?

    Reply
  19. Ming

    Passionate to work as a data scientist, I've been looking for tutorials giving me ideas what exactly SVM is and this page is the one, Thank you.

    Reply
  20. jayadi.kurniawan

    Nice article
    I have a question, in your example the SVM used to classify 2 classes
    can we use SVM to classify 3 classes ?
    my research topic is sentiment analysis (tex classification) using SVM. Im about to classify 3 classes (positive, negative and netral class).
    thanks before.

    Reply
    1. Alexandre KOWALCZYK Post author

      Yes, SVMs can be used to classify more than one class. There are several ways to do this, one-vs-one, one-vs-all, ... You can read this page on the subject if you are using scikit-learn. In my upcoming ebook about SVM there will be one chapter dedicated to multi-class classification as it is a frequent question.

      Reply
  21. Sonu

    Dear sir,
    How to calculate the weight vector w and the bias term b?. I have searched a lot but did not find the clear answer of calculation of the weight vector w and the b. Could you please explain in simple way as you done all the things

    Reply
    1. Alexandre KOWALCZYK Post author

      Hello Sonu. Computing w and b is done once we have solved the optimization problem and found the Lagrange multipliers. Moreover, some algorithms such as SMO compute w and b separately. I explain how to do that in detail in my upcoming ebook. As it is not published yet, I can recommend you to read this paper by Andrew Ng. You will find how to compute w in equation (9) and b in equation (11).

      Reply

Leave a Reply