My ebook Support Vector Machines Succinctly is available for free.

## About Support Vector Machines Succinctly

While I was working on my series of articles about the mathematics behind SVMs, I have been contacted by Syncfusion to write an ebook in their "Succinctly" e-book series. The goal is to cover a particular subject in about 100 pages. I gladly accepted the proposition and started working on the book.

I took me almost one year to complete writing this ebook. Hundred of hours spent reading about Support Vector Machines in several books and papers, trying to figure out the complex things in order to give you a clear view. And just as much drawing schemas and writing code.

## What you will find in this book?

- A refresher about some prerequisites to understand the subject more easily
- An introduction with the Perceptron
- An explanation of the SVM optimization problem
- How to solve the SVM optimization problem with a quadratic programming solver
- A description of kernels
- An explanation of the SMO algorithm
- An overview of multi-class SVMs
- A lot of Python code to show how it works (everything is available in this bitbucket repository )

I hope this book will help all people strugling to understand this subject to have a fresh and clear understanding.

## Is it hard to read?

My goal was to try to keep the book as easy to understand as possible. However, because of the space available, I was not able to go into as much details as in this blog.

Chapter 4 about the optimization problem will probably be the most challenging to read. Do not worry if you do not catch everything the first time and feel free to use it as a base to dive deeper into the subject if you wish to understand it fully.

If you struggle on some part, you can ask your question on stats.stackexchange if it is about Machine Learning or on math.stackexchange if it is about the mathematics. Both communities are great.

## What now?

If you read the book I would love to hear your feedback!

Do not hesitate to post a comment and to share the book with your friends!

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

S.M. HosseiniDear Alexandre, Congratulations!

This is superb! I had no idea a book could look this good. Smooth and easy to read.

Greg42Congrats ! That's probably the best book out there on the subject.

Dmitry LifshitzThank you Alexandre,

Awesome book. The most comprehensive material ever found in the net.

It has everything: refreshment of linear algebra, convex optimization basics, code examples and of course SVMs.

Thank you for sharing your passion,

Dmitry

QinThanks very much for the long waiting ebook and your hard work!

ALdiI've been waiting so long. thank you so much sir.

CHi there Alexandre,

First and foremost, I read through your book over the course of the last two weeks, and all of the concepts were brilliantly explained. This book is an incredible reference that I would definitely recommend to anyone interested in learning about SVMs.

When working through your Python examples, I think I may have found a very small issue. Within `smo_algorithm.py`, on the Second Heuristic C (lines 209-216), I noticed that the `randrange` function appears to be causing the algorithm to begin at the same point in a list every single time the program is run. For instance:

```

rand_i = randrange(self.m)

print(rand_i)

```

will always begin at the following indices every time the program is run:

```

$ python run_smo.py

8

13

13

12

6

```

For reference, I am using the `linearly_separable` dataset, and I'm running this on Python version 2.7.13. Is this behavior by design?

Alexandre KOWALCZYKPost authorHello.

Thank you for your comment. The randrange function always returns the same numbers because at line 16 of run_smo.py we call the function seed(5). You can read more about the seed function here. I did that so that my results are reproducible, but the algorithm states that we select the numbers randomly :).

Best,

TetsuyaHello Alexandre,

Thank you so much for the book. I had read some books about SVM before, but your book is the most comprehensive and easy to understand for beginner like me.

So, far I have finished reading until chapter 3, and here I have a question.

When one hyperplane is given, I think that there could be 2 w vectors which are orthogonal with respect to the hyperplane.

For example, in Figure 25, there could be 2 w vectors, as follows:

(1) w = (1, 1)

In this case, b is computed as below:

- b / w_{1} = 12

b = -12 * w_{1} = -12 * 1 = -12

As for the given data, I suppose:

x_{1} = (4, 6), y_{1} = -1 for the blue star

x_{2} = (8, 6), y_{2} = 1 for the red triangle

Using these data, the functional margins, f_{1} and f_{2} are computed as follows:

f_{1} = y_{1}((w, x_{1}) + b) = -1((1, 1), (4, 6)) - 12) = -(4 + 6 - 12) = 2

f_{2} = y_{2}((w, x_{2}) + b) = 1((1, 1), (8, 6)) - 12) = 8 + 6 - 12 = 2

Since both functional margins are positive, those points are correctly classified.

(2) w = (-1, -1)

In this case, b is computed as below:

-b / w_{1} = 12

b = -12 * w_{1} = -12 * -1 = 12

In the same way, the functional margins, f_{1} and f_{2} are computed as follows:

f_{1} = y_{1}((w, x_{1}) + b) = -1((-1, -1), (4, 6)) + 12) = -(-4 - 6 + 12) = -2

f_{2} = y_{2}((w, x_{2}) + b) = 1((-1, -1), (8, 6)) + 12) = (-8 - 6 + 12) = -2

Since both functional margins are negative, those points are incorrectly classified.

From above considerations, I think w = (1, 1) is correct, and should be taken for Figure 25. Therefore, the functional margin of the data set for Figure 25 is +2. (Yes, I agree with you on this point.)

Then, let's compute the functional margin for Figure 26. In the same idea, there could be 2 w vectors for Figure 26.

(1) w = (-1, 1)

In this case, b is computed as below:

-b / w_{1} = 0

b = 0

So, the functional margins are computed as follows:

f_{1} = y_{1}((w, x_{1}) + b) = -1((-1, 1), (4, 6)) = -(-4 + 6) = -2

f_{2} = y_{2}((w, x_{2}) + b) = 1((-1, 1), (8, 6)) = -8 + 6 = -2

Since both functional margins are negative, those points are incorrectly classified. Thus, I think we shouldn't take the vector, w = (-1, 1) for Figure 26.

(2) w = (1, -1)

In this case, b is computed as below:

-b / w_{1} = 0

b = 0

In the same way, the functional margins are computed as follows:

f_{1} = y_{1}((w, x_{1}) + b) = -1((1, -1), (4, 6)) = -(4 - 6) = 2

f_{2} = y_{2}((w, x_{2}) + b) = 1((1, -1), (8, 6)) = 8 - 6 = 2

Since both functional margins are positive, those points are correctly classified. It means that we should take this vector, w = (1, -1) for Figure 26, as a correct vector, instead of w = (-1, 1). If this is true, the functional margin of the data set for Figure 26 is also +2. (Sorry, I can't agree with you on this point.)

This is a contradicted conclusion against the description in page 43:

=====

Using this formula, we find that the functional margin of the hyperplane in Figure 25 is +2, while in Figure 26 it is -2. Because it has a bigger margin, we will select the first one.

=====

To conclude the description, "while in Figure 26 it is -2", I think you chose w = (-1, 1) for Figure 26. So, here I have a question:

Could you possibly clarify the reason why you chose w = (-1, 1) for Figure 26? Is there any criteria how to choose w, other than being orthogonal to respect with hyperplane?

I would be grateful, if you could response to my question.

Thanks a lot in advance.

Best regards,

Tetsuya

MichaelDear Alexandre,

Congratulations! Thank you for sharing the book. I've been waiting for months and finally got it today. I'm going to read through the book all the way today.

Best regards,

- Michael

MikeThank you for this book, I appreciate all the effort you put into it. It is easy to understand and very well written.

BlestGThank you so much 🙂 It can help me a lot.

and...

I hope your mother rest in peace. God bless her...

jacobamazing.

thank u sir

aliceAlexandre,

thanks for the great book on svm. May I have a question about Fig26 on page 42? I do not understand why the examples are incorrectly classified. As the hyperplane is x1-x2 = 0. and the red point (8, 6) get a predicted value +1 as its value is 2, which is correctly classified. Any explanation is quite appreciate. Thanks a lot.

Alexandre KOWALCZYKPost authorHello Alice,

Yes indeed, it looks like there is a typo, and that the two figures are switched.

Figure 26 correctly classifiy the data, and Figure 25 does not.

Thanks for message!

I used the code below to check:

import numpy as np

# We associate each vector x_i with a label y_i,

# which can have the value +1 or -1

# (respectively the triangles and the stars in Figure 13).

# triangle = +1

# stars = -1

# Hyperplane of figure 25

w = np.array([-1, -1])

bias = 12

print(np.dot([4,6], w) + bias) # returns 2 so gets classified as +1

print(np.dot([8,6], w) + bias) # returns -2 so gets classified as -1

# Hyperplane of figure 26

w = np.array([1, -1])

bias = 0

print(np.dot([4,6], w) + bias) # returns -2 so gets classified as -1

print(np.dot([8,6], w) + bias) # returns 2 so gets classified as +1

# [8,6] is a triangle, so its value is +1,

# it is correctly classified by figure 26

# and it is incorrectly classified by figure 25

Md Humaun RashidDear Alexandre KOWALCZYK

I couldn't understand how to say thank you. The way you teach SVM is really fantastic. I didn't find any other tutorial better than your post. I also want to learn others like logistic regression , random forest etc. Do you have also tutorial for those. I am really interested.

Thank you. Wish you a happy life.

M SI think one of the most difficult things is to be simple with simple things, and simple with difficult things. This is an art... and you are an artist 😉

AHi Alexandre,

really enjoyed your book so far, definitely the best tutorial on svm I have seen.

But on page 59, I simply can not understand, how to compute the vector w.

I know that xi is a particular training example and is of dimension (m x 1), where m = number of training examples. The vector should be a (n x 1) dimensional vector, where n = number of variables, right?

If so, I can not understand how to compute the vector w. For example, how does the first element of w, w1, get computed?

Thanks a lot.

Alexandre KOWALCZYKPost authorw is computed as follow (Code in Python):

`def compute_w(multipliers, X, y):`

return np.sum(multipliers[i] * y[i] * X[i] for i in range(len(y)))

This code comes from this example.

KevThis piece here is incredibly amazing. Congratulations to you sir!!

HermishThank you for the great tutorial! This is the best explanation about maths behind SVM. I have one question - The SVM libraries usually return 1 or 0 for a classification problem with two classes (e.g. LibSVM library). They do that by predicting the probabilities for each class and taking the class label that has the probability > 0.5. Can you tell me where this probability comes into play in the SVM maths? Thank you in advance.

LEE KOK CHONGThank you sir, very very good mathematical insight on SVM ..

GordonHello. I'm currently reading your book. Your explanations and *coughs* succinct! I would like to however raise a potential typo as well as a question on this part where I'm kinda stuck on.

On page 55, you stated that could try to solve for the Lagrange function equating to 0 (line 5). However, you previously stated on page 54 that we want to solve for the *gradient* of the Lagrange function equating to 0 with the use of the inverted triangle sign before the function (lines 10-11) which will give us the minimum of the Larange function. I believe there has been a typo error.

In minimizing the Larange function on page 55, I would like to ask why we minimising with respect to b. I understand we minimise with respect to w so as to mimimise f(w) and maximise a as its a coefficient of the term we substracting but I don't understand why we minimise with respect to b. This was not intuitive to me. I hope for your reply and thanks for sharing your knowledge.

Much appreciation,

Gordon

Alexandre KOWALCZYKPost authorHi Gordon,

Yes indeed it looks like a typo on page 55 we want to solve for the gradient of the Lagrange function equating to 0.

We need to minimize with respect to b because the hyperplane is defined by the vector w and the bias b.

Best regards,