This is the **Part 6** of my series of tutorials about the math behind Support Vector Machines.

Today we will learn about duality, optimization problems and Lagrange multipliers.

If you did not read the previous articles, you might want to start the serie at the beginning by reading this article: an overview of Support Vector Machine.

## Duality

In mathematical optimization theory,

dualitymeans that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem (the duality principle). The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem. (Wikipedia)

The concept of duality is pretty simple to understand if you know what a lower bound is.

### What is a lower bound?

If you have a partially ordered set (a set having comparable elements, where the relation "less than or equal" can be used), the lower bound is an element of which is less than or equal to every element of .

To be less abstract. If you pick a real number (from the partially ordered set ) and it is less than or equal to every element of a subset of , then you can call this element a lower bound.

Example:

Let us consider the subset of :

- Because 1 is less than or equal to 2, 4 ,8 and 12, I can say that 1 is a lower bound of S.
- The same is true for -3 for instance.
- And even if it is in S we can also call 2 a lower bound of S.

Moreover, because 2 is larger than any other lower bounds, we can give it a special name, we call it the **infimum** (or greatest lower bound).

So in our example, you can find an infinity of lower bound, but there is only one infimum.

**Note**: The same logic apply with the relation "greater than or equal" and we have the concept of **upper-bound** and **supremum**.

**Coming back to duality**

Now that we know what a lower bound is, what do we understand about the definition of duality? Well, this definition means that **if you have a minimization problem, you can also see it as a maximization problem**. And when you find the maximum of this problem, it will be a lower bound to the solution of the minimization problem, i.e. **it will always be less than or equal to the minimum of the minimization problem**.

## Why do we care about duality?

It turns out that sometimes, solving the dual problem is simpler than solving the primal problem. From what we saw about lower bounds, we can see that for some problems solving the dual problem gives us the same result as solving the primal problem! But when?

Let us look at a visual illustration.

In the schema above, imagine that in our primal problem, we are trying to minimize the function at the top of the graph. Its minimum is the point . If we search for a dual function, we could end up with the one at the bottom of the graph, whose maximum is the point . In this case, we clearly see that is a lower bound. We defined the value and call it the **duality gap**. In this example, 0" /> and we say that **weak duality holds**.

In the schema below, we see that , there is no duality gap, and we say that **strong duality holds**.

## Optimization problems with constraints

#### Notation

An optimization problem is typically written:

This notation is called the standard form. You should know that there are others notations as well.

In this notation, is called the **objective function** (it is also sometimes called the **cost function**). By changing (the optimization variable) we wish to find a value for which is at its minimum.

There is also functions which define equality constraints and functions which define inequality constraints.

The value we find **MUST** respect these constraints!

### What does it mean to respect the constraints?

Imagine you try to solve the following optimization problem:

There is no constraint, so finding the minimum is easy, the function is 0 when . This is shown with a red star on the graph below:

#### Equality constraints

However, what if we try to add an equality constraint? For instance, we want to find the minimum, but we must ensure that . It means we need to solve this optimization problem:

This time, when we try we see that the function returns its minimum, however, we cannot say this is the solution of our optimization problem. Indeed, the constraint is violated. In this example, our only choice is to use and this is the solution.

Looking at this example, you might feel like equality constraints are useless. This is not the case because most of the time optimization problems are performed in more than one dimension. So you could try to minimize a function with only an equality constraint on $x$ for instance.

#### Inequality constraint

What if we now use an inequality constraint? It gives us a problem of the form:

This time, we can try more value of . For instance, respects the constraint, so it could potentially be a solution to the problem. In the end, we find that the function has once again its minimum at under the constraint.

On the graph above, the **feasible region **is shown in black bold it is a set of values of we are authorized to use**. **It is also called, the** feasible set**.

In mathematical notation we can write it:

is the set of values of for which the constraints are satisfied.

#### Combining constraints

It is possible to add several constraints to an optimization problem. Here is an example with two inequality constraints and its visual representation:

=1\\&&&x<=2\end{aligned}" />

Note that we can also mix equality and inequality constraints together. The only restriction is that if we use contradictory constraints, we can end up with a problem which does not have a feasible set:

Indeed, for a constraint to be respected, it must be true. When there is several constraints, they **all** must be true. In the example above, it is impossible for to equal 1 and to be less than zero at the same time.

For a function with one equality constraint and one inequality constraint the feasible set is :

### What is the solution of an optimization problem?

The optimal value of the optimization problem:

is:

Basically, it says in mathematical notation that the optimal value is the **infimum **of with all the constraints respected.

## How do we find the solution to an optimization problem with constraints?

We will be using the Lagrange Multipliers. It is a method invented by the Italian mathematician, Joseph-Louis Lagrange around 1806.

### Lagrange multipliers

As often, we can find a pretty clear definition on Wikipedia:

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints. (Wikipedia)

The critical thing to note from this definition is that the method of Lagrange multipliers **only works with equality constraints**. So we can use it to solve some optimization problems: those having one or several equality constraints.

But don't worry, the Lagrange multipliers will be the basis used for solving problems with inequality constraints as well, so it is worth understanding this simpler case 🙂

Before explaining what the idea behind Lagrange multipliers is, let us refresh our memory about contour lines.

### Contour lines

It is important to understand contour lines to appreciate better Lagrange multipliers.

A contour plot is a way to visualize a three-dimensional function into two dimensions.

For instance, the picture below shows the function displayed in 3D (left) and with a contour plot (right):

**Key concepts regarding contour lines**:

- for each point on a line, the function returns the same value
- the darker the area is, the smallest the value of the function is

You can see the last point illustrated in the two figures below. Even if they have the same lines, the change of color allows us to draw a 3D picture of the function in our mind.

Moreover, the gradient of a function can be visualized as a vector field, with the arrow pointing in the direction where the function is increasing:

It turns out we can easily draw gradient vectors on a contour plot:

- they are perpendicular to a contour line
- they should point in the direction where the function is increasing

Here is a representation on two contours plots:

### Back to Lagrangian multipliers

Let us consider the following optimization problem:

The objective function and the constraint function can be visualized as contour in the figure below:

It is interesting to note that we can combine both contour plot to visualize how the two functions behave on one plot. Below you can see the constraint function depicted by blue lines.

Also, I draw some gradient vectors of the objective function (in black) and some gradient vectors of the constraint function (in white).

However, we are not interested in the whole constraint function. We are only interested in points where the constraint is satisfied, i.e. .

It means that we want points where:

In the graph below we plot the line on top of the objective function. We saw earlier that this line is also called the **feasible set**.

What did Lagrange find? He found that the minimum of under the constraint is obtained **when their gradients point in the same direction**. Let us look at how we can find a similar conclusion.

In the figure below, we can see the point where the objective function and the feasible set are tangent. I also added some objective function gradient vectors in black and some constraint gradient vectors in white.

We can see that the is only one point where two vectors point in the same direction: it is the minimum of the objective function, under the constraint.

Imagine you put your finger on the first point with arrows at the top of the figure. Here you evaluate the objective function, and you find out it returns some value . Now, you wish to know if there is a smaller value. You must stay on the blue line, or the constraint will be violated. So you can either go to the left or the right.

You would like to go left, but then you see that the gradient of the objective function is pointing towards the left (remember it is always pointing towards greater values) so it might not be a good idea. To be sure, you try and go to the left. You find out that the objective function returns , so it is increasing which is not okay. Going to the right, this time, the objective function returns , it is good you are going in the correct direction. So you continue like that until you reach the center point where the two arrows are parallel. But then, once you move a little bit you find out that the objective function is increasing again. As you can not move right or left anymore without increasing the objective function, you conclude this point is the minimum.

#### How does it translate mathematically?

Lagrange told us that to find the minimum of a constrained function, we need to look or points were .

But what is and where does it come from?

It is what we call a **Lagrange multiplier**. Indeed, even if the two gradient vectors point in the same direction, they might not have the same length, so there must be some factor allowing to transform one in the other.

Note that this formula does not require the gradients to have the same direction (multiplying by a negative number will change the direction), but only to be parallel. It is because it can be used to find both maximum and minimum at the same time (see Example 1 of this article if you want to see how).

#### How do we find points for which ?

Note that, is equivalent to:

To make things a little easier, we notice that if we define a function:

then its gradient is:

This function is called the Lagrangian, and solving for the gradient of the Lagrangian (solving ) means finding the points where the gradient of and are parallels.

### Let us solve this example using the Lagrange multiplier method!

Remember, the problem we wish to solve is:

#### Step 1: We introduce the Lagrangian function

and its gradient is :

#### Step 2: We solve for its gradient

We solve :

which means solving the following system of equations:

By multiplying the second equation by -1 and adding the first and second equation we get:

so:

We replace by in the third equation:

We replace by in the first equation:

We have found a solution : , and

We conclude that the minimum of under the contraint is obtained for , .

Let us verify if it makes sense with our graphical analysis on the graph below:

We can see that indeed the point is on the feasible set (the blue line) and it is the one we identified earlier.

## Conclusion

In this article, we learned an important concept in optimization: duality. Moreover, we discovered that an optimization problem can have equality and inequality constraints. Eventually, we learned what Lagrange multipliers are and how we can use them to solve an optimization problem with one equality constraint.

If you wish to learn more about Lagrange multipliers in the context of SVM, you can read this very good paper showing how to use them with more equality constraints and with inequality constraints.

### When will next part be ready?

There is no next part, but instead I wrote a free e-book about Support Vector Machines. That's your next step if you wish to learn more about SVM!

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

Great.. Pertamax gan... Anyway, do you have a plan for writing about non-linearly separable hyperplane and kernel function? Or it would be written in your book?

It will be in the book yes, there is a chapter about soft-margin and another about kernels. However, maybe I will continue this series once the book is over because sometimes I cannot go into too many details in the book.

Hi, could you explain about primal and dual problem of SVM?

I will explain it in my upcoming ebook 🙂

a super awesome piece of work!!

i want to ask you about one-class SVM methods. are they included on your book? if not, i was wondering if you could tell me. i have known 2 methods (SVDD & v-SVC). thanks in advance

Sir, a very nice piece of work.

Tried to understand SVM theory from many sources but failed to grasp details and logics,

but here they are explained very clearly and simplified logics which follow mathematics and logic transparently.

Please publish complete theory of duality and langranges which follow in SVM theory explaining the equations in SVM duality theory.

The best written article I've come across regarding svm. Thank you for preventing us from being caught up in the jargon of difficult mathematical notations.

Beautifully explained. Even a monkey could understand it.

Many thanks.

How to set an alert once your book is ready to sell, i definitely will buy your work.please advice, thanks

Hello, you just need to put your mail and subscribe at the end of the article. Moreover, the book will be free 🙂

For the example under Lagrange multiplier method

Just sharing for those who mb lost in this step

In order to used the wolframalpha, replace λ to z.

∇L(x,y,λ)=∇f(x,y)−λ∇g(x,y)

∇L(x,y,λ) = (x^2+y^2) - z( x+y−1)

∇L(x,y,λ) = (x)^2+(y)^2-zx-zy+z

Then, we get

(d)/(dx)(x^2-z x+y^2-y z+z) = 2 x-z

(d)/(dy)(x^2-z x+y^2-y z+z) = 2 y-z

(d)/(dz)(x^2-z x+y^2-y z+z) = -x-y+1

Check using

https://www.wolframalpha.com/input/?i=differentiate+(x)%5E2%2B(y)%5E2-zx-zy%2Bz

Thanks a lot 🙂

In many references, they will use α instead of λ and the constraint will be sigma of all N input vectors. What does it mean? I think this is the next step of your post, but I can't wait. I need your help for explaining it. Thanks

Hello,the sigma or lambda is just a notational choice. The constraint being sigma of all N input vectors means there are N constraints 😉

Beautifully explained. Do you suggest a Data Scientist do a course on optimization? Does it help?

I think it can be a good idea yes. I recommend you the following textbook if you are interested to learn more about convex optimization.

could you introduce something about the using of kernal method, and show us different kinds of these in your e-book? your tutorial helps me a lot.

Yes it will be included 😉

great tutorial! thanks for your great work, looking forward the book. thanks.

Terrific work !

When is the book releasing ?

I've already subscribed

I don't know yet. It's 95% finished, but sometimes the last percents are the hardest 🙂

Great tutorial! Looking forward to your e-book!

Sir, How the perpendicular vector w is minimized so we get maximized margin. In some line you say w is a unit vector. does it value is one.? if yes then in 2/|w|, w is already unit so what is need of minimization. I'am little confuse from basics. could you tell me why we always take w is unit vector. when we have different data set in every training and the distance of support vector to plane is not always 1. so how we take it equal to 1 and how to normalize it. just tell me the basic svm? How to do training and how the w is normalized to 1. Thank you very much for your time.

w is not equal to 1. It is the norm of w which is equal to 1. There are a lot of vectors which can have a norm of 1.

Thank you very much for this article. Could you tell me when the ebook will be posted?

There is no definitive date yet. I still need to finish it, then send it to the editor which will review it... You can subscribe to the mailing list to be notified when it is available.

Just subscribed. Thanks for this great set of tutorials. I assume the book is not yet ready?

Not yet indeed, but I can say I finished most of the writing. Now, I need to focus on the editing, and to read it again and correct/improve things if necessary.

Looking forward to it

Wonderfully clear and with very helpful images. You are an excellent teacher. I really look forward to the ebook! Thanks, Ned.

Wonderfully explained, thanks a lot. Looking forward to the e-book

comprehensive and accessible explanation. Great,many thanks. As everybody looking forward to the book.

Hoping for the SVR tutorial... the Alex J's tutorial paper is too long for me to read.

Hi Wu, thanks for your comment. I do not plan on writing a tutorial for SVR for now. I will take a look at Alex J's tutorial though 😉

Really a nice and clear tutor and helps me understand many intricate notations from my text book.

This tutorial series is very well written. Looking forward to the eBook.

Perfect quantity of details in the explanations.

It's not too long but it has all the necessary basis of undestanding.

If the book follow in that way, it's gonna be awesome.

Subscription done!!

Thank you for this tutorial, it is amazing!!

I look forward to the eBook as well! Thank you!

Hi Alex,

Any update on the Book.

Waiting eagerly to read it.

Hello. It is in the process of being reviewed, so it will come out soon 🙂

Hey Alexandre,

Excellent set of articles. Really appreciate the effort you have put into this.

I know your ebook is being reviewed. If at all possible, can you send me the draft?

Thanks!

Hello,

Unfortunately it is not possible. Sorry!

Alexandre

Thank you Alexandre. I know it's time consuming to write blogs. So TY for everything.

Hi Alexandre ,

Thanks for the great tutorial. I have some doubts cam you please help .

1. Lagrange multipliers can be used only between the gradients of an objective function f and an "equality" constraint g. But in the case of SVM our constraint is >=1.Can we ignore the > part and just roll with =,since points beyond that margin of =1 will give always anyways give us a g(x)>1.

2. Now given a labelled data set we dont know its f(x) , so how do we( or libraries) first find an approximate f(x) that represents our data, then take its contour plot projection and then solve for , Lagrange multipliers and minima?

3.To get an inequality constraint we need to start with some valid hyper plane to plot and then play with the parameters(w) to see which is the best hyperplane. How is the first plane determined.

Sorry if I my Qs are too stupid.I am just trying hard to imagine step by step real time and these doubts pop-up,i could be missing something obvious.

Hello,

1. To solve for SVM we do not ignore the > part of course. Instead we use the KKT conditions which works for inequality constraints. You can think of the KKT as a more general version of the Lagrange multipliers. As a result the multipliers used to solve an SVM optimization problem should be called "KKT multipliers" but people often call them "Lagrange multipliers". I talk about them in more details in my ebook.

2. We do not care about its contour projection to solve for the Lagrange multiplier. It is just a useful visualization technique. In the case of SVM we don't know the function f of the labelled data set, and our goal is to find it. That is why we introduce the optimization problem, which contains another function (the one maximizing the margin). Let us call it g. We know this function g and we can then find its minima using the KKT conditions. From this result, we can construct f 🙂

3. What you say is true for an algorithm like the Perceptron learning algorithm. It start either with a vector w at zero or with random values. The main drawback of this algorithm is that is gives us a different hyperplane every time. As you can see in my answer for question 2. your assumption "To get an inequality constraint we need to start with some valid hyper plane" is incorrect for SVM. Because we use another function to determine f we get the inequality constraints and find the hyperplane for f right away. AS a result the first (and only) plane we find is the optimal one.

Your questions are pretty pertinent, you wonder how to use the knowledge in this article to solve for SVMs and that is exactly what you should be doing when trying to understand this.

I did explain a lot more about how to solve the SVM optimization problem in my ebook.

It is still being reviewed but we are making very good progress on it and the review should be over soon. I hope it will help clear things in your mind.

Alexandre

Thanks a lot Alexandre , cant wait for the e-book. How do i know when its released , i mean how do i ensure i dont miss your e-book.

You can subscribe at the end of this article.

Hi Alexandre,

Did the ebook published ?

Thank you for your tutorial.

Hello. The ebook has been reviewed and will be published soon. (I don't have a date yet :)). It will announce when it is ready.

Thank you.

And I think you should add a full example (for all step) in this tutorial for easy understanding (sorry for my english). EX: there is (-1, 1), (-2, 2) , (-1, 3) for label -1, (2, 1), (3, 2) , (4, 3) for label 1

I have been read all 6 parts of the tutorial but i don't realy know what exactly w, x are ?

Nice job! It helps a lot! Thanks a lot!

Hi Alexandre

Thank you for this Great tutorial series. Eagerly looking forward for your book. Is it done? Any tentative date as of yet? Sorry for being too greedy!!

No date yet. But I asked my editor for more informations 😉

I have to say that you explain everything in details and in a really understandable way.

Many thanks

very good explanation great effort ..... like a blessing for the beginner ..keep it up

waiting remaining part ...

Alexandre,

section "Optimization problem with constraints", the link has broken in the following piece of text:

"This notation is called the standard form. You should know that there are others notations as well."

Keep up with book!

There is something incorrect when you describe "optimal value" of x in optimization problem:

==

The optimal value $x^∗$ of the optimization problem:

$x^* = \text{inf}\{f(x) \mid g_i(x)=0,\; i=1,\dots,p,\;h_{i}(x)\leq 0, i=1,\dots,m\}$

==

Actually, it should not be $x^*$, but $p^*$. Instead, the optimal value of x is a feasible point while $f(x) = p^*$

Hello, Xiao-Feng,

I say that the optimal value of the optimization problem is called $x^*$. Choosing to call it $x^*$ or $p^*$ is just a matter of notation.

Now it's July 10, was the ebook released? I haven't received the email. Thanks!

Hello, it will be released on August 07!

Oh, it was my fault. Very much looking forward to your excellent SVM eBook.

okay! thanks for the update

Eagerly waiting for your book Alex.

Hi Alexandre , thanks again for your great tutorial, i subscribed for the E-book, but seems i have missed it.Please what can i do to have a copy. THANKS

Is the book released ? waiting to read kernel part

Hello, the book publication is currently delayed because of a problem with image equations. It should be published in a week or two. I do not have a precise date as this is in the hand of the editor.

Is there any update about the ebook? Really look forward to that!

The situation is still blocked for now. I hope it will become available soon.

My learning schedule related to your tutorial I am looking forward your ebook to catch up my learning progress

Trust me, now I understand how SVM works...!!!

Can't thank you enough for explaining such complex things so wonderfully...!!!

Eagerly waiting for the ebook...I really hope it get released in a week or two..!!!

Thank you so much!

Great job! Never find an easier tutorial for SVM. Thank you.

This has been excellent piece of work in explaining SVM algorithm and maths behind that. Thank you so very much !!!

Great post.

Please keep doing.

It's the most beautifully written blog on SVM.

Clearly shows you love and passion for SVm.

Can't wait to finish the book as well.

Thank you so much!

Great article, but there is a error, Joseph-Louis Lagrange is not an Italian mathematician, he is a French mathematician. I'm a mathematical student is China, and I highly admire your work in SVM! Thanks for you pretty work!

Hello, thank you for your comment. Lagrange was born in Italy and became French later in his life. I was not sure which nationality to give him so I used the one from wikipedia which say he was Italian. 😉

Sir,you are explanation was really really awesome sir..Thanks for the explanation sir.

Great Work

Loved the way you explained math. Graphically it becomes very easy.

Hi, Alexandre, thank you so much. I've read many materials about SVM, but I got frustrated every time I try to understand those mathematical formulas. But your tutorial is so fascinating and easy to understand, I finished it without a break. Thanks again, I will continue reading your Ebook 🙂

Hello,

Can we write a svm from scratch in python without using the cvxopt? In your book you have mentioned a code which uses cvxopt. I aim to write it without using cvxopt.

Expecting a reply.

Hello,

Yes you can. That is what is done in the following chapters in the book. Code Listing 55 is an implementation of the SMO Algorithm in Python and it solves the SVM optimization problem.

Best regards,

Thank you so much. Such an intuitive explanation of Lagrange multipliers. I’ve been using the for about a year now in my ML course and only just now have I really grokked them. Thank you again!!! 🤓

Wonderfully explained, thanks a lot. Now I understand a SVM a little bit. Can you please send me the link of the ebook.

Really good work sir

But quadratic programming how to find manually lagrange multipliers ?

I searched many youtube videos and documents

Can't find

Hello,

This paper will be very helpful for you 🙂

Hello, thanks for the explanation. It is very straightforward. But I have a basic question. I don't understand why you compute the value of the langrange multiplier at the end. Do we just need the values for x and y? What is reason for computing the langrange multiplier lambda? Has it some meaning in this sample?

In this example it is not really that useful. We compute it just for the sake of solving the system of equation completly. However Lagrange multipliers are importants for SVM because they are used in the SMO algorithm. You will find all the details in my ebook.

Best regards,

Amazing explanation, I can't thank you enough !

However, I have some confusion regarding the Lagrange multiplier ''lambda''. How do we say that the gradient of the f function is supposed to be equal to the gradient of the constraint function g, so they are parallel. Then, while representing that mathematically we have multiplied the g gradient by a scale. Doesn't that make the f and g gradients unequal from the beginning ? Is it like we have forced them to be equal ?

What we say is that we have two gradients f and g, and there exist a value lambda so that the gradient of f is equal to lambda times the gradient of g. Lamba could be all 1, so f and g could be equals.

This is an amazing post - thank you.