This is the **Part 4 **of my series of tutorials about the math behind Support Vector Machines.

Today we are going to learn how to solve an unconstrained minimization problem.

If you did not read the previous articles, you might want to start the serie at the beginning by reading this article: an overview of Support Vector Machine.

## About this Part

It took me a while to write this article because the subject is vast and assume a lot of prior knowledge. What should I explain and what should I skip was kind of a hard line to trace. After a while, I ended up with a large Part 4 which was too long to read. So I decided to split it. Welcome, Part 4, Part 5 and Part 6!

In this article try to make it as simple as possible for everybody. However, I cannot explain everything. I will assume that you know what **derivatives **and **partial derivatives **are. You are also expected to know what a matrix, the transpose of a matrix are and how to compute the determinant of a matrix.

During the last few months, I received a lot of comments and encouragements and several hundred people subscribed to be notified when this part is published. I wish to thank all of you, and I hope you will enjoy reading it.

## Where we left.

In Part 3, we discovered that to maximize the margin we need to minimize the norm of .

It means we need to solve the following optimization problem:

Minimize in ()

subject to

(for any )

The first thing to notice about this optimization problem is that it has **constraints**. They are defined by the line which begins with "subject to". You may think that there is only one constraint, but there is, in fact, constraints. (this is because of the last line "for any"...)

"OK, How do I solve it? I have been waiting for this for one year !!!"

Before tackling such a complicated problem, let us start with a simpler one. We will first look at how to solve an unconstrained optimization problem, more specifically, we will study **unconstrained minimization**. That is the problem of finding which input makes a function return its minimum. (Note: in the SVM case, we wish to minimize the function computing the norm of , we could call it and write it ).

## Unconstrained minimization

Let us consider a point (you should read it "x star", we just add the star so that you know we are talking about a specific variable, and not about any ).

How do we know if is a local minimum of a function ? Well, it is pretty simple, we just need to apply the following theorem:

**Theorem**:

Let be a continuously twice differentiable function at .

If satisfies and is positive definite then is a local minimum.

(Proof, at page 11)

The hard truth with such a theorem is that although being extremely concise, it is totally impossible to understand without some background information. What is ? What is ? What do we mean by positive definite?

Sometimes, we will be given more informations, and the previous theorem can also be rephrased like this :

**Theorem (with more details)**:

If satisfies:

- has a zero gradient at :

and

- the Hessian of at is positive definite:

where

then is a local minimum.

### What does this all mean?

Let us examine this definition step by step.

**Step 1:**

Let be a continuously twice differentiable function at .

First, we introduce a function which we call , this function takes its values from a set (omega) and returns a real number. There is a first difficulty here because we do not state what is, but we will be able to guess it in the next line. This function should be continuous and twice differentiable, or the rest of the definition will not be true.

**Step 2:**

is a local minimum of if and only if:

We want to find a value to give to for it to produce its minimum. We simply name this value .

From the notation we can tell two things:

- is written in bold, so it is a vector. It means that is a multivariate function.
- As a result, the set we saw earlier is the set from which we pick values to give to . It means that the set is a set of vectors and ("x stars belongs to Omega")

**Step 3:**

has a zero gradient at

This one is the first condition which must hold if we want to be a local minimum of . We must check that the gradient of the function at is equal to zero. What is the gradient? Just think of it as a derivative on steroids.

Definition: "thegradientis a generalization of the usual concept of derivative of a function in one dimension to a function in several dimensions" (Wikipedia)

This definition gives us more pieces of information. A gradient is, in fact, the same thing as a derivative, but for functions like which take vectors as input. That is why we wanted to be a differentiable function in the first place, if it is not the case we cannot compute the gradient, and we are stuck.

In calculus, when we want to study a function, we often study the sign of its derivative. It allows you to determine if the function is increasing or decreasing and to identify minimum and maximum. By setting the derivative to zero, we can find the "critical points" of the function at which it reaches a maximum or a minimum. (You can read this excellent explanation if you want to refresh your memory). When we work with functions having more variable, we need to set each partial derivative to zero.

It turns out, the gradient of a function is a vector containing each of its partial derivatives. By studying the sign of the gradient, we can gather important pieces of information about the function. In this case, checking if the gradient equals zero for allow us to determine if is a critical point (and that the function possibly has a minimum at this point). (Note: Checking if the gradient equals zero at a point means checking that each partial derivative equals zero for this point)

The gradient of a function is denoted by the symbol (nabla).

The line

is just a repetition of " has a zero gradient at " in mathematical notation.

For a vector , means:

**Step 4:**

the Hessian of at is positive definite

That is where most people get lost. This single sentence requires a lot of backgrounds. You need to know:

- that the Hessian is a matrix of second-order partial derivatives
- how we can tell if a matrix is positive definite

### The Hessian matrix

The Hessian is a matrix, and we give it a name. We could call it but instead we call it which is more explicit. We keep the symbol used for the gradient, and add a to denote we the fact that this time we are talking about second-order partial derivative. Then we specify the name of the function () from which we will compute these derivates. By writing we know that takes a vector as input and that the Hessian is computed for a given .

To sum up, we need to compute a matrix called the Hessian matrix for .

So we take the function , we take the value of and we compute the value for each cell of the matrix using the following formula:

Eventually we get the Hessian matrix and it contains all the numbers we have computed.

Let us look at the definition to see if we understand it well:

Definition: In mathematics, theHessian matrixorHessianis a square matrix of second-order partial derivatives of a scalar-valued function.It describes the local curvature of a function of many variables. (Wikipedia)

(Note: A scalar valued function is a function that takes one or more values but returns a single value. In our case is a scalar valued function.)

#### Positive definite

Now that we have the Hessian matrix, we want to know if it is positive definite at .

Definition: A symmetric matrix is calledpositive definiteif , for all . (Source)

This time, we note that once again we were given the definition in the first place. It was just a little bit harder to read because of our notational choice. If we replace by and by we get exactly the formula written in the part 2. of the detailed theorem:

The problem with this definition is that it is talking about a symmetric matrix. A **symmetric matrix** is a square matrix this is equal to its transpose.

The Hessian matrix is square, but is it symmetric?

Luckily for us yes!

"if the second derivatives of are all continuous in a neighborhood , then the Hessian of is a symmetric matrix throughout " (Wikipedia)

But even with the definition, we still don't know how to check that the Hessian is positive definite. That is because the formula is for all in .

We can't try this formula for all in !

That is why we will use the following theorem:

**Theorem: **

The following statements are equivalent:

- The symmetric matrix is positive definite.
- All eigenvalues of are positive.
- All the leading principal minors of are positive.
- There exists nonsingular square matrix such that

(Source)

So we have three ways of checking that a matrix is positive definite:

- By computing its eigenvalues and checking they are positive.
- By computing its leading principal minors and checking they are positive.
- By finding a nonsingular square matrix such that .

Let's pick the second method and look at it in more details.

### Computing the leading principal minors

**Minors**

To compute the minor of a matrix we remove the line and the column, and compute the determinant of the remaining matrix.

Example:

Let us consider the following 3 by 3 matrix:

To compute the minor of this matrix we remove the line number 1 and the column number 2. We get:

so we compute the determinant of:

which is :

**Principal minors**

A minor is called a **principal minor** when .

For our 3x3 matrix, the principal minors are :

- ,

But that is not all ! Indeed, minors also have what we call an order.

**Definition**:

A minor of of order is principal if it is obtained by deleting rows and the columns with the same numbers. (Source)

In our previous example, the matrix is so and we deleted 1 line, so we computed principal minors of order 2.

There are principal minors of order , and we write for any of the principal minors of order .

To sum-up:

: does not exist because if we remove three lines and three columns we have deleted our matrix!

: We delete (3-1) = 2 lines and 2 columns with the same number.

So we remove lines 1 and 2 and column 1 and 2.

It means that one of the principal minors of order 1 is . Let us find the others:

We delete lines 2 and 3 and column 2 and 3 and we get .

We delete lines 1 and 3 and column 1 and 3 and we get

: is what we have seen before:

- ,

: We delete nothing. So it is the determinant of the matrix : .

#### Leading principal minor

**Definition**:

The leading principal minor of of order is the minor of order obtained by deleting the last rows and columns.

So it turns out leading principal minors are simpler to get. If we write for the leading principal minor of order k we find that:

(we deleted the last two lines and column)

(we removed the last line and the last column)

Now that we can compute all the leading principal minors of a matrix, we can compute them for the Hessian matrix at and if they are all positive, we will know that the matrix is positive definite.

We now have fully examined what we have to know, and you should be able to understand how to solve an unconstrained minimization problem. Let us check that everything is clear with an example.

#### Example:

In this example we will try to find the minimum of the function: which is known as the Rosenbrock's banana function.

First, we will search for which points its gradient equals zero.

So we compute the partial derivatives and we find:

(Tip: if you want to check your calculation you can use Wolfram alpha)

Our goal is to find when they are both at zero, so we need to solve the following system of equations:

We distribute to get:

We multiply by to obtain:

We now add and to get:

which simplifies into:

We substitute in

It looks like we have found the point for which . But is this a minimum?

The Hessian matrix is :

Let us now compute the Hessian for

The matrix is symetric, we can check its leading principal minors:

Minors of rang 1:

If we remove the last line and last column the minor is .

Minor of rang 2:

This is the determinant of the Hessian:

All the leading principal minors of the Hessian are positives. It means that the Hessian is positive definite.

The two conditions we needed are met, and we can say that the point is a local minimum.

### LOCAL minimum?

A point is called a **local minimum** when it is the smallest value within a range. More formally:

Given a function defined on a domain , a point is said to be a local minimum if there exists some such that within distance of . This is illustrated in the figure below:

A **global minimum**, however, is true for the whole domain of the function:

Given a function defined on a domain , a point is said to be a **global** minimum if

So all our hard work was just to find a local minimum, but in real life, we often want to find the global minimum...

### How can we find a global minimum?

There is one simple way to find the global minimum:

- Find all the local minima
- Take the smallest one; it is the global minimum.

Another approach is to study the function we are trying to minimize. If this function is **convex**, then we are sure its local minimum is a global minimum.

## Conclusion

We discovered that finding the minimum of a function is not so simple, and it was not even a global minimum. However, some functions, called convex functions are easier to work with. What is a convex function? Read the Part 5 of this tutorial series to find out! Thanks for reading.

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

kecai wua point x* is said to be a global minimum if f(x*)>=f(x) for all x in X? This x* is a global maximum!

Alexandre KOWALCZYKPost authorThank you. I corrected the typo.

Ravindra MThank you for writing this article. Explaining the math along the way intuitively and in detail.

Eric LThank you for the attention you pay for explaining every details. It's really helpfull. By the way, I'm not sure but is it normal that f partial derivative relative to x is exactly equal to f(x,y) ?

(2-x)^2+100(y-x^2)^2

Alexandre KOWALCZYKPost authorIndeed it was a typo. I fixed it. Thanks a lot.

LuyuCHENThank you for your excellent explanation about support vector machine! It is so helpful!!!!

NickIt's the best explanation about svm which I ever was seen!

lcmosasohiSorry I do not understand why we compute M22 in the given example if "minor of order kk obtained by deleting the last n−kn−k rows and columns." It seemed it just needed to compute M11 when delete the last 1 raw and columb ...

Alexandre KOWALCZYKPost authorHello. You are right, we do not need to calculate M22 as it is not a leading principal minor. I updated the article to avoid the confusion. Thank you for your comment 🙂

Yixin ChenHi, thanks for the article. It's pretty clear and thorough. Just on confusion I thought in your example of Rosenbrock's banana function, should M11 be 4801 instead of 200 because your try to remove the last line and column. Actually I think this is the same question as lcmosasohi put. Thanks.

Alexandre KOWALCZYKPost authorHello Yixin,

Yes indeed, you are correct. I updated the article.

Thank your for your comment.

Igor PetrovskiMan, this is amazing!

I am doing my master studies at ETH Zürich and this helps me a lot!

Do you have any other tutorials on some other models (Linear Regression, Logistic Regression, etc) ?

Cheers and thank you!

Alexandre KOWALCZYKPost authorHello Igor. Thank you for your comment. For the moment I only have tutorials about SVM. 🙂

gskDear Alex, you're doing great job here.Please continue the same.

If possible, Please make tutorials on Math behind on other machine learning algorithms as well. You're going to be life savior to many.

Thank you in advance.

SrinivasSir, when your book will be available? Are you also covering the kernels in that? eager to see the book.

张翔I hope to read more tutorial about Random forest, Logistic Regression, boosting etc. if you are free.

张翔Thanks very much!

This is the best SVM tutorial I met ever!!!

FelipeFirst of all, thank you very much for this article.

I have a question about hessian matrix, why did you say that the first element of the matrix is 4801? Shouldn't it be 1200×2²−400×4+2 = 3202?

Alexandre KOWALCZYKPost authorYes indeed, I corrected the formula and the value. Thanks a lot!

FelipeI'm happy to help. Your tutorial is the best. Thanks again.

MinseokI study neural network in korea university student, recently, I'm happy to study, because I read your lecture to SVM. You are best teacher for me. I hope that you lecture for Principal Component Analisis. thank you so much. if you will write to lecture, i must read to your lecture.

Nitin BansalBest Article on SVM. Thank You!

RavindraThanks for excellent explanation

Yi LonghaoThank you very much!

Paul NavinThis is the most intuitive and lucid explanation of math behind statistical model I've ever seen. Thank's a lot!

JayeshHi Alex,

Great write-up indeed. Helped me a lot.

Are you planning to take-up "regularizer terms" to avoid overfitting while maximizing margins?

Alexandre KOWALCZYKPost authorNot in this series. I talk about soft margin SVM in my ebook though.

MirekHello Alexandre,

thanks a lot for this tutorial.

I have one question regarding Face Recognition with Support Vector Machines. What is in this case the support vector (Xi): one pixel of the picture or the whole picture as a big matrix with all pixels? In the tutorials will often be used an example with 2D dots. Could you please explain how it works? Thanks!

regards

mirek

Alexandre KOWALCZYKPost authorIn this case Xi represent the whole picture as a vector with all the pixel. X is a matrix containing all the images (each image is a row).

MirekThanks a lot, I saw that for this case can be useful a Principal Component Analysis (PCA) in order to reduce number of pixels (dimention of feature vector).

KevinGreat great thanks to you, it is hard to understand all the slide in the lecture, u really make all the SVMs technical terms and concept much more straightforward!!! Thanks again

Mayank AgarwalThanks a lot for this awesome explanation. I had a doubt. You have taken w as a two dimensional vector(-a,1) and the margin is calculated as 2/|w| , |w| being sqrt(a*a + 1) . So does the final margin not depend on b? But how is that possible

Alexandre KOWALCZYKPost authorHello,

Even if b is not in the margin calculation, it is in the constraint of the minimization problem and as a result it has an impact on the optimal margin which will be selected.

I hope it helps you.

Raj MiglaniHi Alex,

This is the best source on the Internet to learn. I got your tutorial series link to begin with SVM and now I am loving each and every part of the Mathematics that you have explained so beautifully and in a really niche manner. Kudos to you mate you are doing a great job. First time the Mathematics seems to be interesting enough and its been a binge read(to be rightly called) of you articles on the SVM series for me. Expecting more of ML articles from you like these. I have read from a lot of resources online and I can bet that your lessons are above par.

JAYDude , Hats off to your hard work. A big thanks to you.

I would definitely love to read more of your ML explanations of all the other algorithms too , not only SVM.Do you have any more algos explained like this?

Alexandre KOWALCZYKPost authorHello, I am glad it helped 🙂 No I do not have any more algos explained for the moment ^^