Lecture Notes in Pattern Recognition: Episode 22 – Norm-dependent Gradients

These are the lecture notes for FAU’s YouTube Lecture “Pattern Recognition“. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

Welcome back to pattern recognition. Today we want to look a bit more into optimization and the topic today will be looking into the actual update direction. We’ve seen the gradient descent methods in the previous video and today we want to have a couple of thoughts on how to pick the particular update direction.

Image under CC BY 4.0 from the Pattern Recognition Lecture

These are different kinds of steepest descent methods and even the normalized ones we might want to consider what update direction we actually want to choose. What we actually want to get is the largest decrease in the linear approximation of f. Technically we could also constrain this gradient direction by a unit ball of an Lp norm. So here we would then search for the update direction as the minimum over sum u that has a length of one according to our norm. Then the projection of the gradient onto this norm. So we observe that if we do this kind of optimization for selecting the update direction, then the steepest direction might not be simply the negative gradient direction. But it may depend on the chosen norm. So the negative gradient is not necessarily the best choice for the search direction.

Let’s look into a bit of an idea of how to perform this and here we want to think about some linear ideas. We consider now the first-order Taylor approximation of f(x+u). If we want to look at that in the selected position x then we can approximate f(x+u) as f(x) plus the gradient of f(x) inner product with this unit ball u. So here this inner product of the gradient and the unit ball is the directional derivative of x in direction u. So the vector u denotes a descent direction. If the inner product with the gradient vector is negative which means that we have this inner product smaller than zero.

Now this gives then rise to a new steepest descent method. We again have the function we have the initial estimate but now we also have a norm. We initialize with k to zero and the first thing that we do is compute the update direction. So here it’s not just a negative gradient but we compute the steepest descent according to our norm. Then we do the 1D line search and we update with the appropriate fit and then we iterate until we are converged. So there’s only a little change in x.

Image under CC BY 4.0 from the Pattern Recognition Lecture

Now let’s have a look at different norms. Let’s look at the unit ball for the L₂ norm. So here we have indicated the negative gradient direction. This is our unit ball and now we are looking for essentially all of the directions u and we seek the direction u that has essentially the largest projection of our negative gradient direction onto u. If we think about that you can see well it is exactly the negative gradient direction. So for the case of the L₂ norm, our statement that the negative gradient direction is the direction of steepest descent is actually true.

Image under CC BY 4.0 from the Pattern Recognition Lecture

Now let’s consider other norms. We’ll start with the L₁ norm now in the L₁ norm we have a different unit ball. A unit ball looks like this. Now again we vary u over all of the boundaries of our unit ball. Now we look for the direction that produces the maximum projection of the negative gradient onto our unit ball. We see it lies here. So this is the longest projection of our negative gradient direction onto the unit ball of the L₁ norm. So this is interesting because we essentially end up exactly with one of the coordinate axes. Now let’s look into a second example. Here we see if the negative gradient direction would be in this direction again we end up with exactly a projection onto one of our coordinate axes.