Lecture 10. Convergence Rate of Gradient Descent
...


10.1 Gradient flow
...

At the end of the last lecture, we obtained the following lemma on the gradient descent for smooth functions.

Lemma (Descent lemma)

For an -smooth differentiable function (not necessarily convex), and , we have

However, it is still not easy to show the convergence rate for the gradient descent. We now introduce a continuous version of the gradient descent instead, which is easier to analyse.

Definition (Gradient flow)

A gradient flow is a curve following the direction of steepest descent of a function. Given a smooth convex function and a point , the gradient flow of with initial point is the solution to the following differential equation Here we use the notation for convenience.

Applying the chain rule, is decreasing since Now we can take the derivative by convexity. Then integrating both sides, we obtain that which further gives that

10.2 Convergence of gradient descent with smoothness
...

We compare the gradient descent with the gradient flow. Assume the gradient descent iterates with a fixed step size and an initial point , i.e., For the gradient descent, For the gradient flow, Intuitively we know that, if does not change too fast, the gradient descent approximates the gradient flow.

Theorem

Suppose is a convex and -smooth function. Choose , and let the gradient descent iterate with a fixed step size . Then it holds that

Proof

Analogously to the gradient flow, we calculate . Since

If we hope , we need to run the gradient descent steps. If the initial point is far from , and is sufficiently small, the gradient descent is slow. Unfortunately, consider the following function Pasted image 20231027234331.png
This function is convex and -smooth. Hence and the convergence rate of the gradient descent will be very small if .

Question

Under which assumptions the gradient descent converges rapidly?

10.3 Strongly convex functions
...

Recall that, if we run the gradient descent for a quadratic function where , it gives that and thus . Clearly converges to the optimal value at an exponential rate.

We now introduce the following definition, which requires the function is a bit "better" than some quadratic function.

Definition (Strong convexity)

A function is strongly convex with if is convex.

There are some other forms of quadratic functions. Why don’t we choose other functions such as or for some given ? In fact, these functions mentioned can just achieve a similar effect to . For example, is almost equivalent to . In addition, Hence, all quadratic functions achieve similar effects to .

Recall that, a function is convex iff its hessian matrix is positive semidefinite. The hessian matrix of is

Lemma

Suppose is a twice continuously differentiable function. Then is -strongly convex iff . Namely, for all , .

We also have the following lemma similar to the first order condition for convexity and smoothness.

Lemma

Suppose is a differentiable function. Then is -strongly convex iff for all ,

Proof

Let . By the first order condition for convexity, is convex iff for all , Note that . So it gives that is convex iff for all , The last inequality is equivalent to Rearranging it, we obtain that

As a corollary, above lemma implies that for any . Hence, is strictly convex.

Example
  1. An affine functions can not be strongly convex since it is not strictly convex.
  2. can not be strongly convex since and and we can not find out such when .
  3. is -strongly convex.
  4. is not strongly convex since , and we can not find such .
  5. where is strongly convex. Because , is -strongly convex.

Recall the property of monotone gradient for convex functions. We have a similar corollary.

Corollary

Suppose is a differentiable function. Then is -strongly convex iff for all ,

10.4 Convergence of gradient descent with strong convexity
...

We now establish the convergence of gradient descent with strong convexity. First consider the gradient flow again. By strong convexity, we can bound the derivative as follows For a time-continuous non-negative process , if , then we have . The same result holds if we replace the equality by an inequality.

Theorem (Gronwall’s lemma)

For a time-continuous non-negative process , if , then we have

Applying the Gronwall’s lemma, we conclude immediately, which gives an exponential decay rate. Intuitively, as the discretization version of the gradient flow, the gradient descent for strongly convex functions should also follow the exponential decay.

Theorem

Suppose is n -smooth and -strongly convex function. Choose , and let the gradient descent iterate with a fixed step size . Then it holds that

Proof

By strong convex, where the second inequality is due to the descent lemma .

The function value also has an exponential decay. Since is -smooth, we have which gives the following corollary.

Corollary

10.5 Condition number
...

For a quadratic function where , we already have the following facts:

  • is -strongly convex;
  • is -smooth.

Applying the above theorem, if we take the step size , will converge at an exponential rate of (since ). Recall the in last lecture, we have shown as long as , which means that is not necessary for convergence. Since we hope is as small as possible, we may choose a greater value of to obtain a better rate.

Now let us calculate the optimal convergence rate of . Since , we have
Applying the eigen-decomposition , where and .
Then where . So Let . We have .

Note that . Thus, We would like to choose to minimize , which means . In this case,

Definition (Condition number)

Given an positive definite matrix , its condition number is defined by

The argument above reveals that for quadratic functions, the convergence rate of gradient descent depends on .

Example
  • For , its condition number is , so the convergence rate is .
  • For , its condition number is , so the convergence rate is .

These two examples shows that the gradient descent may converge very slowly when the coefficient matrix has a large condition number.

For nonquadratic functions, we can approximate them locally (near the minimum point ) by the following Taylor series Hence, in the neighborhood of the minimum point , the convergence rate depends on . If the condition number of is large, the results given by gradient descent with fixed step size cannot converge rapidly.

Here are some well-conditioned and ill-conditioned examples:
1698678801074.png
1698678753538.png