At the end of the last lecture, we obtained the following lemma on the gradient descent for smooth functions.
For an
However, it is still not easy to show the convergence rate for the gradient descent. We now introduce a continuous version of the gradient descent instead, which is easier to analyse.
A gradient flow is a curve
Applying the chain rule,
We compare the gradient descent with the gradient flow. Assume the gradient descent iterates with a fixed step size
Suppose
Analogously to the gradient flow, we calculate
If we hope
This function is convex and
Under which assumptions the gradient descent converges rapidly?
Recall that, if we run the gradient descent for a quadratic function
We now introduce the following definition, which requires the function is a bit "better" than some quadratic function.
A function
There are some other forms of quadratic functions. Why don’t we choose other functions such as
Recall that, a function is convex iff its hessian matrix is positive semidefinite. The hessian matrix of
Suppose
We also have the following lemma similar to the first order condition for convexity and smoothness.
Suppose
Let
As a corollary, above lemma implies that
Recall the property of monotone gradient for convex functions. We have a similar corollary.
Suppose
We now establish the convergence of gradient descent with strong convexity. First consider the gradient flow again. By strong convexity, we can bound the derivative as follows
For a time-continuous non-negative process
Applying the Gronwall’s lemma, we conclude
Suppose
By strong convex,
The function value also has an exponential decay. Since
For a quadratic function
Applying the above theorem, if we take the step size
Now let us calculate the optimal convergence rate of
Then
Note that
Given an
The argument above reveals that for quadratic functions, the convergence rate of gradient descent depends on
These two examples shows that the gradient descent may converge very slowly when the coefficient matrix has a large condition number.
For nonquadratic functions, we can approximate them locally (near the minimum point
Here are some well-conditioned and ill-conditioned examples: