We have analysed the convergence (and the rate) of the gradient descent with a fixed step size, where we should choose the step size if the objective function is -smooth. However, if the smoothness is difficult to determine, how can we set the step size?

Note that a convex function restricted to any line is also convex. A naive idea is to improve the length of step so that the restricted function achieve its minimum value. Specifically, since where the step size is to be determined, we consider the function . It is also a convex function of , and . We can greedily make as small as possible, which is to set
This method is called the exact line search.

Example

Consider . Let . Then, by definition,

Proposition

Applying the method of exact line search, successive gradient directions are always orthogonal.

Proof

Let , then . By the first-order necessary condition, we have , which means Therefore, and are orthogonal.

Newton’s method to find zero points
...

In the method of exact line search, we need to find the minimum point of a convex function. This is equivalent to find the zero points of if we let .

Now we introduce some methods to find the zero points. Note that is an increasing function, since is convex. Thus, there is a smart method using binary search. If is continuous and we know that there exists such that and , then we can check whether is zero and continue partitioning the interval into two parts and updating the left and right points until finding the zero point.

A better method is to apply the so-called Newton's method. The idea is that given a value of , we can approximate locally (near ): The zero point of the right hand side is , which should be a good approximation of the zero point of (if the zero point is near ).
1698765180659.png
The Newton method is to do the approximation iteratively. Namely, we choose an arbitrary and let

Given a positive real number , how to calculate ?
Let . Then iff . Applying the Newton’s method, we choose a magic value of and let . It converges to rapidly.
The algorithm is best known for its implementation in 1999 in Quake III Arena.

Theorem

Suppose is -strongly convex and -smooth. Let be the sequence given by the gradient descent with exact line search. Then

Proof

Let . Since is -smooth, we have Denote the right hand side by . The minimizer of is . Since , we have Moreover, is -strongly convexity, so it holds that Denote the right hand side by . The minimizer of is , since . Thus, Combining and , we have .

If we know that is -smooth and set the fixed step size , it is easy to see that the result of convergence rate is the same as that with exact line search. But the advantage of the exact line search method is that we do not need to know the smoothness of in advance.

In general, it is expensive to calculate the step size in exact line search, and we usually do not need to know the exact minimizer. It is sufficient to show that the value of the objective function decreases sufficiently at each step. So we consider a so-called backtracking line search method.

Definition (Armijo's rule)

Armijo’s rule is a well-known and widely applied backtracking rule to update the step size. Given a descending direction , and , we first initialize and then update to as long as .
In particular, if we set , then .

The intuition is that, we know by convexity. If for some , we think decrease by a sufficient amount (then we can get an inequality similar to the descent lemma for gradient descent for -smooth functions). Moreover, this requirement is always true if is sufficiently close to , since if . Therefore, the update process stops after a finite number of iterations.

Armijo choose . In the textbook, it suggests and .

We first give a lower bound of the step size. Initially set and . Since is -smooth, we have if . Thus the update process terminates once we have , which further implies that .

Then, applying the lower bound of , we give the following result of the convergence rate.

Theorem

Suppose is -strongly convex and -smooth. Let be the sequence given by the gradient descent with Armijo’s backtracking line search. Then