Lecture 9. Descent Method
...


9.1 Unconstrained optimization problems
...

We now study the general convex optimization problems. First, we consider the easiest case: no constraints. Namely, the optimization problem is where is a convex function.

Recall that, the optimality condition for convex functions is

Theorem

Suppose is a convex function. Then is a global minimum point of iff In particular, if , then is a global minimum point iff .

For convenience, we assume that , the objective function is differentiable and has a finite minimum point (and the minimum value ). For some simple cases, we can compute the minimum point by solving the equation . However, in general we cannot expect that closed-form solutions always exist. So we introduce some algorithms to find optimal solutions.

9.2 Descent method
...

Analogously to the simplex method, we would like to move from a solution to a better “neighbor” . The convexity guarantees that As we hope is better, i.e., , it requires that . Conversely, we know that if the directional derivative , then there exists such that . So is a reasonable requirement for the moving direction .

This inspired the so-called descent method: start from a solution and move to iteratively, where is the step size to be determined and is the moving direction satisfying .

The first question is when we can stop? Of course the ideal stopping criterion is for some . If so, we know that is indeed a minimum point. However, in practice, we cannot expect this happens. So we usually use stopping criteria such as , , or iterations.

The next question is, does this algorithm converge to an optimal solution? In fact, we claim that if we assume that only depend on , and the choice of satisfies for every (note that the optimal solution may not be unique), then the value of objective functions generated by the descent method converge to the minimum value . (However, may not converge and we will give an example later.)

We assume has a finite minimum value , and . So has a limit as goes to infinity. Now we would like to show that the limit is .


Let . Intuitively, if , as we hope as long as , we can argue that still decreases too fast even if is sufficiently close to .

Rigorously, let . Then is a compact set, if we assume for convenience (otherwise may not be necessarily bounded). Let be a function defined by where are the step size and the direction we choose if . That is, measures the difference between and if we set .

By our assumption as long as , and noting that if , we conclude that for all . Applying the extreme value theorem, there exists which implies that for every .This contradicts our assumption that there exists such that , and thus completes the proof.

Tip

In fact, it is not necessary to define as the difference between the function values. Analogously to the amortised analysis for some data structures, we may define to measure the difference between some potential function. So this argument above is a simplified result of the Lyapunov's global stability theorem in discrete time.
Suppose where is a continuous function and . If there exists a continuous (Lyapunov) function such that

  1. , for all , (positivity)
  2. as , (radical unboundedness)
  3. for all . (strict decrease)

Then for all , we have as .
For our setting, just select an optimal solution , and set , and .

9.3 Gradient descent
...

We now consider a specific descent method, the gradient descent, where we select . Then trivially .

There is an advantage to choose since it is the direction of steepest descent, namely, the value of decreases most rapidly: For any unit length vector , the directional derivative satisfies by the Cauchy-Schwarz inequality, and the equality holds iff .

Applying this choice of directions, we obtain the gradient descent method:

We now consider how to choose the step size . Intuitively, the choice of step size can effect the converge rate of the algorithm.

Let's start from an easy example: where . Since we hope , it requires that , which is equivalent to So suffices.

Next, consider the multivariate function where . Now . So It is sufficient to find a value of such that for all , . We need the following lemma.

Lemma (Rayleigh quotient)

Let be a positive semi-definite matrix, and and be its minimum and maximum eigenvalues, respectively. Then for all , we have

Proof

Since is symmetric, consider its eigen-decomposition , where is the diagonal matrix consisting of 's eigenvalues, and consists of corresponding unit-length eigenvectors. It easy to see that .
Assume (i.e. ). Then So clearly we have . Moreover, we have which completes the proof.

Note that in this proof we do not really need . This lemma holds for all symmetric . Applying this lemma, it gives that suffices in the gradient descent method for quadratic functions.

However, for general cases, we cannot expect a universal condition for . For example, consider the function . If we choose to be a constant , no matter what value is, the algorithm does not work as long as .

Question

Under which assumptions can we choose a constant as the step size?

9.4 -smooth functions
...

We would like to avoid functions similar to , where changes too drastically near .

Definition (Lipschitz continuity)

A function is -Lipschitz, if for all ,

We usually use -norm, unless otherwise specified.

An -Lipschitz function is continuous, but may not be differentiable. Intuitively, for a Lipschitz continuous function, there exists a double cone (white) whose origin can be moved along the graph so that the whole graph always stays outside the double cone.
Pasted image 20231025203006.png

Example
  • where is -Lipschitz.
  • where is -Lipschitz
  • where is -Lipschitz, since by the bound for the Rayleigh quotient. In particular, if is symmetric,

Recall that we hope does not change rapidly. So we define the following notion of "smoothness".

Definition (Smoothness)

A function is -smooth if if -Lipschitz, i.e., for all ,

Example

with is -smooth ().

We use the notation if . Then we have the following equivalent definitions.

Lemma

Suppose is a twice differentiable function. Then is -smooth iff for all , where is the identity matrix. Namely, for all , , where are eigenvalues.

Note that if , we can easily prove the “” direction since the mean value theorem gives that for some . However, there is no such theorem for vector-valued functions.

Proof
  • "" direction. We would like to restrict the vector-valued function to a line. Fix any . Let be a function defined by Then, and . By the mean value theorem, there exists such that . Note that by the Cauchy-Schwarz inequality. It implies that which further gives that The last inequality follows from the third example of Lipschitz functions.
  • "" direction. Fix any . Let be a function defined by Then, by the Cauchy-Schwarz inequality and the -smoothness, we have which further gives that . Taking the limit on both sides, and applying the chain rule, we obtain that Thus, .

An -smooth functions may be not convex. If is further convex, all absolute values are not necessary.

Lemma

Suppose is a differentiable function. Then is -smooth iff for all ,

Recall that is convex iff , which shows that is underestimated by an affine function. Now, if is -smooth, it is overestimated by a quadratic function.
Pasted image 20231025214327.png

Proof
  • "" direction. Fix . Define Note that for all , , and . So is a local minimum point of , which gives that . Since , we conclude that . Similarly, is a local maximum point of , and thus .
  • "" direction. Fix . Let It is clear that , and Moreover, . Therefore, it holds that Note that We now have which completes the proof.

Recall that, we hope to find the value of the step size such that . Now we assume that is -smooth. Then if we set . In particular, if we choose , it gives the following descent lemma.

Lemma (Descent lemma)

For an -smooth differentiable function (not necessarily convex), and , we have