Lecture 13. Proximal Gradient Descent
...


13.1 Proximal operator and proximal gradient descent
...

So far we only consider minimizing differentiable functions. If the objective function is not differentiable, clearly neither the gradient descent nor the Newton's method does not work. In this lecture, we focus on how to solve the optimization problem for much more families of convex functions. We will generalize a method of gradient descent named the proximal gradient descent for nondifferentiable functions.

Recall the gradient descent iteration , where we let be the minimum point of Now we assume is not differentiable, but can be divided into two parts:where is convex and differentiable, and is convex but not necessarily differentiable. Then we define to approximate and let to approximate the minimum point of .

Note that where is a constant not depending on (assuming that and are fixed). So we obtain Here is the gradient descent iteration if we would like to optimize only . So roughly speaking, after adding , we hope locate near the local minimum of , and not make large.

Now we define the proximal operator as follows.

Definition (Proximal operator)

Given , let

Then we can rewrite the proximal gradient descent method as

Tip

Another viewpoint of the proximal operator is a discretization of the gradient flow. Recall the gradient flow The forward discretization (the forward Euler method) is the gradient descent, where Similarly, we can also try to discretize it in a backward form (the backward Euler method): However, the iteration becomes difficult since we need to find the point satisfying above equations. Actually, this is what the proximal operator is doing. Let Then we have , since

13.2 LASSO
...

The key is that we need to decompose properly. Clearly we can set and . But this decomposition is meaningless since wo do not know how to compute the proximal operator at all.

Fortunately for some important problems we have a "good" decomposition. For example, we consider the problem of linear progression avoiding overfitting. Suppose we have a data set where and and we assume that for some .

Let We can compute coefficients by solving the following optimization problem (least square method) However, there are usually some redundant coefficients that may cause overfitting. So we hope to add some constraints, such as, the number of nonzero entries in is at most . Unfortunately this problem is not a convex optimization after adding this constraint (why?). We still need to approximate it. One idea is to use the following approximation where is a parameter. This method is called LASSO (least absolute shrinkage and selection operator).

The objective function has a clear decomposition: and . The advantage is that the proximal operator is easy to compute with respect to this .

Given , the proximal mapping is Note that in this optimization, all entries in are independent, so we can solve this optimization by solving the following optimization for each entry: where is called the soft thresholding operator.
Pasted image 20231107194332.png

Finally, recall that , so the iteration of the proximal gradient descent is This algorithm is called the ISTA (iterative soft-thresholding algorithm).

13.3 Correctness and convergence
...

We now show the correctness and convergence of the proximal gradient descent.

Assume that is -smooth and set . Sometimes we would write the proximal gradient descent as the following form where We first show that, if and only if is an minimum point of . A trivial example is . Then we have and thus . So if and only if is a minimizer of . The following theorem asserts the general cases.

Theorem

In the proximal gradient descent iterations, iff , where is the minimum value of .

The "only if" direction is easy, since we have by -smoothness of . It yields that If , all inequalities are tight, so . However, because is convex and is strictly convex, is also strictly convex, and thus has a unique minimizer , which gives that .

Now we would like to show that if then . In other words, for all , . How can we show this? A naive idea is to show that and . However this idea looks too good to be true.

A reasonable method is to use convexity. Note that . So we hope This inequality looks like the first order condition for . Here performs like the gradient of . Since is not differentiable, we would introduce the notion of subgradients.

Definition

Let be a convex function and . We say is a subgradient of at , denoted by , if , .

If is convex, then subgradients always exist (but may not unique). Just consider the supporting hyperplane of the epigraph of .

Lemma

If , then .

This lemma immediately implies that if , then . Thus

Proof of the Lemma

By the definition of , we have that for all , Our goal is to show that for all , .
For the sake of contradiction, assume that there exists such that . Define Then .
Let and . By convexity we have Since it follows that if is sufficiently small (. This contradicts to the definition of .

Finally, we present the results of convergence rate.

Theorem

Suppose is -smooth and we set . Then we have If is further -strongly convex, then If -smoothness is unknown, we can use the exact/backtracking line search, and the results to the convergence rate are also the same as the rate of gradient descent.