Lecture 20. Bregman Divergence and Mirror Descent
...


20.1 Mirror descent: the proximal point view
...

Recall the gradient descent method, where we optimize and let . A natural question is, can we use other functions instead of quadratic functions to approximate ? Clearly, we hope the approximate function is easy to optimize, and somehow adapt the “geometry” of the problem.

The mirror descent framework allows us to do precisely this. Specifically, given an objective function , we assume that there exists a convex function to approximate . Then we use the Bregman divergence with respect to to replace the squared Euclidean norm in and still let , where the Bregman divergence is defined by and thus can be expressed by 2-1.png
Dropping the constant terms (that only depends on but not on ), the update step of the mirror descent is given by or equivalently,

Remark

What is the “right” choice of to minimize the function ? A little thought shows that the “best” should equal , because adding to the linear approximation of at gives us back exactly . Of course, the update now requires us to minimize , which is the original problem. So we should choose a function that is somehow “similar” to , and make the update step tractable.

Bregman divergence
...

We now introduce more on Bregman divergence.

Definition (Bregman divergence)

Let be a continuously differentiable and strictly convex function. Then the Bregman divergence from to with respect to function is defined by

Here are some examples.

Example
  1. Euclidean distance. Let . Then the Bregman divergence with respect to is
  2. Negative entropy. Let be the (open) standard -simplex, and be the negative entropy function over . Then the Bregman divergence with respect to is This is called the relative entropy, or Kullback-Leibler divergence (KL divergence) between probability distribution and , measuring the expected number of extra bits required to code samples from distribution  using a code optimized for  rather than the code optimized for .

Since is a strictly convex function, for any fixed , we know that is also a (strictly) convex function in the first argument . But it is not convex in the second argument in general.

Remark

It is clear that for all . Since is strictly convex, by the first order condition for convexity, we know that if . Furthermore, if is -strongly convex, then by definition. So the Bregman divergence somehow measures the (squared) distance from to . But we should note that in general the Bregman divergence is NOT symmetric. For example, see KL divergence.

Consider a well-known puzzle: given points , the goal is to find a point to minimize the total (squared) distances from to . A natural idea is to choose the mean of . For example, in a triangle, the centroid is the point that minimizes the sum of the squared distances of a point from the three vertices. The Bregman divergence encodes a kind of (squared) distances that the mean of distribution works.

Lemma

Suppose is a random variable over an open set with distribution . Then is optimized at .

Proof

For any , we have This must be nonnegative, and equal if and only if .

Perhaps a surprising result is that Bregman divergence is an exhaustive notion for such (squared) distances. In other words, if a kind of distance satisfies the above lemma, then it must be a Bregman divergence. See, e.g., 1 or 2 for proof details.

The Bregman divergence is also a right way to describe the (squared) distance from a point to a convex set. Recall that, in Lecture 4, we show the following lemma, which means is obtuse.

Lemma

Let be a nonempty, closed and convex set. Given and , for any , it holds that .
1696152640351.png

We now establish a similar result using Bregman divergence. If is the projection of onto a convex set , namely, Then for all , it holds that In Euclidean case, it also means that the angle is obtuse, by the generalized Pythagorean theorem (law of cosines) . The proof is a simple application of the law of cosines for Bregman divergence. Since we have for all . Note that . So the above inequality is equivalent to Then the proof concludes with the following lemma (by setting , , and ).

Lemma (Law of cosines for Bregman divergence)

20.2 Mirror descent: the mirror map view
...

A different view of the mirror descent framework is the one originally presented by Arkadi Nemirovski and David Yudin. Recall that in the gradient descent, we update the iterate by . However, the gradient was actually defined as a linear functional on (a linear map from the vector space into its underlying field ). Hence, naturally belongs to the dual space of . The fact that we represent this functional as a vector is a matter of convenience, and highly depends on the choice of coordinates. In fact, that’s why the gradient descent is not affinely invariant.

In the vanilla gradient descent method, we only consider with -norm, and this normed space is self-dual, so it is perhaps reasonable to combine points in the primal space (the iterates ) with objects in the dual space (the gradients ). But when working with other normed spaces, adding a linear map to a vector might not be the right thing to do.

Instead, Nemirovski and Yudin propose the following:

  1. we map our current point to a point in the dual space using a mirror map.
  2. Next, we take the gradient step .
  3. We map back to a point in the primal space using the inverse of the mirror map from Step 1.
  4. If we are in the constrained case, this point might not be in the convex feasible region , so we still need to project back to a close point in .

How do we choose these mirror maps? Again, this comes down to understanding the geometry of the problem, the kinds of functions and feasible sets we care about. We usually choose a proper differentiable and strongly convex function , and define the mirror map by , that is, Since is differentiable and strongly convex, its gradient is “monotone”, and thus the inverse mirror map exists. We can use these maps in the Nemirovski-Yudin process, namely, we set 3-1.png

The name of the process comes from thinking of the dual space as being a mirror image of the primal space.

But why this view and the proximal point view give the same algorithm? We consider the update rule in the proximal point view and consider the gradient of Bregman divergence Since is a convex function, we obtain that which is Rearranging terms it gives a step of update in the dual space

Dual space and dual norm
...

Given any vector space over a field , the (algebraic) dual space is defined as the set of all linear map  (linear functional). Since linear maps are vector space homomorphisms, the dual space may be denoted . The dual space  itself becomes a vector space over when equipped with an addition and scalar multiplication satisfying: for all , , and .

If  is finite-dimensional, then  has the same dimension as . In particular, can be interpreted as the space of columns of real numbers, its dual space is typically written as the space of rows of real numbers. Such a row acts on as a linear functional by ordinary matrix multiplication. This is because a functional maps every -vector into a real number . Then, seeing this functional as a matrix , and as an matrix, and a matrix (trivially, a real number) respectively, if , then by dimension reasons, must be a matrix, that is, a row vector. So there is an isomorphism between (and any finite-dimensional vector space ) and its dual space.

However, it is not a canonical isomorphism. Informally, an isomorphism is a map that preserves sets and relations among elements. When this map or this correspondence is established with no choices involved, it is called canonical isomorphism. When we defined  from  we did so by picking a special basis (the dual basis), therefore the isomorphism from  to  is not canonical. But for the double dual of a finite-dimensional vector space (the dual of the normed vector space ), there is a canonical isomorphism. Indeed, the following map defined as follows is a canonical isomorphism. For any , is a map from to given by Given a norm on a vector space , its dual norm, denoted , is a function (a norm) of a linear functional belonging to defined by In particular, for , a linear functional can be represented by a vector with inner product. Thus, the dual norm is given by By Cauchy-Schwarz inequality, the dual norm of the -norm is again the -norm. In general, the dual for the -norm is the -norm, where and we assume for convenience.

Similar to the double dual space, for a finite-dimensional space with norm , we have .

20.3 Convex conjugate
...

We now focus on how to implement mirror descent. We need to show that the inverse gradient can be computed efficiently.

For any convex function with domain , the gradient of at some point is a vector (actually a covector) satisfying for all . More generally, the subgradients of is the set of all such vectors, namely, Rearranging terms we obtain for all . Note that . It gives that Thus we can rewrite the subgradients as We can now introduce the convex conjugate of a function.

Definition (Convex conjugate)

Let be a convex function. Its convex conjugate is the function given by

Note that for any fixed , is an affine function of . Thus is a convex function of (by the convexity of pointwise supremum).

In fact, is defined on the dual space of . Roughly speaking, for each , one can think of it as the hyperplane with the normal vector . Then gives the longest (directed) vertical distance between the hyperplane and the graph of . In other words, is how far down you can translate the hyperplane so that the entire hyperplane is just below the graph of , namely, becomes the supporting hyperplane of the epigraph. So this definition can be interpreted as an encoding of the convex hull of the function's epigraph in terms of its supporting hyperplanes.
1702988618482.png

Example

We now see some examples.

  1. Let be an affine function. Its convex conjugate is
  2. Let be a quadratic function. Its convex conjugate is
  3. Let . Its convex conjugate is
  4. Let . It convex conjugate is

It is easy to see that (in particular, if is differentiable at ) if and only if . Otherwise () we have , which gives the following Fenchel’s inequality.

Theorem (Fenchel’s inequality)

For all and , we have The equality holds if and only if (or in general).

It is still not easy to compute by Fenchel’s inequality. We need the following direct corollary.

Theorem (Fenchel-Moreau theorem)

For any convex function , we have .

Proving the theorem in full generality (the domain of is given by ) requires a bit of care. But it is relatively straightforward to show that the result holds on the interior of . For simplicity, we only consider the case where . The proof consists of two parts: (1). proving that for all ; (2). proving that .

Proof

By definition we have Note that . In particular, Thus, .
For any , let (or for general non-differentiable ). Then by Fenchel’s inequality we have So

Now we can show that

Corollary

If is strictly convex, then More generally, if the domain of is , then

Proof

Let . By Fenchel’s inequality we have By Fenchel-Moreau theorem, it is equivalent to which gives if we apply Fenchel’s inequality again.

Example

Consider on . We know that . If we take , then

Now we put convex conjugate together with Bregman divergence. Let be a differentiable and strictly convex function. Then is also differentiable and strictly convex. The Bregman divergence with respect to and are and Let and in the Bregman divergence with respect to . Then we have Thus simplifies to which gives the following result.

Theorem

Let be a differentiable and strictly convex function. Then for any it holds that

20.4 Convergence of mirror descent
...

We now consider the convergence analysis of mirror descent. Similar to the analysis for gradient descent, we hope to establish the connection between and in terms of Bregman divergence. The basic ingredient is equation (). In general, given any convex function , let be the following minimizer Then for all , it holds that Recall the mirror descent update It gives that for all , Rearranging terms we obtain that Note that , and Hence we have for all . Now we can give the following lemma.

Theorem

Let be a convex and -Lipschitz function with respect to Bregman divergence . Suppose can be bounded by . Then by selecting it holds that

In other words, if we would like to obtain an (approximate) answer that is less than , it is sufficient to run the mirror descent steps.

Proof

By previous analysis we have Summing over both sides from to , it implies that The remaining part is to bound .
Since is -Lipschitz with respect to , we have Thus we obtain that

Note that this results holds even for non-differentiable . We only need to replace by some subgradient in the previous analysis.

To see the advantage of mirror descent, suppose is -Lipschitz with respect to some norm (which means the gradient of can be bounded by with respect to its dual norm), and is -strongly convex with respect to the same norm. Then is -Lipschitz with respect to the Bregman divergence. We can choose a particular norm and a particular Bregman divergence to capture the geometry of the problem.

We now give an example. Suppose is the (open) -dimensional probability simplex, and we use KL-divergence for which is -strongly convex with respect to the norm. The dual norm of the -norm is the -norm. Then we can bound by using KLdivergence, and it is at most if we set and lies in the probability simplex. Suppose the objective function is -Lipschitz with respect to -norm (and thus is -Lipschitz with respect to -norm). So the mirror descent requires time to approximate , which is smaller than that of subgradient descent by an order of . Note the saving of term is from the norm of gradient by replacing the -norm by the -norm (decreasing by an order of ), at a slight cost of increasing by .

Furthermore, if is -smooth with respect to some norm (the gradient of is -Lipschitz continuous), namely, then the convergence rate can be better.

Theorem

Let be a convex and -smooth function with respect to some norm , and be a -strongly convex function with respect to the same norm. Suppose can be bounded by . Then by selecting it holds that

This result gives convergence rate to obtain an (approximate) optimal value.

Proof

We start again from Now we bound by . Since is -smooth and is -strongly convex, we have and Thus it follows that by selecting . Plugging in inequality () it gives that The remaining part is the same as the previous proof. Summing over both sides from to , we obtain that

Reference
...

[1] A. Banerjee, Xin Guo and Hui Wang, "On the optimality of conditional expectation as a Bregman predictor," in IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2664-2669, July 2005.

[2] A. Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh, and John Lafferty. "Clustering with Bregman divergences." Journal of machine learning research 6, no. 10 (2005).