Recall the gradient descent method, where we optimize
The mirror descent framework allows us to do precisely this. Specifically, given an objective function
Dropping the constant terms (that only depends on
What is the “right” choice of
We now introduce more on Bregman divergence.
Let
Here are some examples.
Since
It is clear that
Consider a well-known puzzle: given
Suppose
For any
Perhaps a surprising result is that Bregman divergence is an exhaustive notion for such (squared) distances. In other words, if a kind of distance satisfies the above lemma, then it must be a Bregman divergence. See, e.g., 1 or 2 for proof details.
The Bregman divergence is also a right way to describe the (squared) distance from a point to a convex set. Recall that, in Lecture 4, we show the following lemma, which means
Let
We now establish a similar result using Bregman divergence. If
A different view of the mirror descent framework is the one originally presented by Arkadi Nemirovski and David Yudin. Recall that in the gradient descent, we update the iterate by
In the vanilla gradient descent method, we only consider
Instead, Nemirovski and Yudin propose the following:
How do we choose these mirror maps? Again, this comes down to understanding the geometry of the problem, the kinds of functions and feasible sets
The name of the process comes from thinking of the dual space as being a mirror image of the primal space.
But why this view and the proximal point view give the same algorithm? We consider the update rule in the proximal point view
Given any vector space
If
However, it is not a canonical isomorphism. Informally, an isomorphism is a map that preserves sets and relations among elements. When this map or this correspondence is established with no choices involved, it is called canonical isomorphism. When we defined
Similar to the double dual space, for a finite-dimensional space with norm
We now focus on how to implement mirror descent. We need to show that the inverse gradient
For any convex function
Let
Note that for any fixed
In fact,
We now see some examples.
It is easy to see that
For all
It is still not easy to compute
For any convex function
Proving the theorem in full generality (the domain of
By definition we have
For any
Now we can show that
If
Let
Consider
Now we put convex conjugate together with Bregman divergence. Let
Let
We now consider the convergence analysis of mirror descent. Similar to the analysis for gradient descent, we hope to establish the connection between
Let
In other words, if we would like to obtain an (approximate) answer that is less than
By previous analysis we have
Since
Note that this results holds even for non-differentiable
To see the advantage of mirror descent, suppose
We now give an example. Suppose
Furthermore, if
Let
This result gives
We start again from
[1] A. Banerjee, Xin Guo and Hui Wang, "On the optimality of conditional expectation as a Bregman predictor," in IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2664-2669, July 2005.
[2] A. Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh, and John Lafferty. "Clustering with Bregman divergences." Journal of machine learning research 6, no. 10 (2005).