Lecture 12. Newton’s Method
...


12.1 Newton’s method for optimization
...

Recall that the gradient descent converges slowly if the condition number is large. For example, consider the function . At some point , . It locally decreases rapidly but not globally. The ideal descending direction is , which can also be written as In general, if where , the ideal direction at is since .

More generally, recall the Newton's method introduced before for finding roots, where we use a Taylor series to estimate the objective function. When is not quadratic, consider its Taylor approximation Denote by the right hand side. Note that if is invertible. Thus the minimizer of is . Then, we let be the minimizer of the second-order Taylor series at . This method is called the Newton's method for optimization.

Note that if the objective function is strictly convex, then , which implies that is invertible and . This is because if 's eigenvalues are all positive, their inverse are also positive. Therefore, .

Recall our requirement for the descending direction. We hope . Clearly, if , is a descending direction, since unless .

12.2 Convergence rate of the Newton's method
...

Question

Does the Newton's method always work well?

Intuitively this is not true since we use the second order Taylor series to approximate the function, but the Taylor series only works locally.

If the second-order Taylor series estimates the value of functions well, then given by the Newton's method is the minimum point of the Taylor series and it should be close to the minimum point of . But what happens if the Taylor series doesn't approximates well? Now we consider some "bad" examples.

Example

Consider the function .

  • Its first-order derivative is ;
  • Its second-order derivative is .


If we set , then we can show that as long as . Thus, the Newton's method does not converge in this example.

The reason why the Newton's method does not converge in this example is that the second-order derivative of is close to zero when is large, which yields that the second-order Taylor series is not a good approximation near the minimum point. The Taylor series approximates well in the neighborhood of , but loses control at , if is large.

However, just keeping small is not enough yet. Here is another "bad" example.

Example

Consider the function .

  • Its first-order derivative is ;
  • Its second-order derivative is .


In this case, . Clearly, the Newton's method does not produce convergent iterates.

In this example, the reason to failure is that changes rapidly so the second order Taylor series cannot approximates well even in the neighborhood of .

We now give some conditions to guarantee the convergence of the Newton’s method iterates.

Definition

Given a twice continuously differentiable function , we say is -Lipschitz, if for all ,

Theorem

Suppose is a -strongly convex function, and is -Lipschitz. Let be the iterates generated by the Newton's method. Then

Remark

Let . Then we have . So if , the iterates given by the Newton's method converge rapidly (much more rapidly than what we get for the gradient descent).
This is called the quadratic convergence, or the convergence of order .

Proof

Fix . Let . Then . So applying the Newton–Leibniz formula, we have

Recall the gradient descent iteration , where we actually calculate the minimum point of the following function and let be its minimizer to approximate the minimum point of . In particular, if , is an upper estimation of by -smoothness.
The second order Taylor series locally approximate well, if satisfies some conditions, so the Newton’s method converges rapidly near . However, if is far from , the Newton’s method loses control of . In contrast, -smoothness guarantees a global upper bound so gradient descent iterations (with a fixed step size ) always converge, although it does not converge as fast as Newton’s method in the neighbourhood of .

Norm of matrices
...

Note that is a matrix if is a function mapping to . To describe the "rapid change" of , we need to define a norm of matrices.

A simple idea is to view a matrix as a -dimensional vector, and applying -norms. If , such a norm is called the Frobenius norm.

Definition (Frobenius norm)

The Frobenius norm, sometimes also called the Euclidean norm, is the matrix norm of an matrix defined as the square root of the sum of the absolute squares of its elements, namely,

However, a more natural way is to consider the following definition. We may view an matrix as a linear map from to . So we can define its operator norm as follows.

Definition

Given a norm on and a norm on , the operator norm of an matrix is given by In particular, if and are both -norms, the operator norm is also called the spectral norm.

Unless specified in context, we use to denote the spectral norm. Why do we call it "spectral norm"?

Proposition

If we use to denote the maximum eigenvalue of , then In particular, if , then .

The spectrum of a matrix is the set of all its eigenvalues. This proposition shows that why this norm is called the "spectral" norm.

Proof

We have , since .

The advantage to use operator norm is that we usually need the Cauchy-Schwarz inequality, which is trivially true (by definition) under the operator norm: for all and , it holds that

12.3 Damped Newton’s method
...

Unfortunately, Newton’s method does not guarantee descent of the function values even when the Hessian matrix is positive definite. Similar to the gradient descent with a step size , we can modify the Newton’s method to include a small step size instead of , where the step size is chosen by a certain line search. This is called the damped Newton’s method.

Since is a descending direction (by convexity), we claim that there exists such that with parameter . Again, applying the backtracking line search, we can find such by starting from an initial (usually ) and repeating until the above sufficient decrease condition is satisfied.

Convergence analysis
...

The convergence of the damped Newton’s method has two phase: damped Newton phase and quadratically convergent phase. We can show that there exists and such that the following holds. Specifically, assuming and for a constant satisfying for any , we have

  • (damped Newton phase) if , then
  • (quadratically convergent phase) if , then the backtracking line search condition is satisfied by selecting , and

12.4 Self-concordant functions
...

Another way to control is to compute the third derivative. For simplicity, we consider a univariate function . If is bounded then is Lipschitz. Previous analysis involves the bound of and separately. We now introduce another assumption of functions, in which we take into consideration both and simultaneously.

Moreover, Newton’s method is affinely invariant. Suppose is nonsingular, and define . If we use Newton’s method (with the same backtracking parameters) to minimize , starting from , then we have for all . However, the previous convergence analysis is not affinely independent. In contrast, the following assumption does not depend on affine changes of coordinates.

Definition (Self-concordant function)

A convex function is self-concordant, if or, equivalently, satisfies wherever and satisfies elsewhere.
More generally, a multivariate convex function is self-concordant, if it is self-concordant along every line in its domain, i.e., the function is a self-concordant function of for all and . Equivalently, is self-concordant, if

The self-concordant functions include many of the logarithmic barrier functions that play an important role in barrier method and interior point method for solving convex optimization problems.

In fact, the coefficient in the definition is not necessary, and it can be replaced by any constant . The standard choice is to guarantee that is a self-concordant function.

Example
  • is self-concordant, since and .
  • is self-concordant.
  • (log-barrier for linear inequalities) on is self-concordant.
  • (log-determinant) on is self-concordant.
  • on is self-concordant.
  • If and are both self-concordant, then is also self-concordant.

Convergence analysis
...

For strictly convex self-concordant function, we obtain bounds in terms of the Newton decrement There exists constants and (only depending on the backtracking line search parameters and ) such that the following holds.

  • (damped Newton phase) If , then
  • (quadratically convergent phase) If , then the backtracking line search condition is satisfied by selecting , and