review of CG method:III

by Yi Zhang

This Mathematica script generates the following picture indicating a typical behavior of steepest descent method: the zigzag in two perpendicular directions. In fact, in 2D this is the only case because one has only two choices, determined by starting direction, as I showed at the beginning. It is natural to ask whether one can use only two step with single change of direction to reach the minimization point. This introduces conjugate direction method.

steepest_descent_example

Assume x\in R^n and the iteration starts with direction d_0. The method of conjugate direction tries to reach a point along d_i with condition that in the rest of iteration descent along \pm d_o direction would never happen again, so does for any other following steps. To make sure this happen, we can require that the iterations after d_i always are performed on the subspace perpendicular to d_i. Remembering that exact solution x^{\ast} should always be in the subspace in which the coming iterations are to be performed, we come up this scheme:

x_{i+1}=x_i+l_id_i, \quad d_i\perp (x_{i+1}-x^{\ast})=e_{i+1}=e_i+l_id_i\Longleftrightarrow d_i^T(e_i+l_id_i)=0

If we view the above as two equations with three unknowns (x_{i+1},l_i,x^{\ast}), we know that it’s not going to work. However, since A is SPD matrix, by Cholesky decomposition we have A=L^TL in which L is an upper-triagonal matrix. Then if we consider the orthogonality between Ld_i and Le_{i+1}, instead of d_i and e_{i+1}, we have

Ld_i\perp Le_{i+1}\Rightarrow (Ld_i)^T Le_{i+1}=0\Rightarrow d_i^TL^TLe_{i+1}=d_i^TAe_{i+1}=d_i^TA(e_i+l_id_i)=d_i^T(Ax_i-b+Al_id_i)=d_i^T(-r_i+Al_id_i)=0

Finally we have

l_i=d_i^Tr_i/d_i^TAd_i

Of course this can be obtained by using the \frac{df(x_{i+1})}{d l_i}=0 requirement as well, but derivation I used here gives better geometric explanation. In order to utilize the fact that we do not know e_i but we know r_i=b-Ax_i=-Ae_i, we impose the orthogonality requirement d_i\perp e_{i+1} in the space transformed by L. By now one can see that to consider r_i as the effect of A applied on e_i can often serve the similar purpose of e_i the error itself, that’s why sometimes we call r_i a similar name: residual.

The orthogonalilty condition Ld_i\perp Le_{i+1}\Longleftrightarrow d_i^TAe_{i+1} is also called A-orthogonality. And by d_i is A-orthogonal to the following e_j one can easily see that d_i is A-orthogonal to the following d_j. Since now we have the formular for l_i, with a set of A-orthogonal d_j, j=0,1,\cdots,n-1 we will be done, and this set can be found by Gram-Schimdt conjugation, a A-orthogonal version of Gram-Schimdt orthogonazation process. With those tools in hand, it is not hard to show that this conjugate direction method, whose name comes from that the directions d_j are A-orthogonal, or, A-conjugate to each other, can indeed reach x^{\ast} within n steps. In fact,  the goal in each interation in conjugate direction method achieved is to eliminate one dimension of the space in which the solution lies and confine the rest of search to a lower dimension hyperplane.

Advertisements