Mesh Mess

Classical iterative algorithms

Posted in Numerical by Yi Zhang on 07/30/2009

For this topic, many excellent textbooks are available. Most of the modern numerical PDE solvers rely on one or another iterative algorithm taking care of the linear system given by discretization ( I know very little of direct methods). In this scenario, one has linear system

Ax=b

and tries to come up with some iterative scheme

x^{(k+1)}=Tx^{(k)}+c

with

\lim_{k\to\infty}{x^{(k)}}=x^{\ast}

meaning the convergence to the “real” solution, or, to be exact, the solution by direct solver if we disregard the round-off error. By this setup, one can see the solution is the fixed point of operator T. So the convergence of the algorithm is equivalent to two proposals: the spectral radius \rho(A)<1, and there exists operator norm induced by some norm of the linear space to which x belongs that this norm of T is less than 1. The convergence speed is measured by average rate of convergence of A (for m iterations) defined by

R(A^m)=-\log[\|A^m\|^{1/m}]=-\log[\|A^m\|]/m

which comes from the fact that \lim_{k\to\infty}(\|T^k\|^{1/k})=\rho(T). In order to put the original problem into this fixed point setup, A is splitted:

A=M-N\longrightarrow x^{(k+1)}=M^{-1}Nx^k+M^{-1}b

If A is splitted according to entries’ positions as A=L+D+U, one has the classical algorithms:

Jacobi: M=D,N=-(L+U)

Gauss-Seidel: M=D+L,N=-U

SOR: x_i^{(k+1)}=\omega \bar{x}_i^{(k+1)}+(1-\omega)x_i^{(k)}

The last one is in the component wise form in which \bar{x}^{(k+1)} is the Gauss-Seidel result, and the relaxation parameter is between zero and two (even though the name “overrelaxation” comes when it is greater than one). Each of those algorithms has direct(but different in implementation) analogical block version. All these methods are stationary, i.e. in each iteration T is unchanged. If this is not the case, we have unstationary methods, in which after each iteration information is re-gathered in order to determine a new T for next iteration. When A is a SPD matrix, a large family of unstationary methods can be formed by the solving a minimization problem(minimal residual method (MINRES)). Classical CG method I wrote about before belongs to this family. A modern member of this family is GMRES by Saad and Schultz in 1986.

review of CG method:IV

Posted in Math by Yi Zhang on 07/03/2009

I have shown that in method of conjuate direction, the searching directions are in some sense orthogonal to each other:

d_i^TAd_j=0,\quad i\neq j

Which makes more sense when the whole thing is considered under a new coordinate system, a system obtained by impose tranform of L, in which L is the upper-triangle matrix from A‘s Cholesky decomposition.

d_i^TAd_j=0\Longleftrightarrow (Ld_i)^T(Ld_j)=0\Longleftrightarrow Ld_i\perp Ld_j

Similar conclusion includes that d_i is A-orthogonal to e_{i+1}, resulting in that d_i is orthogonal to r_{i+1} the residual:

d_i^Tr_{i+1}=0\Rightarrow d_i^Tr_j=0,\quad i\neq j
This condition actually implies that one can use the residual r_i processed by  Gram-Schimdt conjugation to obtain the direction d_i, and this is the conjugate gradient (CG) method. The advantage of CG is that, instead of finding a set of directions d_i at the beginning, r_i is obtained along with the iteration, because the process of Gram-Schimdt conjugation is “triagonal”, i.e., one variable is determined by its predecessors. In particular, starting from r_0=b-Ax_0 (the first step is the same as steepest descent method), the following searching directions d_i is determined by

d_i=r_i+\sum_{k=0}^{i-1}\alpha_{ik}d_k

Coefficient \alpha_{ik} is obtained by conjugate condition:

d_i^TAd_j=0\Longrightarrow r_i^TAd_j+\sum_{k=0}^{i-1}\alpha_{ik}d_k^TAd_j=r_i^TAd_j+\alpha_{ij}d_j^TAd_j=0

Then we have \alpha_{ij}=-r_i^TAd_j/(d_j^TAd_j)

In order to simplify this expression, remember that during this process of generating d_i from r_i, for any i,

\text{span}\{d_0,d_1,\cdots,d_i\}=\text{span}\{r_0,r_1,\cdots,r_i\}

so d_i^Tr_j=0\Longrightarrow r_i^Tr_j=0,\quad i\neq j

The Gram-Schimdt process also indicates that d_i^Tr_i=r_i^Tr_i (prove it). The above two identities can be used to simplify \alpha_{ij}, and prove that

\alpha_{ij}=r_i^Tr_i/(l_{i-1}d_{i-1}^TAd_{i-1}),\quad j=i-1

\alpha_{ij}=0,\quad j<i-1

and eventually

\alpha_{i,i-1}=r_i^Tr_i/(r_{i-1}^Tr_{i-1})

Finally, a complete iteration, when d_i, x_i,r_i is known, consists of

l_i=r_i^Tr_i/d_i^TAd_i\longrightarrow x_{i+1}=x_i+l_id_i\longrightarrow r_{i+1}=r_i-l_iAd_i

\alpha_{i+1}\equiv\alpha_{i+1,i}=r_{i+1}^Tr_{i+1}/(r_i^Tr_i)\longrightarrow d_{i+1}=r_{i+1}+\alpha_{i+1}d_i

review of CG method:III

Posted in Math by Yi Zhang on 07/01/2009

This Mathematica script generates the following picture indicating a typical behavior of steepest descent method: the zigzag in two perpendicular directions. In fact, in 2D this is the only case because one has only two choices, determined by starting direction, as I showed at the beginning. It is natural to ask whether one can use only two step with single change of direction to reach the minimization point. This introduces conjugate direction method.

steepest_descent_example

Assume x\in R^n and the iteration starts with direction d_0. The method of conjugate direction tries to reach a point along d_i with condition that in the rest of iteration descent along \pm d_o direction would never happen again, so does for any other following steps. To make sure this happen, we can require that the iterations after d_i always are performed on the subspace perpendicular to d_i. Remembering that exact solution x^{\ast} should always be in the subspace in which the coming iterations are to be performed, we come up this scheme:

x_{i+1}=x_i+l_id_i, \quad d_i\perp (x_{i+1}-x^{\ast})=e_{i+1}=e_i+l_id_i\Longleftrightarrow d_i^T(e_i+l_id_i)=0

If we view the above as two equations with three unknowns (x_{i+1},l_i,x^{\ast}), we know that it’s not going to work. However, since A is SPD matrix, by Cholesky decomposition we have A=L^TL in which L is an upper-triagonal matrix. Then if we consider the orthogonality between Ld_i and Le_{i+1}, instead of d_i and e_{i+1}, we have

Ld_i\perp Le_{i+1}\Rightarrow (Ld_i)^T Le_{i+1}=0\Rightarrow d_i^TL^TLe_{i+1}=d_i^TAe_{i+1}=d_i^TA(e_i+l_id_i)=d_i^T(Ax_i-b+Al_id_i)=d_i^T(-r_i+Al_id_i)=0

Finally we have

l_i=d_i^Tr_i/d_i^TAd_i

Of course this can be obtained by using the \frac{df(x_{i+1})}{d l_i}=0 requirement as well, but derivation I used here gives better geometric explanation. In order to utilize the fact that we do not know e_i but we know r_i=b-Ax_i=-Ae_i, we impose the orthogonality requirement d_i\perp e_{i+1} in the space transformed by L. By now one can see that to consider r_i as the effect of A applied on e_i can often serve the similar purpose of e_i the error itself, that’s why sometimes we call r_i a similar name: residual.

The orthogonalilty condition Ld_i\perp Le_{i+1}\Longleftrightarrow d_i^TAe_{i+1} is also called A-orthogonality. And by d_i is A-orthogonal to the following e_j one can easily see that d_i is A-orthogonal to the following d_j. Since now we have the formular for l_i, with a set of A-orthogonal d_j, j=0,1,\cdots,n-1 we will be done, and this set can be found by Gram-Schimdt conjugation, a A-orthogonal version of Gram-Schimdt orthogonazation process. With those tools in hand, it is not hard to show that this conjugate direction method, whose name comes from that the directions d_j are A-orthogonal, or, A-conjugate to each other, can indeed reach x^{\ast} within n steps. In fact,  the goal in each interation in conjugate direction method achieved is to eliminate one dimension of the space in which the solution lies and confine the rest of search to a lower dimension hyperplane.

review of CG method:II

Posted in Math by Yi Zhang on 07/01/2009

Now let’s examine the convergence property of steepest descent method. For this purpose, define the error against exact solution x^{\ast} at each iteration step

e_i\equiv x_i-x^{\ast}

By iteration scheme there is

e_{i+1}=e_i+l_ir_i

Since at every step matrix A would be multiplied, and the iterative property of a matrix is largely determined by its eigenvalues, one’s first reaction should be look at the decomposition of the vector space under discussion into its standard forms by

e_i=\sum_j{\xi_{(i)j}v_j}

where \{v_j\} is the orthonormal basis of eigenvector space of A. By this defitino, there are

r_i=b-Ax_i=Ax^{\ast}-Ax_i=-Ae_i

e_{i+1}=e_i+l_ir_i=e_i+\frac{r_i^Tr_i}{r_i^TAr_i}r_i=e_i+\frac{\sum_j\xi_{(i)j}^2\lambda_j^2}{\sum_j\xi_{(i)j}^2\lambda_j^3}r_i

In which \lambda_j is eigenvalue corresponding to v_j. One can see if the normal/spectral condition number \kappa(A)\equiv\lambda_{\max}(A)/\lambda_{\min}(A)=0, i.e. all the eigenvalues of A are same, the error at next step immediately drops to zero. Geometricly this simply means contours are spheres and the steepest descent direction is toward the centre. Generally, there is

e_{i+1}=(I-l_iA)e_i\Rightarrow \|e_{i+1}\|_A^2=\|e_i\|_A^2\omega_i^2

where the norm is energy norm induced by A

\|x\|_A=\sqrt{\frac{1}{2}x^TAx}

and \omega_i (subscript emphasizes the dependence on each step) is determined by spectral condition number \kappa(A) and slope \mu_i=\xi_{(i)\max}/\xi_{(i)\min}:

\omega_i^2=1-\frac{(\kappa^2+\mu_i^2)^2}{(\kappa+\mu_i^2)(\kappa^3+\mu_i^2)}

Picture below shows the dependence of \omega to two variables. One can see that the slowest convergence happens when both condition number and slope are large. A is referred to as ill-conditioned when \kappa is large.

omega_SDconvergence

review of CG method:I

Posted in Math by Yi Zhang on 06/30/2009

Most popular method in solving linear system, conjugate gradient is simply a few connected elegant ideas whose geometric meanning is almost intuitive. In solving system

Ax=b

where A is SPD matrix, it is obvious that the solution x of the equation is equal to the minimizer of quadratic form

f(x)=\frac{1}{2}x^TAx-x^Tb+c

If the positive definite condition is removed, then solution of Ax=b is simply one stationary point of f.

Pictures below show 3D plot of a quadratic form and its contour with minus gradient -\nabla f(x)=b-Ax. In order to solve the linear system, one need find the minimizer, and in turn steepest descent method is the way to follow the direction of arrow in the picture and move toward the goal, in another word,  like a ball sliding on the 3D paraboloid toward the lowest point. For iteration algorithm, we obtain point x_{i+1} in the domian by x_{i+1}=x_i+l_ir_i, where r_i is the descent direction at point x_i

r_i=-\nabla f(x_i)=b-Ax_i

and l_i is the distance to move toward that direction, which is determined by maximizing the “profit” in present step, i.e. to move to the lowest value of x in direction of r_i. This can be achieved, intuitively, move to the “most inner” contour in that direction. Now one can see by “most inner” I mean when the contour is tangential to vector line of r_i. One corollary of this method is that in next step we are going to move perpendicular to the direction of last move: r_{i+1}\perp r_i

Another view to see this is that by “maximizing profit” we mean to minimize the value of f at present step, so we are looking for step size l_i satisfying

\frac{d f(x_{i+1})}{d l_i}=f'(x_{i+1})^Tr_i=(Ax_i-b)^Tr_i=-r_{i+1}^Tr_i=0

By this we can obtain l_i=r_i^Tr_i/(r_i^TAr_i), and algorithm of steepest descent: \cdots\rightarrow r_i=b-Ax_i \rightarrow l_i\rightarrow x_{i+1}=x_i+l_ir_i\rightarrow\cdots

Follow

Get every new post delivered to your Inbox.