Brought to you by:
Paper

Preasymptotic convergence of randomized Kaczmarz method

, and

Published 23 November 2017 © 2017 IOP Publishing Ltd
, , Citation Yuling Jiao et al 2017 Inverse Problems 33 125012 DOI 10.1088/1361-6420/aa8e82

0266-5611/33/12/125012

Abstract

Kaczmarz method is one popular iterative method for solving inverse problems, especially in computed tomography. Recently, it was established that a randomized version of the method enjoys an exponential convergence for well-posed problems, and the convergence rate is determined by a variant of the condition number. In this work, we analyze the preasymptotic convergence behavior of the randomized Kaczmarz method, and show that the low-frequency error (with respect to the right singular vectors) decays faster during first iterations than the high-frequency error. Under the assumption that the initial error is smooth (e.g. sourcewise representation), the result explains the fast empirical convergence behavior, thereby shedding new insights into the excellent performance of the randomized Kaczmarz method in practice. Further, we propose a simple strategy to stabilize the asymptotic convergence of the iteration by means of variance reduction. We provide extensive numerical experiments to confirm the analysis and to elucidate the behavior of the algorithms.

Export citation and abstract BibTeX RIS

1. Introduction

Kaczmarz methood [20], named after Polish mathematician Stefan Kaczmarz, is one popular iterative method for solving linear systems. It is a special form of the general alternating projection method. In the computed tomography (CT) community, it was rediscovered in 1970 by Gordon, Bender and Herman [9], under the name algebraic reconstruction techniques. It was implemented in the very first medical CT scanner, and since then it has been widely employed in CT reconstructions [14, 15, 28].

The convergence of Kaczmarz method for consistent linear systems is not hard to show. However, the theoretically very important issue of convergence rates of Kaczmarz method (or the alternating projection method for linear subspaces) is very challenging. There are several known convergence rates results, all relying on (spectral) quantities of the matrix A that are difficult to compute or verify in practice (see [7] and the references therein). This challenge is well reflected by the fact that the convergence rate of the method depends strongly on the ordering of the equations.

It was numerically discovered several times independently in the literature that using the rows of the matrix A in Kaczmarz method in a random order, called randomized Kaczmarz method (RKM) below, rather than the given order, can often substantially improve the convergence [15, 28]. Thus RKM is quite appealing for practical applications. However, the convergence rate analysis was given only very recently. In an influential paper [34], in 2009, Strohmer and Vershynin established the exponential convergence of RKM for consistent linear systems, and the convergence rate depends on (a variant of) the condition number. This result was then extended and refined in various directions [1, 3, 10, 26, 29, 33, 35], including inconsistent or underdetermined linear systems. Recently, Schöpfer and Lorenz [33] showed the exponential convergence for RKM for sparse recovery with elastic net. We recall the result of Strohmer and Vershynin and its counterpart for noisy data in theorem 2.1 below. It is worth noting that all these estimates involve the condition number, and for noisy data, the estimate contains a term inversely proportional to the smallest singular value of the matrix A.

These important and interesting existing results do not fully explain the excellent empirical performance of RKM for solving linear inverse problems, especially in the case of noisy data, where the term due to noise is amplified by a factor of the condition number. In practice, one usually observes that the iterates first converge quickly to a good approximation to the true solution, and then start to diverge slowly. That is, it exhibits the typical 'semiconvergence' phenomenon for iterative regularization methods, e.g. Landweber method and conjugate gradient methods [13, 21]. This behavior is not well reflected in the known estimates given in theorem 2.1; see section 2 for further comments.

The purpose of this work is to study the preasymptotic convergence behavior of RKM. This is achieved by analyzing carefully the evolution of the low- and high-frequency errors during the randomized Kaczmarz iteration, where the frequency is divided according to the right singular vectors of the matrix A. The results indicate that during initial iterations, the low-frequency error decays must faster than the high-frequency one, see theorems 3.1 and 3.2. Since the inverse solution (relative to the initial guess x0) is often smooth in the sense that it consists mostly of low-frequency components [13], it explains the good convergence behavior of RKM, thereby shedding new insights into its excellent practical performance. This condition on the inverse solution is akin to the sourcewise representation condition in classical regularization theory [5, 16]. Further, based on the fact that RKM is a special case of the stochastic gradient method [31], we propose a simple modified version using the idea of variance reduction by hybridizing it with the Landweber method, inspired by [19]. This variant merits both good preasymptotic and asymptotic convergence behavior, as indicated by the numerical experiments.

Last, we note that in the context of inverse problems, Kaczmarz method has received much recent attention, and has demonstrated very encouraging results in a number of applications. The regularizing property and convergence rates in various settings have been analyzed for both linear and nonlinear inverse problems (see [2, 4, 12, 17, 18, 2224] for an incomplete list). However, these interesting works all focus on a fixed ordering of the linear system, instead of the randomized variant under consideration here, and thus they do not cover RKM.

The rest of the paper is organized as follows. In section 2 we describe RKM and recall the basic tool for our analysis, i.e. singular value decomposition, and a few useful notations. Then in section 3 we derive the preasymptotic convergence rates for exact and noisy data. Some practical issues are discussed in section 4. Last, in section 5, we present extensive numerical experiments to confirm the analysis and shed further insights.

2. Randomized Kaczmarz method

Now we describe the problem setting and RKM, and also recall known convergence rates results for both consistent and inconsistent data. The linear inverse problem with exact data can be cast into

Equation (2.1)

where the matrix $A\in \mathbb{R}^{n\times m}$ , and $b\in\mathbb{R}^n$ and $b\in {\rm range}(A)$ . We denote the ith row of the matrix A by $a_i^t$ , with $a_i\in\mathbb{R}^m$ being a column vector, where the superscript t denotes the vector/matrix transpose. The linear system (2.1) can be formally determined or under-determined.

The classical Kaczmarz method [20] proceeds as follows. Given the initial guess x0, we iterate

Equation (2.2)

where $\langle\cdot, \cdot\rangle$ and $\Vert \cdot\Vert $ denote the Euclidean inner product and norm, respectively. Thus, Kaczmarz method sweeps through the equations in a cyclic manner, and n iterations constitute one complete cycle.

In contrast to the cyclic choice of the index i in Kaczmarz method, RKM randomly selects i. There are several different variants, depending on the specific random choice of the index i. The variant analyzed by Strohmer and Vershynin [34] is as follows. Given an initial guess x0, we iterate

Equation (2.3)

where i is drawn independent and identically distributed (i.i.d.) from the index set $\{1, 2, \ldots, n\}$ with the probability pi for the ith row given by

Equation (2.4)

where $\Vert \cdot\Vert _F$ denotes the matrix Frobenius norm. This choice of the probability distribution pi lends itself to a convenient convergence analysis [34]. In this work, we shall focus on the variant (2.3)–(2.4).

Similarly, the noisy data $b^\delta$ is given by

Equation (2.5)

where δ is the noise level. RKM reads: given the initial guess x0, we iterate

where the index i is drawn i.i.d. according to (2.4).

The following theorem summarizes typical convergence results of RKM for consistent and inconsistent linear systems [29, 34, 36] (see [26] for in-depth discussions), under the condition that the matrix A is of full column-rank. For a rectangular matrix $A\in\mathbb{R}^{n\times m}$ , we denote by $A^{\dagger}\in\mathbb{R}^{m\times n}$ the pseudoinverse of A, $\Vert A\Vert _2$ denotes the matrix spectral norm, and $\sigma_{\rm min}(A)$ the smallest singular value of A. The error $\Vert x_k-x^*\Vert $ of the RKM iterate xk is stochastic due to the random choice of the index i. Below $ \newcommand{\E}{{\mathbb{E}}} \E[\cdot]$ denotes expectation with respect to the random row index selection. Note that $\kappa_A$ differs from the usual condition number [8].

Theorem 2.1. Let xk be the solution generated by RKM (2.3)–(2.4) at iteration k, and $\kappa_A=\Vert A\Vert _F\Vert A^{\dagger}\Vert _2$ be a (generalized) condition number. Then the following statements hold.

  • (i)  
    For exact data, there holds
  • (ii)  
    For noisy data, there holds

Theorem 2.1 gives error estimates (in expectation) for any iterate xk, $k\geqslant 1$ : the convergence rate is determined by $\kappa_A$ . For ill-posed linear inverse problems (e.g. CT), bad conditioning is characteristic and the condition number $\kappa_A$ can be huge, and thus the theorem predicts a very slow convergence. However, in practice, RKM converges rapidly during the initial iteration. The estimate is also deceptive for noisy data: due to the presence of the term $\delta^2/\sigma^2_{\min}(A)$ , it implies blowup at the very first iteration, which is however not the case in practice. Hence, these results do not fully explain the excellent empirical convergence of RKM for inverse problems.

The next example compares the convergence rates of Kaczmarz method and RKM.

Example 2.1. Given $n\geqslant 2$ , let $\theta=\frac{2\pi}{n}$ . Consider the linear system with $A\in\mathbb{R}^{n\times 2}$ , $ \newcommand{\e}{{\rm e}} a_i=(\begin{array}{@{}c@{}}\cos(i-1)\theta \\ \sin(i-1)\theta\end{array})$ and the exact solution $x^*=0$ , i.e. $b=0$ . Then we solve it by Kaczmarz method and RKM. For any $e_0=(x_0, y_0)$ , after one Kaczmarz iteration, $e_1=(0,y_0)$ , and generally, after k iterations,

For large n, the decreasing factor $| \cos\theta|$ can be very close to one, and thus each Kaczmarz iteration can only decrease the error slowly. Thus, the convergence rate of Kaczmarz method depends strongly on n: the larger is n, the slower is the convergence. Similarly, for RKM, there holds

and

For RKM, the convergence rate is independent of n. Further, for any $n> 8$ , we have $0<\theta<\frac{\pi}{4}$ , and $\cos \theta> \cos \frac{\pi}{4}= 2^{ -(\frac{1}{2})}$ . This shows the superiority of RKM over the cyclic one.

Last we recall singular value decomposition (SVD) of the matrix A [8], which is the basic tool for the convergence analysis in section 3. We denote SVD of $A\in\mathbb{R}^{n\times m}$ by

where $U\in \mathbb{R}^{n\times n}$ and $V\in \mathbb{R}^{m\times m}$ are column orthonormal matrices and their column vectors known as the left and right singular vectors, respectively, and $\Sigma \in\mathbb{R}^{n\times m}$ is diagonal with the diagonal elements ordered nonincreasingly, i.e. $\sigma_1\geqslant \ldots\geqslant \sigma_r> 0$ , with $r\leqslant\min(m, n)$ . The right singular vectors vi span the solution space, i.e. $x\in {\rm span}(v_i)$ . We shall write

i.e. $V=(v_1\ \ldots \ v_m)$ . Note that for inverse problems, empirically, as the index i increases, the right singular vectors vi are increasingly more oscillatory, capturing more high-frequency components [13]. The behavior is analogous to the inverse of Sturm–Liouville operators. For a general class of convolution integral equations, such oscillating behavior was established in [6]. For many practical applications, the linear system (2.1) can be regarded as a discrete approximation to the underlying continuous problem, and thus inherits the corresponding spectral properties.

Given a frequency cutoff number $1\leqslant L\leqslant m$ , we define two (orthogonal) subspaces of $\mathbb{R}^m$ by

which denotes the low- and high-frequency solution spaces, respectively. This is motivated by the observation that in practice one only looks for smooth solutions that are spanned/well captured by the first few right singular vectors [13]. This condition is akin to the concept of sourcewise representation in regularization theory, e.g. $x\in A^*w$ for some $w\in\mathbb{R}^n$ or its variants [5, 16], which is needed for deriving convergence rates for the regularized solution. Throughout, we always assume that the truncation level L is fixed. Then for any vector $z\in\mathbb{R}^m$ , there exists a unique decomposition $z=P_Lz+P_Hz$ , where PL and PH are orthogonal projection operators into ${{\mathcal L}}$ and ${{\mathcal H}}$ , respectively, which are defined by

These projection operators will be used below to analyze the preasymptotic behavior of RKM.

3. Preasymptotic convergence analysis

In this section, we present a preasymptotic convergence analysis of RKM. Let $x^*$ be one solution of linear system (2.1). Our analysis relies on decomposing the error $e_k=x_k - x^*$ of the kth iterate xk into low- and high-frequency components (according to the right singular vectors). We aim at bounding the conditional error $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert e_{k+1}\Vert ^2| e_k]$ (on ek, where the expectation $ \newcommand{\E}{{\mathbb{E}}} \E[\cdot]$ is with respect to the random choice of the index i, cf. (2.4)) by analyzing separately $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1} \Vert ^2| e_k]$ and $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_He_{k+1}\Vert ^2| e_k]$ . This is inspired by the fact that the inverse solution consists mainly of the low-frequency components, which is akin to the concept of source condition in regularization theory [5, 16]. Our error estimates allow explaining the excellent empirical performance of RKM in the context of inverse problems.

We shall discuss the preasymptotic convergence for exact and noisy data separately.

3.1. Exact data

First, we analyze the case of noise free data. Let $x^*$ be one solution to the linear system (2.1), and $e_{k}=x_k-x^*$ be the error at iteration k. Upon substituting the identity $b=Ax^*$ into RKM iterate, we deduce that for some $i\in\{1, \ldots, n\}$ , there holds

Equation (3.1)

Note that $I-\frac{a_ia_i^t}{\Vert a_i\Vert ^2}$ is an orthogonal projection operator. We first give two useful lemmas.

Lemma 3.1. For any $e_L\in{{\mathcal L}}$ and $e_H\in {{\mathcal H}}$ , there hold

Proof. The assertions follow directly from simple algebra, and hence the proof is omitted. □

Lemma 3.2. For $i=1, \ldots, n$ , there holds

Proof. By definition, $P_H a_i = \sum_{j=L+1}^m\langle a_i, v_j\rangle v_j.$ Since $a_i^t = u_i^t\Sigma V^t$ , there holds $ \langle a_i, v_j\rangle = u_i^t\Sigma V^tv_j = \langle u_i, \sigma_je_j\rangle = \sigma_j(u_i)_j. $ Hence, $\Vert P_Ha_i\Vert ^2 = \sum_{j=L+1}^m \langle a_i, v_j\rangle^2 = \sum_{j=L+1}^m \sigma_j^2$ $\Vert (u_i)_j\Vert ^2\leqslant \sigma_{L+1}^2$ . The second estimate follows similarly. □

The next result gives a preasymptotic recursive estimate on $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1}\Vert ^2| e_k]$ and $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_He_{k+1}\Vert ^2 | e_k]$ for exact data $b\in {\rm range}(A)$ . This represents our first main theoretical result.

Theorem 3.1. Let $c_1=\frac{\sigma^2_L}{\Vert A\Vert _F^2}$ and $c_2=\frac{\sum^{r}_{i=L+1}\sigma^2_i}{\Vert A\Vert _F^2}$ . Then there hold

Proof. Let eL and eH be the low- and high-frequency errors of ek, respectively, i.e. $e_L=P_Le_k$ and $e_H=P_He_k$ . Then by the identities $P_Le_{k+1} =e_L - \frac{1}{\Vert a_i\Vert ^2} \langle a_i, e_k\rangle P_La_i$ and $\langle P_La_i, e_L\rangle = \langle a_i, e_L\rangle$ , we have

Upon noting the identity $ \sum_{i=1}^n a_ia_i^t = A^t A$ , taking expectation on both sides yields

Now substituting the splitting $e_k = e_L + e_H$ and rearranging the terms give

Thus the first assertion follows from lemma 3.1.

The high-frequency component $P_He_{k+1}$ satisfies $P_He_{k+1} =e_k - \frac{1}{\Vert a_i\Vert ^2}\langle a_i, e_k\rangle P_Ha_i.$ We appeal to lemma 3.2 and the inequality $\langle a_i, e_k\rangle^2\leqslant \Vert a_i\Vert ^2\Vert e_k\Vert ^2 =\Vert a_i\Vert ^2(\Vert e_L\Vert ^2+\Vert e_H\Vert ^2)$ to get

Taking expectation yields

Thus we obtain the second assertion and complete the proof. □

Remark 3.1. By theorem 3.1, the decay of the error $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1}\Vert ^2| e_k]$ is largely determined by the factor $1-c_1$ and only mildly affected by $\Vert P_He_k\Vert ^2$ by a factor c2. The factor c2 is very small in the presence of a gap in the singular value spectrum at $\sigma_L$ , i.e. $\sigma_{L}\gg \sigma_{L+1}$ , showing clearly the role of the gap.

Remark 3.2. Theorem 3.1 also covers the rank-deficient case, i.e. $\sigma_{L+1}=0$ , and it yields

If $L=m$ , it recovers theorem 2.1(i) for exact data. The rank-deficient case was analyzed in [11].

Remark 3.3. By taking expectation of both sides of the estimates in theorem 3.1, we obtain

Then the error propagation is given by

The pairs of eigenvalues $\lambda_\pm $ and (orthonormal) eigenfunctions $v_\pm$ of D are given by

and

For the case $c_2\ll c_1 <1$ , i.e. $\alpha = \frac{c_2}{c_1}\ll 1$ , we have

and

With $V=[v_+ \ v_-]$ , we have the approximate eigendecomposition if $k = O(1)$ :

Thus, for $c_1\gg c_2$ , we have the following approximate error propagation for $k = O(1)$ :

3.2. Noisy data

Next we turn to the case of noisy data $b^\delta$ , cf. (2.5), we use the superscript δ to indicate the noisy case. Since $ \newcommand{\e}{{\rm e}} b_i^\delta=b_i+\eta_i$ , the RKM iteration reads

and thus the random error $e_{k+1} = x_{k+1} - x^*$ satisfies

Equation (3.2)

Now we give our second main result, i.e. bounds on the errors $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1}\Vert ^2| e_k]$ and $ \newcommand{\E}{{\mathbb{E}}} \E[\Vert P_He_{k+1}\Vert ^2| e_k]$ for the noisy data.

Theorem 3.2. Let $c_1=\frac{\sigma_L^2}{\|A\|_F^2}$ , and $c_2=\frac{\sum_{i=L+1}^r\sigma_i^2}{\|A\|_F^2}$ . Then there hold

Proof. By the recursive relation (3.2), we have the splitting

where the terms are given by

The first term ${\rm I}_1$ can be bounded directly by theorem 3.1. Clearly, ${\rm I}_2\leqslant \frac{\delta^2}{\Vert A\Vert _F^2}$ . For the third term ${\rm I}_3$ , we note the splitting

Thus, by Cauchy–Schwarz inequality, we have

Direct computation yields

Consequently, by lemma 3.2, we obtain

These estimates together show the first assertion. For the high-frequency component eH, we have

where the terms are given by

The term ${\rm I}_4$ is already bounded by theorem 3.1. Further, clearly, ${\rm I}_5 \leqslant \frac{\delta^2}{\Vert A\Vert _F^2}$ . For the term ${\rm I}_6$ , note the splitting

and thus ${\rm I}_6 = -{\rm I}_3$ . This shows the second assertion, and completes the proof of the theorem. □

Remark 3.4. Recall the following estimate for RKM [36, theorem 3.7]

The estimate in theorem 3.2 is free from $\kappa_A$ , but introduces an additional term $\frac{2\delta}{\|A\|_F}c_2^\frac{1}{2}\|e_k\|$ . Since $c_2$ is generally small, this extra term is comparable with $\Vert A\Vert _F^{-2}\delta^2$ . Theorem 3.2 extends theorem 3.1 to the noisy case: if $\delta=0$ , it recovers theorem 3.1. It indicates that if the initial error $e_0=x_0-x^*$ concentrates mostly on low frequency, the iterate will first decrease the error. The smoothness assumption on the initial error e0 is realistic for inverse problems, notably under the standard source type conditions (for deriving convergence rates) [5, 16]. Nonetheless, the deleterious noise influence will eventually kick in as the iteration proceeds.

Remark 3.5. One can discuss the evolution of the iterates for noisy data, similar to remark 3.3. By Young's inequality $ \newcommand{\e}{{\rm e}} 2ab\leqslant \epsilon a^2+\epsilon^{-1}b^2$ , the error satisfies (with $\bar c_1 = {(1-\epsilon)c_1}$ and $\bar c_2 = {(1+\epsilon)c_2})$

Then it follows that

In the case $\bar c_2\ll\bar c_1<1$ and $\alpha = \frac{\bar c_2}{\bar c_1}\ll 1$ (by choosing reasonably small epsilon), for $k=O(1)$ , repeating the analysis in remark 3.3 yields

Thus, the presence of data noise only influences the error of the RKM iterates mildly by an additive factor $(k\delta^2)$ , during the initial iterations.

4. RKM with variance reduction

When equipped with a proper stopping criterion, Kaczmarz method is a regularization method [21, 23]. Naturally, one would expect that this assertion holds also for RKM (2.3)–(2.4). This however remains to be proven due to the lack of a proper stopping criterion. To see the delicacy, consider one natural choice, i.e. Morozov's discrepancy principle [27]: choose the smallest integer k such that

Equation (4.1)

where $\tau>1$ is fixed [5, 16]. Theoretically, it is still unclear that (4.1) can be satisfied within a finite number of iterations for every noise level $\delta>0$ . In practice, computing the residual $\Vert Ax_k-b^\delta\Vert $ at each iteration is undesirable since its cost is of the order of evaluating the full gradient, whereas avoiding the latter is the very motivation for RKM! Below we propose one simple remedy by drawing on its connection with stochastic gradient methods [31] and the vast related developments.

First we note that the solution to (2.1) is equivalent to minimizing the least-squares problem

Equation (4.2)

Next we recast RKM as a stochastic gradient method for problem (4.2), as noted earlier in [30]. We include a short proof for completeness.

Proposition 4.1. The RKM iteration (2.3)–(2.4) is a (weighted) stochastic gradient update with a constant stepsize $n/\Vert A\Vert _F^2$ .

Proof. With the weight $w_i=\Vert a_i\Vert ^2$ , we rewrite problem (4.2) into

Since $\sum_{i=1}^n w_i=\Vert A\Vert _F^2$ , we may interpret $p_i=w_i/\Vert A\Vert _F^2$ as a probability distribution on the set $\{1, \ldots, n\}$ , i.e. (2.4). Next we apply the stochastic gradient method. Since $g_i(x):=\nabla f_i(x) =\frac{\Vert A\Vert _F^2}{nw_i}(\langle a_i, x\rangle-b_i)a_i$ , with a fixed step length $ \newcommand{\e}{{\rm e}} \eta=n\Vert A\Vert _F^{-2}$ , we get

where $i\in \{1, \ldots, n\}$ is drawn i.i.d. according to (2.4). Clearly, it is equivalent to RKM (2.3)–(2.4). □

Now we give the mean and covariance of the stochastic gradient $g_i(x)$ .

Proposition 4.2. Let $g(x)=\nabla f(x)$ . Then the gradient $g_i(x)$ satisfies

Proof. The full gradient $g(x):=\nabla f(x)$ at x is given by $g(x)= \frac{1}{n}A^t(Ax-b)$ . The mean $ \newcommand{\E}{{\mathbb{E}}} \E[g_i(x)]$ of the (partial) gradient $g_i(x)$ is given by

Next, by bias-variance decomposition, the covariance ${\rm Cov}[g_i(x)]$ of the gradient $g_i(x)$ is given by

This completes the proof of the proposition. □

Thus, the single gradient $g_i(x)$ is an unbiased estimate of the full gradient $g(x)$ . For consistent linear systems, the covariance ${\rm Cov}[g_i(x)]$ is asymptotically vanishing: as $x_k\to x^*$ , both terms in the variance expression tend to zero. However, for inconsistent linear systems, the covariance ${\rm Cov}[g_i(x)]$ generally does not vanish at the optimal solution $x^*$ :

since one might expect $A^t(Ax^*-b^\delta)\approx 0$ . Further, ${\rm Cov}[g_i(x^*)]$ is of the order $\delta^2$ in the neighborhood of $x^*$ . One may predict the (asymptotic) dynamics of RKM via a stochastic modified equation from the covariance [25]. The RKM iteration eventually deteriorates due to the nonvanishing covariance so that its asymptotic convergence slows down.

These discussions motivate the use of variance reduction techniques developed for stochastic gradient methods to reduce the variance of the gradient estimate. There are several possible strategies, e.g. stepsize reduction, stochastic variance reduction gradient (SVRG), averaging and mini-batch (see e.g. [19, 32]). We only adapt SVRG [19] to RKM, termed as RKM with variance reduction (RKMVR); see algorithm 1 for details. It hybridizes the stochastic gradient with the (occasional) full gradient to achieve variance reduction. Here, s is the length of epoch, which determines the frequency of full gradient evaluation and was suggested to be n [19], and K is the maximum number of iterations. In view of Step 2, within the first epoch, it performs only the standard RKM, and at the end of the epoch, it evaluates the full gradient. In RKMVR, the residual $\Vert Ax_k-b^\delta\Vert $ is a direct by-product of full gradient evaluation and occurs only at the end of each epoch, and thus it does not invoke additional computational effort.

The update at Step 8 of algorithm 1 can be rewritten as (for $k\geqslant s$ )

and thus $\tilde x-x_k\to 0$ as the iteration proceeds, and it recovers the Landweber method. With this choice, the variance of the gradient estimate is asymptotically vanishing [19]. Numerically, algorithm 1 converges rather steadily. That is, it combines the strengthes of RKM and the Landweber method: it merits the fast initial convergence of the former and the excellent stability of the latter.

Algorithm 1. Randomized Kaczmarz method with variance reduction (RKMVR).

1: Specify A, b, x0, K, and s.
2: Initialize $g_i(\tilde x) = 0$ , and $\tilde g = 0$ .
3: for $k=1, \ldots, K$ do
4:   if $k\ {\rm mod}\ s=0$ then
5:     Set $\tilde x = x_k$ and $\tilde g= g(x_k)$ .
6:     Check the discrepancy principle (4.1).
7:   end if
8:   Pick an index i according to (2.4).
9:   Update xk by
          $x_{k+1} = x_k - \frac{n}{\Vert A\Vert _F^2}(g_i(x_k)-g_i(\tilde x) + \tilde g).$
10: end for

5. Numerical experiments and discussions

Now we present numerical results for RKM and RKMVR to illustrate their distinct features. All the numerical examples, i.e. phillips, gravity and shaw, are taken from the public domain MATLAB package Regutools5. They are Fredholm integral equations of the first kind, with the first example being mildly ill-posed, and the last two severely ill-posed, respectively. Unless otherwise stated, the examples are discretized with a dimension $n=m=1000$ . The noisy data $b^\delta$ is generated from the exact data b as

where δ is the relative noise level, and the random variables $\xi_i$ s follow an i.i.d. standard Gaussian distribution. The initial guess x0 for the iterative methods is $x_0=0$ . We present the squared error ek and/or the squared residual rk, i.e.

Equation (5.1)

The expectation $\mathbb{E}[\cdot]$ with respect to the random choice of the rows is approximated by the average of 100 independent runs. All the computations were carried out on a personal laptop with 2.50 GHz CPU and 8.00 G RAM by MATLAB 2015b.

5.1. Benefit of randomization

First we compare the performance of RKM with the cyclic Kaczmarz method (KM) to illustrate the benefit of randomization. Overall, the random reshuffling can substantially improve the convergence of KM; see the results in figures 13 for the examples with different noise levels.

Figure 1.

Figure 1. Numerical results (ek and rk) for phillips by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .

Standard image High-resolution image
Figure 2.

Figure 2. Numerical results (ek and rk) for gravity by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .

Standard image High-resolution image
Figure 3.

Figure 3. Numerical results (ek and rk) for shaw by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .

Standard image High-resolution image

Next we examine the convergence more closely. The (squared) error ek of the Kaczmarz iterate xk undergoes a sudden drop at the end of each cycle, whereas within the cycle, the drop after each Kaczmarz iteration is small. Intuitively, this can be attributed to the fact that the neighboring rows of the matrix A are highly correlated to each other, and thus each single Kaczmarz iteration reduces only very little the (squared) error ek, since roughly it repeats the previous projection. The strong correlation between the neighboring rows is the culprit of the slow convergence of the cyclic KM. The randomization ensures that any two rows chosen by two consecutive RKM iterations are less correlated, and thus the iterations are far more effective for reducing the error ek, leading to a much faster empirical convergence. These observations hold for both exact and noisy data. For noisy data, the error ek first decreases and then increases for both KM and RKM, and the larger is the noise level δ, the earlier does the divergence occur. That is, both exhibit a 'semiconvergence' phenomenon typical for iterative regularization methods. Thus a suitable stopping criterion is needed. Meanwhile, the residual rk tends to decrease, but for both methods, it oscillates wildly for noisy data and the oscillation magnitude increases with δ. This is due to the nonvanishing variance; see the discussions in section 4. One surprising observation is that a fairly reasonable inverse solution can be obtained by RKM within one cycle of iterations. That is, by ignoring all other cost, RKM can solve the inverse problems reasonably well at a cost less than one full gradient evaluation!

5.2. Preasymptotic convergence

Now we examine the convergence of RKM. Theorems 3.1 and 3.2 predict that during first iterations, the low-frequency error $ \newcommand{\E}{{\mathbb{E}}} e_L=\E[\Vert P_Le_k\Vert ^2]$ decreases rapidly, but the high-frequency error $ \newcommand{\E}{{\mathbb{E}}} e_H=\E[\Vert P_He_k\Vert ^2]$ can at best decay mildly. For all examples, the first five singular vectors can capture the majority of the energy of the initial error $x^*-x_0$ . Thus, we choose a truncation level $L=5$ , and plot the evolution of low-frequency and high-frequency errors eL and eH, and the total error $ \newcommand{\E}{{\mathbb{E}}} e=\E[\Vert e_k\Vert ^2]$ , in figure 4.

Figure 4.

Figure 4. The error decay for the examples with two noise levels: $\delta=10^{-2}$ (top) and $\delta = 5\times 10^{-2}$ (bottom), with a truncation level $L=5$ . (a) phillips, (b) gravity and (c) shaw.

Standard image High-resolution image

Numerically, the low-frequency error eL decays much more rapidly during the initial iterations, and since the low-frequency modes are dominant, the total error e also enjoys a very fast initial decay. Intuitively, this behavior may be explained as follows. The rows of the matrix ${A}$ mainly contain low-frequency modes, and thus each RKM iteration tends to mostly decrease the low-frequency error eL of the initial error $x^*-x_0$ . The high-frequency error eH experiences a similar but slower decay during the iteration, and then levels off. These observations fully confirm the preasymptotic analysis in section 3. For noisy data, the error ek can be highly oscillatory, so is the residual rk. The larger is the noise level δ, the larger is the oscillation magnitude. However, the degree of ill-posedness of the problem seems not to affect the convergence of RKM, so long as $x^*$ is mainly composed of low-frequency modes.

To shed further insights, we present in figure 5 the decay behavior of the low- and high-frequency errors for the example phillips with a random solution whose entries follow the i.i.d. standard normal distribution. Then the source type condition is not verified for the initial error. Now with a truncation level $L=5$ , the low-frequency error eL only composes a small fractional of the initial error e0. The low-frequency error eL decays rapidly, exhibiting a fast preasymptotic convergence as predicted by theorem 3.2, but the high-frequency error eH stagnates during the iteration. Thus, in the absence of the smoothness condition on e0, RKM is ineffective, thereby supporting theorems 3.1 and 3.2.

Figure 5.

Figure 5. The error decay for phillips with a random solution, with a truncation level $L=5$ . (a) $\delta=10^{-2}$ and (b) $\delta=5\times10^{-2}$ .

Standard image High-resolution image

Naturally, one may divide the total error e into more than two frequency bands. The empirical behavior is similar to the case of two frequency bands; see figure 6 for an illustration on the example phillips, with four frequency bands. The lowest-frequency error e1 decreases fastest, and then the next band e2 slightly slower, etc. These observations clearly indicate that even though RKM does not employ the full gradient, the iterates are still mainly concerned with the low-frequency modes during the first iterations, like the Landweber method in the sense that the low-frequency modes are much easier to recover than the high-frequency ones. However, the cost of each RKM iteration is only one nth of that for the Landweber method, and thus it is computationally much more efficient.

Figure 6.

Figure 6. The error decay for phillips. The total error e is divided into four frequency bands: 1–3, 4–6, 7–9, and the remaining, denoted by ei, $i=1, \ldots, 4$ . (a) $\delta=10^{-2}$ and (b) $\delta=5\times10^{-2}$ .

Standard image High-resolution image

5.3. RKM versus RKMVR

The nonvanishing variance of the gradient $g_i(x)$ slows down the asymptotic convergence of RKM, and the iterates eventually tend to oscillate wildly in the presence of data noise; see the discussion in section 4. This is expected: the iterate converges to the least-squares solution, which is known to be highly oscillatory for ill-posed inverse problems. Variance reduction is one natural strategy to decrease the variance of the gradient estimate, thereby stabilizing the evolution of the iterates. To illustrate this, we compare the evolution of RKM with RKMVR in figure 7. We also include the results by the Landweber method (LM). To compare the iteration complexity only, we count one Landweber iteration as n RKM iterations. The epoch of RKMVR is set to n, the total number of data points, as suggested in [19]. Thus n RKMVR iterations include one full gradient evaluation, and it amounts to 2n RKM iterations. The full gradient evaluation is indicated by flat segments in the plots.

Figure 7.

Figure 7. Numerical results for the examples by RKM, RKMVR and LM. (a) phillips, $\delta=10^{-2}$ , (b) phillips, $\delta=5\times10^{-2}$ , (c) gravity, $\delta=10^{-2}$ , (d) gravity, $\delta=5\times10^{-2}$ , (e) shaw, $\delta=10^{-2}$ and (f) shaw, $\delta=5\times10^{-2}$ .

Standard image High-resolution image

With the increase of the noise level δ, RKM first decreases the error ek, and then increases it, which is especially pronounced at $\delta = 5\times 10^{-2}$ . This is well reflected by the large oscillations of the iterates. RKMVR tends to stabilize the iteration greatly by removing the large oscillations, and thus its asymptotical behavior resembles closely that of LM. That is, RKMVR inherits the good stability of LM, while retaining the fast initial convergence of RKM. Thus, the stopping criterion, though still needed, is less critical for the RKMVR, which is very beneficial from the practical point of view. In summary, the simple variance reduction scheme in algorithm 1 can combine the strengths of both worlds.

Last, we numerically examine the regularizing property of RKMVR with the discrepancy principle (4.1). In figure 8, we present the number of iterations for several noise levels for RKMVR (one realization) and LM. For both methods, the number of iterations by the discrepancy principle (4.1) appears to decrease with the noise level δ, and RKMVR consistently terminates much earlier than LM, indicating the efficiency of RKMVR. The reconstructions in figure 8(d) show that the error increases with the noise level δ, indicating a regularizing property. In contrast, in the absence of the discrepancy principle, the RKMVR iterates eventually diverge as the iteration proceeds, cf. figure 7.

Figure 8.

Figure 8. The residual rk and the recoveries for phillips (top), gravity (middle), shaw (bottom) by RKMVR and LM with the discrepancy principle (4.1) with $\tau=1.1$ . (a) $\delta=10^{-3}$ , (b) $\delta=10^{-2}$ , (c) $\delta=5\times10^{-2}$ and (d) solutions.

Standard image High-resolution image

6. Conclusions

We have presented an analysis of the preasymptotic convergence behavior of the randomized Kaczmarz method. Our analysis indicates that the low-frequency error decays much faster than the high-frequency one during the initial randomized Kaczmarz iterations. Thus, when the low-frequency modes are dominating in the initial error, as typically occurs for inverse problems, the method enjoys very fast initial error reduction. Thus this result sheds insights into the excellent practical performance of the method, which is also numerically confirmed. Next, by interpreting it as a stochastic gradient method, we proposed a randomized Kaczmarz method with variance reduction by hybridizing it with the Landweber method. Our numerical experiments indicate that the strategy is very effective in that it can combine the strengths of both randomized Kaczmarz method and Landweber method.

Our work represents only a first step towards a complete theoretical understanding of the randomized Kaczmarz method and related stochastic gradient methods (e.g. variable step size, and mini-batch version) for efficiently solving inverse problems. There are many important theoretical and practical questions awaiting further research. Theoretically, one outstanding issue is the regularizing property (e.g. consistency, stopping criterion and convergence rates) of the randomized Kaczmarz method from the perspective of classical regularization theory.

Acknowledgments

The authors are grateful to the constructive comments of the anonymous referees, which have helped improve the quality of the paper. In particular, the remark by one of the referees has led to much improved results as well as more concise proofs. The research of Y Jiao is partially supported by National Science Foundation of China (NSFC) No. 11501579 and National Science Foundation of Hubei Province No. 2016CFB486, B Jin by EPSRC grant EP/M025160/1 and UCL Global Engagement grant (2016–2017), and X Lu by NSFC Nos. 11471253 and 91630313.

Footnotes

Please wait… references are loading.
10.1088/1361-6420/aa8e82