Preasymptotic convergence of randomized Kaczmarz method

Yuling Jiao; Bangti Jin; Xiliang Lu

doi:10.1088/1361-6420/aa8e82

1. Introduction

Kaczmarz methood [20], named after Polish mathematician Stefan Kaczmarz, is one popular iterative method for solving linear systems. It is a special form of the general alternating projection method. In the computed tomography (CT) community, it was rediscovered in 1970 by Gordon, Bender and Herman [9], under the name algebraic reconstruction techniques. It was implemented in the very first medical CT scanner, and since then it has been widely employed in CT reconstructions [14, 15, 28].

The convergence of Kaczmarz method for consistent linear systems is not hard to show. However, the theoretically very important issue of convergence rates of Kaczmarz method (or the alternating projection method for linear subspaces) is very challenging. There are several known convergence rates results, all relying on (spectral) quantities of the matrix A that are difficult to compute or verify in practice (see [7] and the references therein). This challenge is well reflected by the fact that the convergence rate of the method depends strongly on the ordering of the equations.

It was numerically discovered several times independently in the literature that using the rows of the matrix A in Kaczmarz method in a random order, called randomized Kaczmarz method (RKM) below, rather than the given order, can often substantially improve the convergence [15, 28]. Thus RKM is quite appealing for practical applications. However, the convergence rate analysis was given only very recently. In an influential paper [34], in 2009, Strohmer and Vershynin established the exponential convergence of RKM for consistent linear systems, and the convergence rate depends on (a variant of) the condition number. This result was then extended and refined in various directions [1, 3, 10, 26, 29, 33, 35], including inconsistent or underdetermined linear systems. Recently, Schöpfer and Lorenz [33] showed the exponential convergence for RKM for sparse recovery with elastic net. We recall the result of Strohmer and Vershynin and its counterpart for noisy data in theorem 2.1 below. It is worth noting that all these estimates involve the condition number, and for noisy data, the estimate contains a term inversely proportional to the smallest singular value of the matrix A.

These important and interesting existing results do not fully explain the excellent empirical performance of RKM for solving linear inverse problems, especially in the case of noisy data, where the term due to noise is amplified by a factor of the condition number. In practice, one usually observes that the iterates first converge quickly to a good approximation to the true solution, and then start to diverge slowly. That is, it exhibits the typical 'semiconvergence' phenomenon for iterative regularization methods, e.g. Landweber method and conjugate gradient methods [13, 21]. This behavior is not well reflected in the known estimates given in theorem 2.1; see section 2 for further comments.

The purpose of this work is to study the preasymptotic convergence behavior of RKM. This is achieved by analyzing carefully the evolution of the low- and high-frequency errors during the randomized Kaczmarz iteration, where the frequency is divided according to the right singular vectors of the matrix A. The results indicate that during initial iterations, the low-frequency error decays must faster than the high-frequency one, see theorems 3.1 and 3.2. Since the inverse solution (relative to the initial guess x₀) is often smooth in the sense that it consists mostly of low-frequency components [13], it explains the good convergence behavior of RKM, thereby shedding new insights into its excellent practical performance. This condition on the inverse solution is akin to the sourcewise representation condition in classical regularization theory [5, 16]. Further, based on the fact that RKM is a special case of the stochastic gradient method [31], we propose a simple modified version using the idea of variance reduction by hybridizing it with the Landweber method, inspired by [19]. This variant merits both good preasymptotic and asymptotic convergence behavior, as indicated by the numerical experiments.

Last, we note that in the context of inverse problems, Kaczmarz method has received much recent attention, and has demonstrated very encouraging results in a number of applications. The regularizing property and convergence rates in various settings have been analyzed for both linear and nonlinear inverse problems (see [2, 4, 12, 17, 18, 22–24] for an incomplete list). However, these interesting works all focus on a fixed ordering of the linear system, instead of the randomized variant under consideration here, and thus they do not cover RKM.

The rest of the paper is organized as follows. In section 2 we describe RKM and recall the basic tool for our analysis, i.e. singular value decomposition, and a few useful notations. Then in section 3 we derive the preasymptotic convergence rates for exact and noisy data. Some practical issues are discussed in section 4. Last, in section 5, we present extensive numerical experiments to confirm the analysis and shed further insights.

2. Randomized Kaczmarz method

Now we describe the problem setting and RKM, and also recall known convergence rates results for both consistent and inconsistent data. The linear inverse problem with exact data can be cast into

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:lin} Ax = b, \nonumber \end{align} \tag{ 2.1 }$

where the matrix $A\in \mathbb{R}^{n\times m}$ , and $b\in\mathbb{R}^n$ and $b\in {\rm range}(A)$ . We denote the ith row of the matrix A by $a_i^t$ , with $a_i\in\mathbb{R}^m$ being a column vector, where the superscript t denotes the vector/matrix transpose. The linear system (2.1) can be formally determined or under-determined.

The classical Kaczmarz method [20] proceeds as follows. Given the initial guess x₀, we iterate

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:Kaczmarz} x_{k+1}= x_k + \frac{b_i-\langle a_i,x_k\rangle}{\Vert a_i\Vert ^2} a_i,\quad i=(k\ {\rm mod }\ n)+1, \nonumber \end{align} \tag{ 2.2 }$

where $\langle\cdot, \cdot\rangle$ and $\Vert \cdot\Vert$ denote the Euclidean inner product and norm, respectively. Thus, Kaczmarz method sweeps through the equations in a cyclic manner, and n iterations constitute one complete cycle.

In contrast to the cyclic choice of the index i in Kaczmarz method, RKM randomly selects i. There are several different variants, depending on the specific random choice of the index i. The variant analyzed by Strohmer and Vershynin [34] is as follows. Given an initial guess x₀, we iterate

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:rkm} x_{k+1}= x_{k} + \frac{b_i-\langle a_i,x_k\rangle}{\Vert a_i\Vert ^2}a_i, \nonumber \end{align} \tag{ 2.3 }$

where i is drawn independent and identically distributed (i.i.d.) from the index set $\{1, 2, \ldots, n\}$ with the probability p_i for the ith row given by

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:probability} p_i=\frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2},\quad i=1,\ldots,n, \nonumber \end{align} \tag{ 2.4 }$

where $\Vert \cdot\Vert _F$ denotes the matrix Frobenius norm. This choice of the probability distribution p_i lends itself to a convenient convergence analysis [34]. In this work, we shall focus on the variant (2.3)–(2.4).

Similarly, the noisy data $b^\delta$ is given by

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:noisy-data} b_i^\delta=\langle a_i,x^*\rangle+\eta_i,\quad i=1,\ldots,n,\qquad {\rm{with~}}\ \ \Vert \eta\Vert \leqslant \delta, \nonumber \end{align} \tag{ 2.5 }$

where δ is the noise level. RKM reads: given the initial guess x₀, we iterate

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle x_{k+1} = x_k + \frac{b_i^\delta-\langle a_i,x_k\rangle}{\Vert a_i\Vert ^2}a_i, \nonumber \end{align*}$

where the index i is drawn i.i.d. according to (2.4).

The following theorem summarizes typical convergence results of RKM for consistent and inconsistent linear systems [29, 34, 36] (see [26] for in-depth discussions), under the condition that the matrix A is of full column-rank. For a rectangular matrix $A\in\mathbb{R}^{n\times m}$ , we denote by $A^{\dagger}\in\mathbb{R}^{m\times n}$ the pseudoinverse of A, $\Vert A\Vert _2$ denotes the matrix spectral norm, and $\sigma_{\rm min}(A)$ the smallest singular value of A. The error $\Vert x_k-x^*\Vert$ of the RKM iterate x_k is stochastic due to the random choice of the index i. Below $\newcommand{\E}{{\mathbb{E}}} \E[\cdot]$ denotes expectation with respect to the random row index selection. Note that $\kappa_A$ differs from the usual condition number [8].

Theorem 2.1. Let x_k be the solution generated by RKM (2.3)–(2.4) at iteration k, and $\kappa_A=\Vert A\Vert _F\Vert A^{\dagger}\Vert _2$ be a (generalized) condition number. Then the following statements hold.

(i)
For exact data, there holds
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert x_k-x^*\Vert ^2] \leqslant \left(1-\kappa_A^{-2}\right)^k\Vert x_0-x^*\Vert ^2. \nonumber \end{align*}$
(ii)
For noisy data, there holds
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert x_k-x^*\Vert ^2] \leqslant \left(1-\kappa_A^{-2}\right)^k\Vert x_0-x^*\Vert ^2 + \frac{\delta^2}{\sigma_{\min}^2(A)}. \nonumber \end{align*}$

Theorem 2.1 gives error estimates (in expectation) for any iterate x_k, $k\geqslant 1$ : the convergence rate is determined by $\kappa_A$ . For ill-posed linear inverse problems (e.g. CT), bad conditioning is characteristic and the condition number $\kappa_A$ can be huge, and thus the theorem predicts a very slow convergence. However, in practice, RKM converges rapidly during the initial iteration. The estimate is also deceptive for noisy data: due to the presence of the term $\delta^2/\sigma^2_{\min}(A)$ , it implies blowup at the very first iteration, which is however not the case in practice. Hence, these results do not fully explain the excellent empirical convergence of RKM for inverse problems.

The next example compares the convergence rates of Kaczmarz method and RKM.

Example 2.1. Given $n\geqslant 2$ , let $\theta=\frac{2\pi}{n}$ . Consider the linear system with $A\in\mathbb{R}^{n\times 2}$ , $\newcommand{\e}{{\rm e}} a_i=(\begin{array}{@{}c@{}}\cos(i-1)\theta \\ \sin(i-1)\theta\end{array})$ and the exact solution $x^*=0$ , i.e. $b=0$ . Then we solve it by Kaczmarz method and RKM. For any $e_0=(x_0, y_0)$ , after one Kaczmarz iteration, $e_1=(0,y_0)$ , and generally, after k iterations,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Vert e_{k+1}\Vert = | \cos\theta| ^{k}\Vert e_1\Vert . \nonumber \end{align*}$

For large n, the decreasing factor $| \cos\theta|$ can be very close to one, and thus each Kaczmarz iteration can only decrease the error slowly. Thus, the convergence rate of Kaczmarz method depends strongly on n: the larger is n, the slower is the convergence. Similarly, for RKM, there holds

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert e_{k+1}\Vert ^2| e_k] = \frac{1}{n}\sum_{i=1}^n( \cos i\theta) ^2\Vert e_k\Vert ^2 =\frac{1}{2n}\sum_{i=1}^n(1-\cos 2i\theta)\Vert e_k\Vert ^2=\frac{1}{2}\Vert e_k\Vert ^2, \nonumber \end{align*}$

and

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert e_{k+1}\Vert ^2] = 2^{-(k+1)}\Vert e_0\Vert ^2. \nonumber \end{align*}$

For RKM, the convergence rate is independent of n. Further, for any $n> 8$ , we have $0<\theta<\frac{\pi}{4}$ , and $\cos \theta> \cos \frac{\pi}{4}= 2^{ -(\frac{1}{2})}$ . This shows the superiority of RKM over the cyclic one.

Last we recall singular value decomposition (SVD) of the matrix A [8], which is the basic tool for the convergence analysis in section 3. We denote SVD of $A\in\mathbb{R}^{n\times m}$ by

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle A = U\Sigma V^t, \nonumber \end{align*}$

where $U\in \mathbb{R}^{n\times n}$ and $V\in \mathbb{R}^{m\times m}$ are column orthonormal matrices and their column vectors known as the left and right singular vectors, respectively, and $\Sigma \in\mathbb{R}^{n\times m}$ is diagonal with the diagonal elements ordered nonincreasingly, i.e. $\sigma_1\geqslant \ldots\geqslant \sigma_r> 0$ , with $r\leqslant\min(m, n)$ . The right singular vectors v_i span the solution space, i.e. $x\in {\rm span}(v_i)$ . We shall write

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle U = \left(\begin{array}{@{}c@{}} u_1^t \nonumber \\ \vdots \nonumber \\ u_n^t \end{array}\right)\quad {\rm{and}}\quad V^t=\left(\begin{array}{@{}c@{}} v_1^t \nonumber \\ \vdots \nonumber \\ v_m^t \end{array}\right), \nonumber \end{align*}$

i.e. $V=(v_1\ \ldots \ v_m)$ . Note that for inverse problems, empirically, as the index i increases, the right singular vectors v_i are increasingly more oscillatory, capturing more high-frequency components [13]. The behavior is analogous to the inverse of Sturm–Liouville operators. For a general class of convolution integral equations, such oscillating behavior was established in [6]. For many practical applications, the linear system (2.1) can be regarded as a discrete approximation to the underlying continuous problem, and thus inherits the corresponding spectral properties.

Given a frequency cutoff number $1\leqslant L\leqslant m$ , we define two (orthogonal) subspaces of $\mathbb{R}^m$ by

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle {{\mathcal L}}={\rm span}\{v_1,\ldots,v_L\}\quad {\rm{and}}\quad {{\mathcal H}}={\rm span}\{v_{L+1},\ldots,v_m\}, \nonumber \end{align*}$

which denotes the low- and high-frequency solution spaces, respectively. This is motivated by the observation that in practice one only looks for smooth solutions that are spanned/well captured by the first few right singular vectors [13]. This condition is akin to the concept of sourcewise representation in regularization theory, e.g. $x\in A^*w$ for some $w\in\mathbb{R}^n$ or its variants [5, 16], which is needed for deriving convergence rates for the regularized solution. Throughout, we always assume that the truncation level L is fixed. Then for any vector $z\in\mathbb{R}^m$ , there exists a unique decomposition $z=P_Lz+P_Hz$ , where P_L and P_H are orthogonal projection operators into ${{\mathcal L}}$ and ${{\mathcal H}}$ , respectively, which are defined by

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle P_Lz =\sum_{i=1}^L\langle v_i,z\rangle v_i\quad ~{\rm and}~\quad P_Hz=\sum_{i=L+1}^m\langle v_i,z\rangle v_i. \nonumber \end{align*}$

These projection operators will be used below to analyze the preasymptotic behavior of RKM.

3. Preasymptotic convergence analysis

In this section, we present a preasymptotic convergence analysis of RKM. Let $x^*$ be one solution of linear system (2.1). Our analysis relies on decomposing the error $e_k=x_k - x^*$ of the kth iterate x_k into low- and high-frequency components (according to the right singular vectors). We aim at bounding the conditional error $\newcommand{\E}{{\mathbb{E}}} \E[\Vert e_{k+1}\Vert ^2| e_k]$ (on e_k, where the expectation $\newcommand{\E}{{\mathbb{E}}} \E[\cdot]$ is with respect to the random choice of the index i, cf. (2.4)) by analyzing separately $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1} \Vert ^2| e_k]$ and $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_He_{k+1}\Vert ^2| e_k]$ . This is inspired by the fact that the inverse solution consists mainly of the low-frequency components, which is akin to the concept of source condition in regularization theory [5, 16]. Our error estimates allow explaining the excellent empirical performance of RKM in the context of inverse problems.

We shall discuss the preasymptotic convergence for exact and noisy data separately.

3.1. Exact data

First, we analyze the case of noise free data. Let $x^*$ be one solution to the linear system (2.1), and $e_{k}=x_k-x^*$ be the error at iteration k. Upon substituting the identity $b=Ax^*$ into RKM iterate, we deduce that for some $i\in\{1, \ldots, n\}$ , there holds

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:err} e_{k+1}= \left(I-\frac{a_ia_i^t}{\Vert a_i\Vert ^2}\right)e_k. \nonumber \end{align} \tag{ 3.1 }$

Note that $I-\frac{a_ia_i^t}{\Vert a_i\Vert ^2}$ is an orthogonal projection operator. We first give two useful lemmas.

Lemma 3.1. For any $e_L\in{{\mathcal L}}$ and $e_H\in {{\mathcal H}}$ , there hold

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \sigma_L\Vert e_L\Vert \leqslant \Vert Ae_L\Vert \leqslant \sigma_1\Vert e_L\Vert,\quad \Vert Ae_H\Vert \leqslant \sigma_{L+1}\Vert e_H\Vert,\quad {\rm{and}}\quad \langle Ae_L,Ae_H\rangle =0. \nonumber \end{align*}$

Proof. The assertions follow directly from simple algebra, and hence the proof is omitted. □

Lemma 3.2. For $i=1, \ldots, n$ , there holds

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Vert P_Ha_i\Vert ^2 \leqslant \sigma_{L+1}^2\quad {\rm{and}}\quad \sum_{i=1}^n\Vert P_Ha_i\Vert ^2\leqslant \sum_{i=L+1}^r\sigma_i^2. \nonumber \end{align*}$

Proof. By definition, $P_H a_i = \sum_{j=L+1}^m\langle a_i, v_j\rangle v_j.$ Since $a_i^t = u_i^t\Sigma V^t$ , there holds $\langle a_i, v_j\rangle = u_i^t\Sigma V^tv_j = \langle u_i, \sigma_je_j\rangle = \sigma_j(u_i)_j.$ Hence, $\Vert P_Ha_i\Vert ^2 = \sum_{j=L+1}^m \langle a_i, v_j\rangle^2 = \sum_{j=L+1}^m \sigma_j^2$ $\Vert (u_i)_j\Vert ^2\leqslant \sigma_{L+1}^2$ . The second estimate follows similarly. □

The next result gives a preasymptotic recursive estimate on $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1}\Vert ^2| e_k]$ and $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_He_{k+1}\Vert ^2 | e_k]$ for exact data $b\in {\rm range}(A)$ . This represents our first main theoretical result.

Theorem 3.1. Let $c_1=\frac{\sigma^2_L}{\Vert A\Vert _F^2}$ and $c_2=\frac{\sum^{r}_{i=L+1}\sigma^2_i}{\Vert A\Vert _F^2}$ . Then there hold

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_{k+1}\Vert ^2| e_k] &\leqslant (1-c_1)\Vert P_Le_k\Vert ^2 + c_2\Vert P_He_k\Vert ^2,\nonumber \\ \E[\Vert P_He_{k+1}\Vert ^2| e_k] &\leqslant c_2\Vert P_Le_k\Vert ^2 + (1+c_2)\Vert P_He_k\Vert ^2. \nonumber \end{align*}$

Proof. Let e_L and e_H be the low- and high-frequency errors of e_k, respectively, i.e. $e_L=P_Le_k$ and $e_H=P_He_k$ . Then by the identities $P_Le_{k+1} =e_L - \frac{1}{\Vert a_i\Vert ^2} \langle a_i, e_k\rangle P_La_i$ and $\langle P_La_i, e_L\rangle = \langle a_i, e_L\rangle$ , we have

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Vert P_Le_{k+1}\Vert ^2 & = \Vert e_L\Vert ^2 - \frac{2}{\Vert a_i\Vert ^2}\langle P_La_i,e_L\rangle \langle a_i,e_k\rangle + \langle a_i,e_k\rangle^2\frac{\Vert P_La_i\Vert ^2}{\Vert a_i\Vert ^4} \nonumber \\ & = \Vert e_L\Vert ^2 - \frac{2}{\Vert a_i\Vert ^2}\langle a_i,e_L\rangle \langle a_i,e_k\rangle + \langle a_i,e_k\rangle^2\frac{\Vert P_La_i\Vert ^2}{\Vert a_i\Vert ^4}\nonumber \\ & \leqslant \Vert e_L\Vert ^2 -\frac{2}{\Vert a_i\Vert ^2}\langle a_i,e_L\rangle \langle a_i,e_k\rangle + \frac{\langle a_i,e_k\rangle^2}{\Vert a_i\Vert ^2}\nonumber \\ & = \Vert e_L\Vert ^2 -\frac{2}{\Vert a_i\Vert ^2}\langle a_i,e_L\rangle \langle a_i,e_k\rangle + \frac{\langle a_i,e_L\rangle^2+2\langle a_i,e_L\rangle\langle a_i,e_H\rangle+\langle a_i,e_H\rangle^2}{\Vert a_i\Vert ^2}. \nonumber \end{align*}$

Upon noting the identity $\sum_{i=1}^n a_ia_i^t = A^t A$ , taking expectation on both sides yields

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_{L}e_{k+1}\Vert ^2| e_k] \leqslant \Vert e_L\Vert ^2 - \frac{2}{\Vert A\Vert _F^2}\langle e_k,A^tAe_L\rangle + \frac{\Vert Ae_L\Vert ^2+2\langle e_H,A^tAe_L\rangle+\Vert Ae_H\Vert ^2}{\Vert A\Vert _F^2}. \nonumber \end{align*}$

Now substituting the splitting $e_k = e_L + e_H$ and rearranging the terms give

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_{L}e_{k+1}\Vert ^2| e_k] & \leqslant \Vert e_L\Vert ^2 - \frac{2}{\Vert A\Vert _F^2}\langle e_L,A^tAe_L\rangle - \frac{2}{\Vert A\Vert _F^2}\langle e_H,A^tAe_L\rangle \nonumber \\ &\quad + \frac{\Vert Ae_L\Vert ^2+2\langle e_H,A^tAe_L\rangle+\Vert Ae_H\Vert ^2}{\Vert A\Vert _F^2} \leqslant \Vert e_L\Vert ^2 - \frac{1}{\Vert A\Vert _F^2}\Vert A e_L\Vert ^2 + \frac{\Vert Ae_H\Vert ^2}{\Vert A\Vert _F^2}. \nonumber \end{align*}$

Thus the first assertion follows from lemma 3.1.

The high-frequency component $P_He_{k+1}$ satisfies $P_He_{k+1} =e_k - \frac{1}{\Vert a_i\Vert ^2}\langle a_i, e_k\rangle P_Ha_i.$ We appeal to lemma 3.2 and the inequality $\langle a_i, e_k\rangle^2\leqslant \Vert a_i\Vert ^2\Vert e_k\Vert ^2 =\Vert a_i\Vert ^2(\Vert e_L\Vert ^2+\Vert e_H\Vert ^2)$ to get

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Vert P_He_{k+1}\Vert ^2 & = \Vert e_H\Vert ^2 - \frac{2}{\Vert a_i\Vert ^2}\langle a_i,e_H\rangle \langle a_i,e_k\rangle + \langle a_i,e_k\rangle^2\frac{\Vert P_Ha_i\Vert ^2}{\Vert a_i\Vert ^4}\nonumber \\ & \leqslant \Vert e_H\Vert ^2 -\frac{2}{\Vert a_i\Vert ^2}\langle a_i,e_H\rangle \langle a_i,e_k\rangle + \frac{\Vert P_Ha_i\Vert ^2}{\Vert a_i\Vert ^2}(\Vert e_L\Vert ^2 +\Vert e_H\Vert ^2). \nonumber \end{align*}$

Taking expectation yields

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \mathbb{E}[\|P_He_{k+1}\|^2{|e_k}] & \leq \|e_H\|^2 - \frac{2}{\|A\|_F^2}\|Ae_H\|^2+ {\frac{1}{\|A\|_F^2}\sum_{i=1}^n\|P_Ha_i\|^2(\|e_L\|^2+\|e_H\|^2)}\\ &\leq {c_2\|e_L\|^2 + (1+c_2)\|e_H\|^2}. \nonumber \end{align*}$

Thus we obtain the second assertion and complete the proof. □

Remark 3.1. By theorem 3.1, the decay of the error $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1}\Vert ^2| e_k]$ is largely determined by the factor $1-c_1$ and only mildly affected by $\Vert P_He_k\Vert ^2$ by a factor c₂. The factor c₂ is very small in the presence of a gap in the singular value spectrum at $\sigma_L$ , i.e. $\sigma_{L}\gg \sigma_{L+1}$ , showing clearly the role of the gap.

Remark 3.2. Theorem 3.1 also covers the rank-deficient case, i.e. $\sigma_{L+1}=0$ , and it yields

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_{k+1}\Vert ^2| e_k] \leqslant (1-c_1)\Vert P_Le_k\Vert ^2\quad {\rm{and}}\quad \E[\Vert P_He_{k+1}\Vert ^2| e_k] \leqslant \Vert P_He_k\Vert ^2. \nonumber \end{align*}$

If $L=m$ , it recovers theorem 2.1(i) for exact data. The rank-deficient case was analyzed in [11].

Remark 3.3. By taking expectation of both sides of the estimates in theorem 3.1, we obtain

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_{k+1}\Vert ^2] &\leqslant (1-c_1)\E[\Vert P_Le_k\Vert ^2] + c_2\E[\Vert P_He_k\Vert ^2],\nonumber \\ \E[\Vert P_He_{k+1}\Vert ^2] &\leqslant c_2\E[\Vert P_Le_k\Vert ^2] + (1+c_2)\E[\Vert P_He_k\Vert ^2]. \nonumber \end{align*}$

Then the error propagation is given by

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \left[\begin{array}{@{}c@{}}\E[\Vert P_Le_{k}\Vert ^2] \nonumber \\ \E[\Vert P_He_{k}\Vert ^2]\end{array}\right] \leqslant D^k \left[\begin{array}{@{}c@{}}\Vert P_Le_0\Vert ^2 \nonumber \\ \Vert P_He_0\Vert ^2\end{array}\right]\qquad {\rm with }~D =\left[\begin{array}{@{}cc@{}} 1-c_1 & c_2 \nonumber \\ c_2 & 1+c_2\end{array}\right]. \nonumber \end{align*}$

The pairs of eigenvalues $\lambda_\pm$ and (orthonormal) eigenfunctions $v_\pm$ of D are given by

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lambda_\pm =\frac{2-c_1+c_2\pm ((c_1+c_2)^2+4c_2^2)^{1/2}}{2}, \nonumber \end{align*}$

and

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle v_\pm = \frac{[((c_1+c_2)^2+4c_2^2)^\frac{1}{2}\mp (c_1+c_2)]^\frac{1}{2}}{\sqrt{2}((c_1+c_2)^2+4c_2^2)^{1/4}} \left[\begin{array}{@{}c@{}}1 \nonumber \\ \frac{2c_2}{((c_1+c_2)^2+4c_2^2)^{1/2}\mp(c_1+c_2)}\end{array}\right]. \nonumber \end{align*}$

For the case $c_2\ll c_1 <1$ , i.e. $\alpha = \frac{c_2}{c_1}\ll 1$ , we have

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lambda_+ = 1+ c_1 (\alpha + O(\alpha^2)) \quad {\rm{and}}\quad \lambda_- = 1- c_1 (1 + O(\alpha^2)) \nonumber \end{align*}$

and

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle v_+ \approx \frac{1}{(1+\alpha^2)^\frac{1}{2}}\left[\begin{array}{@{}c@{}}-\alpha \nonumber \\ 1\end{array}\right] \quad {\rm{and}}\quad v_- \approx \frac{1}{(1+\alpha^2)^\frac{1}{2}}\left[\begin{array}{@{}c@{}}1 \nonumber \\ \alpha\end{array}\right]. \nonumber \end{align*}$

With $V=[v_+ \ v_-]$ , we have the approximate eigendecomposition if $k = O(1)$ :

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle D^k \approx V \left[\begin{array}{@{}cc@{}} 1 + k \alpha c_1 & \nonumber \\ & (1-c_1)^k \end{array}\right]V^t. \nonumber \end{align*}$

Thus, for $c_1\gg c_2$ , we have the following approximate error propagation for $k = O(1)$ :

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_k\Vert ^2] & \approx (1-c_1)^k\Vert P_Le_{0}\Vert ^2 + \alpha (1 - (1 - c_1)^k)\Vert P_He_{0}\Vert ^2,\nonumber \\ \E[\Vert P_He_k\Vert ^2] & \approx \alpha (1 - (1 - c_1)^k) \Vert P_Le_0\Vert ^2 + (1 + k\alpha c_1)\Vert P_He_0\Vert ^2. \nonumber \end{align*}$

3.2. Noisy data

Next we turn to the case of noisy data $b^\delta$ , cf. (2.5), we use the superscript δ to indicate the noisy case. Since $\newcommand{\e}{{\rm e}} b_i^\delta=b_i+\eta_i$ , the RKM iteration reads

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle x_{k+1}-x^* = x_k-x^* + \frac{\langle a_i,x^*-x_k\rangle}{\Vert a_i\Vert ^2}a_i+\frac{\eta_i a_i}{\Vert a_i\Vert ^2}, \nonumber \end{align*}$

and thus the random error $e_{k+1} = x_{k+1} - x^*$ satisfies

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:recurs-noise} e_{k+1}=\left(I-\frac{a_ia_i^t}{\Vert a_i\Vert ^2}\right)e_k + \frac{\eta_ia_i}{\Vert a_i\Vert ^2}. \nonumber \end{align} \tag{ 3.2 }$

Now we give our second main result, i.e. bounds on the errors $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_Le_{k+1}\Vert ^2| e_k]$ and $\newcommand{\E}{{\mathbb{E}}} \E[\Vert P_He_{k+1}\Vert ^2| e_k]$ for the noisy data.

Theorem 3.2. Let $c_1=\frac{\sigma_L^2}{\|A\|_F^2}$ , and $c_2=\frac{\sum_{i=L+1}^r\sigma_i^2}{\|A\|_F^2}$ . Then there hold

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \mathbb{E}[\|P_Le_{k+1}\|^2{|e_k}]&\leq (1-c_1)\|P_Le_k\|^2 + c_2\|P_He_k\|^2+\frac{\delta^2}{\|A\|_F^2}+{\frac{2\delta}{\|A\|_F}c_2^\frac{1}{2}\|e_k\|},\\ \mathbb{E}[\|P_He_{k+1}\|^2{|e_k}]&\leq c_2\|P_Le_k\|^2 + (1+c_2)\|P_He_k\|^2+\frac{\delta^2}{\|A\|_F^2}+{\frac{2\delta}{\|A\|_F}c_2^\frac{1}{2}\|e_k\|}. \nonumber \end{align*}$

Proof. By the recursive relation (3.2), we have the splitting

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_{k+1}\Vert ^2| e_k] = {\rm I}_1 + {\rm I}_2 + {\rm I}_3, \nonumber \end{align*}$

where the terms are given by

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle {\rm I}_1 &= \sum_{i=1}^n \frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2}\Vert e_L-\frac{\langle a_i,e_k\rangle}{\Vert a_i\Vert ^2}P_La_i\Vert ^2,\quad {\rm I}_2 = \sum_{i=1}^n\frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2}\frac{\eta_i^2\Vert P_La_i\Vert ^2}{\Vert a_i\Vert ^4},\nonumber \\ {\rm I}_3 & = \sum_{i=1}^n\frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2}\left[\frac{2\eta_i}{\Vert a_i\Vert ^2}\langle P_La_i,e_L\rangle - \frac{2\eta_i}{\Vert a_i\Vert ^4}\langle P_La_i,P_La_i\rangle\langle a_i,e_k\rangle\right]. \nonumber \end{align*}$

The first term ${\rm I}_1$ can be bounded directly by theorem 3.1. Clearly, ${\rm I}_2\leqslant \frac{\delta^2}{\Vert A\Vert _F^2}$ . For the third term ${\rm I}_3$ , we note the splitting

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle &\langle P_La_i,e_L\rangle - \frac{\Vert P_La_i\Vert ^2}{\Vert a_i\Vert ^2}\langle a_i,e_k\rangle \nonumber \\ = & \frac{\Vert P_La_i\Vert ^2+\Vert P_Ha_i\Vert ^2}{\Vert a_i\Vert ^2}\langle P_La_i,e_L\rangle - \frac{\Vert P_La_i\Vert ^2}{\Vert a_i\Vert ^2}(\langle P_La_i,e_L \rangle+\langle P_Ha_i,e_H \rangle)\nonumber \\ =& \frac{\Vert P_Ha_i\Vert ^2\langle P_La_i,e_L \rangle-\Vert P_La_i\Vert ^2\langle P_Ha_i,e_H\rangle}{\Vert a_i\Vert ^2}:={\rm I}_{3,i}. \nonumber \end{align*}$

Thus, by Cauchy–Schwarz inequality, we have

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Vert {\rm I}_3\Vert \leqslant \frac{2}{\Vert A\Vert _F^2}\Vert \eta\Vert \Big(\sum_{i=1}^n{\rm I}_{3,i}^2\Big)^{1/2}. \nonumber \end{align*}$

Direct computation yields

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle {\rm I}_{3,i}^2 \leqslant &\frac{\Vert P_Ha_i\Vert ^2\Vert P_La_i\Vert ^2}{\Vert a_i\Vert ^2}\cdot\frac{\Vert P_Ha_i\Vert ^2\Vert P_Le_L\Vert ^2 + 2\Vert P_Ha_i\Vert \Vert P_La_i\Vert \Vert e_L\Vert \Vert e_H\Vert + \Vert P_La_i\Vert ^2\Vert e_H \Vert ^2}{\Vert P_La_i\Vert ^2+\Vert P_Ha_i\Vert ^2}\nonumber \\ & \leqslant \Vert P_Ha_i\Vert ^2 \frac{(\Vert P_La_i\Vert ^2 + \Vert P_Ha_i\Vert ^2)(\Vert e_L\Vert ^2+\Vert e_H\Vert ^2)}{\Vert P_La_i\Vert ^2+\Vert P_Ha_i\Vert ^2}=\Vert P_Ha_i\Vert ^2\Vert e_k\Vert ^2. \nonumber \end{align*}$

Consequently, by lemma 3.2, we obtain

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Vert {\rm I}_3\Vert \leqslant \frac{2}{\Vert A\Vert _F^2}\delta\Big(\sum_{i=L+1}^r\sigma_i^2\Big)^{1/2}\Vert e_k\Vert . \nonumber \end{align*}$

These estimates together show the first assertion. For the high-frequency component e_H, we have

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_He_{k+1}\Vert ^2| e_k] = {\rm I}_{4} + {\rm I}_5 + {\rm I}_6, \nonumber \end{align*}$

where the terms are given by

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle {\rm I}_4 &= \sum_{i=1}^n \frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2}\Vert e_H-\frac{\langle a_i,e_k\rangle}{\Vert a_i\Vert ^2}P_Ha_i\Vert ^2,\quad {\rm I}_5 = \sum_{i=1}^n\frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2}\frac{\eta_i^2\Vert P_Ha_i\Vert ^2}{\Vert a_i\Vert ^4},\nonumber \\ {\rm I}_6 & = \sum_{i=1}^n\frac{\Vert a_i\Vert ^2}{\Vert A\Vert _F^2}\left[\frac{2\eta_i}{\Vert a_i\Vert ^2}\langle P_Ha_i,e_H\rangle - \frac{2\eta_i}{\Vert a_i\Vert ^4}\langle P_Ha_i,P_Ha_i\rangle\langle a_i,e_k\rangle\right]. \nonumber \end{align*}$

The term ${\rm I}_4$ is already bounded by theorem 3.1. Further, clearly, ${\rm I}_5 \leqslant \frac{\delta^2}{\Vert A\Vert _F^2}$ . For the term ${\rm I}_6$ , note the splitting

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \langle P_Ha_i,e_H\rangle - \frac{\Vert P_Ha_i\Vert ^2}{\Vert a_i\Vert ^2}\langle a_i,e_k\rangle = \frac{1}{\Vert a_i\Vert ^2}(\Vert P_La_i\Vert ^2\langle P_Ha_i,e_H\rangle - \Vert P_Ha_i\Vert ^2\langle P_La_i,e_L\rangle), \nonumber \end{align*}$

and thus ${\rm I}_6 = -{\rm I}_3$ . This shows the second assertion, and completes the proof of the theorem. □

Remark 3.4. Recall the following estimate for RKM [36, theorem 3.7]

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert e_{k+1}\Vert ^2| e_k] \leqslant \left(1-\kappa_A^{-2}\right)\Vert e_k\Vert + \Vert A\Vert _F^{-2}\delta^2. \nonumber \end{align*}$

The estimate in theorem 3.2 is free from $\kappa_A$ , but introduces an additional term $\frac{2\delta}{\|A\|_F}c_2^\frac{1}{2}\|e_k\|$ . Since $c_2$ is generally small, this extra term is comparable with $\Vert A\Vert _F^{-2}\delta^2$ . Theorem 3.2 extends theorem 3.1 to the noisy case: if $\delta=0$ , it recovers theorem 3.1. It indicates that if the initial error $e_0=x_0-x^*$ concentrates mostly on low frequency, the iterate will first decrease the error. The smoothness assumption on the initial error e₀ is realistic for inverse problems, notably under the standard source type conditions (for deriving convergence rates) [5, 16]. Nonetheless, the deleterious noise influence will eventually kick in as the iteration proceeds.

Remark 3.5. One can discuss the evolution of the iterates for noisy data, similar to remark 3.3. By Young's inequality $\newcommand{\e}{{\rm e}} 2ab\leqslant \epsilon a^2+\epsilon^{-1}b^2$ , the error satisfies (with $\bar c_1 = {(1-\epsilon)c_1}$ and $\bar c_2 = {(1+\epsilon)c_2})$

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_{k+1}\Vert ^2| e_k] & \leqslant (1-\bar c_1)\Vert P_Le_k\Vert ^2+ \bar c_2\Vert P_He_k\Vert ^2 + \frac{(1+\epsilon^{-1})\delta^2}{\Vert A\Vert _F^2},\nonumber \\ \E[\Vert P_He_{k+1}\Vert ^2| e_k] &\leqslant \bar c_2 \Vert P_Le_k\Vert ^2 + (1+\bar c_2)\Vert P_He_k\Vert ^2 + \frac{(1+\epsilon^{-1})\delta^2}{\Vert A\Vert _F^2}. \nonumber \end{align*}$

Then it follows that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \left[\begin{array}{@{}c@{}}\E[\Vert P_Le_{k}\Vert ^2] \nonumber \\ \E[\Vert P_He_{k}\Vert ^2]\end{array}\right] & \leqslant D^k \left[\begin{array}{@{}c@{}}\Vert P_Le_0\Vert ^2 \nonumber \\ \Vert P_He_0\Vert ^2\end{array}\right] + \frac{(1+\epsilon^{-1})\delta^2}{\Vert A\Vert _F^2}(I-D)^{-1}(I-D^{k})\left[\begin{array}{@{}c@{}} 1 \nonumber \\ 1\end{array}\right],\quad D =\left[\begin{array}{@{}cc@{}} 1-\bar c_1 & \bar c_2 \nonumber \\ \bar c_2 & 1+\bar c_2\end{array}\right]. \nonumber \end{align*}$

In the case $\bar c_2\ll\bar c_1<1$ and $\alpha = \frac{\bar c_2}{\bar c_1}\ll 1$ (by choosing reasonably small epsilon ), for $k=O(1)$ , repeating the analysis in remark 3.3 yields

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[\Vert P_Le_k\Vert ^2] & \approx (1-\bar c_1)^k\Vert P_Le_{0}\Vert ^2 + \alpha (1 - (1 - \bar c_1)^k)\Vert P_He_{0}\Vert ^2 + k\frac{(1+\epsilon^{-1})\delta^2}{\Vert A\Vert _F^2},\nonumber \\ \E[\Vert P_He_k\Vert ^2] & \approx \alpha (1 - (1 - \bar c_1)^k) \Vert P_Le_0\Vert ^2 + (1 + k\alpha \bar c_1)\Vert P_He_0\Vert ^2 + k\frac{(1+\epsilon^{-1})\delta^2}{\Vert A\Vert _F^2}. \nonumber \end{align*}$

Thus, the presence of data noise only influences the error of the RKM iterates mildly by an additive factor $(k\delta^2)$ , during the initial iterations.

4. RKM with variance reduction

When equipped with a proper stopping criterion, Kaczmarz method is a regularization method [21, 23]. Naturally, one would expect that this assertion holds also for RKM (2.3)–(2.4). This however remains to be proven due to the lack of a proper stopping criterion. To see the delicacy, consider one natural choice, i.e. Morozov's discrepancy principle [27]: choose the smallest integer k such that

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:dp} \Vert Ax_k-b^\delta\Vert \leqslant \tau\delta, \nonumber \end{align} \tag{ 4.1 }$

where $\tau>1$ is fixed [5, 16]. Theoretically, it is still unclear that (4.1) can be satisfied within a finite number of iterations for every noise level $\delta>0$ . In practice, computing the residual $\Vert Ax_k-b^\delta\Vert$ at each iteration is undesirable since its cost is of the order of evaluating the full gradient, whereas avoiding the latter is the very motivation for RKM! Below we propose one simple remedy by drawing on its connection with stochastic gradient methods [31] and the vast related developments.

First we note that the solution to (2.1) is equivalent to minimizing the least-squares problem

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:ls} \min_{x\in\mathbb{R}^n}\Big\{\,f(x):=\frac{1}{2n}\sum_{i=1}^n\Vert \langle a_i,x\rangle-b_i\Vert ^2\Big\}. \nonumber \end{align} \tag{ 4.2 }$

Next we recast RKM as a stochastic gradient method for problem (4.2), as noted earlier in [30]. We include a short proof for completeness.

Proposition 4.1. The RKM iteration (2.3)–(2.4) is a (weighted) stochastic gradient update with a constant stepsize $n/\Vert A\Vert _F^2$ .

Proof. With the weight $w_i=\Vert a_i\Vert ^2$ , we rewrite problem (4.2) into

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \frac{1}{2n}\sum_{i=1}^n(\langle a_i,x\rangle-b_i)^2&=\frac{1}{2n}\sum_{i=1}^n\frac{w_i}{\Vert A\Vert _F^2} \frac{\Vert A\Vert _F^2}{w_i}(\langle a_i,x\rangle-b_i)^2,\nonumber \\ & = \sum_{i=1}^n\frac{w_i}{\Vert A\Vert _F^2} f_i, \quad {\rm{with}}\quad f_i(x) = \frac{\Vert A\Vert _F^2}{2nw_i}(\langle a_i,x\rangle-b_i)^2. \nonumber \end{align*}$

Since $\sum_{i=1}^n w_i=\Vert A\Vert _F^2$ , we may interpret $p_i=w_i/\Vert A\Vert _F^2$ as a probability distribution on the set $\{1, \ldots, n\}$ , i.e. (2.4). Next we apply the stochastic gradient method. Since $g_i(x):=\nabla f_i(x) =\frac{\Vert A\Vert _F^2}{nw_i}(\langle a_i, x\rangle-b_i)a_i$ , with a fixed step length $\newcommand{\e}{{\rm e}} \eta=n\Vert A\Vert _F^{-2}$ , we get

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle x_{k+1} = x_k - w_i^{-1}(\langle a_i,x\rangle-b)a_i, \nonumber \end{align*}$

where $i\in \{1, \ldots, n\}$ is drawn i.i.d. according to (2.4). Clearly, it is equivalent to RKM (2.3)–(2.4). □

Now we give the mean and covariance of the stochastic gradient $g_i(x)$ .

Proposition 4.2. Let $g(x)=\nabla f(x)$ . Then the gradient $g_i(x)$ satisfies

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle {\rm \E}[g_i(x)] & = g(x),\nonumber \\ {\rm Cov}[g_i(x)] & = \frac{\Vert A\Vert _F^2}{n^2}\sum_{i=1}^n(\langle a_i,x\rangle-b_i)^2\frac{a_ia_i^t}{\Vert a_i\Vert ^2}-\frac{1}{n^2}A^t(Ax-b)(Ax-b)^tA. \nonumber \end{align*}$

Proof. The full gradient $g(x):=\nabla f(x)$ at x is given by $g(x)= \frac{1}{n}A^t(Ax-b)$ . The mean $\newcommand{\E}{{\mathbb{E}}} \E[g_i(x)]$ of the (partial) gradient $g_i(x)$ is given by

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle \E[g_i(x)]=\frac{1}{n}\sum_{i=1}^n\frac{\Vert a_i\Vert ^2}{\Vert A\Vert ^2_F}\frac{\Vert A\Vert _F^2}{\Vert a_i\Vert ^2}(\langle a_i,x\rangle-b_i)a_i=\frac{1}{n}A^t(Ax-b). \nonumber \end{align*}$

Next, by bias-variance decomposition, the covariance ${\rm Cov}[g_i(x)]$ of the gradient $g_i(x)$ is given by

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\E}{{\mathbb{E}}} \displaystyle {\rm Cov}[g_i(x)]&=\E[g_i(x)g_i(x)^t]-\E[g_i(x)]\E[g_i(x)]^t\nonumber \\ & = \frac{\Vert A\Vert _F^4}{n^2}\sum_{i=1}^n\frac{\Vert a_i\Vert ^2}{\Vert A\Vert ^2_F}\frac{1}{\Vert a_i\Vert ^4}(\langle a_i,x\rangle-b_i)^2a_ia_i^t-\frac{1}{n^2}A^t(Ax-b)(Ax-b)^tA\nonumber \\ & = \frac{\Vert A\Vert _F^2}{n^2}\sum_{i=1}^n(\langle a_i,x\rangle-b_i)^2\frac{a_ia_i^t}{\Vert a_i\Vert ^2}-\frac{1}{n^2}A^t(Ax-b)(Ax-b)^tA. \nonumber \end{align*}$

This completes the proof of the proposition. □

Thus, the single gradient $g_i(x)$ is an unbiased estimate of the full gradient $g(x)$ . For consistent linear systems, the covariance ${\rm Cov}[g_i(x)]$ is asymptotically vanishing: as $x_k\to x^*$ , both terms in the variance expression tend to zero. However, for inconsistent linear systems, the covariance ${\rm Cov}[g_i(x)]$ generally does not vanish at the optimal solution $x^*$ :

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle {\rm Cov}[g_i(x^*)] \approx \frac{\Vert A\Vert _F^2}{n^2}\sum_{i=1}^n(\langle a_i,x^*\rangle-b_i^\delta)^2\frac{a_ia_i^t}{\Vert a_i\Vert ^2}, \nonumber \end{align*}$

since one might expect $A^t(Ax^*-b^\delta)\approx 0$ . Further, ${\rm Cov}[g_i(x^*)]$ is of the order $\delta^2$ in the neighborhood of $x^*$ . One may predict the (asymptotic) dynamics of RKM via a stochastic modified equation from the covariance [25]. The RKM iteration eventually deteriorates due to the nonvanishing covariance so that its asymptotic convergence slows down.

These discussions motivate the use of variance reduction techniques developed for stochastic gradient methods to reduce the variance of the gradient estimate. There are several possible strategies, e.g. stepsize reduction, stochastic variance reduction gradient (SVRG), averaging and mini-batch (see e.g. [19, 32]). We only adapt SVRG [19] to RKM, termed as RKM with variance reduction (RKMVR); see algorithm 1 for details. It hybridizes the stochastic gradient with the (occasional) full gradient to achieve variance reduction. Here, s is the length of epoch, which determines the frequency of full gradient evaluation and was suggested to be n [19], and K is the maximum number of iterations. In view of Step 2, within the first epoch, it performs only the standard RKM, and at the end of the epoch, it evaluates the full gradient. In RKMVR, the residual $\Vert Ax_k-b^\delta\Vert$ is a direct by-product of full gradient evaluation and occurs only at the end of each epoch, and thus it does not invoke additional computational effort.

The update at Step 8 of algorithm 1 can be rewritten as (for $k\geqslant s$ )

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle x_{k+1} = x_k + \frac{\langle a_i,\tilde x-x_k\rangle a_i}{\Vert a_i\Vert ^2}-\frac{n}{\Vert A\Vert _F^2}\tilde g, \nonumber \end{align*}$

and thus $\tilde x-x_k\to 0$ as the iteration proceeds, and it recovers the Landweber method. With this choice, the variance of the gradient estimate is asymptotically vanishing [19]. Numerically, algorithm 1 converges rather steadily. That is, it combines the strengthes of RKM and the Landweber method: it merits the fast initial convergence of the former and the excellent stability of the latter.

Algorithm 1. Randomized Kaczmarz method with variance reduction (RKMVR).

1: Specify A, b, x₀, K, and s.

2: Initialize $g_i(\tilde x) = 0$ $g_i(\tilde x) = 0$ , and $\tilde g = 0$ $\tilde g = 0$ .

3: for $k=1, \ldots, K$ $k=1, \ldots, K$ do

4: if $k\ {\rm mod}\ s=0$ $k\ {\rm mod}\ s=0$ then

5: Set $\tilde x = x_k$ $\tilde x = x_k$ and $\tilde g= g(x_k)$ $\tilde g= g(x_k)$ .

6: Check the discrepancy principle (4.1).

7: end if

8: Pick an index i according to (2.4).

9: Update x_k by

$x_{k+1} = x_k - \frac{n}{\Vert A\Vert _F^2}(g_i(x_k)-g_i(\tilde x) + \tilde g).$ $x_{k+1} = x_k - \frac{n}{\Vert A\Vert _F^2}(g_i(x_k)-g_i(\tilde x) + \tilde g).$

10: end for

5. Numerical experiments and discussions

Now we present numerical results for RKM and RKMVR to illustrate their distinct features. All the numerical examples, i.e. phillips, gravity and shaw, are taken from the public domain MATLAB package Regutools⁵. They are Fredholm integral equations of the first kind, with the first example being mildly ill-posed, and the last two severely ill-posed, respectively. Unless otherwise stated, the examples are discretized with a dimension $n=m=1000$ . The noisy data $b^\delta$ is generated from the exact data b as

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle b^\delta_i = b_i + \delta \max_{j}(| b_j|)\xi_i,\quad i =1,\ldots,n, \nonumber \end{align*}$

where δ is the relative noise level, and the random variables $\xi_i$ s follow an i.i.d. standard Gaussian distribution. The initial guess x₀ for the iterative methods is $x_0=0$ . We present the squared error e_k and/or the squared residual r_k, i.e.

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eqn:err-res} e_k =\mathbb{E}[\Vert x^*-x_k\Vert ^2]\quad {\rm{and}}\quad r_k = \mathbb{E}[\Vert Ax_k-b^\delta\Vert ^2]. \nonumber \end{align} \tag{ 5.1 }$

The expectation $\mathbb{E}[\cdot]$ with respect to the random choice of the rows is approximated by the average of 100 independent runs. All the computations were carried out on a personal laptop with 2.50 GHz CPU and 8.00 G RAM by MATLAB 2015b.

5.1. Benefit of randomization

First we compare the performance of RKM with the cyclic Kaczmarz method (KM) to illustrate the benefit of randomization. Overall, the random reshuffling can substantially improve the convergence of KM; see the results in figures 1–3 for the examples with different noise levels.

**Figure 1.** Numerical results (e_k and r_k) for `phillips` by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

**Figure 1.** Numerical results (e_k and r_k) for `phillips` by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

**Figure 2.** Numerical results (e_k and r_k) for `gravity` by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

**Figure 3.** Numerical results (e_k and r_k) for `shaw` by KM and RKM. (a) $\delta=0$ , (b) $\delta=10^{-3}$ , (c) $\delta = 10^{-2}$ and (d) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

Next we examine the convergence more closely. The (squared) error e_k of the Kaczmarz iterate x_k undergoes a sudden drop at the end of each cycle, whereas within the cycle, the drop after each Kaczmarz iteration is small. Intuitively, this can be attributed to the fact that the neighboring rows of the matrix A are highly correlated to each other, and thus each single Kaczmarz iteration reduces only very little the (squared) error e_k, since roughly it repeats the previous projection. The strong correlation between the neighboring rows is the culprit of the slow convergence of the cyclic KM. The randomization ensures that any two rows chosen by two consecutive RKM iterations are less correlated, and thus the iterations are far more effective for reducing the error e_k, leading to a much faster empirical convergence. These observations hold for both exact and noisy data. For noisy data, the error e_k first decreases and then increases for both KM and RKM, and the larger is the noise level δ, the earlier does the divergence occur. That is, both exhibit a 'semiconvergence' phenomenon typical for iterative regularization methods. Thus a suitable stopping criterion is needed. Meanwhile, the residual r_k tends to decrease, but for both methods, it oscillates wildly for noisy data and the oscillation magnitude increases with δ. This is due to the nonvanishing variance; see the discussions in section 4. One surprising observation is that a fairly reasonable inverse solution can be obtained by RKM within one cycle of iterations. That is, by ignoring all other cost, RKM can solve the inverse problems reasonably well at a cost less than one full gradient evaluation!

5.2. Preasymptotic convergence

Now we examine the convergence of RKM. Theorems 3.1 and 3.2 predict that during first iterations, the low-frequency error $\newcommand{\E}{{\mathbb{E}}} e_L=\E[\Vert P_Le_k\Vert ^2]$ decreases rapidly, but the high-frequency error $\newcommand{\E}{{\mathbb{E}}} e_H=\E[\Vert P_He_k\Vert ^2]$ can at best decay mildly. For all examples, the first five singular vectors can capture the majority of the energy of the initial error $x^*-x_0$ . Thus, we choose a truncation level $L=5$ , and plot the evolution of low-frequency and high-frequency errors e_L and e_H, and the total error $\newcommand{\E}{{\mathbb{E}}} e=\E[\Vert e_k\Vert ^2]$ , in figure 4.

**Figure 4.** The error decay for the examples with two noise levels: $\delta=10^{-2}$ (top) and $\delta = 5\times 10^{-2}$ (bottom), with a truncation level $L=5$ . (a) `phillips`, (b) `gravity` and (c) `shaw`.
Download figure:
Standard image High-resolution image

**Figure 4.** The error decay for the examples with two noise levels: $\delta=10^{-2}$ (top) and $\delta = 5\times 10^{-2}$ (bottom), with a truncation level $L=5$ . (a) `phillips`, (b) `gravity` and (c) `shaw`.
Download figure:
Standard image High-resolution image

Numerically, the low-frequency error e_L decays much more rapidly during the initial iterations, and since the low-frequency modes are dominant, the total error e also enjoys a very fast initial decay. Intuitively, this behavior may be explained as follows. The rows of the matrix ${A}$ mainly contain low-frequency modes, and thus each RKM iteration tends to mostly decrease the low-frequency error e_L of the initial error $x^*-x_0$ . The high-frequency error e_H experiences a similar but slower decay during the iteration, and then levels off. These observations fully confirm the preasymptotic analysis in section 3. For noisy data, the error e_k can be highly oscillatory, so is the residual r_k. The larger is the noise level δ, the larger is the oscillation magnitude. However, the degree of ill-posedness of the problem seems not to affect the convergence of RKM, so long as $x^*$ is mainly composed of low-frequency modes.

To shed further insights, we present in figure 5 the decay behavior of the low- and high-frequency errors for the example phillips with a random solution whose entries follow the i.i.d. standard normal distribution. Then the source type condition is not verified for the initial error. Now with a truncation level $L=5$ , the low-frequency error e_L only composes a small fractional of the initial error e₀. The low-frequency error e_L decays rapidly, exhibiting a fast preasymptotic convergence as predicted by theorem 3.2, but the high-frequency error e_H stagnates during the iteration. Thus, in the absence of the smoothness condition on e₀, RKM is ineffective, thereby supporting theorems 3.1 and 3.2.

**Figure 5.** The error decay for `phillips` with a random solution, with a truncation level $L=5$ . (a) $\delta=10^{-2}$ and (b) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

Naturally, one may divide the total error e into more than two frequency bands. The empirical behavior is similar to the case of two frequency bands; see figure 6 for an illustration on the example phillips, with four frequency bands. The lowest-frequency error e₁ decreases fastest, and then the next band e₂ slightly slower, etc. These observations clearly indicate that even though RKM does not employ the full gradient, the iterates are still mainly concerned with the low-frequency modes during the first iterations, like the Landweber method in the sense that the low-frequency modes are much easier to recover than the high-frequency ones. However, the cost of each RKM iteration is only one nth of that for the Landweber method, and thus it is computationally much more efficient.

**Figure 6.** The error decay for `phillips`. The total error e is divided into four frequency bands: 1–3, 4–6, 7–9, and the remaining, denoted by e_i, $i=1, \ldots, 4$ . (a) $\delta=10^{-2}$ and (b) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

**Figure 6.** The error decay for `phillips`. The total error e is divided into four frequency bands: 1–3, 4–6, 7–9, and the remaining, denoted by e_i, $i=1, \ldots, 4$ . (a) $\delta=10^{-2}$ and (b) $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

5.3. RKM versus RKMVR

The nonvanishing variance of the gradient $g_i(x)$ slows down the asymptotic convergence of RKM, and the iterates eventually tend to oscillate wildly in the presence of data noise; see the discussion in section 4. This is expected: the iterate converges to the least-squares solution, which is known to be highly oscillatory for ill-posed inverse problems. Variance reduction is one natural strategy to decrease the variance of the gradient estimate, thereby stabilizing the evolution of the iterates. To illustrate this, we compare the evolution of RKM with RKMVR in figure 7. We also include the results by the Landweber method (LM). To compare the iteration complexity only, we count one Landweber iteration as n RKM iterations. The epoch of RKMVR is set to n, the total number of data points, as suggested in [19]. Thus n RKMVR iterations include one full gradient evaluation, and it amounts to 2n RKM iterations. The full gradient evaluation is indicated by flat segments in the plots.

**Figure 7.** Numerical results for the examples by RKM, RKMVR and LM. (a) `phillips`, $\delta=10^{-2}$ , (b) `phillips`, $\delta=5\times10^{-2}$ , (c) `gravity`, $\delta=10^{-2}$ , (d) `gravity`, $\delta=5\times10^{-2}$ , (e) `shaw`, $\delta=10^{-2}$ and (f) `shaw`, $\delta=5\times10^{-2}$ .
Download figure:
Standard image High-resolution image

With the increase of the noise level δ, RKM first decreases the error e_k, and then increases it, which is especially pronounced at $\delta = 5\times 10^{-2}$ . This is well reflected by the large oscillations of the iterates. RKMVR tends to stabilize the iteration greatly by removing the large oscillations, and thus its asymptotical behavior resembles closely that of LM. That is, RKMVR inherits the good stability of LM, while retaining the fast initial convergence of RKM. Thus, the stopping criterion, though still needed, is less critical for the RKMVR, which is very beneficial from the practical point of view. In summary, the simple variance reduction scheme in algorithm 1 can combine the strengths of both worlds.

Last, we numerically examine the regularizing property of RKMVR with the discrepancy principle (4.1). In figure 8, we present the number of iterations for several noise levels for RKMVR (one realization) and LM. For both methods, the number of iterations by the discrepancy principle (4.1) appears to decrease with the noise level δ, and RKMVR consistently terminates much earlier than LM, indicating the efficiency of RKMVR. The reconstructions in figure 8(d) show that the error increases with the noise level δ, indicating a regularizing property. In contrast, in the absence of the discrepancy principle, the RKMVR iterates eventually diverge as the iteration proceeds, cf. figure 7.

**Figure 8.** The residual r_k and the recoveries for `phillips` (top), `gravity` (middle), `shaw` (bottom) by RKMVR and LM with the discrepancy principle (4.1) with $\tau=1.1$ . (a) $\delta=10^{-3}$ , (b) $\delta=10^{-2}$ , (c) $\delta=5\times10^{-2}$ and (d) solutions.
Download figure:
Standard image High-resolution image

6. Conclusions

We have presented an analysis of the preasymptotic convergence behavior of the randomized Kaczmarz method. Our analysis indicates that the low-frequency error decays much faster than the high-frequency one during the initial randomized Kaczmarz iterations. Thus, when the low-frequency modes are dominating in the initial error, as typically occurs for inverse problems, the method enjoys very fast initial error reduction. Thus this result sheds insights into the excellent practical performance of the method, which is also numerically confirmed. Next, by interpreting it as a stochastic gradient method, we proposed a randomized Kaczmarz method with variance reduction by hybridizing it with the Landweber method. Our numerical experiments indicate that the strategy is very effective in that it can combine the strengths of both randomized Kaczmarz method and Landweber method.

Our work represents only a first step towards a complete theoretical understanding of the randomized Kaczmarz method and related stochastic gradient methods (e.g. variable step size, and mini-batch version) for efficiently solving inverse problems. There are many important theoretical and practical questions awaiting further research. Theoretically, one outstanding issue is the regularizing property (e.g. consistency, stopping criterion and convergence rates) of the randomized Kaczmarz method from the perspective of classical regularization theory.

Acknowledgments

The authors are grateful to the constructive comments of the anonymous referees, which have helped improve the quality of the paper. In particular, the remark by one of the referees has led to much improved results as well as more concise proofs. The research of Y Jiao is partially supported by National Science Foundation of China (NSFC) No. 11501579 and National Science Foundation of Hubei Province No. 2016CFB486, B Jin by EPSRC grant EP/M025160/1 and UCL Global Engagement grant (2016–2017), and X Lu by NSFC Nos. 11471253 and 91630313.

Preasymptotic convergence of randomized Kaczmarz method

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Randomized Kaczmarz method

3. Preasymptotic convergence analysis

3.1. Exact data

3.2. Noisy data

4. RKM with variance reduction

5. Numerical experiments and discussions

5.1. Benefit of randomization

5.2. Preasymptotic convergence

5.3. RKM versus RKMVR

6. Conclusions

Acknowledgments

Footnotes

Preasymptotic convergence of randomized Kaczmarz method

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Randomized Kaczmarz method

3. Preasymptotic convergence analysis

3.1. Exact data

3.2. Noisy data

4. RKM with variance reduction

5. Numerical experiments and discussions

5.1. Benefit of randomization

5.2. Preasymptotic convergence

5.3. RKM versus RKMVR

6. Conclusions

Acknowledgments

Footnotes