On the Regularizing Property of Stochastic Gradient Descent

Stochastic gradient descent is one of the most successful approaches for solving large-scale problems, especially in machine learning and statistics. At each iteration, it employs an unbiased estimator of the full gradient computed from one single randomly selected data point. Hence, it scales well with problem size and is very attractive for truly massive dataset, and holds significant potentials for solving large-scale inverse problems. In the recent literature of machine learning, it was empirically observed that when equipped with early stopping, it has regularizing property. In this work, we rigorously establish its regularizing property (under \textit{a priori} early stopping rule), and also prove convergence rates under the canonical sourcewise condition, for minimizing the quadratic functional for linear inverse problems. This is achieved by combining tools from classical regularization theory and stochastic analysis. Further, we analyze the preasymptotic weak and strong convergence behavior of the algorithm. The theoretical findings shed insights into the performance of the algorithm, and are complemented with illustrative numerical experiments.


Introduction
In this paper, we consider the following finite-dimensional linear inverse problem: where A ∈ R n×m is a matrix representing the data formation mechanism, x ∈ R m is the unknown signal of interest, and y † ∈ R n is the exact data formed by y † = Ax † , with x † being the true solution. Note that in practice, the matrix A is not necessarily of full rank, and equation (1.1) may have a multitude of solutions. The exact solution x † will be identified with the unique minimum norm solution; see (2.1) for the definition. In practice, we can only access the noisy data y δ ∈ R n defined by where the vector ξ ∈ R n is the noise in the data, with a noise level δ = ξ (and δ = n − 1 2 δ). The noise ξ is assumed to be a realization of an independent identically distributed (i.i.d.) mean zero Gaussian random vector. Throughout, we denote the ith row of the matrix A by a column vector a i ∈ R m , and the ith entry of the vector y δ by y δ i . The model (1.1) is representative of many discrete linear inverse problems, including linearized (sub)problems of nonlinear inverse problems. Hence, the stable and efficient numerical solution of the model (1.1) has been the topic of many research works, and plays an important role in developing practical inversion techniques (see, e.g. [8,9]).
Stochastic gradient descent (SGD), dated at least back to Robbins and Monro [23], represents an extremely popular solver for large-scale least square type problems and statistical inference, and its accelerated variants represent state-of-the-art solvers for training (deep) neural networks [4,5,14]. Such methods hold significant potentials for solving large-scale inverse problems. For example, the randomized Kaczmarz method (RKM) [25], which has been very popular and successful in computed tomography [18], can be viewed as SGD with weighted sampling (see, e.g. [19] and [12, proposition 4.1]). See [6,11] for several exper imental evaluations on SGD for inverse problems. Hence, it is important to understand theoretical properties of such stochastic reconstruction methods, which, to the best of our knowledge, have not been addressed in the context of ill-posed inverse problems.
In this work, we contribute to the theoretical analysis of SGD for inverse problems. The starting point is the following optimization problem: where (·, ·) denotes the Euclidean inner product on R m . For a rank-deficient A, the functional F is not strictly convex, and does not have a unique minimizer. The basic version of SGD reads: given an initial guess x 1 ∈ R m , update the iterate x δ k+1 by where the index i k is drawn i.i.d. uniformly from the set {1, . . . , n}, η k > 0 is the step size at the kth iteration, and the update (1.2) can be derived by computing a gradient estimate ∂f i k (x) = ((a i k , x) − y δ i k )a i k from the functional f i k (x) for a randomly sampled single datum {a i k , y δ i k }, instead of the full gradient ∂F(x). Thus, the SGD iteration (1.2) is a randomized version of the Landweber iteration: Compared with Landweber iteration (1.3), SGD requires only evaluating one datum {a i k , y δ i k } per iteration, and thus the per-iteration cost is drastically reduced, which is especially attractive for large-scale problems. In theory, Landweber method is known to be regularizing [8, chapter 6]. However, the regularizing property of SGD remains to be established, even though it was conjectured and empirically examined (see, e.g. [10,24,28]). Numerically, one observes a semiconvergence phenomenon for SGD: the iterate x δ k first converges to the true solution x † , and then diverges as the iteration further proceeds. Semiconvergence is characteristic of (deterministic) iterative regularization methods, and early stopping is often employed [8,15]. Below we describe the main theoretical contributions of this work, which are complemented with numerical experiments in section 6.
The first contribution is to analyze SGD with a polynomially decaying sequence of step sizes (see assumption 2.1) through the lens of regularization theory. In theorems 2.1 and 2.2, we prove that SGD is regularizing in the sense that iterate x δ k converges to the exact solution x † in the mean squared norm as the noise level δ tends to zero, under a priori early stopping rule, and also x δ k converges to x † at certain rates under canonical source condition. To the best of our knowledge, this is the first result on regularizing property of a stochastic iteration method. The analysis relies on decomposing the error into three components: approximation error due to early stopping, propagation error due to the presence of data noise, and stochastic error due to the random index i k . The first two parts are deterministic and can be analyzed in a manner similar to Landweber method [8, chapter 6]; see theorem 3.1 and 3.2. The last part on the variance of the iterate constitutes the main technical challenge in the analysis. It is overcome by relating the iterate variance to the expected square residuals and analyzing the evolution of the latter; see theorems 3.3 and 3.4.
The second contribution is to analyze the preasymptotic convergence in both weak and strong sense. In practice, it is often observed that SGD can decrease the error very fast during initial iterations. We provide one explanation of the phenomenon by means of preasymptotic convergence, which extends the recent work on RKM [12]. It is achieved by dividing the error into low-and high-frequency components according to right singular vectors, and studying their dynamics separately. In theorems 2.3 and 2.4, we prove that the low-frequency error can decay much faster than the high-frequency one in either weak or strong norm. In particular, if the initial error is dominated by the low-frequency components, then SGD decreases the total error very effectively during the first iterations. The analysis sheds insights into practical performance of SGD. Further, under a source type condition, the low-frequency error is indeed dominating, see proposition 5.1. Now we situate this work in existing literature in two related areas: inverse problems with random noise, and machine learning. Inverse problems with random noise have attracted much attention over the last decade. Hohage and his collaborators [1][2][3] studied various regularization methods, e.g. Tikhonov and iterative regularization, for solving linear and nonlinear inverse problems with random noise. For example, Bissantz et al [2] analyzed Tikhonov regularization for nonlinear inverse problems, and analyzed consistency and convergence rates. In these works, randomness enters into the problem formulation via the data y δ directly as a Hilbert space valued process, which is fixed (though random) when applying regularization techniques. Thus, it differs greatly from SGD, for which randomness arises due to the random row index i k and changes at each iteration. Handling the iteration noise requires different techniques than that in these works.
There are also a few relevant works in machine learning [7,17,26,27]. Ying and Pontil [27] studied an online least-squares gradient descent algorithm in a reproducing kernel Hilbert space (RKHS), and derived bounds on the generalization error. Tarres and Yao [26] analyzed the convergence of a (regularized) online learning algorithm closely related to SGD. Lin and Rosasco [17] analyzed the influence of batch size on the convergence of mini-batch SGD. See also the recent work [7] on SGD with averaging for nonparametric regression in RKHS. All these works analyze the method in the framework of statistical learning, where the noise arises mainly due to finite sampling of the (unknown) underlying data distribution, whereas for inverse problems, the noise arises from imperfect data acquisition process and enters into the data y δ directly. Further, the main focus of these works is to bound the generalization error, instead of error estimates for the iterate. Nonetheless, our proof strategy in decomposing the total error into three different components shares similarity with these works.
The rest of the paper is organized as follows. In section 2, we present and discuss the main results, i.e. regularizing property and preasymptotic convergence. In section 3, we derive bounds on three parts of the total error. Then in section 4, we analyze the regularizing property of SGD with a priori stopping rule, and prove convergence rates under classical source condition. In section 5, we discuss the preasymptotic convergence of SGD. Some numerical results are given in section 6. In an appendix, we collect some useful inequalities. We conclude this section with some notation. We use the superscript δ in x δ k to indicate SGD iterates for noisy data y δ , and denote by x k that for the exact data y † . The notation · denotes Euclidean norm for vectors and spectral norm for matrices, and [·] denotes the integral part of a real number. {F k } k 1 denotes a sequence of increasing σ-fields generated by the random index i k up to the kth iteration. The notation c, with or without a subscript, denotes a generic constant that is always independent of the iteration index k and the noise level δ.

Main results and discussions
Now we present the main results of the work, i.e. regularizing property of SGD and preasymptotic convergence results. The detailed proofs are deferred to sections 4 and 5, which in turn rely on technical estimates derived in section 3. Throughout, we consider the following step size schedule, which is commonly employed for SGD. Due to stochasticity of the row index i k , the iterate x δ k is random. We measure the approximation error x δ k − x † to the true solution x † by the mean squared error where the expectation E[·] is with respect to the random index i k . The reference solution x † is taken to be the unique minimum norm solution (relative to the initial guess x 1 ): Now we can state the regularizing property of SGD (1.2) under a priori stopping rule: tends to zero as the noise level δ → 0, if the stopping index k(δ) is chosen properly in relation to the noise level δ. Thus, SGD equipped with suitable a priori stopping rule is a regularization method. Note that condition (2.2) is analogous to that for classical regularization methods.
To derive convergence rates, we employ the source condition in classical regularization theory [8,9]. Recall that the canonical source condition reads: there exists some w ∈ R m such that where the symmetric and positive semidefinite B ∈ R m×m is defined in (3.4) below, and B p denotes the usual fractional power (via spectral decomposition). Condition (2.3) represents a type of smoothness of the initial error x † − x 1 , and the exponent p determines the degree of smoothness: the larger the exponent p is, the smoother the initial error x † − x 1 becomes. It controls the approximation error due to early stopping (see theorem 3.1 below for the precise statement). The source type condition is one of the most classical approaches to derive convergence rates in classical regularization theory [8,9]. Next we can state convergence rates under a priori stopping index.

Theorem 2.2. Let assumption 2.1 and the source condition (2.3) be fulfilled. Then there holds
where the constants c, c and c depend on α, p, w , Ax 1 − y δ and A .
Remark 2.1. Theorem 2.2 indicates a semiconvergence for the iterate x δ k : the first term is decreasing in k and dependent of regularity index p and the step size parameter α ∈ (0, 1), and the second term k 1−αδ2 is increasing in k and dependent of the noise level. The first term k − min(2α,min(1,2p)(1−α)) ln 2 k contains both approximation error (indicated by p) and stochastic error. By properly balancing the first two terms in the estimate, one can obtain a convergence rate. The best possible convergence rate depends on both the decay rate α and the regularity index p in (2.3), and it is suboptimal for any p > 1 2 when compared with Landweber method. That is, the vanilla SGD suffers from saturation, due to the stochasticity induced by the random row index i k . The saturation is also observed in the context of statistical learning theory [27].
In practice, it is often observed that SGD decreases the error rapidly during the initial iterations. This phenomenon cannot be explained by the regularizing property. Instead, we analyze the preasymptotic convergence by means of SVD, in order to explain the fast initial convergence.
∈ R m×m are orthonormal, and Σ = diag(σ 1 , . . . , σ r , 0, . . . , 0) ∈ R n×m is diagonal with the diagonals ordered nonincreasingly and r the rank of A. For any fixed truncation level 1 L r, we define the low-and high-frequency solution spaces L and H respectively by ). Let P L and P H be the orthogonal projection onto L and H, respectively. The analysis relies on decomposing the error e δ k = x δ k − x † into the low-and high-frequency components P L e δ k and P H e δ k , respectively, in order to capture their essentially different dynamics.
We have the following preasymptotic weak and strong convergence results, which characterize the one-step evolution of the low-and high-frequency errors. The proofs are given in section 5.

Remark 2.2.
It is noteworthy that in theorems 2.3 and 2.4, the step size η k is not required to be polynomially decaying. Theorems 2.3 and 2.4 indicate that the low-frequency error can decrease much faster than the high-frequency error in either the weak or mean squared norm sense. Thus, if the initial error e 1 consists mostly of low-frequency modes, SGD can decrease the low-frequency error and thus also the total error rapidly, resulting in fast initial convergence.

Preliminary estimates
In this part, we provide several technical estimates for the SGD iteration (1.2). By bias-variance decomposition and triangle inequality, we have 1) where x k is the random iterate for exact data y † . Thus, the total error is decomposed into three components: approximation error due to early stopping, propagation error due to noise and stochastic error due to the random index i k . The objective below is to derive bounds on the three terms in (3.1), which are crucial for proving theorems 2.1 and 2.2 in section 4. The approximation and propagation errors are given in theorems 3.1 and 3.2, respectively. The stochastic error is analyzed in section 3.2: first in terms of the expected squared residuals in theorem 3.3, and then bound on the latter in theorem 3.4. The analysis of the stochastic error represents the main technical challenge.

Approximation and propagation errors
For the analysis, we first introduce auxiliary iterations. Let e δ k = x δ k − x † and e k = x k − x † be the errors for SGD iterates x δ k and x k , for y δ and y † , respectively. They satisfy the following recursion: Then we introduce two auxiliary matrices: for any vector b ∈ R n , The cases s = 0 and s = 1/2 will be used for bounding the approximation error and the residual, respectively.

Proof. It follows from (3.2) and the identity y
Taking the full expectation yields Repeatedly applying the recursion (3.6) and noting that e 1 is deterministic give From the source condition (2.3), we deduce By lemmas A.1 and A.2, we arrive at This completes the proof of the theorem. □ with c r,α given by

Proof. By the recursions (3.2) and (3.3), the propagation error
Applying the recursion repeatedly yields Thus, by the triangle inequality, we have Under assumption 2.
By balancing the two terms, one can derive a convergence rate in terms of δ (instead of δ), and this is achieved quickest by α = 0. Such an estimate is known as weak error in the literature of stochastic differential equations. By bias variance decomposition, it is weaker than the mean squared error.

Stochastic error
The next result gives a bound on the variance It arises from the random index i k in SGD (1.2). Theorem 3.3 relates the variance to the past mean squared residuals The extra exponent 1 2 follows from the quadratic structure of the least-squares functional.

Theorem 3.3. For the SGD iteration (1.2), there holds
with z 1 = 0. Upon rewriting, z k satisfies where the iteration noise M k is defined by (3.8) Then taking full expectation yields (3.8). Applying the recursion (3.7) repeatedly gives Then it follows from (3.8) that Since a i = A t e i (with e i being the ith Cartesian vector), we have (with ȳ δ = n −1 y δ )

This and the identity
By the measurability of where the inequality is due to the identity E[((a ij , x δ j ) − y δ ij )e ij |F j−1 ] =Āx j −ȳ δ and bias-variance decomposition. Thus, by taking full expectation, we obtain Combining the preceding bounds yields the desired assertion.
where the constants c α and c α depend on α, p, w , Ax 1 − y δ and A .
Next, we bound the variance I 3 by theorem 3.3 with s = 1/2 and lemma A.1: with c 1 = e −1 A 2 and c 2 = c 0 A 2 . Combining these estimates yields (with c 3 = 4c p and Due to the presence of the factor ln k in theorem 3.4, the upper bound is not uniform in k for noisy data, but the growth is very mild. For exact data y † , there holds: where the constant c depends on α, p, Ax 1 − y † and A . The proof also indicates that the condition c 0 max i a i 2 1 in assumption 2.1 may be replaced with c 0 B 1.

Regularizing property
In this section, we analyze the regularizing property of SGD with early stopping, and prove convergence rates under a priori stopping rule. First, we show the convergence of the SGD iterate x k for exact data to the minimum-norm solution x † defined in (2.1), for any α ∈ (0, 1).

By theorem 3.4 (and remark 3.3), the sequence {E[
The desired assertion follows from bias variance decomposition by It is well known that the minimum norm solution is characterized by x † − x 1 ∈ range(A t ). By the construction of the SGD iterate x k , x k − x 1 always belongs to range(A t ), and thus the limit is the unique minimum-norm solution x † . □ Next we analyze the convergence of the SGD iterate x δ k for noisy data y δ as δ → 0.
where the constants c and c depend on α, p, w , Ax 1 − y δ and A .
Proof. Let r k = E[ Ax δ k − y δ 2 ] be the expected squared residual at the kth iteration. Then theorem 3.3 with s = 0 and lemma A.

The last two inequalities and lemma A.3 imply the desired bound. □
Now we can prove the regularizing property of SGD in theorem 2.1.
Proof of theorem 2.1. We appeal to the bias-variance decomposition (3.1): By the proof of theorem 4.1 and condition (2.2), we have

Thus, it suffices to analyze the errors
. By theorem 3.2 and the choice of k(δ) in condition (2.2), there holds

Last, by lemma 4.1 and condition (2.2), we can bound the variance
Combining the last three estimates completes the proof. □ Remark 4.1. The consistency condition (2.2) in theorem 2.1 requires α ∈ (0, 1). The constant step size, i.e. α = 0, is not covered by the theory, for which the bootstrapping argument does not work.
Last, we give the proof of theorem 2.2 on the convergence rate of SGD under a priori stopping rule.

Proof of theorem 2.2. By bias-variance decomposition, we have
It follows from lemma 4.1 that Meanwhile, by the triangle inequality and theorems 3.1 and 3.2, These two estimates together give the desired rate. □

Remark 4.2.
The a priori parameter choice in theorem 2.2 requires a knowledge of the regularity index p, and thus is infeasible in practice. The popular discrepancy principle also does not work directly due to expensive residual evaluation, and further, it induces complex dependence between the iterates, which requires different techniques for the analysis. Thus, it is of much interest to develop purely data-driven rules without residual evaluation while automatically adapting to the unknown solution regularity, e.g. quasi-optimality criterion and balancing principle [13,21].

Preasymptotic convergence
In this part, we present the proofs of theorems 2.3 and 2.4 on the preasymptotic weak and strong convergence, respectively. First, we briefly discuss the low-frequency dominance on the initial error e 1 under the source condition (2.3): if the singular values σ i of n − 1 2 A decay fast, e 1 is indeed dominated by P L e 1 , i.e. P L e 1 P H e 1 . We illustrate this with a simple probabilistic model: the sourcewise representer w ∈ R m follows the standard Gaussian distribution N (0, I m ). Proof. Under Condition (2.3), we have e 1 = B p w = VΣ 2p V t w. Thus, we have Since w ∼ N (0, I m ) and the matrix V is orthonormal, (V t w) i ∼ N (0, 1), and E[(V t w) 2 i ] = 1, from which the assertion on E[ P L e 1 2 ] follows, and the other estimate follows similarly. □

Remark 5.1. For polynomially decaying singular values
Hence, for a truncation level L m and 4pβ 1, E[ P L e 1 2 ] is dominating. The condition 4pβ 1 holds for either severely ill-posed problems (large β) or highly regular solution (large p). Now we give the proof of the preasymptotic weak convergence in theorem 2.3.
Proof of theorem 2.3. By applying P L to the SGD iteration (3.3), we have By taking conditional expectation with respect to F k−1, since e δ k = P L e δ k + P H e δ k , we obtain By the construction of P L and P H , P L BP H e δ k = 0, and then taking full expectation yields Then the first assertion follows

Next, appealing again to the SGD iteration (3.3) gives
Thus the conditional expectation E[P H e δ k+1 |F k−1 ] is given by Then, taking full expectation and appealing to the triangle inequality yield the second estimate. □

Remark 5.2.
For exact data y † , we obtain the following simplified expressions: Thus the low-frequency error always decreases faster than the high-frequency one in the weak sense. Further, there is no interaction between the low-and high-frequency errors in the weak error.
Next we analyze preasymptotic strong convergence of SGD. We first analyze exact data y † . The argument is needed for the proof of theorem 2.4.
Proof. It follows from the SGD iteration (3.2) that P L e k+1 = P L e k − η k (a i k , e k )P L a i k . This and the condition c 0 max i a i 2 1, imply The conditional expectation with respect to F k−1 is given by With the splitting e k = P L e k + P H e k and the construction of P L and P H , we obtain (P L e k , P L Be k ) = (P L e k , P L BP L e k ), (e k , Be k ) = (P L e k , P L BP L e k ) + (P H e k , P H BP H e k ).

Substituting the last two identities leads to
This shows the first estimate. Next, appealing again to the SGD iteration (3.2), we obtain which together with the condition c 0 max i a i 2 1, and the Cauchy-Schwarz inequality, implies Thus the conditional expectation E[ P H e k+1 2 |F k−1 ] is given by This proves the second estimate and completes the proof of the lemma. □ Remark 5.3. The proof gives a slightly sharper estimate on the low-frequency error: Now we can present the proof of theorem 2.4 on preasymptotic strong convergence.

Proof of theorem 2.4. It follows from the SGD iteration (3.3) that
and upon expansion, we obtain It suffices to bound the three terms I i . The term I 1 can be bounded by the argument in lemma 5.1 as For the term I 2 , by assumption 2.1, there holds . For the third term I 3 , by the identity (a i , e δ k ) = (P L a i , P L e δ k ) + (P H a i , P H e δ k ), we have It suffices to bound I 3,i . By the condition on η k , we deduce and consequently, Combining these two estimates with the Cauchy-Schwarz inequality leads to The bounds on I 1 , I 2 and I 3 together show the first assertion. For the high-frequency part P H e δ k , we have and upon expansion, we obtain The term I 4 can be bounded by the argument in lemma 5.1 as Clearly, I 5 c −1 0 η 2 kδ 2 . For I 6 , simple computation yields with I 6,i given by where the last line is due to the identity P H B . This estimate together with the Cauchy-Schwarz inequality gives These estimates together show the second assertion, and complete the proof. □

Numerical experiments
Now we present numerical experiments to complement the theoretical study. All the numerical examples, i.e. phillips, gravity and shaw, are taken from the public domain MATLAB package Regutools 3 . They are Fredholm integral equations of the first kind, with the first example being mildly ill-posed, and the other two severely ill-posed. Unless otherwise stated, the examples are discretized with a dimension n = m = 1000. The noisy data y δ is generated from the exact data y † as where δ is the relative noise level, and the random variables ξ i s follow the standard Gaussian distribution. The initial guess x 1 is fixed at x 1 = 0. We present the mean squared error e k and/ or residual r k , i.e.
The expectation E[·] with respect to the random index i k is approximated by the average of 100 independent runs. The constant c 0 in the step size schedule is always taken to be c 0 = 1/ max i a i 2 , and the exponent α is taken to be α = 0.1, unless otherwise stated. All the computations were carried out on a personal laptop with 2.50 GHz CPU and 8.00G RAM by MATLAB 2015b.

The role of the exponent α
The convergence of SGD depends essentially on the parameter α. To examine its role, we present in figures 1-3 the numerical results for the examples with different noise levels, computed using different α values, for 10000 iterations. The smaller the α value is, the quicker the algorithm reaches the convergence and the iterate diverges for noisy data. This agrees with the analysis in section 4. Hence, a smaller α value is desirable for convergence. However, in the presence of large noise, a too small α value may sacrifice the attainable accuracy; see figures 1(c) and 2(c) for illustrations; and also the oscillation magnitudes of the iterates and the residual tend to be larger. This is possibly due to the intrinsic variance for large step sizes, and it would be interesting to precisely characterize the dynamics, e.g. with stochastic differential equations [16]. In practice, the fluctuations may cause problems with a proper stopping rule (especially with only one single trajectory). Further, it is noteworthy that with 10 000 iterations, the iterate may or may not reach convergence, dependent of the noise level δ and regularity index p of the exact solution x † ; see section 6.4 for further discussions.

Comparison with Landweber method and randomized Kaczmarz method
Since SGD is a randomized version of the Landweber method, in figure 4, we compare their performance. To compare the iteration complexity only, we count one Landweber iteration as n SGD iterations, and the full gradient evaluation is indicated by flat segments in the plots.
For all examples, the error e k and residual r k first experience fast reduction, and then the error starts to increase, which is especially pronounced at δ = 5 × 10 −2 , exhibiting the typical semiconvergence behavior. During the initial stage, SGD is much more effective than SGD: indeed one single loop over all the data can already significantly reduce the error e k and produce an acceptable approximation. This interesting observation will be further examined in section 6.3 below. However, the nonvanishing variance of the stochastic gradient slows down the asymptotic convergence of SGD, and the error e k and the residual r k eventually tend to oscillate for noisy data, before finally diverge.
One popular variant of the randomized Kaczmarz method (RKM) [25] reads where the ith row is chosen with a probability a i 2 / A 2 F and a step size such that each step is actually an orthogonal projection into the hyperplane defined by (a i k , x) = y δ i k . It is known that RKM is SGD with specialized algorithmic parameters, i.e. a weighted sampling and special step size schedule [12,19]. In figure 5 we present comparative results between SGD (with polynomially decaying step sizes and uniform sampling) and RKM. It is observed that when compared with SGD, RKM converges faster at the initial stage, but also diverges faster and suffers from larger oscillations in both residual r k and error e k . This shows clearly the crucial role of the algorithmic parameters in SGD for achieving a good balance between stability and accuracy. It is of much interest to quantify their influences on the convergence in both preasymptotic and asymptotic regimes and interplays with other problem parameters, and then to adapt these parameters for optimized performance.

Preasymptotic convergence
Now we examine the preasymptotic strong convergence of SGD (note that the weak error satisfies a Landweber type iteration). Theorem 2.4 (and lemma 5.1) predicts that during first iterations, the low-frequency error e L := E[ P L e k 2 ] decreases rapidly, but the high-frequency error e H := E[ P H e k 2 ] can at best decay mildly. For all examples, the first five singular vectors can well capture the total energy of the initial error e 1 = x * − x 1 , which suggests a truncation level L = 5 for the numerical illustration. We plot the low-and high-frequency errors e L and e H and the total error e = E[ e k 2 ] in figure 6. The low-frequency error e L decays much more rapidly during the initial iterations, and since under the source condition (2.3), e L is indeed dominant, the total error e also enjoys a fast initial decay. Intuitively, this behavior may be explained as follows. The rows of the matrix A mainly contain low-frequency modes, and thus each SGD iteration tends to mostly remove the low-frequency component e L of the initial error x * − x 1 . The high-frequency component e H experiences a similar but much slower decay. Eventually, both components level off and oscillate, due to the deleterious effect of noise. These observations confirm the preasymptotic analysis in section 5. For noisy data, the error e k can be highly oscillating, so is the residual r k . The larger the noise level δ is, the larger the oscillation magnitude becomes.

Asymptotic convergence
To examine the asymptotic convergence (with respect to the noise level δ), in table 1, we present the smallest error e along the trajectory and the number of iterations to reach the error e for several different noise levels. It is observed that for all three examples, the minimal error e increases steadily with the noise level δ, whereas also the required number of iterations decreases dramatically, which qualitatively agrees well with remark 2.1 and also figures 1-3  for graphical illustrations. Thus, SGD is especially efficient in the regime of high noise levels, for which one or two epochs can already give very good approximations, due to the fast preasymptotic convergence. This agrees with the common belief that SGD is most effective for finding an approximate solution that is not highly accurate. At low noise levels, the example shaw takes far more iterations to reach the smallest error. This might be attributed to the fact that the exponent p in the source condition (2.3) for shaw is much smaller than that for phillips or gravity, since the low-frequency modes are less dominating, as roughly indicated by the red curves in figure 6. Interestingly, for all examples, the error e undergoes sudden change when the noise level δ increases from 1 × 10 −2 to 3 × 10 −2 . This might be related to the exponent α in the step size schedule, which probably should be adapted to the noise level δ in order to achieve optimal balance between the computational efficiency and statistical errors.

Concluding remarks
In this work, we have analyzed the regularizing property of SGD for solving linear inverse problems, by extending deterministic inversion theory. The study indicates that with proper early stopping and suitable step size schedule, it is regularizing in the sense that iterates converge to the exact solution in the mean squared norm as the noise level tends to zero. Further, under the canonical source condition, we prove error estimates, which depend on the noise level and the schedule of step sizes. Further we analyzed the preasymptotic convergence behavior of SGD, and proved that the low-frequency error can decay much faster than highfrequency error. This allows explaining the fast initial convergence of SGD typically observed in practice. The findings are complemented by extensive numerical experiments. There are many interesting questions related to stochastic iteration algorithms that deserve further research. One outstanding issue is stopping criterion, and rigorous yet computationally efficient criteria have to be developed. In practice, the performance of SGD can be sensitive to the exponent α in the step size schedule [20]. Promising strategies for overcoming the drawback include averaging [22] and variance reduction [14]. It is of much interest to analyze such schemes in the context of inverse problems, including nonlinear inverse problems and penalized variants. For r = 1, it can be derived directly Simple computation gives where we slightly abuse k − max(0,0) for ln k, and the constants c α,β,r and c α,β,r are given by c α,β,r = c 2−r 0    2 r (2α + β − 1) −1 , 2 α + β > 1, 2, 2α + β = 1, 2 r−1+2α+β (1 − 2α − β) −1 , 2α + β < 1, c α,β,r = 2 2α+β c 2−r 0    (r − 1) −1 , r > 1, 1, r = 1, 2 r−1 (1 − r) −1 , r < 1.

Proof. It follows from the inequality (A.3) that
Collecting terms shows the first estimate. The second estimate follows similarly. □ Last, we give a technical lemma on recursive sequences.
If b j is nondecreasing, then for some c(α, c i ) dependent of α and c i , there holds a k+1 c(α, c i )k − min(α,1−α) ln k + 2b k .
Proof. Let c α = c(α, 0, 1) + c (α, 0, 1) from lemma A.3. Take k * ∈ N such that c 1 c α k − min(1−α,α) ln k + c 2 k −2α < 1/2 for any k k * . The existence of a finite k * is due to the monotonicity of f (t) = t − min(1−α,α) ln t for large t > 0 and lim t→∞ f (t) = 0. Now we claim that there exists a * > 0 such that a k a * + 2b k for any k ∈ N. Let a * = max 1 k k * a k . The claim is trivial for k k * . Suppose it holds for some k k * . Then by lemma A.3 and the monotonicity of b j , a k+1 max This shows the claim by mathematical induction. Next, by lemma A.3, for any k > k * , we have a k+1 (a * + 2b k )(c 1 c α k − min(α,1−α) ln k + c 2 k −2α ) + b k c(α, c i )k − min(α,1−α) ln k + 2b k .
This completes the proof of the lemma. □ Remark A.1. By the argument in lemma A.4 and a standard bootstrapping argument, we deduce the following assertions. If sup j b j < ∞, then {a j } ∞ j=1 is bounded by a constant dependent of α, sup j b j and c i s. Further, if b j c 3 j −γ , j ∈ N with γ > 0, then for some c(α, γ, c i , ) dependent of α, γ, and c i s, there holds a k+1 c(α, γ, c i , )k − min( α,1−α,γ) ln k.