An entropic Landweber method for linear ill-posed problems

The aim of this paper is to investigate the use of a Landweber-type method involving the Shannon entropy for the regularization of linear ill-posed problems. We derive a closed form solution for the iterates and analyze their convergence behaviour both in a case of reconstructing general nonnegative unknowns as well as for the sake of recovering probability distributions. Moreover, we discuss several variants of the algorithm and relations to other methods in the literature. The effectiveness of the approach is studied numerically in several examples.


Introduction
This work deals with linear ill-posed equations Au = y with A : X → Y acting between a Banach space X and a Hilbert space Y, for which solutions with specific properties (such as positivity) are sought. In this respect, we consider iterative regularization methods of the following type u k+1 ∈ arg min Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
where d = D f denotes the Bregman distance [10] associated with a convex functional f : X → R ∪ {+∞} which is nonnegative and c is some positive number. The term d(u,u k ) acts as a penalty enforcing the desired features for the solutions. Note that we can rewrite (1) as which shows that the scheme can be obtained also by linearizing the quadratic data-fitting term 1 2 Au − y 2 at the current iterate u k . This class of methods incorporates several procedures that have been proposed so far in the literature. For instance, the classical case when f is quadratic in Hilbert spaces reduces to the Landweber method, as emphasized by [17] and as investigated for nonlinear operator equations by means of surrogate functionals in [27], see also the discussion in [30]. The case of quadratic f in reflexive Banach spaces has been studied by [31]. The setting when f is the total variation functional smoothed by a quadratic has been analyzed by [4], requiring fine analysis tools due to the bounded variation function space context. The case of 1 -penalties has been treated in [12,32], resulting in the so-called linearized Bregman algorithm. In all those cases however, some quadratic term had to be part of f to guarantee even well-definedness of the iterates and subsequently convergence. We are interested here in the Shannon entropy setting without any quadratic term, i.e.
We mention that one can alternatively consider a linear shift tõ which is a nonnegative functional inducing the same Bregman distance. This raises challenges to analyze the problem in the L 1 setting without quadratic terms in the functional, but provides a simple closed iterative method with preserving the sign of the starting point (function) along the iterations. The latter formulation involves entropic projections, as one can see in the following section.
Moreover, we shall also be interested in the solution of inverse problems with unknowns being probability densities, i.e. we minimize on the domain of f subject to the constraint Ω u(t) dt = 1, which again results in a simple closed iterative form. The advantages of using Bregman projections for solving variational problems with unknown probability densities have been exploited by several authors before, e.g. in optimal transport (see [8,26]).
In order to write both problems in a closed form, we will use the equivalent formulation where χ 0 ≡ 0 denotes the original problem without integral constraint, and is employed for enforcing probability densities. The minimization is taken here over the domain of the entropy functional.
One finds the above entropy based algorithm in the finite dimensional optimization literature, as well as in the machine learning one, under quite different names. It is precisely the mirror descent algorithm with the negative entropy as the mirror map-see [7,25], as well as [16] and Chapter 9 in [6] for a detailed analysis. It is also called the exponentiated gradient descent method for linear predictions-see [24]. The work [22] investigated three versions of the so-called approximate (linearized) proximal point methods for optimization in combination with line search strategies. The reader is referred to sections 6.6-6.9 in [14] for other iterative optimization methods employing the Shannon entropy.
The main contribution of our work is the convergence of the iterates (4) to a solution of the equation Au = y in an infinite dimensional setting of such a nonquadratic penalty version, by stating also error estimates in the sense of a distance between the solution and the iterates, as opposed to the classical situation encountered in optimization, where the error for the objective function values is highlighted. This contribution will be emphasized in the next section (section 2.3), where more details will be provided about the existing results on the mirror descent and other related methods.
The manuscript is organized as follows. Section 2 provides the necessary background on entropy functionals, as well as on well-definedness of the proposed iterative procedure. Section 3 analyzes (weak) convergence of the method when both a priori and a posteriori stopping rules are considered, while section 4 deals with error estimates only for the former rule. Section 5 explores a version of the entropic Landweber method for nonquadratic data fidelity terms. The theoretical results are tested in section 6 on several integral equation examples, in comparison with the Expectation-maximization algorithm and the projected Landweber method-see [15] for an overview on regularization methods for nonnegative solutions of ill-posed equations.

Preliminaries
In the following we collect some basic results and assumptions needed for the analysis below. We start with properties of the entropy and then proceed to the operator A.

Entropy
Let Ω be an open and bounded subset of R d . The negative of the Boltzmann-Shannon entropy is the function f : L 1 (Ω) → (−∞, +∞], given by 4 Here and in what follows L p + (Ω), p ∈ [1, ∞], stands for the set {u ∈ L p (Ω) : u(t) 0 a.e.}, while · p denotes, as usual, the norm of the space L p (Ω).
The Kullback-Leibler functional or the Bregman distance with respect to the Boltzmann-Shannon entropy can be defined as d : where f (u, ·) is the directional derivative at u.
if d (v, u) is finite, as one can see below. Some properties of the entropy functionals f and d are recalled below (see, e.g. [28,29]). (iv) The directional derivative of the function f is given by Based on lemma 2.1 (iii) we define in the following dom ∂f = {u ∈ L 1 (Ω) : u bounded and bounded away from zero a.e.}. (9) Lemma 2.2. The statements below hold true: The function d(·, u * ) is lower semicontinuous with respect to the weak topology of L 1 (Ω), whenever u * ∈ dom f ; (iii) For any C > 0 and any nonnegative u ∈ L 1 (Ω), the following sets are weakly compact in L 1 (Ω): (iv) The set ∂d(·, u * )(u) is nonempty for u * ∈ dom f if and only if u belongs to L ∞ + (Ω) and is bounded away from zero. Moreover, ∂d(·, u * )(u) = {ln u − ln u * }.
A key observation for obtaining well-definedness of the iterative scheme as well as an explicit form for the iterates is the following result on the entropic projection.
has a unique solution in the cases m = 0 and m = 1, respectively, given by which satisfies u m ∈ dom ∂f .
Proof. We simply rewrite the functional as Hence, the problem is equivalent to minimizing d(u, u m ) + χ m (u). Since both terms are nonnegative and vanish for u = u m , we see that u m is indeed a minimizer in dom f . Strict convexity of d implies the uniqueness and since u m is the product of v with a function strictly bounded away from zero it also satisfies u m ∈ dom ∂f . □

Forward operators and entropy
In this paper we always assume that A : L 1 (Ω) → Y is a linear and bounded operator with Y being a Hilbert space. In addition to the norm boundedness of A, we assume a continuity property in terms of the Bregman distance. More precisely we assume that holds on dom( f + χ m ) in the respective cases m = 0 or m = 1 for some positive number γ. It is easy to see that the latter is already implied by the boundedness of A in case m = 1: Lemma 2.4. Let A be as above with A denoting its operator norm, let u, v ∈ dom( f + χ 1 ), and v ∈dom ∂f . Then (12) is satisfied with γ = √ 2 A .

Lemma 2.1 (v) further implies
which yields the assertion. □ We define the nonlinear functional which will be useful for further analysis. Note that and v ∈dom ∂f , whenever c γ 2 2 (see lemma 2.4). In case m = 0, we restrict the analysis to the class of operators A for which D(u, v) 0 for u ∈ dom f and v ∈dom ∂f .

The entropic Landweber method versus the mirror descent and other related methods
The mirror descent algorithm (MDA) was introduced in [25] for minimizing nonsmooth convex functions f over closed convex sets X ⊂ I R n , is a subgradient of f at x k , the parameters t k are positive and B ψ is the Bregman distance associated with a well-behaving function ψ. It is known that the algorithm converges It holds also in infinite dimensional spaces-see, e.g. chapter 9 in [6]. As shown in [16], even the iterates converge to a solution when the auxiliary function ψ is strongly convex, that is B ψ (z, x) µ 2 z − x 2 , for some µ > 0 and for any z, x belonging to a set which is large enough. This property is crucial in the analysis performed in spaces where the sublevel sets of the norms are (weakly) compact, implying boundedness of the iterates from the boundedness of the sequence with term B ψ (x * , x k ). It is however not helpful in infinite dimension for the entropy setting, since strong convexity of the Kullback-Leibler distance yields boundedness of the iterates with respect to the L 1 norm only, which is not powerful enough to guarantee convergence on subsequences. The entropic Landweber method is shown to converge by means of the L 1 weak compactness of the sublevel sets of the Shannon entropy-see, e.g. the use of lemma 2.2 (iii) in the proof of proposition 3.3 (iii). Another difference to the setting studied in our paper is that we use a constant parameter t k = 1/c, which would not satisfy the assumption tipically employed for the mirror descent algorithm: k∈N t 2 k < ∞. The first order methods analyzed by [5] in a finite dimensional setting are also closely related to the MDA und thus, to the finite dimensional counterpart of the entropic Landweber algorithm. A Lipschitz-like condition as in section 2.2 becomes an essential tool in that study. However, the so-called assumption H requiring convergence x k → x * from convergence of a Bregman distance sequence as above does not hold in the entropy infinite dimensional framework, as mentioned in the previous paragraph.
Last but not least, we provide additionally error estimates for the iterates under a solution regularity condition.

Convergence of the entropic Landweber method
In the following we consider the iterative method defined by (4), where d is the Kullback-Leibler divergence given by (7).
Noticing that A * maps to L ∞ (Ω), we can equivalently rewrite the minimization in (4) in the form of proposition 2.3, which implies the following result. Proposition 3.1. Let u 0 ∈ dom ∂f . Then there exists a unique minimizer in (4) for any k 0, given by ( which further satisfies u k+1 ∈ dom ∂f . Note that, from pointwise manipulation of (14) we rigorously obtain the first-order optimality condition for the variational problem in each step, i.e.
where ln c 0 k = 0 and ln c 1 k is to be interpreted as a Lagrange multiplier for the integral constraint.

Remark 3.2.
In the latter case, this constant term is orthogonal to all functions of the form Since most estimates below for iterates will be based on taking duality products of (15) with such functions, they can be carried out in the same way for m = 0 and m = 1.
The analysis of the above method ressembles the one for proximal point methods, which is apparent from rewriting (4) as However, the quantity D is neither a metric distance nor necessarily a Bregman distance of a convex function, rather a weighted difference of Bregman distances. This and the involved Kullback-Leibler divergence in an infinite dimensional setting require thus a careful investigation. We choose u 0 ∈ dom ∂f such that ξ 0 : for k 1. Then (15) can be expressed as For instance, one can consider u 0 = e −1 (constant function), ξ 0 = 0 and w 0 = 0. We show next that the entropic Landweber method converges in the exact data case. Proof. We will use the proximal point method techniques in order to prove the statements, by taking care of the fact that D is a nonnegative functional satisfying D(u, u) = 0 for any u in this function's domain.
(i) We have for all k ∈ N, By using (15), one has for all k ∈ N: This implies the typical inequality for a proximal-like method: which yields the conclusion.
which yields To this end, due to nonnegativity of D(z,u k ), k ∈ N, and to (18), one has The right hand side is bounded by (20) and (21), thus ensuring boundedness of { f (u k )} k∈N . Consequently, there exists a subsequence {u l } l∈N in dom ∂f which is L 1 -weakly convergent to some u ∈ dom f , see lemma 2.2. Then one has Au l → Au weakly in Y and moreover Au l → y in the Y-norm since due to inequality (19) and to monotonicity of {D(z, u k )} k∈N . Hence, u satisfies Au = y .
The proof of the statements above for the case m = 1 is similar, the main difference being the optimality condition (15) with c m k satisfying c(ξ k − ξ k+1 ) = A * (Au k − y) + c ln c m k . In more detail, the term c ln c m k , z − u k+1 vanishes for z, u k+1 ∈ dom χ 1 when evaluating a k and does not influence further calculations, while other terms containing c m k behave similarly in the remaining argumentation (see also remark 3.2). □ Let us consider now the iterative method based on the noisy data, that is We propose first a discrepancy principle for stopping the algorithm in this case. Before detailing how it works, denote for k 1. Then the optimality condition for (22) yields Proposition 3.4. Assume that A : L 1 (Ω) → Y is a bounded linear operator which satisfies (12) and such that the operator equation Au = y has a positive solution z verifying χ m (z) = 0 if m = 1. Let y δ ∈ Y be noisy data satisfying y − y δ δ, for some noise level δ. Let u 0 ∈ dom ∂f be an arbitrary starting element with the properties 1 + ln u 0 ∈ R(A * ) and χ m (u 0 ) = 0 if m = 1. Then (i) The residual Au k − y δ decreases monotonically and the following inequalities hold (ii) The term D(z,u k ) decreases as long as y δ − Au k 2 > δ 2 .

Proof.
We consider only the case m = 0, since for m = 1 one can use similar arguments, as explained in remark 3.2. First part of (i) follows by the definition of the iterative procedure. For proving the remaining inequalities in (i) and (ii), we consider as in the previous proof Inequality (26) can be obtained by writing (25) for k = 0, ..., n − 1 und calculating the telescope sum: Moreover, (ii) follows from (25) by neglecting the nonnegative term D(u k+1 , u k ).
(iii) follows from (26) and the definition of k * (δ): which implies (iv) can be shown similarly to proposition 3.3 (iii). Due to nonnegativity of D(z,u k ) for any k ∈ N and to (24), one has The right hand side written for k = k * (δ) is bounded by (27) and (28) the monotonicity of the residual and by (29), thus ensuring boundedness of { f (u k * (δ) } δ>0 for δ small enough. The conclusion follows then as in the proof of proposition 3.

(iii). □
A convergence result can be established also in case of an a priori stopping rule with k * (δ) ∼ 1 δ by following the lines of proposition 3.4 (iv). Proposition 3.5. Assume that A : L 1 (Ω) → Y is a bounded linear operator which satisfies (12) and such that the operator equation Au = y has a positive solution z verifying χ m (z) = 0 if m = 1. Let y δ ∈ Y be noisy data satisfying y − y δ δ , for some noise level δ. Let u 0 ∈ dom ∂f be an arbitrary starting element with the properties 1 + ln u 0 ∈ R(A * ) and χ m (u 0 ) = 0 if m = 1. Let the stopping index k * (δ) be chosen of order 1/δ. Then { f (u k * (δ) )} δ is bounded and hence, as δ → 0, there exists a weakly convergent subsequence {u k(δn) } n in L 1 (Ω) whose limit is a solution of Au = y in dom χ m when m = 1. Moreover, if the solution of the equation is unique, then {u k * (δ) } δ>0 converges weakly to the solution as δ → 0.

Error estimates
In this section we derive error estimates under a specific source condition (on a solution) for the entropy type penalty. We proceed first with the case of exact data on the right-hand side of the operator equation and then with the noisy data case, by employing an a priori rule for stopping the algorithm.

Exact data case
Then one has Proof. We consider only the case m = 0 (similar arguments for the other case, see remark 3.2). First, we symmetrize D by considering D s (x,y ) = D(x,y ) + D(y ,x).
One can use similar techniques as in [11] for deriving the announced error estimates, by carefully dealing with the setting of the D distance penalty. Based on (17), one has By writing the last inequality also for k − 1, k − 2, ..., 1, by summing up and by combining with monotonicity of {D(z,u k )}, one obtains Au j − y 2 and thus, due to (21), holds. The announced convergence rate in the L 1 -norm holds in case m = 1 by lemma 2.1 (v). (32) Proof. Note that cξ − A * Az = A * q for q = v − A * z. With this notation, one can show the following estimate as in Theorem 4.3 in [11]: Then one has which yields (32) when written for k = k * (δ), due to (26). □ Establishing convergence rates by means of a discrepancy rule remains an open issue.

General data fidelities
Before we conclude with numerical examples, we want to emphasize that Problem (1) can easily be generalised to Here F y δ : Y → [0, +∞) is a more general data fidelity term that is assumed to be convex and Fréchet-differentiable and g : Note that (33) is an instance of the Bregman proximal method [13,20]. The update for (33) can be written, in analogy to (14), as for λ = 1/c and m ∈ {0, 1}. We want to emphasise that a more general data fidelity term that satisfies the assumptions mentioned above together with for all u, v ∈ dom f , is no restriction in terms of Fejér-monotonicity. In analogy to [9, lemma 6.11] we can conclude for all k < k * (δ), with k * (δ) chosen according to a modified version of (27) that reads as However, we can also derive a monotonicity result for d directly. First of all we observe that (35) implies

Inserting (34) into the inequality above then yields
As mentioned earlier in section 3, we either have ln(c m k ) = 0 for m = 0, or orthogonality of ln(c m k ) to all functions of the form With the three-point identity we then observe Together with (36) we can then conclude for k < k * (δ).

Example problems
We finally discuss several types of problems that satisfy the conditions used in the analysis and present numerical illustrations for some of these situations.

Integral equations
Let Ω ⊂ I R d and Ω ⊂ I Rd be open and bounded sets and let k ∈ L ∞ (Ω × Ω). Then the integral operator is a well-defined and bounded linear operator. Thus, the convergence analysis is applicable due to lemma 2.4. We mention that in the case of k being a nonnegative function, and hence A and A * preserving nonnegativity, standard schemes preserving nonnegativity are available. In particular for k including negative entries, the entropic Landweber scheme offers a straightforward alternative, since it does not depend on the positivity preservation of A respectively its adjoint. For comparison we consider the EM-Algorithm and the projected Landweber iteration We implement the forward operator by discretization of u on a uniform grid and a trapezoidal rule for integration. We use the following examples of kernels and initial values, all on Ω = (0, 1), the first two being standard test examples used in the literature on maximum entropy methods (see [1]) 3. Kernel k 3 (x,y ) = 1 if x y and k 3 (x,y ) = 0 else, exact solution z 3 (x) = e − x 2 2σ 2 .
In all examples, we have chosen σ 2 = 0.01 and a constant initial value u 0 . In order to illustrate the behaviour of the iteration methods we plot the error u k − z L 1 versus the iteration number k in figure 1.
We observe that the entropic Landweber is at least competitive to the other schemes in all examples, it outperforms the EM and projection method in the first example, which is a combination of severe ill-posedness with an exact solution having many entries close to zero (which is a particularly difficult case for the EM algorithm).
In the second case, again severely ill-posed, the projected Landweber iteration performs better, mainly due to the strong initial decrease when the solution is positive and no projection is applied.
The third case corresponding to numerical differentiation, i.e. a very mildly ill-posed problem, is characterized by fast convergence of the schemes, but again the projection method converges significantly slower. For comparison we also include the stochastic version of the entropic Landweber method, with only one equation used in each iteration step, hence a highly efficient computation. That is, the operator A is divided in M blocks A = (A 1 , A 2 , . . . , A M ) T , and the data y are partitioned in the same way: y = (y 1 , y 2 , . . . , y M ). With J(k) a discrete uniform random variable in {1, . . . , M}, we compute the iterates The initial convergence curve is similar to the other method, with much lower computational effort, then the asymptotic convergence close to the exact solution becomes significantly slower. Hence, it might be very attractive to use the stochastic version at least for the first phase of the reconstruction.

Discrete sampling of continuous probability densities
Suppose that our forward operator is the Fourier integral of a real-valued function evaluated at discrete samples ξ 1 , . . . , ξ n on a compact domain Ω ⊂ I R d , i.e.
Then the adjoint operator A * that satisfies n j=1 (Au) where Re denotes the real part of a complex function. For this choice of A the iterates of (4) read where c m k is defined as in (14). Note that this update can also be written as Due to ỹ j(k+1) =ỹ jk + (2π) − d 2 I R d u k (t)e −it·ξj dt, this formulation has the advantage that the numerical costs for evaluating the integrals remains constant.
In the following we consider a one-dimensional setting (d = 1) with Ω = [−a, a] for a = 10, where we measure n = 16 samples {y j } n j=1 of the Fourier integral for coordinates ξ j = (2π( j − 1))/n, j = 1, . . . , n. We assume that these measurements are of the form for a function z ∈ L 1 ([−10, 10]) and where n j ∈ N (0, σ 2 ) are normal-distributed random variables with mean zero and variance σ 2 , for all j ∈ {1, . . . , n}. We consider numerical experiments for two choices of z. The first choice is the following Gaussian-mixture model, that is constructed as a linear combination of three normalised Gaussian, i.e.
In the following we run the entropic Landweber method (14) for j = 1, τ = 9/(10 √ 2π) and with the initial function  (27) is violated (for τ = 1) or until we reach a certain maximum number of iterations. We first investigate the algorithm for the function z 1 as seen in figure 2(a), for perfect data (σ = 0) and for noisy data σ = 1/500. For perfect data we run the algorithm for 201 iterations and observe that we are converging towards z 1 as can be seen in figure 3(a) as well as in figure 5(a), which is a numerical confirmation of proposition 4.1. For the non-trivial noise-level σ = 1/500 the algorithm stops after 65 iterations according to the discrepancy principle ( figure 3(c)).
To conclude, we run the same numerical experiments for z 2 as shown in figure 2(b). As we mentioned earlier, (30) is violated and even for perfect data (i.e. σ = 0) we cannot expect the results of proposition 4.1 to hold true. It can be seen in figure 4(a) that u k does not seem to converge towards z 2 despite a decrease of the objective to values in the order of 10 −5 . In fact, if we compare the L 1 -norm of the difference u k − z 2 , we also see in figure 5(b) that u k does not seem to converge towards z 2 . For noisy data with σ = 1/500 the discrepancy principle is violated after 46 iterations, with its result being visualised in figure 4(c).

Initial densities for stochastic differential equations
An interesting problem in several applications, e.g. in data assimilation scenarios (see e.g [21]), is the reconstruction of the initial density for a system evolving via stochastic differential equations with drift b and volatility a. The density evolves via the Fokker-Planck equation (see [19]) in Ω × (0, T) with no-flux boundary conditions. Under appropriate smoothness conditions on a and b as well as positivity of a is is well-known that the Fokker-Planck equation has a unique nonnegative solution ρ ∈ C(0, T; L 1 (Ω)) for nonnegative initial values u ∈ L 1 (Ω) such that Ω u ln u dx < ∞. In problems related to reconstructing u it is hence rather natural to use methods penalizing its entropy. The forward operator A maps the initial density to indirect measurements of the density ρ over time, e.g. moments or local integrals. Parametrizing the measurements by values σ in a bounded set σ we obtain It is well-known that Fokker-Planck equations satisfy an L 1 -contractivity property on the domain of the entropy functional (see [23]) i.e. for ρ i denoting the solution with initial value u i for almost all t ∈ (0, T). Thus, the map u → ρ is Lipschitz continuous with unit modulus when considered as a map into L ∞ (0, T; L 1 (Ω)) on the domain of the entropy. Hence, if k ∈ L ∞ (Σ × Ω) we can easily verify that the operator A satisfies (12). We finally mention that in the case of stationary coefficients a and b, the Fokker-Planck equation has a unique stationary solution ρ ∞ among nonnegative functions with unit mass (see [18]), to which it converges with exponential speed in the relative entropy (see [2,3]), i.e. the Bregman distance related to the entropy functional. Hence, it is natural to use ρ ∞ as an initial value for the reconstruction of u, since we may expect them to be close in particular in the relative entropy.

Conclusions and remarks
In this study we have investigated a multiplicative entropic type method for ill-posed equations, which preserves nonnegativity of the iterates. Historically, the underlying strategy has spreading roots in the inverse problems literature: Landweber iterates, surrogate functionals and linearized Bregman, to quote a few approaches. In parallel, this has been treated in different contexts in finite dimensional optimization, e.g. as a mirror descent or as a steepest descent (linearized proximal) algorithm with generalized distances, or in machine learning-as an exponentiated gradient descent algorithm for online prediction via linear models.
The closed form algorithm is shown to converge weakly in L 1 to a solution of the ill-posed problem and convergence rates are obtained by means of the Kullback-Leibler (KL) distance. All the results are quite naturally established when imposing 'mean one' restriction to the unknown, while the case without restrictions relies on a norm combined with KL distance based Lipschitz condition, in which case operators satisfying it remain to be found. Methods of this type involving other interesting fidelity terms, nonlinear operators and eventually stochastic versions and line search strategies might be considered in more detail for future research.