Coupled regularization with multiple data discrepancies

We consider a class of regularization methods for inverse problems where a coupled regularization is employed for the simultaneous reconstruction of data from multiple sources. Applications for such a setting can be found in multi-spectral or multi-modality inverse problems, but also in inverse problems with dynamic data. We consider this setting in a rather general framework and derive stability and convergence results, including convergence rates. In particular, we show how parameter choice strategies adapted to the interplay of different data channels allow to improve upon convergence rates that would be obtained by treating all channels equally. Motivated by concrete applications, our results are obtained under rather general assumptions that allow to include the Kullback-Leibler divergence as data discrepancy term. To simplify their application to concrete settings, we further elaborate several practically relevant special cases in detail. To complement the analytical results, we also provide an algorithmic framework and source code that allows to solve a class of jointly regularized inverse problems with any number of data discrepancies. As concrete applications, we show numerical results for multi-contrast MR and joint MR-PET reconstruction.


Introduction
In many classical fields of application of inverse problems, rather than measuring a single data channel, the simultaneous or sequential acquisition of multiple channels has become increasingly important recently. Besides having different sources of information available, an advantage of multiple measurements is that correlations between different data channels can be exploited in the inversion process, often leading to a significant improvement for each channel.
Multi-modality and multi-contrast imaging techniques for instance deal with the exploration of such joint structures for improved reconstruction. Applications of such techniques can be found for instance in biomedical imaging [2,11,12,13,26,42,36,38,42,46,21], geosciences [45], electron microscopy [20] and beyond. Also spatio-temporal imaging techniques can be interpreted in this way, regarding the measurement at each time-point as a different channel, and we exemplary refer to [41,32,17] for applications.
The present work focuses on a general framework for coupled regularization of inverse problems where the data is split into multiple channels. In an abstract setting, we consider the following problem setup.
where the T i denote forward operators, the f i the given data, the D i data discrepancy terms and R a regularization functional. Our main interpretation of this setup is that u = (u 1 , . . . , u N ) denotes some unknown multi-channel quantity of interest, each u i corresponds to some measured data f i such that T i u = T i u i ≈ f i , and R realizes a coupled regularization of all channels. A first particular, example for such a setting is the joint reconstruction of magnetic resonance (MR) and positron emission tomography (PET) images via solving min u=(u 1 ,u 2 ) R((u 1 , u 2 )) + λ 1 T 1 u 1 − f 1 as has for instance been considered in [26,13]. Here, the first data fidelity reflects Gaussian noise in MR, while the second one is the Kullback-Leibler divergence that arises since the noise in PET imaging is typically assumed to be Poisson distributed. A second example is the situation that the components (f i ) i result from a sequential measurement process which is such that at different measurement times, the quality of the measured data is different. A practical application of this is spatio-temporal regularization for dynamic magnetic resonance imaging (dMRI) [41,32], where the f i would correspond to measurements of the same object at different time points. There, one aims to solve where now u = (u i ) N i=1 is a time series of measurements and R employs a spatio-temporal regularization. Since in dMRI, there is always a trade-off between the signal quality and measurement time, it is reasonable to adapt the measurement time to the underlying dynamics of the object, leading to measurements with different resolution and noise properties.
We highlight two particular situations of the setting (1) that are captured by standard theory: The first one is the situation when the regularization decouples, i.e., R(u) = N i=1 R i (u i ), which reduces the reconstruction to N decoupled problems. Application-wise, however, it has been shown in many recent works that, in case the different channels share some structure, a coupled regularization is highly beneficial in terms of reconstruction quality. The second particular case is when all data terms are weighted the same, i.e, λ i = λ j for all i, j. In this situation, all data terms can be grouped to a single discrepancy and, again, standard theory applies. We believe, however, that there are various good reasons to consider an individual weighting of all data terms: i) The noise type of the involved measurements might be different and the different data discrepancies might hence scale differently. ii) The noise level will in general be different, e.g., the SNR for one channel might be significantly better, allowing to choose a very high value for one of the λ i . iii) The degree of ill-posedness might be very different for each channel, as is the case for instance with MR-PET reconstruction, and hence again an appropriate adaption of the regularization parameters is necessary.
Allowing different parameters for different data discrepancies, the following questions arise.
• What is the convergence behavior of the method if the data of some channels converges to a ground truth? Is there a limit problem, and if so, what does it look like?
• What about convergence rates in case the noise on all channels approaches zero? How is the interplay of the different data terms with respect to convergence rates?
• Taking into account different types of data terms, what is the best parameter choice?
We will see that those questions all have rather natural answers. In particular, we will show that in case the data discrepancies have different continuity properties, which is true for instance for MR-PET reconstruction, an appropriate parameter choice allows to partially compensate for less regular discrepancies.
To the best knowledge of the authors, a convergence theory for coupled regularization approach to inverse problems with multiple data discrepancies has not been developed so far. Regarding classical Tikhonov regularization with a single regularization and a single data term we refer to the books [15,40,43,28] for convergence results in Hilbert and Banach spaces. In particular, we highlight the seminal work [22] that provides appropriate source conditions and convergence rates in Banach spaces that have also been the basis for this work.
While [22] and classical works consider a setting that allows for powers of norms as data discrepancies, there are far less works that also cover discrepancies that do not satisfy a (generalized) triangle inequality, as is the case for the Kullback-Leibler divergence that is included in the present work. For the latter, we refer to [23,37,39,33].
While multi-discrepancy regularization has not been considered in depth so far, there has been a lot of recent research on multi-penalty regularization and convergence behavior for multiple parameters and functionals. We refer to the works [29,25,18] for an additive combination of multiple regularization terms, to [31,19,16] for an infimal-convolution-type combination and to [5] for stability and convergence results for multiple-parameter-TGV regularization.

Outline of the paper
The paper is organized as follows: In Section 2 we define the variational problem setting under consideration and provide stability and convergence results in the general setting. This is done in a way that captures both powers of norms as well as the Kullback-Leibler divergence as data discrepancy terms. In Section 3 we consider particular classes of data fidelities. We show how our results can be applied to powers of norms and obtain corresponding convergence rates. After some preliminary results on the Kullback-Leibler divergence, we conclude stability and convergence rates for mixed L 2 norm and Kullback-Leibler discrepancies, which is particularly motivated by applications to joint MR-PET reconstruction. In Section 4 we exemplary work out the application of our results to two choices of coupled regularization. In Section 5 we first outline an algorithmic framework that allows to realize coupled TGV regularization with any combination of L 2 and Kullback-Leibler divergence discrepancies. Then we show exemplary numerical results for coupled regularization in the context of multi-contrast MR and PET imaging.
2 General stability and convergence results

Notation and variational problem setting
We will, throughout this work and up to further specification, consider the following problem setting: Let (X, · X ) be a Banach space and R : X → [0, ∞] be a regularization term.
where, for λ i = ∞, we define We consider the problem We will further use τ X and τ Y i , τ D i , i = 1, . . . , N , to denote different topologies on X and Y i , i = 1, . . . , N , respectively, and assume that τ X and τ Y i satisfy the T2 separation axiom. For τ being any topology, we say that a sequence (z n ) n τ -converges to z as n → ∞, and write z n for Poisson noise. Regarding their applicability, we note the following: Assumptions 1 and 2 are rather weak and standard, e.g. are fulfilled in case the D i are powers of norms and τ Y is stronger than the corresponding weak topology. Assumptions 3 and 4 are also rather standard. Assumption 5 essentially requires pre-compactness of sublevel sets, which is a standard condition for existence.
It will also be convenient to use the following notation: For a set I ⊂ {1, . . . , N } we define the reduced energy functional Here, the complement I c is always taken in {1, . . . , N }, i.e., we define I c = {1, . . . , N } \ I. Further, writing I = {i 1 , . . . , i |I| }, we will use the notation For two sequences (x n ) n and (y n ) n we write x n ∼ y n if there exist constants c, C > 0 such that cy n ≤ x n ≤ Cy n for all n. Further, we use the standard little o() and big O() terminology, that is, we say x n = o(y n ) and x n = O(y n ) if lim n→∞ x n /y n = 0 or x n /y n is uniformly bounded in n, respectively. For the functional R and ξ ∈ ∂R(u) ⊂ X * , an element of the subdifferential of R at u ∈ X, we define the Bregman distance D ξ R :

Results in the general setting
Using Assumption (G), the following existence result can be obtained by standard arguments.
Proposition 2.2. Let f ∈ Y and λ ∈ (0, ∞) N be such that J λ (·, f ) is proper, i.e., finite in at least one point, and assume that Assumption (G) holds. Then there exists a solution to P (λ, f ).
Proof. Take (u n ) n to be a minimizing sequence for which we can assume that J λ (u n , f ) is bounded. Then by assumption (u n ) n admits a τ X -convergent subsequence and, by the lower semi-continuity assumptions on R and D i and by continuity of T i we obtain that the limit is a minimizer.
The following proposition shows existence of what we will refer to as J λ,I c minimal solution to T I u = f † I . This notion will be relevant later on when we study limit problems in multi-discrepancy regularization for vanishing noise level.
Then there exist an u † such that Remark 2.4. Note that J λ,I c is again a multi-data Tikhonov functional for indices in I c , and so J λ,I c -minimal solutions to T i u = f † i can be understood as solutions to a Tikhonov problem for I c , using T I u = f † I as prior.
Proof. It is obvious that the two minimization problems are equivalent, hence we show existence for the second one. Due to our assumptions, the set of solutions is not empty and the infimum is greater equal zero. Consequently, an infimising sequence exists, which due to the coercivity assumption admits a τ X -convergent subsequence with limit u † . Due to continuity of the T i and lower semi-continuity of the D i in the first component w.r.t. τ Y i we get that the limit u † is a solution of (6) and the assertion follows.
Now we consider the situation of varying data, which is relevant for investigating stability of the solution method with respect to data perturbations. To this aim, we will need the following continuity assumption at a particular pair (u, f † ) ∈ X × Y and index set I ⊂ {1, . . . , N }.
We remark that this assumption is in particular always satisfied if the data fidelity is a power of a norm, as can be easily deduced from the inverse triangle inequality for norms.
The next theorem considers the rather general situation that the given data converges and we choose the parameters to either converge to infinity, which corresponds to assuming vanishing noise, or to a fixed positive number. From this, simplified results on stability and convergence for vanishing noise level will then be obtained as corollaries. Note further that we allow for powers p i of the data terms, which will be relevant for the particular case of norms later on.
Theorem 2.5 (General convergence result). Suppose that Assumption (G) holds. Let Choose the parameters λ n = (λ n 1 , . . . , λ n N ) ∈ (0, ∞) N such that, for a given subset I ⊂ {1, . . . , N }, and denote λ † = lim n→∞ λ n ∈ (0, ∞] N . Assume that there exists a J λ † ,I c -minimal solution to T I u = f † I , denoted by u 0 , such that S 1 (u 0 , f † , I c ) holds. Then every sequence (u n ) n of solutions to the problems admits a τ X -convergent subsequence again denoted by (u n ) n with limit u † such that and every limit of a τ X -convergent subsequence of solutions is a solution to Proof. Let (u n ) n be any sequence of solutions and let u 0 be a J λ † ,I c -minimal solution to . By minimality, one obtains . Setting λ 0 i = inf n∈N λ n i > 0 for all i, this implies boundedness of J λ 0 (u n , f n ) and consequently, by coercivity, (u n ) n admits a convergent subsequence.
Assume now that (u n ) n is any subsequence of a sequence of solutions converging to some u † ∈ X. The estimate in (9) implies for each i ∈ I lim sup and hence, by lower semi-continuity of D i and since λ i → ∞ we obtain T i u † = f † i . Further, we get from lower semi-continuity and convergence of the Combining this with the estimate in (9) we obtain, since u 0 ∈ X solves (8), that also u † is also a J λ † ,I c -minimizing solution of T I u = f † I . Now assume that there is one i 0 ∈ I such that λ n i 0 D i 0 (T i 0 u n , f n ) does not converge to zero. This implies that lim sup n→∞ i∈I which, by (9), yields and hence a contradiction. Consequently, for all i ∈ I, . Finally, we assume that either there is an In either case this implies that there is a c > 0 such that But as before this implies that which is a contradiction to u 0 being optimal. Together with lower semi-continuity, this for all i ∈ I c as well as lim n→∞ R(u n ) = R(u † ).
Remark 2.6. We note that, by choosing λ n i for i ∈ I such that with i ∈ (0, p i ), the assumptions of Theorem 2.5 concerning the choice of λ n i hold and we obtain the rate which comes arbitrary close to the rate o(δ p i ). The rate O(δ p i ) will be obtained under additional source conditions later on.
As first consequence of Theorem 2.5 we obtain a basic stability result. That is, letting all regularization parameters in Theorem 2.5 fixed, it follows that the solutions of P (λ, f † ) are stable with respect to perturbations of the data. Note that this could have also been obtained from standard theory (e.g. [22]) without handling the N data terms individually.
be such that f n D → f † as n → ∞ and take λ † ∈ (0, ∞) N . Assume that for u 0 being a solution to P (λ, f † ), Assumption S 1 (u 0 , f † , {1, . . . , N }) holds. Then every sequence (u n ) n of solutions to min admits a τ X -convergent subsequence again denoted by (u n ) n with limit u † such that and every limit of a τ X -convergent subsequence of solutions is a solution to Proof. Choose I = ∅, λ n = λ † and apply Theorem 2.5.
The next corollary is specific to the multi-data-discrepancy setting and handles the situation when the noise on some components vanishes while the noise on other components remains the same. It shows that in this situation, using an appropriate parameter choice, the limit problem comprises the minimization of the remaining terms subject to hard constraints for the channels with ground truth data. To deduce this result, we set the regularization parameters corresponding to components with a fixed noise level to be constant and assume an appropriate parameter choice for the others. Remarkably, the result holds without Assumption S 1 (u, f † , I c ).
Choose the parameters λ n = (λ n 1 , . . . , λ n N ) ∈ (0, ∞) N such that λ n i = λ † i ∈ (0, ∞) for i ∈ I c and λ n i → ∞, as well as λ n i (δ n i ) p i → 0 for i ∈ I, and denote λ † = lim n→∞ λ n ∈ (0, ∞] N . Assume that there exists a J λ † ,I c -minimal solution to T I u = f † I . Then every sequence (u n ) n of solutions to min u∈dom(T ) J λ n (u, f n ), (13) admits a τ X -convergent subsequence again denoted by (u n ) n with limit u † such that and every limit of a τ X -convergent subsequence of solutions is a solution to Proof. Apply Theorem 2.5 and note that Assumption S 1 (u 0 , f † , I) was only needed in its proof to show the convergence Remark 2.9. A particular consequence of the previous corollary is that fixing f n i = f † i also for i ∈ I allows for any parameter choice λ n i such that λ n i → ∞ and yields still In the context of joint regularization, this analytically confirms that, letting the parameter λ n i for some particular, more well-conditioned modalities or channels increase approximates the situation that the corresponding components T i u are fixed to hit the data exactly and only enter as prior information in the joint regularization functional R.
In summary, the previous results answer the first question posed in the introduction: Multi-data-fidelity regularization is stable with respect to convergence of the data for some channels and, using an appropriate parameter choice strategy, the limit problem in case the data for some channels converges to a ground truth comprises the minimization of the regularization and data discrepancies for the other channels subject to hard constraints for the channels with ground truth data.
Next we consider convergence rates. To this aim, we will need a slightly stronger version of Assumption S 1 (u, f † , I c ), which we denote by S 2 (U, f † ). Essentially we will require that the continuity of Assumption S 1 (u, f † , I c ) also holds for all u in a set U and that the modulus of continuity can be estimated by functions ψ i . The typical situation we have in mind for applications is that Remark 2.10. We note that Assumption S 2 (U, f † ) is satisfied trivially in case the data discrepancy is a norm or, more generally, whenever the data terms satisfy an inverse triangle inequality of the type The reason for using a slightly weaker assumption is that the weaker version as above is satisfied by norms to the power p with p > 1, for which an inverse triangle inequality does not hold. We also remark that the assumption S 2 (U, f † ) could be further relaxed for those indices i ∈ I where λ n i → ∞, by allowing arbitrary constants as factors for ) which goes to one. The fact that we can even get the constants to converge to one, which is true but less trivial for powers of norms, is only needed for indices in I c where λ n i converges to a finite value. Hence the latter is a particularity of the multi-data fidelity setting when the noise in some component goes to zero but in others not.
The next theorem provides the main estimates for obtaining convergence rates for all specifications of the data terms that are considered later on. It partially relies on a rather abstract source condition that will be discussed below and simplified in particular applications of the theorem.
Let (u n ) n be a sequence of solutions to P (λ n , f n ) and u † be a J λ † ,f † -minimizing solution of Then If we assume further that there exists for some > 0, then, after at most finitely many indices n we have Proof. Take (u n ) n a sequence of solutions and u † a J λ † ,f † -minimizing solution of T I u = f † I and, without loss of generality, assume that S 2 (U 0 , f † ) holds. At first we note that, from optimality and the fact that Now we aim to change from f n to f † and from λ n to λ † in the functionals J λ n ,I c of last line above. To this aim, we first note that, by assumption S 2 (U 0 , f † ), where the last inequality follows from estimating λ n where the last inequality is obtained by boundedness of D i (T i u n , f † i ), which follows by estimating it above with D i (T i u n , f n i ) (up to constants) using S 2 (U 0 , f † ), which in turn is bounded by the assertion of Theorem 2.5. Now in order to change from λ n to λ † we sum the two equations above and further estimate where the last inequality is again due to boundedness of D(T i u n , f † i ). Combining the last estimate with (16) which already shows the estimate on the extended objective functional J λ † ,I c (·, f † ) as in (15). Now by the second inequality in This, together with the fact that C < λ n i for i ∈ I and n sufficiently large allows to conclude from (17) that, up to finitely many indices, Hence the source condition holds for u = u n and the difference of the reduced objective functionals as above can be estimated by Plugging this estimate in Equation (17), the assertion follows.
As mentioned above, the parameters p i ≥ 1 are introduced for the case that the data discrepancies are powers of norms. Setting all p i = 1, an abstract estimate on convergence rates follows immediately from the previous theorem.
Proposition 2.12. With the assumptions of Theorem 2.11, set p i = 1 for i ∈ I. Then, with (u n ) n a sequence of solutions to P (λ n , f n ) we obtain for j ∈ I Proof. First note that, by Assumption S 2 (U 0 , f † ) we can again estimate, for i ∈ I, It then follows from the assertion of Theorem 2.11 that after finitely many indices Using non-negativity of all involved terms (for sufficiently large n), the assertion follows as direct consequence.
Before considering a more concrete result for convergence rates, we discuss the source condition (SC), which is a direct extension of a standard source condition for single, norm discrepancy terms in the context of non-linear inverse problems in Banach spaces, see [22]. The reason for introducing the power 1/p i in the condition is to obtain compatibility with existing source conditions in the case that Obviously, when letting the noise on all data components convergence to zero (still potentially at different rates), i.e., choosing I = {1, . . . , N }, J λ † ,I c (·, f † ) reduces to R and the source condition coincides with the one of [22]. As discussed for instance there or in [40], under some additional assumptions, this source condition then is equivalent to more classical source conditions [15].
The following remark, whose proof is direct, allows for an interpretation of the source condition also in the most general setting.
The second equation can be interpreted such that if u deviates from the equality constraints T i u = f † i for i ∈ I, the cost of the respective data term needs to increase at least as fast as the Bregman distance of u to u † with respect to the reduced objective functional.
We now consider more concrete assertions on convergence rates. It is easy to see that, in the assertion of Proposition 2.12, the choice f n i = f † i and λ n i = λ † i for i ∈ I c allows to choose φ i ≡ 0 and improves the convergence rates accordingly. More particularly, in the case that the noise on every component goes to zero, i.e., I c = ∅, we get the following corollary.
Corollary 2.14. With the assumptions of Theorem 2.11, set I = {1, . . . , N }, p i = 1 for all i and let (u n ) n be a sequence of solutions to P (λ n , f n ) and u † be an R-minimizing solution of T u = f † such that S 2 (U n 0 , f † ) holds for some n 0 . Then, after at most finitely many n ≥ 0, If we assume further that there exists Proof. This follows from Theorem 2.11 and Proposition 2.12 by choosing I = {1, . . . , N }.
The previous result provides, in a general setting, an answer to the second question posed in the introduction: Under appropriate assumptions, convergence rates can be obtained for the multi-discrepancy setting and the estimates above provide explicit information on the interplay of the different channels and discrepancy terms with respect to convergence rates. Next, we will discuss a first example in the general setting how this interplay can be exploited with appropriate parameter choice strategies in order to improve convergence rates. This is a first answer to the thirst question of the introduction, for which a more detailed discussion with concrete examples of data discrepancies is provided in the subsequent section.
Consider the particular case that the noise level for all components is given as δ n i ∼ (δ n ) µ i for some δ n → 0 and µ i ≥ 1 and that ψ i (x) ∼ x for all i. It is immediate that using the same δ n i -dependent parameter choice for all λ n i , say λ n i ∼ (δ n i ) −(1− ) for ∈ (0, 1), yields the rates That is, the rates for both D ξ R and D j are affected by the worst of all µ i . In addition, choosing → 1, gives the best rate for D ξ R , however worsens the rate for D j whenever µ j > min i {µ i }.
As will be shown in the following corollary, adapting the rates of each parameter to the interplay of the µ i allows to improve upon this. In particular, it allows to obtain the rate (δ n ) µ j for each µ j and a slightly improved rate for D ξ R . This is achieved in the more general setting that ψ i (x) ∼ x ν i with potentially different ν i ∈ (0, 1], which will be relevant for applications with different noise characteristics yielding different discrepancy terms. Corollary 2.15. With the assumptions of Corollary 2.14, suppose that the ψ i in Assumption S 2 (U, f † ) are such that ψ i (x) ∼ x ν i and that δ n i ∼ (δ n ) µ i with a sequence (δ n ) n in (0, ∞) such that δ n → 0, µ i ≥ 1 and ν i ∈ (0, 1]. Define η i = µ i ν i and η min = min i {η i } and take one i 0 such that µ i 0 = min{µ i | η i = η min }. Assume that ν i 0 < 1. Then, with i = η min µ i for all i it follows that i ∈ (0, 1) and with the parameter choice If we again assume that the source condition (SC) of Corollary 2.14 holds, then we obtain for any j ∈ {1, . . . , N }, This choice of the i is optimal for the estimates on D j and D ξ R given in Corollary 2.14 in the sense that they cannot be improved by any other parameter choice of the form λ n i = (δ n i ) −(1−˜ i ) with˜ i ∈ (0, 1). Remark 2.16. Before providing the proof, we note that, for the sake of simplicity, we have left out the case that ν i 0 = 1 with ν i 0 as above. In this case, one has to choose i = η min µ i with ∈ (0, 1), which allows to approximate the same rates in the limit → 1.
Proof. At first we show that indeed i < 1: In case i is such that η i > η min this follows In the other case, we note that necessarily ν i < 1 hence i < η min η i ≤ 1. Plugging in the proposed parameter choice in the rates of Corollary 2.14 the claimed rates follow by direct computation. Regarding their optimality, we note that for any choice of˜ i , the rate for D ξ R is given as which cannot be better that (δ n ) η min . Similar, by considering the jth summand in the rate for D j it is immediate that the rate cannot be better than (δ n ) µ j , which is achieved with the proposed parameter choice.
3 Specification of the data term

Power-of-norm discrepancies
In this subsection, we consider the particular case that all discrepancy terms D i are powers of norms, i.e., In this particular case, we choose the topologies τ D i to be the trivial topologies and hence D i -convergence is equivalent to convergence in the norm · Y i . As we will see, assumption (G) then reduces as follows.
Assumption (N) (Norm discrepancies, τ D the trivial topology). as 1. · Y i is sequentially lower semi-continuous w.r.t. τ Y i , and addition and scalar multiplication are continuous in the topology τ Y i .

The operators
4. For any λ ∈ (0, ∞) N , f ∈ Y and any sequence (u n ) n we have that J λ (u n , f ) < c implies that (u n ) n admits a τ X -convergent subsequence.
, and define τ D to be the trivial topology. Then, if Assumption (N) holds, also Assumption (G) is satisfied.
Proof. We consider the particular points of Assumption (G) with the numbering given there. The Point 1. obviously holds. For two sequences (a n i ) n , (f n i ) n in Y i converging to a i , f i ∈ Y i with respect to τ Y i and norm convergence, respectively, we get that and hence, by monotonicity and continuity of the mapping x → x p i on [0, ∞) Point 2. in Assumption (G) holds. The Points 3. and 4. remain unchanged. Regarding Point 5., take λ > 0, (f n ) n to be sequence D-converging to f ∈ Y and take (u n ) n to be a sequence such that J λ (u n , f n ) < c, with c > 0. By the triangle-inequality we get that which is bounded, hence by assumption (u n ) n admits a τ X -convergent subsequence.
By the inverse triangle inequality and continuity of x → x p i it is also clear that Assumption S 1 (u, f † , I c ) holds for any u ∈ X. Hence, in summary, for the case of power-of-norm discrepancies, we get the following simplified version of the general result in Theorem 2.5, covering existence, stability and convergence for vanishing noise. Obviously, the Corollaries 2.7 and 2.8 then also hold in simplified versions.
Choose the parameters λ n = (λ n 1 , . . . , λ n N ) ∈ (0, ∞) N such that, for a given subset I ⊂ {1, . . . , N }, λ n i → ∞, as well as λ n i (δ n i ) p i → 0 for i ∈ I, Further denote λ † = lim n→∞ λ n ∈ (0, ∞] N . If dom(R) ∩ ( n i=1 dom(T i )) = ∅, then there exists a sequence (u n ) n of solutions to P N (λ, f n ), every such sequence of solutions possesses a τ X -convergent subsequence again denoted by (u n ) n with limit u † such that and every limit of a τ X -convergent subsequence of solutions is a solution to Regarding convergence rates results, we remember that in the general case they rely on Assumption S 2 (U, f † ). For power-of-norm discrepancies, we now show that also this assumption is always satisfied. To this aim, we first need the following basic facts from analysis.
This allows us to obtain the following estimates, which are a particular case of the inequalities required by Assumption S 2 (U, f † ).
Proposition 3.4. Let (Y, · ) be a normed space and f, f † , v ∈ Y and p, q ∈ (1, ∞) with Then, there exits α > 0 and constants c, C, d, D > 0 such that, whenever δ < α, Proof. Obviously, the results holds for δ = 0, hence we assume δ > 0. First we use Lemma 3.3 to estimate for a general λ ∈ Now we set λ = δ 1 q and get for δ sufficiently small
The previous proposition shows, in case all D i -terms are powers of norms, Assumption S 2 (U, f † ) holds again for any U ⊂ X, f † with ψ i (x) = x 1 p i . From this, we obtain the following convergence rates.
Theorem 3.5. With the assumptions of Theorem 3.2, decompose the index set I such that Let the parameter choice for the indices i ∈ I be such that with i ∈ (0, 1). Let (u n ) n be a sequence of solutions to P N (λ, f n ) and u † be a J λ † ,f †minimizing solution of T I u = f † I . Then Assume further that there exists and constants 0 < β 1 < 1, 0 < β 2 such that and, in case p j = 1, and, in case p j > 1, Proof. At first we note that, by Theorem 2.11, for all but finitely many indices n we get Using that, by Young's inequality, for i ∈ J p , Estimating the left hand side below by the individual terms, the assertion follows.
We now consider the particular case that the noise on all components vanishes, i.e., I = {1, . . . , N }, and that p i = 2 for all i, which is typically chosen when the norms · Y i are two-norms and the noise is Gaussian. Assume again that the noise level of the different components is related as (δ n i ) = (δ n ) µ i with δ n → 0, δ n ≥ 0 and µ i ≥ 1 with µ i 0 = 1 for one i 0 (the slowest rate). As a direct consequence of the previous theorem we get the rate δ n for D ξ R and (δ n ) µ j +1 for T j u n − f j 2 Y j . The following corollary shows that, again, by adapting the parameter choice to the interplay of the different noise levels, we can improve the rate for T j u n −f j 2 Y j without worsening the rate for D ξ R .
Let (u n ) n be a sequence of solutions to P N (λ, f ) and u † be an R-minimizing solution of Assume further that there exists ξ ∈ ∂R(u † ) and constants 0 < β 1 < 1, 0 < β 2 such that Then, for any j, it holds that D ξ R (u n , u † ) = O(δ n ) (36) and This parameter choice is optimal for the estimates given in Equation (33) in the sense that they cannot be improved by choosing λ n i = (δ n i ) −(2− i ) with i ∈ (0, 2).
Proof. First note that the first estimate in Equation (33) of Theorem 3.2 holds true for any admissible parameter choice. It is also easy to see that, for the particular case of this corollary, the rate for D ξ R cannot be better than claimed. Also for the data discrepancy T j u n − f n j 2 Y j we see by considering the jth summand, that the rate cannot be better than (δ n ) 2µ j , which is the claimed rate for the proposed choice. Plugging in the proposed parameter choice, the claimed rates follow by direct computation.

Mixed two-norm and Kullback-Leibler-Divergence Discrepancies
In inverse problems where the measurement process essentially corresponds to counting a number of events, such as photon measurements in PET or electrons in electron tomography, the noise is typically assumed to follow a Poisson distribution. Interpreting the regularization term as a prior in a Bayesian approach, the data fidelity for computing a maximum a posteriori (MAP) estimate for Poisson-noise-corrupted data is the Kullback-Leibler divergence (see for instance [23,39]).
Since important applications of joint regularization approaches, such as MR-PET reconstruction or multi-spectral electron tomography, include this type of measurements we also consider the application of the previous results to the Kullback-Leibler divergence as data fidelity. To this aim, we first establish some basic properties of this type of data fidelity.

Basic properties of the Kullback-Leibler Divergence
We define the Kullback-Leibler divergence as follows. Let (Σ, B µ , µ) with Σ ⊂ R m be a finite measure space. We consider functions in Then we define D KL : where we set 0 log a 0 = 0 for a ≥ 0 and −b log( 0 b ) = ∞ for b > 0. Note that D KL is well-defined since, as an easy computation shows, the integrand is always non-negative, but potentially takes the value infinity also for u ≥ 0.
The following properties of D KL are well-known and can for instance be found in [3,37] for the case of Lebesgue measures and directly extended more general measures since they rely exclusively on pointwise assertions.  (Σ, µ), the Kullback-Leibler divergence satisfies the following.

The Kullback-Leibler divergence can be estimated as follows
Further, we will require suitable continuity results for the Kullback-Leibler divergence as stated in the following proposition. 1. For sequences (v n ) n , (f n ) n ∈ L 1 (Σ, µ) such that (f n ) n is bounded in L 1 (Σ, µ), the Kullback-Leibler divergence is jointly lower semi-continuous in the sense that 2. Let f, v ∈ L 1 (Σ, µ) and the sequence (f n ) n be given such that (f n ) n is bounded in Then 3. In the setting of Assertion 2, defining Σ v+ = {x ∈ Σ | v(x) > 0}. Then there exists C > 0 such that Proof. Ad 1. We show that D KL is lower semi-continuous with respect to L 1 convergence in both components. The desired statement then follows since convergence in D KL (f, f n ) and uniform boundedness of (f n ) n in L 1 implies convergence in L 1 due to Lemma 3.7 and convexity of the function yields lower semi-continuity in weak L 1 convergence. Without loss of generality we assume that (v n ) n and (f n ) n are such that v n ≥ 0, f n ≥ 0 pointwise almost everywhere and lim n→∞ D KL (v n , f n ) = lim inf n→∞ D KL (v n , f n ). The sequences (v n ) n , (f n ) n are bounded in L 1 , and thus contain subsequences (v n i ) i , (f n i ) i converging pointwise to v and f , respectively. We show that the non-negative integrand . Indeed, for a n → a, b n → b, the case a > 0, b > 0 is trivial since g is continuous on (0, ∞) × (0, ∞). If a = 0 and b = 0, then g(a, b) = 0 ≤ g(a n , b n ), if a = 0, b > 0, then −b n log a n b n → ∞ implying lim g(a n , b n ) = g(a, b) = ∞ . In the case a > 0, b = 0, the value b n log a n b n → 0 resulting in g(a, b) = lim g(a n , b n ). Hence,

e. and application of Fatou's Lemma yields
Ad 2. We assume D KL (f, f n ) < ∞ and D KL (v, f ) < ∞, since the case D KL (v, f ) = ∞ is trivial due to lower semi-continuity. Define Σ v+ = {x ∈ Σ | v(x) > 0} and note that f (x) > 0 for x ∈ Σ v+ by assumption. Using the conventions for a log( a b ) for the case that a = 0 or b = 0, direct computation shows Indeed, note that the above computations are also valid in the special cases that one of the occurring functions attains the value 0: Since both D KL (f, f n ) < ∞ and D KL (v, f ) < ∞ are finite we get that, up to sets of measure zero, f (x) = 0 implies f n (x) = 0 as well as v(x) = 0 implies f (x) = 0. This means that all log terms in the second line above are finite and a simple case study shows that the estimate above also captures all degenerate cases. The claimed convergence then follows immediately. Ad 3. This follows directly from (43) and Lemma 3.7.
Our goal is now to identify when the general Assumptions (G) as well Assumptions S 1 (u, f † , I c ), S 2 (U, f † ) can be verified for data fidelities that include D KL . To this aim, we summarize the corresponding assertions in the following proposition, which is an immediate consequence of Lemma 3.7 and Proposition 3.8. Proposition 3.9 (Properties of D KL ). Define by τ D KL to be the L 1 topology on L 1 (Σ, µ) Then the following holds for each f, v ∈ L 1 (Σ, µ) and each sequence (f n ) n in L 1 (Σ, µ).

f n D KL
→ f if and only if D KL (f, f n ) → 0 and f n 1 is uniformly bounded.
3. D KL is lower semi-continuous w.r.t weak-L 1 convergence in the first and D KL convergence in the second component.

Stability and convergence for mixed two-norm and Kullback-Leibler discrepancies
In the previous section we have derived properties of the Kullback-Leibler divergence that allow to incorporate it in the general framework of Section 2. In this section, motivated by concrete applications, we will work out how these results can be employed to the particular case of mixed L 2 and Kullback-Leibler data discrepancy terms. That is, we decompose the index set {1, . . . , N } as the disjoint union of L nr and L kl and consider the minimization problem where p ∈ (1, 2] and Ω ⊂ R d is a bounded Lipschitz domain, R : L p (Ω) N → [0, ∞] is a regularization term and λ ∈ (0, ∞) N . Note that we have further specified the problem by evaluating the data discrepancies only on the components u i of u = (u 1 , . . . , u N ), which is a particular case of the previous setting and corresponds to the situation that all components of u are obtained from independent measurements. Also, we now assume the operators T i to be bounded linear operators, that is, for i ∈ L nr we assume T i ∈ L(L p (Ω), L p (Σ i )) with Σ i ⊂ R m i bounded and Lebesgue measurable and for i ∈ L kl we assume T i ∈ L(L p (Ω), L 1 (Σ i , µ i )) with Σ i ⊂ R m i and (Σ i , B µ i , µ i ) a finite measure space. The (f i ) denote the given data in the respective spaces and, for indices i ∈ L kl , we have included additional additive terms c i that we assume to be given and such that c i ∈ L 1 (Σ i ), c i ≥ 0 pointwise. The reason for including the c i is that in PET imaging they are typically used to model some (a-priory estimated) measurements resulting from scattering and random events (see for instance [26]). The data discrepancies D KL are the Kullback-Leibler divergence as defined in Equation (38). For the norm discrepancies we extend the 2-norm by duality for v ∈ L p (Σ i ) as Note that, as can be easily seen by duality and density arguments that u 2 < ∞ if and only if u ∈ L 2 (Σ i ) and · 2 is lower semi-continuous w.r.t weak L 1 convergence as being the pointwise supremum of a set of continuous functions [14]. The reason for using L p (Σ i ) as image space and extending the 2-norm instead of defining T i to be continuously mapping in L 2 is that, in typical applications we need to choose p ≤ d d−1 with d ∈ {2, 3, 4} to obtain continuous embeddings and hence allowing T i to map continuously to L p instead of L 2 is less restrictive. On the other hand, the 2-norm is the natural choice for measurements corrupted by Gaussian noise.
For the particular setup of Equation P NKL (λ, f ), choosing the involved topologies appropriately, the general Assumption (G) can be simplified as follows.
Assumption (NKL) (τ X and τ Y i the weak topologies in the respective spaces, τ D i the trivial topolgy for i ∈ L nr and τ D i the L 1 -topology for i ∈ L kl ). ds 1. R is lower semi-continuous w.r.t weak convergence in L p (Ω) N 2. T i ∈ L(L p (Ω), L p (Σ i )) for i ∈ L nr and T i ∈ L(L p (Ω), L 1 (Σ i , µ i )) for i ∈ L kl .
3. For each C > 0 the set Proposition 3.10. If assumption (NKL) holds for the particular choices of regularization and data fidelity used Equation P NKL (λ, f ), and if we choose τ X the topology corresponding to weak L p convergence, τ D i the trivial topology for i ∈ L nr and τ D i the topology induced by convergence in · 1 for i ∈ L kl and τ Y i the topology corresponding to weak L p and weak L 1 convergence for i ∈ L nr and i ∈ L kl , respectively, then Assumption (G) holds.
Proof. First note that, when considering Assumption (G) we include the additional terms c i appearing in the Kullback-Leibler discrepancies in the forward operators T i , that is, for i ∈ L kl we consider the affine operatorsT i u = T i u i + c i . Given the properties of D KL and the 2-norm, the only thing that is then left to show is that the sequential compactness as in Point 3. above implies sequential compactness as in Assumption (G), Point 5. To this aim take λ ∈ (0, ∞) N and sequences (u n ) n , (f n ) n , and assume that f n D → f and J λ (u n , f n ) < c. Obviously this implies that R(u n ) is bounded. For i ∈ L nr , also, T i u n i − f n i 2 is bounded and we can estimate and consequently T i u i p is bounded. Hence, if we can show that T i u n i 1 is bounded for i ∈ L kl the assertion follows. Assume that (up to subsequences) T i u n i 1 → ∞ which implies that T i u n i + c i 1 → ∞. Using Lemma 3.7 we get for large enough n Now estimating all bounded quantities with constants, this implies that for C 1 , C 2 > 0 which contradicts to T i u n i + c i 1 → ∞. Hence the assertion follows. Now the continuity assumption S 1 (u, f † , I c ) that was required for stability translates to the following assumption on the data.
Noting that Assumption S 1 (u, f † , I c ) was only used for u such that D i (T i u, f † i ) < ∞, the following lemma allows to relate the Assumption S 1 (u, f † , I c ) and S kl 1 (u, f † , I c ∩ L kl ).
Lemma 3.11. Assume that (NKL) holds and that u is such that Proof. For indices i ∈ I c ∩ L nr we note that Hence the inverse triangle inequality can be employed to show continuity of f i → T i u i − f i 2 w.r.t D iconvergence. For i ∈ I c ∩ L kl , this is the assertion of Proposition 3.9.
Remark 3.12. Remember that Assumption S kl 1 (u, f † , I c ∩ L kl ) will be applied with u being a solution to P NKL (λ, f † ) and hence it appears not too restrictive. In particular, it is not stronger that standard assumptions in the context of inverse problems with Poisson noise even with a single data discrepancy [37,39].
Using Assumption S kl 1 (u, f † , I c ∩ L kl ), a basic existence and stability result follows as direct consequence of Proposition 2.2 and Corollary 2.7.
Proposition 3.13 (Stability). Assume that Assumption (NKL) holds. Let (f n ) n in Y and f † ∈ Y be such that f n D → f † as n → ∞ and take λ ∈ (0, ∞) N . In case J λ (·, f n ) is proper, there exists a solution to P NKL (λ, f n ). If we further assume that, for u 0 being a solution to P NKL (λ, f ), Assumption S kl 1 (u 0 , f † , {1, . . . , N } ∩ L kl ) holds, then every sequence (u n ) n of solutions to min admits a τ X -convergent subsequence again denoted by (u n ) n with limit u † such that and every limit of a τ X -convergent subsequence of solutions is a solution to min u∈X J λ (u, f † ). (46) It is immediate that, under Assumption (NKL) and even without the Assumption S kl 1 (u, f † , I c ∩ L kl ), the counterpart of the convergence result of Corollary 2.8 holds for the particular setting of this section. In order to obtain rates, we need to translate Assumption S 2 (U, f † ) appropriately as follows.
There exist β 0 , β 1 > 0 such that, for each i ∈ L kl and v ∈ U, Again we note that, when Assumption S 2 (U, f † ) will be applied in the particular setting of this section, we set U = {(u n ) n | n ≥ n 0 }∪{u † } with (u n ) n being a sequence of minimizers and u † being a minimizer for data f n and f † , respectively. Hence, as can be easily deduced by the triangle inequality, it will hold that T u Taking this into account, the relation of Assumption S kl 2 (U, f † ) and Assumption S 2 (U, f † ) is given as follows.
Lemma 3.14. Assume that (NKL) holds and that U is such that for each u ∈ U , Proof. Given that all involved quantities are finite, we can transfer the result of Proposition 3.4 also to the extended 2-norm, which implies that S 2 (U, f † ) holds with ψ i (x) = x 1/2 for i ∈ L nr . For i ∈ L kl , it is the assertion of Proposition 3.9 that S 2 (U, f † ) holds with ψ i (x) = x 1/2 .
Using Assumption S kl 2 (U, f † ), we get the following convergence rates result for P NKL (λ, f ).
and denote λ † = lim n→∞ λ n ∈ (0, ∞] N . Let (u n ) n be a sequence of solutions to P NKL (λ n , f n ) and let u † be an R-minimizing solution of T i u i = f † . Set U n 0 = {u n | n ≥ n 0 } ∪ {u † } and assume that S kl 2 (U n 0 , f † ) holds for some n 0 . Then If we assume further that there exists ξ ∈ ∂R(u † ), for some > 0, then we obtain for any j ∈ {1, . . . , N }, Proof. The estimate on R is an immediate consequence of Theorem 2.11 and the parameter choice. In case the source condition holds, the theorem also implies that Now using Young's inequality we can estimate for i ∈ L nr and C > 0 a generic constant Further, by S kl 2 (U n 0 , f † ) we can estimate Together, this implies that Plugging in the parameter choice this implies the rates as claimed.
We note that the rates for the data terms above are optimal in the sense that, even for single data term regularization with the corresponding data term, the rate would be the same. Since the Bregman distance involves all components of the unknown, we naturally obtain the worst-case rate amongst all data terms.

Particular choices for coupled regularization
In this subsection, we show how the previous results on mixed L 2 and Kullback-Leibler discrepancies can be applied for concrete regularization functionals. First we consider a rather simple example of joint wavelet-sparsity regularization, where the regularization term is coercive in L 2 . As second application, we consider coupled second order Total Generalized Variation (TGV) [8] regularization, where coercivity can only be established up to finite dimensional subspaces, and show how our results can also be applied in this situation.

Joint wavelet sparsity
Wavelets are well-known and heavily used in image processing and beyond for their property of sparse approximation of data. This is exploited for example in image compression and denoising applications [30]. When dealing with the regularization of multi-spectral data where the individual components are correlated, a natural approach for coupled regularization is to enforce joint sparsity of wavelet coefficients, see for instance [27]. This is realized by the regularization functional where W N u = (W u 1 , . . . , W u N ) with W : L 2 (Ω) → 2 a biorthogonal wavelet transform, Ω a bounded set and, for z = (z 1 , . . . , With the definitions of Section 3.2.2, we then consider Then, with R(u) = W N u 2,1 , it is easy to see that R is lower semi-continuous w.r.t weak L 2 convergence and, setting p = 2, that also the compactness requirement of Assumption (NKL) is satisfied since c u L 2 ≤ W N u 2 ≤ W N u 2,1 for all u ∈ L 2 (Ω) N and c > 0 (see [30]).
Hence if all the forward operators T i are continuous, Assumption (NKL) holds and all results of Section 3.2.2 apply.

Joint Total Generalized Variation regularization
The Total Generalized Variation functional has been introduced in 2010 [8] with the aim of overcoming the well-known staircasing effect of the Total Variation (TV) functional while still allowing for jump discontinuities in the reconstruction. It has been analyzed in [5,9] in the context of regularization of linear inverse problems and has been heavily used in diverse applications, including the joint reconstruction of magnetic resonance (MR) and positron emission tomography (PET) images [26]. The latter employs an extension of second order TGV for vector-valued data where the coupling of the individual terms is carried out via a Nuclear and Frobenius norm at the level of first and second order derivatives, respectively. In this section, we show how the general results of Section 3.2.2 can be applied to an extended version of this setting. Numerical examples for concrete applications to multi-contrast MR and PET-MR regularization will then be considered in the next section.
For parameters α 0 , α 1 > 0 and u = (u 1 , . . . , u N ) ∈ BV(Ω) N with Ω a bounded Lipschitz domain we define the second order TGV functional as noting that by the results in [5] this is equivalent to the original definition of TGV 2 α as in [8]. where | · | can be any pointwise norm on Z N . We note that in fact this pointwise norm defines the coupling of the different components of Du and Ew, e.g., choosing | · | to be a Frobenius norm results in a Frobenius-norm-type coupling of Du and Ew. In [26], | · | was chosen to be the spectral norm for Z = R d , yielding a nuclear-norm coupling of the components of Du.
With the definitions of Section 3.2.2 and p ∈ (1, d/(d − 1)], we now consider the minimization problem Here, I KL + (u) is the indicator function of the set KL + := {u ∈ L p (Ω) N | u i ≥ 0 a.e. for all i ∈ L kl } and constrains all u i for i ∈ L kl to be non-negative, i.e., Our goal is to show that Assumption (NKL) holds for this setting. However, since TGV 2 α itself is coercive only up to a finite dimensional subset, we have to employ a slight modification of this setup to obtain coercivity (without imposing further assumptions on the T i ), which is equivalent in the sense that solutions to the modified problem are still solutions to the original problem. That is, with P 1 (Ω) N the set of R N -valued polynomials of order less or equal to one, we define the finite dimensional subspace where T = (T 1 , . . . , T N ). Further, we define where (·, ·) denotes the standard inner product in R N , and note that, as can be easily shown, We then consider the modified problem P NKL-TGV (λ, f ) which is equivalent to the original one in the sense that any solution of (P NKL-TGV (λ, f )) is a solution of the original one and, with u = u 1 + u 2 ∈ Z ⊥ + Z being a solution of the original problem, u 1 is a solution of P NKL-TGV (λ, f ).
For this setting, we get the following assertion.
Proof. First note that TGV 2 α , I KL + and I Z ⊥ are lower semi-continuous w.r.t weak L p convergence (see for instance [6] for TGV in the vector-valued case). Hence we are left to show that any sequence (u n ) n in L p (Ω) N such that admits an L p -weakly convergent subsequence. To this aim, we first define the operator P P 1 : L p (Ω) → L p (Ω) as It is easy to see that P P 1 is well-defined and a continuous linear projection. Then, by results in [5,6], there exists C > 0 such that Hence, in order to obtain existence of a weakly convergent subsequence in L p (Ω) N , we are left to show boundedness of P P 1 (u n ). For this, note that since u n ∈ Z ⊥ , also P P 1 (u n ) ∈ Z ⊥ for all n and since T is injective on the finite dimensional space P 1 (Ω) N ∩ Z ⊥ we get from equivalence of norms in finite dimensions that and boundedness of u n i − P P 1 (u n ) i in L p for some C > 0 and the proof is complete.

Numerical realization and examples
In this section we sketch how coupled second order TGV regularization with multiple data discrepancies can be realized numerically and provide some examples at the end. Source code that realizes the outlined algorithmic framework for any number of data terms in dimensions two and three is provided online [24].
With Ω h being a discretized pixel grid in R d and the discrete vector spaces U := {u : Ω h → R N } an V := {v : Ω h → (R d ) N }, we consider the following minimization problem. min u∈U,v∈V where we now simultaneously minimize over the unknown u and the balancing variable v appearing in the definition of TGV 2 α . We equip the discrete vector spaces with the standard inner products (u, s) : for u, s ∈ U and similarly for V , and assume that all involved operators and functionals are now appropriately discretized (see for instance [7,4] for a detailed explanation on a possible discretization of TGV 2 α ). In particular, ∇ : U → V is a discrete gradient operator and E : V → W := (R d×d ) N a discrete symmetrized gradient operator, and the terms z 1 for z ∈ V or z ∈ W denote discrete L 1 norms such that z 1 = x∈Ω h |z(x)| with | · | appropriate pointwise norms on (R d ) N or (R d×d ) N that define the coupling of the individual components.
For simplicity, we assume that T = (T 1 , . . . , T N ) does not vanish on affine functions, hence existence can be ensured without the additional constraint on Z ⊥ . Re-writing the discrete minimization problem (52)  Here, at any point x ∈ Ω h in the discrete spatial grid and for i ∈ {0, 1}, the mapping prox α i can be given explicitly as (prox α i (ẑ))(x) = proj {z : |z|≤α i } (ẑ(x)), that is, it reduces to a pointwise projection with respect to the pointwise norm. In case | · | is the Frobenius norm (i.e. the root of the squared sum of all entries on (R d ) N or (S d×d ) N ), the projection is given as proj {z : |z|≤α i } (ẑ) =ẑ max{1,|ẑ|/α i } . In the case that | · | is the nuclear-norm on (R d ) N the computation requires a pointwise SVD of 2 × 2 or 3 × 3 matrices (depending on the spatial dimensions and the number of channels) at every point and efficient code to compute this is provided in the source code [24]. The other proximal In the first experiment, we test the reconstruction of an 2D in-vivo coronal knee MRI exam from a clinical MR system. This is an interesting test case for our approach because parts of the clinical protocol consists of two acquisition sequences that obtain data for the same orientation but with two different contrasts. In the first sequence, the spin signal from fat-tissue is suppressed, in the second sequence, fat-tissue is not suppressed. We employ the setting of (52) with N = 2, L nr = {1, 2} and L kl = ∅.
A reduced number of k-space lines, at a rate of 4 below the Nyquist limit, was acquired to accelerate the duration of the scan. Figure 1 shows results from conventional CG-SENSE parallel imaging [34] and joint TGV regularized reconstructions. Since no ground truth is available, the parameters for the methods were chosen according to visual inspection. As can be seen in the figure, the regularized reconstruction suffers from less noise and artifacts and anatomical details are better visible.
In the next experiment, we consider the situation of combining data from two different imaging modalities, which is an interesting problem in the context of recent developments in hybrid PET-MRI systems [35]. This is an interesting case because the different imaging physics lead to different noise properties. MR data is corrupted by Gaussian noise, while PET data comes from a Poisson process. We first performed a numerical simulation study using a human brain phantom [1], where we have a ground truth available to evaluate our results. The PET acquisition was simulated such that the total number of counts corresponded to a 10min FDG PET head scan. Radial MR acquisitions with 16 projections were simulated for three commonly used contrasts in brain imaging (MPRAGE, FLAIR, T2 weighted). Acquisition with an 8 channel receive coil was simulated, and MR k-space data were corrupted by adding complex Gaussian noise. The simulated matrix size was 176×176×30.
We reconstructed the combined dataset by solving (52) in a three-dimensional setting with N = 4, L nr = {1, 2, 3} and L kl = {4}, and using nuclear norm coupling [26] of the four channels.
Results of the simulation are shown in Figure 2. For reference, we also present results using the currently used standard methods to reconstruct such data, expectation maximization [44] for PET and iterative conjugate gradient (CG) SENSE [34] for MR. Since PET is a quantitative imaging modality, it is important that the quantitative signal values are preserved when coupling the different contrasts and modalities. Table 1 presents the PET signal values (in Bq/cm 3 ) of the ground truth, the conventional and the joint reconstruction for gray matter, white matter and cerebrospinal fluid. Our results demonstrate that coupling improves image quality in terms of RMSE to the ground truth and the fidelity of the quantitative signal values.
Finally, we performed an experiment with in-vivo PET-MR data from a clinical PET-MR system (Figure 3). PET scan duration was 10 minutes, the MR exam was performed using an MPRAGE contrast at an acceleration factor of 4 below the Nyquist limit. The variational reconstruction was obtained by solving (52) again in a three-dimensional setting with N = 2, L nr = {1} and L kl = {2}. Reference reconstructions using EM and CG-SENSE Figure 1: In vivo multi-contrast knee measurement with a data reduction factor of 4. Joint TGV reconstruction is compared to conjugate gradient SENSE.   are again shown for reference. Again no ground truth is available, but one can see that the joint reconstruction method generally reduces noise and obtains a sharper reconstruction of the PET image.