A primal-dual hybrid gradient method for non-linear operators with applications to MRI

We study the solution of minimax problems $\min_x \max_y G(x) + \langle K(x),y\rangle - F^*(y)$ in finite-dimensional Hilbert spaces. The functionals $G$ and $F^*$ we assume to be convex, but the operator $K$ we allow to be non-linear. We formulate a natural extension of the modified primal-dual hybrid gradient method (PDHGM), originally for linear $K$, due to Chambolle and Pock. We prove the local convergence of the method, provided various technical conditions are satisfied. These include in particular the Aubin property of the inverse a monotone operator at the solution. Of particular interest to us is the case arising from reformulation of regularisation problems $\min_x \|f-T(x)\|^2/2 + \alpha R(x)$ with the operator $T$ non-linear. For such problems, we show that our general local convergence result holds when the noise level of the data $f$ is low, and the regularisation parameter $\alpha$ is correspondingly small. We verify the numerical performance of the method by applying it to problems from magnetic resonance imaging (MRI) in chemical engineering and medicine. The specific applications are in diffusion tensor imaging (DTI) and MR velocity imaging. These numerical studies show very promising performance.


Introduction
Let us be given convex, proper, lower semicontinuous functionals G : X → R and F * : Y → R on finite-dimensional Hilbert spaces X and Y . We then wish to solve the minimax problem min x max y G(x) + K(x), y − F * (y), (1.1) it may be better to regularise r and ϕ differently. This leads us to the problem min r,ϕ Here R r and R ϕ are suitable regularisation functionals for the amplitude map r and phase map ϕ, respectively, of a complex image v = r exp(iϕ). Correspondingly, we define the operator T by T (r, ϕ) := SF x → r(x) exp(iϕ(x)) .
Observe that we may rewrite Generally also the regularisation terms can be written in terms of an indicator function and a bilinear part in the form α r R r (r) = max ψ K r r, ψ − δ Cr (ψ).
Observe that F * is strongly convex in the range of the non-linear part of K, corresponding to T . Under exactly this kind of structural assumptions, along with strict complementarity and non-degeneracy assumptions from the solution, we can show in Section 4 that H −1 x possesses the Aubin property required for the general convergence theorem, Theorem 3.2, to hold. Moreover, in this case the condition on y being small in the non-linear range of K corresponds to f − T ( r, ϕ) being small. This can be achieved under low noise and a small regularisation parameter.
Computationally (1.3) is significantly more demanding than (1.2), as it is no longer a convex problem due to the non-linearity of T . One option for locally solving problems of the form (1.3) is the Gauss-Newton scheme. In this, one linearises T at the current iterate, solves the resulting convex problem, and repeats until a stopping criterion is satisfied. Computationally such schemes combining inner and outer iterations are expensive, unless one can solve the inner iterations to a very low accuracy. Moreover, the Gauss-Newton scheme is not guaranteed to converge even locally -a fact that we did occasionally observe when performing the numerical experiments for Section 5. The scheme is however very useful when combined with iterative regularisation, and behaves in that case well for almost linear K [4,28]. It can even be combined with Bregman iterations for contrast enhancement [2]. Unfortunately, our operators of interest are not almost linear. Another possibility for the numerical solution of (1.3) would be an infeasible semismooth Newton method, along the lines of [15], extended to non-linear operators. However, second-order methods quickly become prohibitively expensive as the image size increases, unless one can employ domain decomposition techniques -something that to our knowledge has not yet been done for semismooth Newton methods relevant to total variation type regularisation. Based on this, we find it desirable to start developing more efficient and provably convergent methods for non-convex problems on large data sets. We now study one possibility.

The basics
We describe the proposed method, Algorithm 2.1, in Section 2.1 below. To begin its analysis, we study in Section 2.2 the application of the Chambolle-Pock method to linearisations of our original problem (1.1). We then derive in Section 2.3 basic descent estimates that motivate a general convergence result, Theorem 2.1, stated in Section 2.4. This result will form the basis of the proof of convergence of Algorithm 2.1. Our task in the following Section 3 will be to to derive the estimates required by Theorem 2.1. We follow the theorem with a collection of remarks in Section 2.5.

The proposed method
Let X and Y be finite-dimensional Hilbert spaces. Suppose we are given two convex, proper, lowersemicontinuous functionals G : X → R and F * : X → R, and a possibly non-linear operator K ∈ C 2 (X; Y ). We are interested in solving the problem min x max y G(x) + K(x), y − F * (y). (P) The first-order optimality conditions for (x,ŷ) to solve (P) may be formally derived as −[∇K(x)] * ŷ ∈ ∂G(x), (2.1a) Under a constraint qualification, which is satisfied for example when G is C 1 and either [∇K(x)] * has empty nullspace or dom F = X, these conditions can be seen to be necessary; cf. [27, 10.8]. For linear K in particular, the conditions are necessary and reduce to −K * ŷ ∈ ∂G(x), (2.2a) Kx ∈ ∂F * (ŷ). (2.2b) The modified primal-dual hybrid gradient method (PDHGM) due to Chambolle and Pock [8], solves this problem by iterating for σ, τ > 0 satisfying στ K 2 < 1 the system y i+1 := (I + σ∂F * ) −1 (y i + σKx i+1 ω ). (2.3c) As such, the method is closely related to a large class of methods including in particular the Uzawa method and the alternating direction method of multipliers (ADMM). For an overview, we recommend [11].
In case the reader is wondering, the order of the primal (x) and dual (y) updates in (2.3) is reversed from the original presentation in [8]. The reason is that reordered the updates can, as discovered in [14], be easily written in a proximal point form. We will exploit this. Indeed, (2.3) already contains two proximal point sub-problems, specifically the computation of the resolvents (I + τ ∂G) −1 and (I + σ∂F * ) −1 . We recall that they may be written as For the good performance of (2.3), it is crucial that these sub-problems can be solved efficiently. Usually in applications, they turn out to be simple projections or linear operations. Resolvents reducing to small pointwise quadratic semidefinite problems have also been studied [31,29].
Observe the correspondence between the (merely necessary) optimality conditions (2.1) for the problem (P) with non-linear K and the optimality conditions (2.2) for the linear case. It suggests that we could obtain a numerical method for solving (2.1) by replacing the applications K * y i and Kx i+1 ω in (2.3) by [∇K(x i )] * y i and K(x i+1 ω ). We would thus linearise the dual application, but keep the primal application non-linear. We do exactly that and propose the following method.

4b)
y i+1 := (I + σ∂F * ) −1 (y i + σK(x i+1 ω )). (2.4c) In practise we require ω = 1. Exact conditions on the step length parameters τ and σ will be derived in Section 3 along the course of the proof of local convergence; in the numerical experiments of Section 5, we make τ and σ depend on the iteration, choosing τ i and σ i to satisfy with the ratio σ i /τ i unaltered. We discuss the justification for this kind of strategies in Remark 3.6 after the convergence proof.
We will base the convergence proof on the following fully linearised method, where we replace the application K(x i+1 ω ) also by a linearisation.

Linearised problem and proximal point formulation
To start the convergence analysis of Algorithm 2.1, we study the application of the standard Chambolle-Pock method (2.3) to linearisations of problem (P) at a base pointx ∈ X. Specifically, we define Kx := ∇K(x), and cx := K(x) − Kxx.
Then we consider min This problem is of the form required by the method (2.3), if we write F * x (y) = F * (y) − cx, y . Indeed, we may write the updates (2.3) for this problem as Observe how (2.5c) corresponds to (2.7c) withx = x i .
From now on we use the general notation u = (x, y),

and define
Hx(u) := ∂G(x) + K * x y ∂F * (y) − Kxx − cx as well as With these operators, 0 ∈ Hx( u) characterises solutions u to (2.6), and u i+1 computed by (2.7) is according to [14] characterised as the unique solution to the proximal point problem In fact, returning to the original problem (P), the optimality conditions (2.1) may be written (2.10) and x ω := x + ω(x −x). We therefore study next basic estimates that can be obtained from (2.8) with an additional general discrepancy term ν i . These form the basis of our convergence proof.

Basic descent estimate
We now fix ω = 1 in order to force Mx symmetric, and study solutions u i+1 to the general system In Lemma 2.1 below, we show that u i+1 is better than u i in terms of distance to the "perturbed local solution" u i solving 0 ∈ Hx( u i ) + ν i . Here we use the word perturbation to refer to ν i , and local to refer to the linearisation pointx. Observe that u i depends on both u i and u i+1 in case of Algorithm 2.1, resp. (2.9). In Section 3 we will lessen these dependencies, and convert the statement to be in terms of local (unperturbed) optimal solutions u i , satisfying 0 ∈ Hx( u i ).
For the statement of the lemma, we use the notation a, b M := a, M b , and a M := a, a M , and denote by P V the linear projection operator into a subspace V of Y . We also say that F * is strongly convex on the subspace V with constant γ > 0 if This is equivalent to saying that the operator ∂F * is strongly monotone on V in the sense that If F * is additionally strongly convex on a subspace V of Y with constant γ > 0, then we have Proof. Since the operator Hx is monotone, we have It thus follows from (2.11), (2.12), and the symmetricity of Mx that (2.14) This yields ( D 2 -loc). The strong convexity estimate ( D 2 -loc-γ) is proved analogously, using the fact that instead of (2.13), we have the stronger estimate Following [18], see also [26], if ν i = 0, then it is not difficult to show from ( D 2 -loc) the convergence of the iterates {u i } ∞ i=0 generated by (2.11) to a solution of the linearised problem (2.6). The estimate ( D 2 -loc) also forms the basis of our proof of local convergence of Algorithm 2.1 and Algorithm 2.2. However, we have to improve upon it to take into account thatx = x i changes on each iteration in (2.4), and that the dual update in (2.4c) is not linearised. The consequence of these changes is that also the weight operator M x i of the local norm · M x i changes on each iteration, as do the local perturbed solution u i and the local (unperturbed) solution u i . Taking these differences into account, it turns out that the correct estimate that we have to derive is ( D) in the next theorem. There we have improved ( D 2 -loc) by these changes and additionally, due to proof-technical reasons, by the removal of the squares on the norms.

Idea of convergence proof
Theorem 2.1. Suppose that the operator K ∈ C 1 (X; Y ) and the constants σ, τ > 0 satisfy for some Θ > θ > 0 the bounds Suppose, moreover, that for some constant ζ > 0 and points Then the iterates u i → u for some u = ( x, y) that solves (2.1).
Proof. It follows from ( D) that Consequently, an application of (C-M) shows that This says that {u i } ∞ i=1 is a Cauchy sequence, and hence converges to some u. Let Since ν i → 0 by (C-ν i ), and it follows that z i → 0. By (2.11), we moreover have −z i ∈ H x i (u i+1 ). Using K ∈ C 1 (X; Y ) and the outer semicontinuity of the subgradient mappings ∂G and ∂F * , we see that Here the lim sup is in the sense of an outer limit [27], consisting of the limits of all converging subsequences of elements v i ∈ H x i (u i+1 ). As by (2.11) we have −z i ∈ H x i (u i+1 ), it follows in particular that 0 ∈ H x ( u). This says says precisely that (2.1) holds.

A few remarks
Remark 2.1 (Reference points). Observe that we did not need to assume the reference points in the above proof to solve anything. We did not even need to assume boundedness, which follows from ( D) and (C-M).

Remark 2.2 (Gauss-Newton). Let
In fact, we can even replace M x i by the identity I. Then u i+1 solves the local linearised problem (2.6), that is and Lemma 2.1 together with Theorem 2.1 show the convergence of the Gauss-Newton method to a critical point of (P), provided ν i → 0. That is, either the iterates diverge, or the Gauss-Newton method converges to a solution.
Remark 2.3 (Varying over-relaxation parameter). We have assumed that ω = 1. It is however possible to accommodate a varying parameter ω i → 1 through ν i . In particular, one can easily show the convergence (albeit merely sublinear) of the accelerated algorithm of [8] this way. In this variant, dependent on the strong convexity of F or G, one updates the parameters at each step as ω i = 1/ 1 + 2γτ i , τ i+1 = ω i τ i , and, σ i+1 = σ i /τ i for γ the factor of strong convexity of either F * or G.
For this method Remark 2.5 (An alternative update). It is also interesting to consider using [∇K(x i ω )] * instead of [∇K(x i )] * in (2.4a). From the point of view of the convergence proof, this however introduces major difficulties, as (x i+1 , y i+1 ) no longer depends on just (x i , y i ), but also on x i−1 through x i ω . This kind of dependence also makes analysis using the original ordering of the PDHGM updates in [8] difficult, but is avoided by the reordering due to [14] that we employ in (2.3) and (2.4).

Detailed analysis of the non-linear method
We now proceed to verifying the assumptions of Theorem 2.1 for Algorithm 2.1 and Algorithm 2.2, provided our initial iterate is close enough to a solution which satisfies certain technical conditions, to be derived along the course of the proof. We will begin with the formal statement of our running assumptions in Section 3.1, after which we prove some auxiliary results in Section 3.2. Our first task in verifying the assumptions of Theorem 2.1 is to show that the discrepancy term ν i = D x i (u i+1 ) → 0. This we do in Section 3.3. Then in Section 3.4 we begin deriving the estimate ( D) by analysing the switch to the new local norm at the next iterate. In Section 3.5 we introduce and study Lipschitz type estimates on H −1 x . We then then use these in Section 3.6 and Section 3.7, respectively, to remove the squares from the estimate ( D 2 -loc-γ) and to bridge from one local solution to the next one. The Lipschitz type estimates themselves we will derive in Section 4 to follow. We state and prove our main convergence theorem, combining all the above-mentioned estimates, in Section 3.8. This we follow by a collection of remarks in Section 3.9.

General assumptions
We take G : X → R and F * : Y → R to be convex, proper, lower semicontinuous functionals on finite-dimensional Hilbert spaces X and Y , satisfying int dom G, int dom F * = ∅. We fix ω = 1 and study the sequence of iterates generated by solving where we expect the discrepancy functional to satisfy for any fixed i, C, > 0, the existence of ρ > 0 such that For brevity, we denote We then aim to take u i as a solution of the perturbed linearised problem and u i as a solution of the linearised problem Note that these solutions do not necessarily exist. We will later prove that they exist near a solution u for which H −1 x satisfies the Aubin property. As Theorem 2.1 makes no particular requirement on u i , we will begin with u i and u i arbitrary elements of X × Y , and state the above inclusions explicitly when we require or have them.
Regarding the operator K : X → Y and the step length parameters σ, τ > 0, we require that Here we fix R > 0 such that there exists a solution u to We then denote by L 2 the Lipschitz factor of x → ∇K(x) on the closed ball B(0, R) ⊂ X, namely in operator norm. By (A-K), the supremum is bounded. Finally, we define the "linear" and "non-linear" subspaces Y L := {z ∈ Y | the map x → z, K(x) is linear}, and Y NL := Y ⊥ L , and denote by P NL the orthogonal projection into Y NL .

Auxiliary results
The assumption (A-K) guarantees in particular that the weight operators M x i are uniformly bounded, as we state in the next lemma. For Θ and θ as in the lemma, we also define the uniform condition number κ := Θ/θ. (3.1c) Observe that κ → 1 as τ, σ → 0.
Proof. This follows immediately from the fact that ∇K is bounded on B(0, R).
In case of Algorithm 2.2, showing (A-D i ) is trivial because ν i = 0. For Algorithm 2.1, we show this in the next lemma. Proof. We recall the definition of Dx from (2.10), and notice that To show (A-D i ), we observe that thanks to K being twice continuously differentiable, for any C > 0 there exist L > 0 and ρ 1 > 0 such that In particular, minding that Choosing ρ ∈ (0, ρ 2 ) small enough, (A-D i ) follows.
We will occasionally use the auxiliary mapping E defined in the following lemma. The motivation behind it is that if v ∈ Hx (u) then v + E(u;ū,ū ) ∈ Hx(u).
Then E is Lipschitz with Lipschitz factor E ≤ L 2 ū −ū , and Proof. The fact that E is Lipschitz with the claimed factor is easy to see; indeed Since (A-K) holds and x , x ≤ R, as we have assumed, we have It follows that That is, the claimed Lipschitz estimate holds.
Regarding (3.3), we may rewrite As in the proof of Lemma 3.2, the quantity In fact, since x , x ≤ R, we may choose L 2 = L 2 . We then observe that Thus Using (3.4), we get (3.3).

Convergence of the discrepancy term
The convergence ν i → 0 required by Theorem 2.1 follows from the following simple result that we will also use later on.

Switching local norms: estimates from strong convexity
We finally begin deriving the descent inequality ( D) that we so far have assumed. As a first step, we setx = x i in ( D 2 -loc-γ), and use the extra slack that strong convexity gives to replace M Proof. Using (3.10) and u ≤ R/2, we have Using (3.10) and (3.11), we thus get As both u i , u i+1 ≤ R, by (A-K) we have again This allows us to deduce Using Young's inequality we therefore have An application of Lemma 3.1 and ( D 2 -loc-γ) shows that Applying this to (3.13) yields ( D 2 -M).

Aubin property of the inverse
In [26], the Lipschitz continuity of the equivalent of the map H −1 x was used to prove strong convergence properties of basic proximal point methods for maximal monotone operators. In order to remove the squares from ( D 2 -M), and to bridge with u i between the local solutions u i and u i+1 , we follow the same rough ideas. We however replace the basic form of Lipschitz continuity by a weaker version that is localised in the graph of H x . Namely, with 0 ∈ H x ( u), we assume that the map H −1 x has the Aubin property at 0 for u [27,1]. This is also called metric regularity.
Generally, the inverse S −1 of a set-valued map S : X ⇒ Y having the Aubin property at w for u means that Graph S is locally closed, and there exist ρ, δ, > 0 such that (3.14) We denote the infimum over valid constants by S −1 ( w| u), or S −1 for short, when there is no ambiguity about the point ( w, u). For bijective single-valued S the Aubin property reduces to Lipschitzcontinuity of the inverse S −1 . For single-valued Lipschitz S, we also use the notation S for the (local) Lipschitz factor.
We can translate the Aubin property of H −1 x to H −1 x i , based on the following general lemma. This is needed in order to perform estimation at an iterate u i , only assuming the Aubin property of H x at a known solution u.
Then T −1 has the Aubin property at ∆( u) for u, and The proof is a minor modification of the proof of the following result. ∆ : X → Y be a single-valued Lipschitz map, supposing that ∆ is strictly stationary at u. This means that for every > 0, there exists δ > 0 such that Let T (u) = S(u) + ∆(u). Then T −1 has the Aubin property at ∆( u) for u, and T −1 = S −1 .
Proof. This result is the main theorem of [10] combined with the remark following the theorem.
Proof of Lemma 3.7. We show how to modify the proof of Lemma 3.8 in [10] into a proof of Lemma 3.7. There are two differences in assumptions between the lemmas. The first is the apparently different closedness condition on S. But, obviously, S −1 has locally closed values if Graph S is locally closed, which our definition of the Aubin property includes. Therefore that part of the assumptions of Lemma 3.8 is satisfied.
The second difference is the strict stationarity condition on ∆. In Lemma 3.7, this is replaced by the weaker condition S −1 ∆ < 1. We however observe that, indeed, the strict stationarity condition is only used in the proof of Lemma 3.8 to show that ∆ is Lipschitz with a given constant > 0 in a neighbourhood of u [10, (3)], and then With our assumptions, we may only take > ∆ . This gives us the weaker result (3.15) instead of We conclude that the proof of Lemma 3.8 in [10] is easily modified into a proof of Lemma 3.7.
Lemma 3.9. Suppose H −1 x has the Aubin property at 0 for u, and that (A-K) holds. Given * > H −1 x , there exists δ ∈ (0, R/2) and ρ > 0 such that if Remark 3.1. The property (3.16) is formally the Aubin property of H −1 x i at 0 for u, but cannot strictly be called that, because generally 0 ∈ H x i ( u).
Clearly the same condition then holds with¯ replaced by any ∈ (0,¯ ). We exploit this. By Lemma for some ρ , δ > 0. Referring to Lemma 3.3, we have The next lemma bounds step lengths near a solution. x has the Aubin property at 0 for u. Given > 0, there exists δ > 0 such that the following holds. If then with the specific choice we have the estimates Proof. Observe that in each of the estimates (3.20), we may take > 0 smaller than prescribed. Along the course of the proof, we will accordingly assume as small as required.
We begin by proving (3.20a), (3.20b) and (3.20f), using (3.18a) without (3.18b). Indeed, observe that since u ≤ R/2, (3.20a) trivially holds by choosing δ ∈ (0, R/4). To bound u i −u i+1 , we return to the algorithmic approach for computing so that An application of Young's inequality gives The right hand side of (3.23) can be made arbitrarily close to zero by application of (A-K) and (3.18a). That is, for any > 0, there exists δ > 0 such that if (3.18) holds for some δ ∈ (0, δ ), then We now have to bound y i+1 − y i through (3.21c). Similarly to (3.22), y i+1 solves for y the problem Proceeding as above, using the fact that K( x) ∈ ∂F * ( y) we get We approximate By taking δ, > 0 small enough, which we may do, we can make the term K(x i ) − K( x) arbitrarily small by application of (A-K) and (3.18a). Likewise, we can make the term K x i (x i+1 − x i ) approach zero by application (A-K), (3.18a) and (3.24). Finally, the term v i = ν i we can be make small by additionally using (A-D i ). Indeed, choosing > 0, and employing (A-D i ), we have is small enough. This can be guaranteed by (3.24) above. Thus, taking > 0 small enough, we can make ν i arbitrarily small. This proves (3.20f), and shows that (3.26) can be made arbitrarily small. In summary, there exists δ ∈ (0, δ ) such that if (3.18a) holds for δ ∈ (0, δ ), then This proves (3.20b).
To prove the remaining estimates in (3.20), we require (3.18b) as well as the existence of u i and u i . In fact, since int dom G = ∅ and int dom F * = ∅, we may find a point u ∈ (int dom G) × (int dom F * ) arbitrarily close to u. Then the set H Minding that we have shown (3.20f), choosing > 0 small enough and setting w = 0 resp. w = −ν i shows the existence of In fact, we have the existence of a minimiser u i to (3.19). This follows simply from v → u i −v being level bounded, ∂G and ∂F * outer semicontinuous set-valued mappings, and Kx a continuous linear operator.
We may now move on to the proof of (3.20e). By Lemma 3.9, for any * > H −1 provided that u i − u ≤δ and ν i ≤ρ. The former follows from (3.18b) and taking δ small enough. The latter follows from (3.24), (3.27) and choosing > 0 small enough. In fact, taking ≤ min{ /( * ),ρ/ }, we also show that u i − u i ≤ . This completes the proof of (3.20c).
Finally, to show (3.20d) and (3.20e) we simply bound Then we use (3.18) and (3.20b)-(3.20c), assuming that these estimates hold to the higher accuracy /2 instead of . By the arguments above, this can be done by making δ > 0 small enough.

Removing squares
With the Aubin property assumed, we are now able to remove the squares from ( D 2 -M). Later in Section 4 we will prove the Aubin property for important classes of G, F * and K.
Proof. Choosing δ > 0 small enough and applying Lemma 3.9, we have for some ρ , δ > 0 that (3.28) Lemma 3.10 provides a further upper bound on δ such that if (3.18) holds with such a δ, then as can be seen from the strictly convex problems (3.22), (3.25) for calculating u i+1 . Thus H −1 x i (w) is single-valued. By (3.29a), (A-K) and Lemma 3.1, we have M x i ≤ Θ 2 I. Therefore, using (3.29b) and (3.29c), we have Minding (3.29d), we may thus apply (3.28) to get Squaring this and applying it to ( D 2 -M), we get for λ : By application of ( D 2 -loc), minding that λ > 1, we also have

Bridging local solutions
In order to finalise the proof of ( D), we will now use the perturbed local linearised solution u i to bridge between the local linearised solutions u i and u i+1 . For this we again need the Aubin property of H −1 u .
Lemma 3.12. Let κ and L 2 be as in (3.1). Assume that (A-D i ) and (A-K) hold, and that H −1 x has the Aubin property at 0 for u. Suppose, moreover, that ( D-M) holds for any choice of Under these conditions there exist δ > 0 and That is, the descent inequality ( D) holds.
We still have to show (3.35). Using Lemma 3.10, with δ > 0 small enough, we may take We We pick * > H −1 x . Again with δ > 0 small enough, an application of Lemma 3.9 yields for some ρ , δ > 0 that provided that u i − u ≤ δ and ν i ≤ ρ . These conditions are guaranteed by Lemma 3.10 and choosing , δ > 0 small enough.
With , δ > 0 small, applying (3.18) and (3.20b), we can also make u − u i+1 small enough that another application Lemma 3.9 guarantees the existence of We may assume that ρ , δ > 0 are the same as in (3.36). With w = 0, v = u i+1 and u = u i , we obtain This condition was already verified for (3.37).
To start approximating E and ν i , we use Lemma 3. and Inserting these estimates back into (3.39), it follows for some constant C > 0 that Using (A-D i ), we can for any > 0 find δ > 0 such that The condition u i − u i+1 < δ can be guaranteed through Lemma 3.10 and choosing δ > 0 small enough. Thus, using (3.37), (3.38) and (3.40), we have In order to prove (3.35), we thus need to force η := * κL 2 (A + ) < ζ 2 . As > 0 and * > H −1 x were arbitrary, it suffices to show H −1 Thus it remains to force By Lemma 3.10, this holds for δ > 0 small enough. Thus (3.35) holds, and we may conclude the proof. . Suppose, moreover, that 0 ∈ H x ( u) and that H −1 x has the Aubin property at 0 for u. In this case, given > 0, there exists δ 1 > 0, independent of k, such that if
The following two theorems form our main convergence result.
Choosing * > H −1 x small enough, (3.46) then guarantees (3.33). Inductively assuming that ( D) holds for i = 1, . . . , k − 1, we use Lemma 3.13 to show that provided δ 1 > 0 is small enough. We then use Lemma 3.6 to show that the squared local norm descent inequality ( D 2 -M) holds. Next we apply Lemma 3.11 to show that unsquared local descent inequality ( D-M) holds. Finally, we employ Lemma 3.12 to derive ( D) for i = k. As k was arbitrary, we may conclude the proof.
Theorem 3.2. Suppose the conditions of Theorem 3.1 hold for u = ( x, y) and u 1 = (x 1 , y 1 ). In that case there exists δ 1 > 0 such that provided the initialisation u 1 satisfies u 1 − u ≤ δ 1 , then the iterates (x i , y i ) produced by Algorithm 2.1 or Algorithm 2.2 converge to a solution u * = (x * , y * ) of (2.1), i.e., a critical point of the problem (P).
Proof. We pick δ 1 > 0 small enough that (3.47) is satisfied, and that Lemma 3.13 guarantees the assumption (3.5) of Lemma 3.5. Then Theorem 3.1 and Lemma 3.5 verify the assumptions of the general convergence result Theorem 2.1, from which the claim follows.

Some remarks
Remark 3.2 (Convergence to another solution). In principle the solution u * discovered in Theorem 3.2 may be distinct from u, also solving (2.1). Remark 3.3 (Small non-linear dual). The condition (3.46) forces P NL y to be small. As we will further discuss in Section 4.3, in the applications that we are primarily interested in, involving solving min x f − T (x) 2 /2 + αR(x), this this corresponds to f − T ( x) being small. This says that the noise level of the data f and regularisation parameter α have to be low.
Since this involves significant additional technical detail that complicates the already very technical proof, we have opted not to include this generalisation.
Remark 3.5 (Switch of local metric). The shift to the new local metric M x i+1 , done in Lemma 3.6 using the strong convexity of F * on Y NL , can also be done similarly to the removal of squares in Lemma 3.11. This suggests that the strong convexity might not be necessary. In practise we however need the strong convexity for the Lipschitz continuity, so there is little benefit from that. Moreover, the required strong convexity exists in case of regularisation problems of the form discussed in Remark 3.3. As we are primarily interested in applying the method to such problems, assuming strong convexity is natural.
Remark 3.6 (Varying step lengths). Strictly speaking, we do not need the whole force of (A-K) for the bound (C-M). We can allow for σ, τ vary on each iteration, and even preconditioning operators in place of simple step length parameters, cf. [24]. Indeed, with σ i , τ i dependent on u i , we see that If now , θ > 0 are such that K x i 2 > θ 2 /2 > 0, we see that it suffices to have This is the case with the operators in Section 5. If (3.49) is satisfied, to obtain (C-M), it thus suffices to choose σ i and τ i such that (3.48) holds for fixed > 0, and to choose θ such that This alone is however not enough to show convergence of the method with varying step lengths. There is one further difficulty in the switch of local norms from · M x i to · M x i+1 in Lemma 3.6. Specifically, the first equality in (3.12) does not hold. Defining and otherwise using σ = σ i and τ = τ i in the definition of M x i for each i, we can however calculate similarly to (3.12), we can prove following Lemma (3.50) Since x → K x is Lipschitz close to x, if we start close enough to u that x i − x i+1 stays small, (3.50) can be made to hold for good choices of τ i , σ i . Indeed, let us choose , τ 0 , σ 0 > 0 such that τ 0 σ 0 < 1/(1 + ), and then (3.51) are decreasing, and (3.50) amounts to By the above considerations, this holds if we initialise u 1 close enough to u.

Lipschitz estimates
We now need to show the Aubin property of the inverse H −1 x i of the set-valued map H x i , and to bound P NL y . We will calculate H −1 x i (0| u) through the Mordukhovich criterion, which brings us to the topic of graphical differentiation of set-valued maps. We will introduce the necessary tools in Section 4.1, following [27]; another treatment also covering the infinite-dimensional case can be found in [20]. Afterwards we derive bounds on H −1 x i (0| u) for some general class of maps in Section 4.2. These will then be used in the following Section 4.3 and Section 4.4 to study important special cases. These include in particular total variation (TV) and total generalised variation (TGV 2 ) [5] regularisation.

Differentials of set-valued maps
Let S : X ⇒ Y be a set-valued map between finite-dimensional Hilbert spaces X, Y . The graph of S is Graph S := {(x, y) | y ∈ H(x)}.
The outer limit at x is defined as The inner limit is defined as Pick x ∈ X and y ∈ S(x). The graphical derivative of S at x for y, denoted DS(x|y) : X ⇒ Y , is defined by Geometrically, Graph DS(x|y) is the tangent cone to Graph S at (x, y). If S is single-valued and differentiable, then DS(x|y) = ∇S(x) for y = S(x). Observe that DS(x|y) satisfies There are also various other definitions of differentials for set-valued maps. In particular, the regular derivative of S at x for y, denoted DS(x|y) : X ⇒ Y , is defined by The importance of the regular derivative to us lies in following. The map S is said to be graphically regular at (x, y) if DS(x|y) = DS(x|y). We stress that this correspondence does not hold generally. If it does, we may express the coderivative D * S(x|y) as where Geometrically, Graph D * S(x|y) is a normal cone to Graph S at (x, y) rotated such that it becomes adjoint to DS(x|y) in the sense (4.1). Without graphical regularity, the coderivative has to be defined through other means [27]; we will however always assume graphical regularity.
In our forthcoming analysis, we will occasionally employ the tangent and normal cones to a convex set A at y. These are denoted T A (y) and N A (y), respectively.   We want to translate this result to be stated in terms of DS(x|y) for S −1 , since the graphical derivative is easier to obtain than the coderivative. Proposition 4.1. Suppose S is graphically regular and Graph S locally closed at (x, y). Then By the symmetricity of graphical differentials for S and S −1 , this is the same as z, q ≤ w, v when (q, v) satisfy q ∈ DS(x|y)(v),

Bounds on Lipschitz factors
We now want to approximate the local Lipschitz factor H −1 x (0| u) of H −1 x at 0 for u. We apply Proposition 4.1, assuming that H x is graphically regular and Graph H x is locally closed at ( u, 0). Then Writing u = (x, y) and v = (ξ, ν), we have Suppose there exist self-adjoint linear maps G : X → X and F : Y → Y and (possibly trivial) subspaces V G ⊂ X and V F ⊂ Y such that With P V the orthogonal projection operator into V , this allows us to approximate (4.7) In the following lemma, we show that c > 0. To do so, we have to assume various forms of boundedness from the involved operators. We introduce the operator Ξ as a way to make the estimate hold for a range of regularisation parameters; the details of the procedure will follow the lemma in Section 4.3.
Lemma 4.1. Let Ξ : X × Y → X × Y be a self-adjoint positive definite linear operator, Ξ(x, y) = (Ξ G (x), Ξ F (y)). Suppose that Ξ G commutes with G and P V G , that Ξ F commutes with F and P V F , and that one of the following conditions hold.
i) G ≥γΞ G and F ≥γΞ F for someγ > 0. ii) G = 0, V G = X, Ξ G = I and M Ξ F ≥ F ≥γΞ F for some M,γ > 0, as well as Then there exists a constant c = c(M,γ, a), such that Proof. Let us write We consider point (i) first. We write G V = Gγ +γΞ V,G and F V = Fγ +γΞ V,F for Ξ V,F := Ξ F P V F and Ξ V,G := Ξ G P V G . Observe that the self-adjointness and positivity of Ξ V,G and the commutativity with G yield Ξ V,G ζ, Gγζ = Ξ Analogously This shows that (4.9) holds with c =γ in case (i).

Regularisation functionals with L 1 -type norms
Writing Y = Y NL × Y L and y = (λ, ϕ), we now restrict our attention to where the operator for some γ ≥ 0 and α > 0. Here B(0, α) ⊂ R m j are closed Euclidean unit balls of radius α. We assume that T ∈ C 2 (X; Y NL ), and the functionals G ∈ C 2 (X) and F * NL ∈ C 2 (Y NL ) to be strongly convex, however also allowing G = 0.  Denoting by a discrete gradient operator, we define the discrete total variation by We may then write A non-zero γ > 0 in (4.12) corresponds to Huber-regularisation of the L 1 norm, that is to the functional Example 4.2 (Second-order total generalised variation, TGV 2 ). Let us now replace TV by TGV 2 in (4.13). This regularisation functional was introduced in [5] as a higher-order extension of TV that tends to avoid the stair-casing effect while still preserving edges. Parametrised by α = (β, α), it can according to [7], see also [6], be written as the differentiation cascade for E the symmetrised gradient. The parameter α is the conventional regularisation parameter, whereas the ratio β/α controls the smoothness of the solution. With a large ratio, one obtains results similar to TV, but as the ratio becomes smaller, TGV 2 better reconstructs smooth features of the image.
In the discrete setting, on the two-dimensional domain Ω d = {1, . . . , n 1 } × {1, . . . , n 2 }, we may write (4.14) The operator K L is defined by another discrete gradient operator. Instead of the ratio β/α in front of E d , we could remove this and set the radii in (4.14) for ψ i,j to β. The present approach however makes the forthcoming analysis simpler.

(4.15)
Moreover ∂f is graphically regular and Graph ∂f is locally closed at (y, v) for v ∈ ∂f (y) whenever either v > 0 or y < α.
Remark 4.1. The condition in the final statement can be seen as a form of strict complementarity, commonly found in the context of primal-dual interior point methods [33].
Proof. We first show graphical regularity and local closedness assuming (4.15). Indeed, local closedness of ∂f is a direct consequence of f being convex and lower semicontinuous [25]. With regard to graphical regularity, let (y i , v i ) → (y, v). Then for large enough i, v i > 0. This forces y i = α, because v i ∈ {0} = ∂f (y i ) if y i < α. Consequently we get from (4.15) the expression for any large enough index i and any w i with y i , w i = 0. Choosing w with y, w = 0, and z ∈ v w/α + Ry, we can find w i → w and z i → z with y i , w i = 0 and The inclusion in the other direction is obvious from the definitions. This proves that D(∂f )(y|v) = D(∂f )(y|v), i.e., graphical regularity in the case v > 0. The case y < α is trivial, because ∂f (y i ) = {0} for any y i close enough to y.
Let us now prove (4.15). We do this by calculating the second-order subgradient d 2 f (y|v). In the present situation, writing C := B(0, α) = {x ∈ R m | g(y) ∈ D}, D := (−∞, α 2 /2], g(y) = y 2 /2, the latter is given by [27, 13.17] as w, x∇ 2 g(y)w , (4.16) provided the following constraint qualification is satisfied: x ∈ N D (g(y)), x∇g(y) = 0 =⇒ x = 0. (4.17) Here is the normal cone to N C (y) at v, and The constraint qualification (4.17) is trivially satisfied: if 0 = x ∈ N D (g(y)), then necessarily y = α, so that x∇g(y) = xy = 0. We may thus proceed to expanding (4.16). We find for v ∈ N C (y) that and X(y, v) = v /α. Thus for v ∈ N C (y) we have max x∈X(y,v) w, x∇ 2 g(y)w = v w 2 /α. It follows from (4.16) that otherwise. (4.18) We still have to calculate D(∂f )(y|v). Since f is proper, convex, and lower semi-continuous, it is also prox-regular and subdifferentiably continuous [27, 13.30] in the senses defined in [27], that we introduce here by name just for the sake of binding various results from that book rigorously together. By [27, 13.17], f is also twice epi-differentiable. It therefore follows from [27, 13. Applying this to (4.18), we easily calculate (4.15). then ∂F * L,α is graphically regular and Graph ∂F * L,α locally closed at (ϕ, v), and Here and that ∂F L,α is graphically regular and the graph locally closed if ∂(f j + g j ) satisfies the same for each i = 1, . . . , N . By sum rules for graphical differentiation [27,10.43], we have Moreover, since g is twice continuously differentiable, ∂(f j + g j ) is graphically regular if ∂f j is. By Lemma 4.2, this is the case if ϕ j < α, or v j − ∇g j (ϕ j ) > 0. Minding that ∇g j (ϕ j ) = γϕ j /α, referring to Lemma 4.2 once again for the expression of D(∂f j )(ϕ j |v j − γϕ j /α)(w j ), the claim of the present lemma follows.
We now have the necessary results to bound H −1 x (0| u).
Proposition 4.2. Suppose F * NL is twice continuously differentiable and strongly convex, and G = 0 or G is twice continuously differentiable and strongly convex. Define F * and K by (4.11) for some α > 0. Suppose 0 ∈ H x ( u). If ϕ = ( ϕ 1 , . . . , ϕ N ) and x satisfy for some a, b > 0 the conditions N ), (4.20) and then H −1 x has the Aubin property at 0 for u. In particular, givenᾱ > α, (j = 1, . . . , N ), there exists a constant c = c(a, b + γ, R,ᾱ) > 0 such that Proof. We calculate D(∂F * L,α )( ϕ|K L x) using Lemma 4.3, whose condition (4.19) at v = K L x is guaranteed by (4.20). Then we use (4.5), (4.6) to obtain Here A L,α and V L,α are given by Lemma 4.2. The lemma together with (4.20) also show that ∂F L is graphically regular and Graph ∂F L is locally closed at ( ϕ, K L x). From our standing assumptions, both F * and G are convex and lower semicontinuous. Therefore Graph ∂F and Graph ∂G are locally closed. Moreover ∂F * NL and ∂G are graphically regular, F * NL and G NL being twice continuously differentiable. It follows that H x is graphically regular and Graph H x is locally closed at (0, u).
The operator Ξ α,F commutes with F and P V F . Moreover, using (4.20) and the strong convexity of F * NL , we see that there exist M = M (R) > 0 andγ =γ(ᾱ, b + γ) such that The dependence of M on R comes through sup λ ≤R ∇ 2 F * NL (λ) . Since commutativity with G = 0 is automatic, and (4.21) guarantees (4.8), we may therefore apply Lemma 4.1 to derive the bound Here the constant c = c(M,γ, a) = c(a, b + γ, R,ᾱ). Recalling (4.7), this proves (4.22). (4.20) is difficult to satisfy without Huber regularisation. Indeed, with γ = 0, the condition becomes [K L x] j > 0 for each i = 1, . . . , N . In case of total variation regularisation in Example 4.1, this says that we cannot have [∇ d x] j = 0; there can be no flat areas in the image x. This kind of requirement is, however, almost to be expected: The dual variable ϕ j , solving max

Remark 4.2 (Huber regularisation). Condition
is not uniquely defined. Any small perturbation of x can send it anywhere on the boundary ∂B(0, α).
This oscillation is avoided by Huber regularisation., i.e., γ > 0. In this case optimal ϕ j for x solves max ϕ j ∈B(0,α) Even with γ > 0, we do however have a problem when [K L x] j = γ: the solution is not necessarily strictly complementary in the sense (4.19). A way to avoid this theoretical problem would be to replace the 2-norm cost in (4.12) by a barrier function of B(0, α). This would, however, cause the resolvent (I + σ∂F * ) −1 (y) to become very expensive to calculate. We therefore do not advise this in practise.

Squared L 2 cost functional with L 1 -type regularisation
We now make further assumptions on our problem, and limit ourselves to reformulations of Here we assume that T ∈ C 2 (X; R m ), and that the regularisation term αR(x) be for some linear operator K L : X → Y L and F * L,α as in (4.12), be written in the form This covers in particular Example 4.1 and Example 4.2.
Setting F * NL (λ) := Observe that F * NL is strongly convex. Thus with y = (λ, ϕ), we may reformulate (4.23) in the saddle point form min The optimality conditions (2.1) presently expand to u α = ( x α , λ α , ϕ α ) satisfying Since Proposition 4.2 only shows the Aubin property of H −1 xα without any guarantees of smallness of the Lipschitz factor, we have to make P NL y α small in order to satisfy (3.46) for Theorem 3.2. As P NL y α = λ α , a solution to (4.25) necessarily satisfies We will discuss how to make this small after the next proposition, rewriting Theorem 3.2 for the present setting. as well as the non-degeneracy condition Suppose, moreover, that K and the step lengths τ, σ > 0 satisfy (A-K). Pickᾱ > α, and let c = c(a, b + γ, R,ᾱ) be given by Proposition 4.2. Then there exists δ 1 > 0 such that Algorithm 2.1 applied to the saddle point form (4.24) of (4.23) converges provided Proof. We verify the assumptions of Theorem 3.2. By Lemma 3.2, it only remains to prove (3.46). As  Verifying this is easy if ∇T ( x α ) has full range, but otherwise it can be quite unwieldy thanks to the projection P V L,α ( ϕα) . Unfortunately, we have found no way to avoid this condition or (4.28).
We conclude our theoretical study with a simple exemplary result on the satisfaction of (4.28). The problem with simply letting α 0 in order to get f − T ( x α ) → 0 is that the constant c might blow up, depending on u α through a and b. We therefore need to study the uniform satisfaction of these conditions. In order to keep the present paper at a reasonable length, we limit ourselves to a very simple result that assumes the existence of a convergent sequence {u α } of solutions to (4.25) as α 0. The entire topic of the existence of such a sequence merits an independent study involving set-valued implicit function theorems (e.g. [9]) and source conditions on the data (cf., e.g. [28]). Besides the existence of the minimising sequence, we assume the existence of x * such that f = T (x * ). It is not difficult to formulate and prove equivalent results for the noisy case, where we only have the bound inf x f − T (x) ≤ σ for small σ.

Applications and computational experience
With convergence theoretically studied, and total variation type regularisation problems reformulated into the minimax form, we now move on to studying the numerical performance of NL-PDHGM. We do this by applying the method to two problems from magnetic resonance imaging: velocity imaging in Section 5.1, and diffusion tensor imaging in Section 5.2.

Phase reconstruction for velocity-encoded MRI
As our first application, we consider the phase reconstruction problem (1.3) for magnetic resonance velocity imaging. We are given a complex sub-sampled k-space data f = SFu * + ν ∈ R m corrupted by noise ν. Here S is the sparse sampling operator, F the discrete two-dimensional Fourier transform, and u * ∈ L 1 (Ω d ; C) the noise-free complex image in the discrete spatial domain Ω d = {0, . . . , n − 1} 2 . We seek to find a complex image u = r exp(iϕ) approximating u * . Motivated by the discoveries in [3], we wish to regularise r and ϕ separately. Introducing non-linearities into the reconstruction problem, we therefore define the forward operator For the phase ϕ, we choose to use second order total generalised variation regularisation, and for the magnitude r, total variation regularisation. Choosing regularisation parameters α r , α ϕ , β ϕ > 0 appropriate for the data at hand, we then seek to solve min r,ϕ∈L 1 (Ω d ) For our experiments, due to trouble obtaining real data, we use a simple synthetic phantom on Ω = [−1, 1] 2 , depicted in Figure 1a. It simulates the speed along the y axis of a fluid rotating in a ring. Specifically r(x, y) = χ 0.3< √ x 2 +y 2 <0.9 (x, y), and ϕ(x, y) = x/ x 2 + y 2 .
The image is discretised on a n × n grid with n = 256. To the discrete Fourier-transformed k-space image, we add pointwise Gaussian noise of standard deviation σ = 0.2. For the sub-sampling operator S, we choose a centrally distributed Gaussian sampling pattern with variance 0.15 · 128 and 15% coverage of the n × n image in k-space. For the regularisation parameters, which we do not claim to have chosen optimally in the present algorithmic paper, we choose α ϕ = 0.15·(2/n), β ϕ = 0.20·(2/n) 2 , and α r = 2/n. (The factors 2/n are related to spatial step size 1/n of the discrete differential on the [−1, 1] 2 domain. When the step size is omitted, i.e., implicitly taken as 1, the factors disappear.) We perform computations with Algorithm 2.1 (Exact NL-PDHGM), Algorithm 2.2 (Linearised NL-PDHGM), and the Gauss-Newton method. As we recall, the latter is based on linearising the non-linear operator T at the current iterate, solving the resulting convex problem, and repeating until hopeful convergence. We solve each of the inner convex problems by PDHGM, (2.3). We initialise each method with the backprojection (SF) * f of the noisy sub-sampled data f . We use two different choices for the PDHGM step length parameters σ and τ . The first one has equal primal and dual parameters, both σ, τ = 0.95/L. Recalling Remark 3.6, we update L = sup k=1,...,i ∇K(x k ) dynamically for NL-PDHGM,; for linear PDHGM, used within Gauss-Newton, this simply reduces to L = K . This choice of τ and σ is somewhat naïve, and it has been observed that sometimes choosing σ and τ in different proportions can improve the performance of PDHGM for linear operators [24]. Therefore, as our second step length parameter choice we use τ = 0.5/L and σ = 1.9/L. We limit the number of PDHGM iterations (within each Gauss-Newton iteration) to 100000, and limit the number of Gauss-Newton iterations to 100. As the primary stopping criterion for the NL-PDHGM methods, we use x i − x i+1 < ρ for ρ = 1e −4 . The same criterion is used to stop the outer Gauss-Newton iterations. As the stopping criterion of PDHGM within each Gauss-Newton iteration, we use the decrease of the pseudo-duality gap to less than ρ 2 = 1e −3 = ρ/10; higher accuracy than this would penalise the computational times of the Gauss-Newton method too much in comparison to NL-PDHGM. Much lower accuracy would almost reduce it to NL-PDHGM, if convergence would be observed at all; we will get back to this in our next application.
The pseudo-duality gap is discussed in detail in [30]. We use use it to work around the fact that in reformulations of the linearised problem into the form (P), G = 0. This causes the duality gap to be in practise infinite. This could be avoided if we could make G to be non-zero, for example by taking G(x) = δ B(0,M ) (x) for M a bound on x. Often the primal variable x can indeed be shown to be bounded. The problem is that the exact bound is not known. The idea of the pseudo-duality gap, therefore, is to update the bound M dynamically. Assuming it large enough, it does not affect the PDHGM itself. It only affects the duality gap, and is updated whenever the duality gap is violated. For the artificially dynamically bounded problem the duality gap is finite, and called the pseudo-duality gap.
We perform the computations with OpenMP parallelisation on 6 cores of an Intel Xeon E5-2630 CPU at 2.3GHz, with 64GB (that is, enough!) random access memory available. The results are displayed in Table 1 for the first choice of σ and τ , and in Table 2 for the second choice. In the tables we report the number of PDHGM and Gauss-Newton iterations taken along with the computational time in seconds, as well as the PSNR of both the magnitude and the phase for the reconstructions. For the calculation of the PSNR of ϕ, we have only included the ring 0.3 < x 2 + y 2 < 0.9, as the phase is meaningless outside this, the magnitude being zero. The results of the second parameter choice are also visualised in Figure 1. It shows the original noise-free synthetic data, the backprojection of the noisy sub-sampled data, and the obtained reconstructions, which have little difference between the methods, validating the results. Also, because there is no difference Linearised and Exact NL-PDHGM, we only display the results for the latter.
The main observation from Table 1, with equal τ and σ, is that Exact NL-PDHGM, Algorithm 2.1, is significantly faster than Gauss-Newton, taking only around two minutes in comparison to almost an hour for Gauss-Newton. Gauss-Newton nevertheless appears to converge for the present problem, having taken 13 iterations to each the stopping criterion. Another observation is that although Linearised NL-PDHGM, Algorithm 2.2, takes exactly as many iterations as Exact NL-PDHGM, it requires far more computational time, as the linearisation of K is more expensive to calculate than K itself. Observing Table 2, we see that choosing τ and σ, in unequal proportion significantly improves the performance of NL-PDHGM, taking less than a minute. In case of Gauss-Newton the improvement is merely marginal. Finally, studying Figure 1, we may verify that the method has converged to a reasonable solution.
A full and fair comparison of the non-linear model (5.1) against the linear model (1.2) is out of the scope of the present paper. Especially, considering that we employed in [3] Bregman iterations [23] for contrast enhancement, without the introduction of a contrast enhancement technique for the non-linear model as in [2], it is not fair to directly compare against the results therein. Nevertheless, in order to justify the non-linear model, we have a simple comparison in Table 3 and Figure 2. We computed solutions to the linear model for varying α in the interval [0.01, 3.0] · (2/n) (spacing 0.005 below α = 1.0 · (2/n), and 0.1 above), β = 1.1α · (2/n), and chose the optimalα by three different criteria. These Morozov's discrepancy principle [21], i.e., the largest α such that f − SFu α is below the noise level, as well as the optimal PSNR for both the amplitude r and phase ϕ. For the non-linear model, we chose the parameters manually. We find this reasonable, because for multi-dimensional parameters we do not at this time have to our avail something practical like the discrepancy principle, and a scan of the parameter space is not doable in practical applications. Moreover, in order to show that the non-linear model improves over the linear model, it is only necessary to pick parameters for the linear model optimally. With this in mind, we specifically picked α r = 1 · (2/n), α ϕ = 0.20 · (2/n), and β ϕ = 1.1α ϕ · (2/n) for the non-linear model. We also increased the stopping threshold of NL-PDHGM to ρ = 1e −5 for better quality solutions. For the linear model, we set the pseudo-duality gap threshold to ρ 2 = 1e −4 . This appears to be enough for the comparison, keeping in mind that the stopping criteria are not comparable. What we can draw from Table 3 is that for most criteria, the linear model fails to balance between PSNR(ϕ) and PSNR(r) as well as the non-linear model. With the discrepancy principle as parameter choice criterion, and TV as the regulariser, the linear model however beats the non-linear model in terms of both PSNRs. Inspecting Figure 2c, we however observe extensive stair-casing in both the amplitude and phase. This is avoided by the excellent reconstruction by the non-linear model in Figure 2b. With the higher α in Figure 2d, the phase reconstruction is good even by the linear model, but the amplitude has become smoothed out. This is also avoided by non-linear model.

Diffusion tensor imaging
As a continuation of our earlier work on TGV 2 denoising of diffusion tensor MRI (DTI) [30,31,32] using linear models, we now consider an improved model. Here the purpose of the non-linear operator T is to model the so-called Stejskal-Tanner equation Here v : Ω → Sym 2 (R 3 ), Ω ⊂ R 3 is a mapping to symmetric second order tensors (representable as symmetric 3 × 3 matrices). Each v(x) models the covariance of a Gaussian probability distribution at x for the diffusion of water molecules. Each of the diffusion-weighted MRI measurements s j , (j = 1, . . . , N ), is obtained with a different non-zero diffusion sensitising gradient b j , while s 0 is obtained with a zero gradient. After correct the original k-space data for coil sensitivities, each s j is assumed real. As a consequence, s j has in effect Rician noise distribution [13].
Our goal is to denoise v. We therefore consider where Due to the Rician noise of s j , the Gaussian noise model implied by the L 2 -norm is not entirely correct. However, the L 2 model is not necessary too inaccurate, as for suitable parameters the Rician distribution is not too far from a Gaussian distribution. For correct modelling, there would be approaches.    Figure 1: Non-linear phase/magnitude reconstruction using Exact NL-PDHGM and the Gauss-Newton method. Also pictured is the original noise-free test data, and the backprojection of the noisy sub-sampled data. The upper image in each column is the magnitude r, and the lower image the phase ϕ.   Figure 2: Phase/magnitude reconstruction using linear and non-linear model. Also pictured is the original noise-free test data; the backprojection of the noisy data may be found in Figure 1b. The upper image in each column is the magnitude r, and the lower image the phase ϕ. Only results for TV regularisation are shown for the linear model.
One would be to modify the fidelity term to model Rician noise, as was done in [12,19] for single MR images. The second option would be to include the coil sensitivities in an L 2 model, either by knowing them, or by estimating them simultaneously, as was done in [17] for parallel MRI. This kind of models with direct tensor reconstruction will be the subject of a future study. For the present work, we are content with the simple L 2 model, which already presents computational challenges through the non-linearities of the Stejskal-Tanner equation.
As our test data set, we have an in vivo measurement of the human brain, The data set is threedimensional with 25 slices of size 128 × 128 for 21 different diffusion sensitising gradients, including the zero gradient. Moreover, 9 measurement were measured to construct by averaging the ground truth depicted in Figure 3a. Only the first of the measurements is used for the backprojection in Figure 3b and the reconstructions with (5.3). It has about 8.5 million data points. The diffusion tensor image v is correspondingly 128 × 128 × 25 with 6 components for each symmetric tensor element v(x) ∈ R 3×3 . In addition we have the additional variable w, and dual variables. This gives altogether 42 values per voxel in the reconstruction space, or 17 million values. Considering the size of a double data type (8 bytes), and the need for copies and temporary variables, the (C language) program solving this problem has a rather significant memory footprint of about one gigabyte. Table 4: DTI reconstruction using Exact NL-PDHGM, Linearised NL-PDHGM, and the Gauss-Newton method: equal primal and dual step length parameters. For Gauss-Newton, PDHGM is used in the inner iterations, and the number of PDHGM iterations reported is the total over all Gauss-Newton iterations. The stopping criterion for NL-PDHGM and Gauss-Newton is x i − x i+1 < ρ. The stopping criterion for linear PDHGM within Gauss-Newton iterations is pseudo-duality gap less than ρ 2 . The number of PDHGM iterations (per Gauss-Newton iteration) is limited to 100000, and the number of Gauss-Newton iterations to 100. ρ = 1e −3 , τ = 0.95/L, σ = 0.95/L for L = sup i ∇K(x i )  Table 5: DTI reconstruction using Exact NL-PDHGM, Linearised NL-PDHGM, and the Gauss-Newton method: primal step length smaller than dual. For Gauss-Newton, PDHGM is used in the inner iterations, and the number of PDHGM iterations reported is the total over all Gauss-Newton iterations. The stopping criterion for NL-PDHGM and Gauss-Newton is x i − x i+1 < ρ. The stopping criterion for linear PDHGM within Gauss-Newton iterations is pseudo-duality gap less than ρ 2 . The number of PDHGM iterations (per Gauss-Newton iteration) is limited to 100000, and the number of Gauss-Newton iterations to 100. ρ = 1e −4 , τ = 0.5/L, σ = 1.9/L for L = sup i ∇K(x i ) As in Section 5.1, we evaluate all three, Algorithm 2.1 (Exact NL-PDHGM), Algorithm 2.2 (Linearised NL-PDHGM), and the Gauss-Newton method. The parametrisation and method setup is the same as in the previous section, except for the step length parameter choice τ = σ = 0.95/L, we use the lower accuracy ρ = 1e −3 . For the choice τ = 0.5/L, σ = 1.9/L we use ρ = 1e−4 as before. Also, in both cases, in addition to ρ 2 = ρ/10, we perform Gauss-Newton computations with the higher accuracy ρ 2 = ρ for the inner PDHGM iterations. The regularisation parameters are also naturally different. We choose α = 0.0006/ √ 128 · 128 · 25, and β = 0.00066/(128 · 128 · 25).
The results of the computations are reported in Table 4, Table 5, and Figure 4. They confirm our observations in the velocity imaging example, but we do have some convergence issues. In case of Gauss-Newton, we do not observe convergence with accuracy ρ 2 = ρ/10 for the inner PDHGM iterations, and have to use ρ 2 = ρ. We find this quite interesting, since NL-PDHGM gives better convergence results, and is essentially Gauss-Newton with just a single step in the inner iteration. Although not reported in the tables, we also observed that if using the more accurate stopping threshold ρ = 1e −4 , instead of 1e −3 for the computations in Table 4, with equal σ and τ , NL-PDHGM did not appear to convergence until reaching the maximum iteration count of 100000. This may be due to starting too far from the solution, or due to the fact that convergence of even the linear PDHGM can become very slow in the limit. Nevertheless, with the stopping threshold ρ = 1e −3 , we quickly obtained satisfactory solutions, as can be observed by comparing the PSNRs between Table 4 and Table 5, as well as Figure  4, which visualises the results from the latter. Minding the large scale of the problem, we also had reasonably quick convergence to a greater accuracy with the unequal parameter choice. This gives us confidence for further application and study of NL-PDHGM in future work.
Finally, studying Table 4, the reader may observe that the PSNR of the unconverged solution produced by the Gauss-Newton method is better than the other solutions; this is simply because we did not choose the regularisation parameters optimally, only being interested in studying convergence of  : Visualisation of one slice of the DTI reconstruction using Exact NL-PDHGM and the Gauss-Newton method. Also pictured is the ground truth and the backprojection reconstruction. The data displayed is colourcoded principal eigenvector. The area outside the brain has been masked out. The rectangle in (a) indicates the region displayed in Figure 4 and Figure 5.
the methods for the present paper. Therefore we also do not compare the non-linear model completely rigorously against our earlier efforts with linear models. This will be the topic of future research. However, in order to justify the non-linear model, we have a simple comparison to report in Table 6, Figure 4, and Figure 5. In order to facilitate comparison, we have zoomed the plots into the rectangle indicated in Figure 3a. We compare the model non-linear model (5.3) against the linear model of [30] with fidelity term f − u 2 in tensor space, where we first solve the noisy tensor field f by linear regression from (5.2). Observe that we may linearise the equation by taking the logarithm on both sides. We also include in our comparison the model of [31] involving this linearisation in the fidelity term. We calculated solutions to all of these models for α ∈ [0.0002, 0.00110]/ √ 128 · 128 · 25, spacing 0.00005, and picked again the optimalα by both the PSNR and the discrepancy principle. Reading Table 6, we see that the non-linear performs clearly the best whenα is chosen by the discrepancy principle. Whenα is chosen by optimal PSNR, the tensor space linear model is better. However, studying Figure 4c, the result is significantly smoothed, reminding us that the PSNR is a poor quality criterion for imaging problems. In case of selection ofα by the discrepancy principle in Figure 5, the situation is not so dire, but still the linear models exhibit a level of smoothing. The non-linear model (5.3) performs visually significantly better. (d) Linear model, DWI space [31] (e) Non-linear model (5.3) Figure 4: DTI reconstruction using linear and non-linear models, α selected by best PSNR. The region indicated by the rectangle in Figure 3a is plotted.
(a) Ground truth (b) Backprojection (c) Linear model, tensor space [30] (d) Linear model, DWI space [31] (e) Non-linear model (5.3) Figure 5: DTI reconstruction using linear and non-linear models, α selected by the discrepancy principle. The region indicated by the rectangle in Figure 3a is plotted.