Locally Sparse Reconstruction Using the $\ell^{1,\infty}$-Norm

This paper discusses the incorporation of local sparsity information, e.g. in each pixel of an image, via minimization of the $\ell^{1,\infty}$-norm. We discuss the basic properties of this norm when used as a regularization functional and associated optimization problems, for which we derive equivalent reformulations either more amenable to theory or to numerical computation. Further focus of the analysis is put on the locally 1-sparse case, which is well motivated by some biomedical imaging applications. Our computational approaches are based on alternating direction methods of multipliers (ADMM) and appropriate splittings with augmented Lagrangians. Those are tested for a model scenario related to dynamic positron emission tomography (PET), which is a functional imaging technique in nuclear medicine. The results of this paper provide insight into the potential impact of regularization with the $\ell^{1,\infty}$-norm for local sparsity in appropriate settings. However, it also indicates several shortcomings, possibly related to the non-tightness of the functional as a relaxation of the $\ell^{0,\infty}$-norm.


Introduction
Sparse reconstructions based on minimizing 1 -norms have gained huge attention in signal and image processing, inverse problems, and compressed sensing recently. Their main feature of delivering sparse reconstructions, in some cases provably the same as with minimizing the nonconvex 0 -functional, is attractive for many applications and has led to remarkable development in theory and numerics. However, the overall sparsity enforced by minimal 1norm is not the only kind of prior information available in practice. Strong recent directions of research are related to unknowns being matrices, with prior information being e.g. low rank incorporated via nuclear norm minimization or block sparsity (or collaborative sparsity) incorporated by minimization of p,1 -norms with p ∈ (1, ∞). Such regularizations have been studied for instance by [11] and [26] under the name of joint sparsity.
Speaking of p,q -norms it is worth mentioning that there exist several different definitions in the literature (cf. [19,27] and others). We use the definition of mixed norms as proposed in [19,Definition 1], i.e.
where w is bounded and summable with w ij > 0 for all i, j. Note that the cases p = ∞ and q = ∞ can be obtained by replacing the corresponding norm by the supremum. In this paper we want to discuss such type of sparsity-functionals, namely 1,∞ -norms Our motivation is a local sparsity that frequently appears in inversion with some spatial dimensions (related to the index i) and at least one additional dimension such as time or spectral information in imaging (related to the index j). The main ingredients in the reconstruction problems we want to consider are a dictionary B ∈ R T ×N (encoding basis elements), with N basis vectors, T e.g. time steps and N > T , a forward operator A ∈ R L×M , with M pixels, L depending on the application (typically L < M being a number of detectors), and the measured data W ∈ R L×T , which leads to an equation of the form The unknowns U ∈ R M ×N are the collection of the coefficients with respect to the basis or dictionary encoded in B. Frequently a good dictionary for the local behaviour in the additional dimension can be found such that the vector u i• is expected to be sparse, which means we want to minimize with 0 0 := 0 subject to (2). A natural relaxation is to consider the minimization of (1) subject to (2) instead. Examples of applications with such kind of information are: • Dynamic Positron Emission Tomograpy (PET, and similar problems in SPECT, cf. [30,21,14,13]), where i refers to the pixel number of the image to be reconstructed and B is a local dictionary of (discretized) time-basis functions. The operator A is the PET matrix (roughly a sampled Radon-transforms with some corrections) and the basis functions are generated by kinetic modelling, i.e. as solutions of simple linear ODE systems with unknown parameters. The dictionary is generated indirectly by a dictionary for the parameters in the ODE. Clearly one often looks for a unique parameter value, i.e. a representation by only one basis function in each pixel, an ultimate kind of sparsity.
• Fluorescence-lifetime imaging microscopy (FLIM, cf. [8,20]) where A is a convolution in space or the identity and B contains different functions, which characterize the photon decay and are also convolved in time. Considering different basis functions for different fluorophores local sparsity may enhance the unmixing process.
• ECG Cardiac Activation Time Reconstruction (cf. [23,17], where i refers to a grid point on the epicardial surface and B is a dictionary of step functions parametrized by the activation time. Again one looks for a single activation time in each grid point, i.e. an ultimately sparse local representation usually not formulated this way. • Spectral-and Hyperspectral Imaging (cf. [1,10]), where the operator A is often a convolution and B is a dictionary of spectral signatures of expected elements. Currently resolution is hardly small enough that pixels resolve pure materials, but one may easily assume that only very few materials are contained in each pixel, which corresponds also to the above local sparsity prior.
Our analysis below will demonstrate that it may be advantageous to consider a combination of minimizing (1) with classical 1 sparsity. We will therefore also investigate the more general problem min Besides the constrained model (3) we shall also investigate the unconstrained model which is suited to deal with data noise. Here we use the Frobenius norm for the first part and γ T γ is a positive definite weighting matrix (in a statistical formulation the inverse covariance matrix of the noise). Since in basically all practical applications one only looks for positive combinations of basis elements, we shall put a particular emphasis on the case of an additional nonnegativity constraint on U in (3) respectively (4). We will investigate some basic properties of the model as well as reformulations with additional inequality constraints. This will make the problem more easily accessible to detailed analysis and numerical methods. Based on the latter we also investigate the potential to exactly reconstruct locally one-sparse signals by convex optimization techniques.

Basic Properties and Formulations
In this section we are going to introduce some equivalent formulations of the main problems (3) and (4), which we are going to use for the analysis later on. Furthermore we are going to point out some basic properties like existence, convexity and potential uniqueness. Finally we are going to prove the equivalence of another reformulation, which will improve the accessibility of the problem for numerical analysis.

Problem Formulations
Since in most applications a nonnegativity constraint is reasonable, we first restrict (3) and (4) to this case. For the sake of simplicity we define for the constrained problem and for the unconstrained problem. In order to make these problems more easily accessible, we reformulate the 1,∞ -term in (5) and (6) via a linear constraint: is equivalent to Proof.
Now consider the inequality-constrained problem is not a minimizer, since we can choosev < v, which is still feasible and reduces the objective. Thus in the optimal case the inequality constraint yields the equality constraint and the problems (7) and (8) coincide.
Using Proposition 1 and defining F as the sum of the characteristic function for the data constraint and the non-negative 1,1 -term we are able to reformulate problem (5) as Likewise we obtain the unconstrained problem from (6) as by using the sum of the data fidelity and the non-negative 1,1 -term as F . In order to understand the potential exactness of sparse reconstructions we will focus on the analysis of (9), but of course (10) is more useful in practical situations when data are not exact, hence it will be the basis for computational investigations.
However, we first propose another formulation, which shall make the problem more easily accessible for the numerical solution. For the algorithm we shall then use the newly reformulated unconstrained problem (10) to deal with noisy data as well.
For this reformulation we will prove in Subsection 2.4 that max i=1,...,M N j=1 u ij (α) depends continuously on α in problem (7). Then we will show that α → max i=1,...,M N j=1 u ij (α) is monotonic and finally we will analyze its limits for α going to zero and infinity. We can then show that under certain circumstances the support of the minimizers of (7) and coincide for a certain fixedṽ.

Basic Properties
Now we want to discuss some basic properties of our 1,∞ -regularized variational problems.
We are going to show that there exists a minimizer for these problems and discuss potential uniqueness.

Existence
Since functional (1) is a norm it is also convex. We want to show that this holds for the non-negative formulation (12) as well.
is convex.

Proof.
Since F (U ) is convex, we can conclude that (7) is convex by using Proposition 1. Due to Theorem 1 we deduce that (8) is convex as well.
Our problem is finite dimensional and hence all norms are equivalent. In addition it contains only linear inequalities. Therefore we can deduce lower semi-continuity directly from convexity, which we already have.
Let us now analyze the existence of minimizers of the different problems.
Theorem 2 (Existence of a Minimizer of the Constrained Problem). Let there be at least one U ∈ G that satisfies A U B T = W . Then there exists a minimizer of the constrained problems (5) and (9).

Proof.
Since we only have linear parts, we see that is convex. Then Proposition 2 leads to lower semi-continuity of (5) and (9). We still need to show that there exists a ξ such that the sublevel set is compact and not empty. With U we have a feasible element and we can define Due to U ∈ S ξ we see that S ξ is not empty and since for all U ∈ S ξ holds that and u ij ≥ 0 for all i ∈ {1, . . . , M } and j ∈ {1, . . . , N } the sublevel set S ξ is bounded. Our functional is finite dimensional hence S ξ is bounded in all norms and furthermore boundedness of S ξ yields with lower semi-continuity of (5) and (9) compactness of S ξ . Finally we obtain the existence of a minimizer of the constrained problems (5) and (9).

Proof. Obviously
u ij is convex. Using Proposition 2 we obtain that (6) and (10) are lower semi-continuous. The sublevel set is not empty since we obviously have 0 ∈ S ξ . Analogously to the proof of Theorem 2 we see that S ξ is bounded. Due to the finite dimensionality of the problem we have compactness of the sublevel set S ξ . Together with semi-continuity we obtain existence of a minimizer of the unconstrained problems (6) and (10).
Furthermorev is unique.

Proof.
Obviouslyv can be defined as v := inf {v | (U, v) is a minimizer of (8)} with F as in (9), (10) respectively. Due to Theorem 2 and 3 it isv < ∞. We proof the assumption via contradiction. Assume there exists noŪ with (Ū ,v) being a minimizer of (9), (10) respectively. We can find a sequence of minimizers (U k , v k ) with v k →v. U k is bounded, since U k ∈ S ξ for all k. Thus there exists a subsequence (U k l , v k l ). Finally lower semi-continuity provides us with the limit (Ū ,v) being a minimizer, which is a contradiction to the assumption.
Sincev ∈ R + it is obviously unique.

Uniqueness
In Theorem 4 we have seen that we can reduce the solution set to those solutions with optimal v, i.e.
with F as in (9), (10) respectively. Of course there always exists a uniquev ∈ R + , but in general we can not deduce uniqueness for (U,v) ∈ G × R + .

The Subdifferential of 1,∞
In this section we compute the subdifferential of the 1,∞ -norm (1). Furthermore we establish a connection to a source condition of (2).
Theorem 5 (Subdifferential of the 1,∞ -Functional). Let U, P ∈ R M ×N . The subdifferential of J(U ) = U 1,∞ can be characterized as follows: Let I be the set of indices where U attains its maximum row-1 -norm, i.e.
Then the following equivalence holds: with weights w i ≥ 0 such that Proof.
Let P be chosen according to the characterization on the right hand side of (13). Since U 1,∞ is 1-homogeneous, it is sufficient to verify the following two conditions: Regarding 2., we have Thus we see that a P chosen in this way belongs to the subdifferential of U 1,∞ . Now let P ∈ ∂ U 1,∞ , such that conditions 1. and 2. hold. We will prove the characterization of the subdifferential in five steps: 1. We show that p mn = 0 for all m / ∈ I. Let m / ∈ I and n be arbitrary. Consider V with v ij = u ij for all (i, j) except v mn = u mn + εsign(p mn ). Since m / ∈ I, we can choose ε > 0 small enough such that U 1,∞ = V 1,∞ . Thus we have p ij v ij ≥ 0 holds by condition 2., we have shown that p mn = 0 for all m / ∈ I.
2. Furthermore we show that u mn p mn ≥ 0 for m ∈ I, u mn = 0.
For an arbitrary (m, n) with m ∈ I, u mn = 0 let V be an element with v ij = u ij for all (i, j) = (m, n) and v mn = −u mn . Clearly we have U 1,∞ = V 1,∞ , which leads to and hence u mn p mn ≥ 0.
3. Now we show that for m ∈ I, we have |p mn | = |p md | = const for all (m, n) and (m, d) with u mn = 0 and u md = 0.
Let V be an element with v ij = u ij for all (i, j) / ∈ {(m, n), (m, d)} and v mn = u mn − sign(u mn )ε, v md = u md + sign(u md )ε. Let ε = 0 be chosen small enough such that sign(u mn ) = sign(v mn ) and sign(u md ) = sign(v md ). Clearly U 1,∞ = V 1,∞ holds such that and thus (sign(u md )p md − sign(u mn )p mn )ε ≤ 0. Since the sign of ε is arbitrary, the reverse inequality holds as well such that we obtain sign(u mn )p mn = sign(u md )p md . Since u mn = 0 and u md = 0 we have |sign(u mn )| = |sign(u md )| = 1 and thus we obtain |p mn | = |p md | = const.
In the following we will denote this constant (depending on k) as w m := |p mn | ∀ n ∈ {1, . . . , N }, m ∈ I .

Let us now show that
For this purpose we compute to see that the above claim holds.
Note that from 2. we obtain sign(u mn ) = sign(p mn ) and therefore we have p mn = w m sign(u mn ) for m ∈ I.

This shows that
We can see that the collection of the five results show that P indeed meets all requirements on the right hand side of (13).
One particular thing we can see from the proof of Theorem 5 is that P ∈ ∂ U 1,∞ for an arbitrary U meets Thus we see that the ∞,1 -norm is the dual norm to the 1,∞ -norm, which has already been observed by Tropp in [27].
Let us now consider the nonnegative 1,∞ -Functional.
The subdifferential of the nonnegative 1,∞ -functional can be characterized as and p 1,∞ ij are the entries of P 1,∞ , which is the subdifferential of 1,∞ characterized as before in (13).
Proof. J(U ) can be written using the characteristic function, i.e.
In case that the subdifferential is additive, we have and directly obtain In order to prove that in this case the subdifferential is additive, we have to show that This is quite easy to see: 1. Since U 1,∞ is a norm, it is obviously proper and convex. Furthermore we see that It is also convex, because the characteristic function of a convex set is also convex. Both functionals are lower semi-continuous, since we are in a finite dimensional setting.
Thus we see that the subdifferential is additive and we obtain the assumption.

Remark 1.
Obviously P ∈ ∂J(U ) can be characterized as follows: Knowing the characterization of the subgradient we can state a condition, which allows us to determine whether a certain solution to AU B T = W is 1,∞ -minimizing. We will call this condition a source condition as used in the inverse problem and error estimate literature e.g. in [9,3,25]. However, we would like to point out that similar conditions have been called dual certificate in the compressed sensing literature (c.f. [31,5,4,6]).

Definition 1.
We say that a solution U of A U B T = W meets a source condition with respect to a proper, convex regularization functional J, if there exists a Q such that P = A T QB ∈ ∂J( U ).
The source condition of some U with respect to J is nothing but the optimality condition for U being a J-minimizing solution to A U B T = W : The next natural question is to use our characterization of the subdifferential to ask the question what kind of solutions U to A U B T = W are likely to meet a source condition for 1,∞ -regularization? Particularly, we are interested in investigating how likely 0,∞ -minimizing solutions are to meet a source condition.
Due to the similarity between the subgradient of the 1 -norm and the subgradient of 1,∞ -norm at rows with index i ∈ I, we can make the following simple observation Proof. The 1 -subgradient divided by the number of rows is an 1,∞ -subgradient.
The above lemma particularly shows that exact recovery criteria for the properties of the sensing matrix kron(B, A) (like the Restricted Isometry Property, the Null Space Property, or the Mutual Incoherence Property), are sufficient for the exact recovery of sparse solutions with the same 1 -norm in each row.
Of course, we do expect to recover more 0,∞ -minimizing solutions than just the ones with the same 1 -row-norm. Looking at the characterization of the subdifferential (Remark 1) we can see that there are two cases that pose much more severe restrictions, i.e. equality constraints, than the two other cases (which only lead to inequality constraints). Thus we generally expect solutions, which require only a few of the equality constraints to be more likely to meet a source condition. As we can see equality constraints need to be met for nonzero elements, such that the 1,∞ -regularization prefers sparse solutions. Additionally the constraints are less restrictive, if the corresponding row has maximal 1 -norm. Thus we expect those solutions to be likely to meet a source condition that reach a maximal 1 -norm in as many rows as possible while being sparse. Naturally, these solutions will be row-sparse, which further justifies the idea that 1,∞ can be used as a convex approximation of the 0,∞ -problem.

Equivalence of Formulations
In this subsection we want to show that the minimizers of (7) and (11) coincide under certain circumstances. In order to do so we examine how the regularization parameter α > 0 is connected to the minimizer of (7). For this purpose we first prove the continuity and monotonicity of Afterwards we shall investigate the meaning ofṽ in (11) and its connection to the minimizer of F (U ) with minimal 1,∞ -norm. Analyzing the limits of (14) finally leads us to the main result of this subsection and the connection between the to problems (7) and (11).

Remark 2.
In most of the proofs in this subsection we require F (U ) to be continuous. Thus most of the results are not useful for the constrained problem (5) since is not continuous. However, in reality we have to deal with noisy data anyways and thus we only want to implement the unconstrained problem (6). This will become easier since we can simply use its reformulation (11), which we will summarize in Theorem 7.
For this subsection we define the functional J α : Lemma 3 (Continuity of (14)). Proof.
Let U k be a minimizer of J α k . Let α k → α be a sequence of regularization parameters. Due to boundedness we can find subsequences U k l → U . Due to convergence of the subsequences, the limits of all subsequences are equal and we obtain convergence of the whole sequence U k → U . We prove by contradiction and claim that U is not a minimizer of J α . Then there would exist a U with Defining Since J is continuous we obtain for k → ∞ Using (16) it also has to hold that which is a contradiction to the assumption that U k is a minimizer of J α k . Hence U has to be a minimizer of J α and we see that α → α max i∈{1,...,M } N j=1 u ij (α) is continuous. Thus we also now that (14) is continuous on (0, ∞). (14)).

Lemma 4 (Monotonicity of
Let F : R M ×N → R + be a convex continuous functional with bounded sublevel sets. Let U (α) ∈ G be a minimizer of J α with minimal U (α) 1,∞ .
On the other hand we have Inserting this in (17) yields u ij (α) depends monotonically on α.

Lemma 5.
Let F : R M ×N → R + be a convex continuous functional with bounded sublevel sets. LetŪ ∈ G be a solution of (11) such that Then it holds thatṽ > U 1,∞ , where U ∈ G is a minimizer of F (U ) with U 1,∞ minimal and vice versa.

Proof.
Letṽ > U 1,∞ , then U is feasible for (11) and obviously a minimizer as well.
ij <ṽ for all i we obviously haveŪ = U and since U is a minimizer of F with U 1,∞ minimal. Due to convexity we obtain and for ε = 0 small we have Thus we have found an element with smaller value of F as the minimizerŪ , which is a contradiction.
Let F : R M ×N → R + be a convex continuous functional with bounded sublevel sets. Then Proof.
Let U (α) be a minimizer of J α as proposed in (15). 1. Consider the case of α → ∞. U = 0 is feasible for J α , thus we obtain 2. Now consider the case of α → 0. We can find a subsequence α k → 0 such that Hence we obtain Since U is a minimizer of F , it has to hold that F (U (α k )) ≥ F ( U ) and thus we obtain Obviously U (α) is feasible for (11) with Using Lemma 5 then leads us to Due to the lower semi-continuity of the norm we obtain lim inf u ij (α) = 0 already for α < ∞ but large enough. We see that by considering p 0 ∈ ∂F (0) and then selecting p = − 1 α p 0 . Here we choose α large enough that we have p ∞,1 < 1 and obtain Thus p is a subgradient at U (α) = 0 and we see that the optimality condition is fulfilled.
We can now finally conclude Theorem 7 (Connection of the Solutions of (8) and (11)).
Then there exists an α > 0 such thatŪ is a solution of Remark 4. If U is a solution of (11) we can directly decide, whether there exists an α for this problem, i.e. in the case ofṽ Theorem 8 (Existence of a Solution of (11)). Let F : R M ×N → R + be a convex continuous functional with bounded sublevel sets. Then there exists a minimizer of (11).

Proof.
We can write (11) using the characteristic function, i.e.
Consider the sublevel sets Since F is continuous and χ(0) = 0 we have 0 ∈ S ξ and thus S ξ is not empty. Furthermore S ξ is bounded, since the norm of U is bounded. Additionally the functional stays lower semicontinuous and we obtain the existence of a minimizer of (11).
Using Theorem 7 for problem (10) we obtain Note that we need to look for a suitable regularization parameter in the implementation anyway. Thus we can instead determine a suitableṽ and obtain an easier optimization problem.
We observe that we indeed obtain a special kind of sparsity in every row in the case that the regularizing parameterṽ becomes very small. For this reason we consider the rescaling X :=ṽ −1 U ⇔ U =ṽX and obtain the new variational problem Let us now analyze what happens in (20) forṽ → 0.

Theorem 9.
Let k i be the number of maxima in the ith row of G := A T W B and let X(ṽ) be a minimizer of (20). Then the ith row ofX is at most k i -sparse.

Proof.
After simplifying the norm and dividing byṽ in (20) we can equivalently consider For the case thatṽ → 0 the first summand tends to zero and thus we have With the above definition of G we shall now consider Let J i be the column index set at which the maximum of the ith row of G is reached, i.e. g ij x ij ≤ g in ∀ n ∈ J i for every i ∈ {1, . . . , M }. Hence we see that for a solution of (21) has to hold n∈J i x in = 1 and which means that the ith row of the solution of (21) has at most k i nonzero entries. Thus the ith row ofX is at most k i -sparse, maybe even sparser.

Remark 5.
In case that the ith row ofX is k i -sparse, the asymptotic solutionX has nonzero entries at the same positions as G = A T W B has its maxima in each row. Note that the row-maxima are not necessarily unique. However, in the case that J i contains only one element for every i ∈ {1, . . . , M } the rows ofX are 1-sparse.
Theorem 9 raises the question, whether there exists a small regularization parameterṽ for which X(ṽ) is already k i -sparse. In this case we could apply this knowledge to the original problem (19), which is not possible in the limit case since thenŪ := lim v→0 U (ṽ) would be equal to zero. Then there exists a regularization parameterṽ > 0 such that the solution of (19) has nonzero entries at the same positions as G := A T W B has row-maxima.

Proof.
Consider again the rescaled problem (20). After again simplifying the norm and dividing bỹ v we consider equivalently where we defined again G := A T W B. The Lagrange functional reads as follows Let us now consider the optimality condition where h ij denotes the sum of the mixed terms resulting from the data term, which are independent from x ij , i.e.
Let nowṽ > 0 hold and let J i be the index set for which the entries of the ith row of the solution of (19) are nonzero. We show that g ij > g in for all j ∈ J i and for all n / ∈ J i . In order to do so we consider We have x ij > 0 and µ ij = 0 since j ∈ J i and furthermore x in = 0 and µ in ≥ 0 since n / ∈ J i . Thus we obtain due to the fact thatṽ, x ij , a ·i 2 2 and b ·j 2 2 are positive. Hence we are left to show that g ij > g in +ṽd ijn (22) holds, where we define d ijn := h ij − h in . This statement is obvious for d ijn ≥ 0. Let now d ijn < 0 hold. For n / ∈ J i we assume that g in = max ν∈{1,...,N } g iν and in addition g in ≥ g ij + 2ṽ|d ijn | for all j ∈ J i . This is always possible since we can chooseṽ > 0 small enough. With (22) we have thus we obtain which is a contradiction. Hence we finally obtain and we see that the solution X has nonzero entries at the same positions like A T W B has row-maxima, even forṽ > 0 but small enough. Then obviously the same holds for the solution of (19).

Exact Recovery of Locally One-Sparse Solutions
In this section we discuss the question of exact recovery for our model. There already exist several conditions, which provide information about exact reconstruction using linearly independent subdictionaries, see for instance [12,28]. Unlike the case where the basis vectors are linearly independent, we consider the operator to be coherent, i.e. the mutual incoherence parameter (cf. [18, p. 3 for b i , b j being distant basis vectors, is large. In other words, the vectors are very similar. This is a reasonable assumption for many applications, see for instance Section 5.1. In Appendix A we gain some understanding of necessary scaling conditions recovering locally 1-sparse solutions considering only one spacial dimension plus time dimension using problem (4). We learn that, if the solution is 1-sparse in one spacial dimension plus time dimension, the matrix B has to meet the scaling condition γb n 2 = 1 and | γb n , γb m | ≤ 1 for n = m in order to recover all 1-sparse solutions which result from the data W .

Lagrange Functional and Optimality Conditions
In this subsection we want to introduce the Lagrange functional and optimality conditions of problem (9), which we will need in the further analysis. We equivalently rewrite problem (9) by writing the data constraint for every l and k, i.e.
for every l and k. For this problem the Lagrange functional reads as follows: where λ, µ and η are Lagrange parameters. Now we are able to state the optimality conditions µ ij ≥ 0 and µ ij u ij = 0 .

Scaling Conditions for Exact Recovery of the Constrained Problem
Now we want to examine under which assumptions a 1-sparse solution of the constrained 0,∞ -problem can be reconstructed exactly by using the constrained 1,∞ -1,1 -minimization (24).
We will see that the scaling condition (23) in a slightly reformulated way is a sufficient condition for exact recovery.

Proof.
In order to proof Theorem 11 we have to show that there exist Lagrange parameters λ ∈ R M , µ ∈ R M ×N and η ∈ R L×T such that U fulfills the optimality conditions (OPT1) and (OPT2) with respect to the complimentary conditions (26) and (27). Choose the Lagrange parameters for all i ∈ {1, . . . , M } as follows: with v α = max p c p and m being the number of indices for which holds c i = v α , and η as solution of Note that (30) is solvable, since A T is surjective.

Let us show that (OPT1) and (26) hold for U a) Consider
Thus (OPT1) is fulfilled.
b) In case that c i < v α we see that (26) is trivially fulfilled. Hence let c i = v α hold. Consider and thus (26) is fulfilled as well. (27)  Since (29) has to hold, we obtain µ ij ≥ 0 and hence (27) is fulfilled again.

Let us now show that (OPT2) and
b) Let again be j = J(i). Then we have using the definitions of η and µ and the scaling condition (29). In this case (OPT2) is fulfilled.
Let us now consider j = J(i). Then we obtain where we used the definition of µ. Thus we see that in this case (OPT2) is fulfilled as well.
In summary we see that there exist Lagrange parameters such that U fulfills the optimality conditions and complementary conditions of (24) and thus is a solution of (24).
Finally we have found a condition for the exact recovery of solutions of the constrained 0,∞ -problem, which contain 1-sparse rows, using the constrained problem (24) for the reconstruction, i.e. (29) has to hold.

Remark 6.
We need to assume that A T is surjective in order to solve (30). Unfortunately if A T is surjective, then A is injective and thus we could easier consider U B T = A † W , where A † is the pseudoinverse of A.
Let us now consider an example of the extremest under-determined case, i.e. where L = 1.  In case the exact solution U contains a row-vector u i , which has its nonzero entry at the same position as u m , i.e. J(i) = J(m), but their entries differ, i.e. c i < c m , then exact recovery of U using the nonnegative 1,∞ -problem (24) for the reconstruction is not possible.

Proof.
Let us suppose that exact recovery were possible. Then there exist a λ, which fulfills (OPT1) and (26), a µ, which fulfills (27) and an η such that (OPT2) is fulfilled. Considering the complementary condition (26)  This is a contradiction, since we have λ i = 0, λ m > 0 and a i and a m are unequal to zero. Thus we see that the 1-sparse 0,∞ -solutionû ij can not be the solution of the 1,∞ -problem (24) and in this case exact recovery is not possible.

Remark 7.
In the case of Theorem 12 there always exists a solution of (24) and (28), which has a nonzero element in just one row, i.e. U itself is 1-sparse.
Note that Theorem 12 does not state that the reconstructed support is wrong. Hence we could still obtain important information from the nonnegative 1,∞ -reconstruction. Furthermore Theorem 12 does not apply for the case where β > 0, since in that case we obtain (αλ i + β)a m = (αλ m + β)a i and thus we do not obtain a contradiction in the last step of the proof. Thus Theorem 12 suggests the use of β > 0.

Algorithms
The numerical minimization of p,q -related regularization problems is usually done using a modified FOCUSS algorithm, if the problem includes an exact reconstruction, or in the noisy case a thresholded Landweber iteration (cf. [19]). However, those algorithms are designed for 0 < p, q ≤ 2 and since in our case we have q = ∞ they are not suitable for minimizing 1,∞ -related problems.
In order to solve problem (4) numerically we use a different approach and develop an algorithm for the solution of its reformulated problem, i. e.
For the sake of simplicity we exclude the weight γ for now. In order to solve this problem numerically we want to use alternate direction method of multipliers (ADMM). For computing reasonably simple sub-steps we split the problem twice. Doing this we obtain Using the Lagrange functional, which reads as follows we obtain the unscaled augmented Lagrangian L λ,µ un U, D, Z; P , Q = with Lagrange parameters λ, µ and dual variables P and Q. Since its handling is much easier we also want to state the scaled augmented Lagrangian, i.e.
L λ,µ sc (U, Z, D; P, Q) = with the new scaled dual variables P := P λ and Q := Q µ . Using ADMM the algorithm reads as follows: , which we will adapt to our problem. Another advantage of this extension is that the performance becomes less dependent on the initial choice of the penalty parameter. In order to do so we first propose the optimality conditions.

Optimality Conditions
We obtain the following primal feasibility conditions and the dual feasibility conditions Since U k+1 minimizes L λ,µ sc (U, Z k , D k ; P k , Q k ) by definition, we obtain using the definitions of P k+1 and Q k+1 . This is equivalent to where the right hand side is the first dual feasibility condition (34). Therefore can be seen as a dual residual for (34). Analogically we consider 0 ∈ ∂ D L λ,µ sc which yields that P k+1 and D k+1 always satisfy (35). The same applies for 0 ∈ ∂ Z L λ,µ sc where we see that Q k+1 and Z k+1 always satisfy (36). In addition we will refer to and (39) as the primal residuals at iteration k + 1.

Stopping Criteria
Analog to [2, Section 3.3.1] we derive the stopping criteria for the algorithm. As shown in Appendix C the primal and dual residuals can be related to a bound on the objective suboptimality of the current point Y * . Hence we obtain We see that the residuals should be small in order to obtain small objective suboptimality.
Since we want to obtain a stopping criterion but U * is unknown we estimate that U k − U * F ≤ d. Thus we obtain It stands to reason that the primal and dual residual must be small, i.e.
where ε rel = 10 −3 or 10 −4 is a relative tolerance and the absolute tolerance ε abs depends on the scale of the typical variable values. Note that the factors √ mn and √ mt result from the fact that the Frobenius norms are in R m×n and R m×t , respectively.

Adaptive Parameter Choice
In order to extend the standard ADMM and to improve its convergence rate we vary the penalty parameters λ k and µ k in each iteration as proposed in [2,Section 3.4.1]. This extension has been analyzed in [22] in the context of the method of multipliers. There it has been shown that if the penalty parameters go to infinity superlinear convergence may be reached. If we consider λ and µ to become fixed after a finite number of iterations, the fixed penalty parameter theory still applies, i.e. we obtain convergence of the ADMM.
The following scheme is proposed in [15], [29] and others and often works well.

Application to Dynamic Positron Emission Tomography
Positron Emission Tomography (PET) is an imaging technique used in nuclear medicine that visualizes the distribution of a radioactive tracer, which was applied to the patient. Compared to computer tomography (CT), PET has the advantage of being a functional rather than a morphological imaging technique. Using radioactive water (H 15 2 O) as a tracer it is possible to visualize blood flow. H 15 2 O has the advantage of being highly diffusible and the radiation exposure is low. Even dynamic images are possible. On the other hand the reconstructed images have poor quality due to the short radioactive half-life of H 15 2 O. Consider the inverse problem of dynamic PET, i.e.
where the operator A linking the dynamic image Z with the measured data W is usually the Radon operator, but could also be another operator depending on the application. Using kinetic modeling (cf. [30, Chapter 23, p. 499 et seqq.]) we are able to describe the unknown image Z as the tracer concentration in the tissue C T , i.e.
where C A (τ ) is the arterial tracer concentration, also known as input curve, F (x) refers to the perfusion and λ is the ratio between the tracer concentration in tissue and the venous tracer concentration resulting from Fick's principle.
Expression (43) is an integral equation including the exponential factor e − F (x) λ (t−τ ) , which depends on both input arguments, i.e. time t and space x. This expression is highly nonlinear and thus not easy to handle especially in combination with inverse problems. Due to the fact that we have prior knowledge about F (x) λ , we are able to provide a big pool of given perfusion values for this expression. Subsequently, we can consider a linearization, i.e.
in whichb j are the prior known perfusion values and the integral is now independent of space. Expression (44) is reasonable, if there is at most one u j = 0 for j ∈ {1, ..., N }, i.e. the coefficient u j corresponding to the correct perfusion valueb j . In order to further simplify the work with this operator, we assume that the input curve C A is predetermined.
Hereby we obtain the linear kinetic modeling operator which we use to describe the unknown image Z. The advantage of (45) over (43) is that we are able to compute the basis functions b j (t) in advance and thus we can provide many of those for the reconstruction.
Note that there exists another deduction of (45) by [21].
Discretization of (45) yields where i denotes the pixel and k the time step. After discretizing A as well we can insert (46) for the image Z in (42) and obtain hence (10) can be used for the reconstruction of the spatio-discretized coefficients u ij .

Results
In this section we present some numerical results. For now we are going to work on synthetic data to investigate the effectiveness of the approach. In order to do so we use a simple 3D matrix U containing the exact coefficients as ground truth, i.e. two spatial dimensions and one extra dimension referring to the number of basis vectors. Defining two regions where for only one basis vector the coefficients are nonzero yields the fact that the corresponding coefficients for most of the basis vectors are zero. Obviously our ground truth fulfills the prior knowledge, which we would like to promote in the reconstruction, i.e. there is only one coefficient per pixel, which is unequal to zero. In Figure 1 we see that the exact coefficients for the most basis vectors are zero. Only some coefficients corresponding to the second and seventh basis vectors are nonzero. In order to obtain the artificial data W ∈ R L×T we have to apply the matrices A ∈ R L×M and B T ∈ R N ×T to the ground truth U ∈ R M ×N .  Figure 2 we see that those basis functions are very similar, i.e. they are very coherent. For verifying our model we use a simple 2D convolution in space as matrix A.
Using now Algorithm 1 on the so computed W including a strong 1,∞ -regularization, i.e. v = 0.1, we obtain a very good reconstruction of the support. Figure 3 only shows the coefficient matrices to those basis functions, which include reconstructed nonzero coefficients. For simplicity we do not show the other reconstructed coefficient matrices, which are completely zero. Obviously we obtain a very good reconstruction of the We observe that every value larger thanṽ is projected down toṽ and we make a systematic error. This is due to the inequality constraint in problem (18) and because of the fact that we choseṽ smaller than the maximal value of the exact data U (compare for instance Subsection 2.4 and especially Theorem 7). Thus we are not really close to the exact data. In order to overcome this problem we first reconstruct the support including the 1,∞ -and 1,1 -regularization and then perform a second run without regularization only on the known support to reduce the distance to the exact data. In Figure 4 we see that this approach leads to very good results. We additionally reconstructed an example including some Gaussian noise. In Figure 5 we observe that the algorithm performs quite nicely.
Let us now evaluate Algorithm 1 with respect to the quality of the reconstructed support. In order to do so we compare the reconstructed support after the first run (including both regularizations) with the support of our ground truth and state how much percent of the true support is reconstructed wrongly depending on the 1,∞ -regularization parameterṽ. We also include the distance of the wrongly picked basis vector in each pixel, for instance if the support of the ground truth picks basis vector number 7 and the reconstructed support picks basis vector number 5 instead, we double the influence of the error in this pixel, if the  Table 1: Evaluation of Algorithm 1 with β = 0.1, λ = 0.5 and µ = 0.1 In Table 1 we see the evaluation of Algorithm 1 applied to the noiseless data W . Whenṽ becomes smaller than 0.01 we observe that there is no further improvement. As we have seen in Figure 3 the boundary of the region is reconstructed wrongly and the algorithm selects the sixth instead of the seventh basis function. However, the prior knowledge is already fulfilled, i.e. in every pixel there is only one basis vector active. This is the reason why there are still 0.1772% wrong pixel and we do not obtain further improvement.
In Table 2 and 3 we have the same error measures for differentṽ as in Table 1. But this time we included Gaussian noise on the data W with standard deviation 0.01, and 0.05 respectively. At first the error drops quickly. However, whenṽ becomes smaller the error stagnates in a certain range similar to the noise free case.
In order to smartly chooseṽ we have to find a good tradeoff between a small error and a small number of iterations. Choosingṽ ∈ 10 −4 , . . . , 10 −3 seems to be a good choice.

A Exact 1 -Reconstruction of 1-Sparse Signals in 1D
In order to gain some understanding into suitable and necessary scaling conditions recovering locally 1-sparse solutions, we first consider the simplest case, namely M = 1, when the problem reduces to standard 1 -minimization: Theorem 13 (Exact Reconstruction of a One-Sparse Signal in 1D). Let the vector w := e T j B T = b T j be the jth basis vector and c = 1−(α + β) for (α + β) ∈ (0, 1). Ifû = ce T j is the solution of (4), then the matrix B has to meet the scaling condition γb n 2 = 1 and | γb n , γb m | ≤ 1 for n = m . Proof.
Then it follows Subsequently we insertû = ce T j and w = e T j B T to obtain Since p n ∈ ∂ |û n | has to be satisfied, we need to ensure that which is true under the assumptions mentioned above.
For this reason we know that we have to normalize our basis vectors with respect to the 2 -norm to reconstruct at least a δ-peak exactly in one dimension. Note that in the one-dimensional case the 1,∞ -regularization and the 1,1 -regularization reduce to a single 1 -regularization with regularization parameter α + β.
We further want to analyze the special case of the Kullback-Leibler approximation (cp.

Proof.
We first compute the optimality condition of (4) with γ = 1 √ w as It follows that Then we insertû and w and conclude With the assumptions mentioned above we obtain again p n ∈ ∂ |û n |.
It is worth mentioning that in this case every positive solution of Bu = w meets the optimality condition (48), in particular every non-sparse solution.
B Solving the Positive 1,∞ − 1,1 -Projection-Problem We want to solve the following problem In order to do so we reformulate the first part of the problem, i.e.
Since the last part of the sum is independent of d ij we can consider with the constraints with d (i) denoting the ith transposed row of D, and for u (i) , p (i) respectively.
To minimize this problem we first consider (50) only under (Constr1). In this case the solution is given by To include (Constr2) we have to do a case-by-case-analysis: Case a: Let (51) satisfy (Constr2). In this case the solution of (50) under (Constr1) and (Constr2) is given by Once we know ϑ we can compute the optimal d (i) as We can see this by computing the optimality condition of (52) with the complementary conditions µ (i) j ≥ 0 and µ (i) j d (i) j = 0 ∀j .
If d (i) j = 0, then µ (i) j = 0 and thus we obtain from (54) that end if 12: end for 13: return D Solution of (49)

C Inequality for Stopping Criteria
In order to proof the inequality which is needed in Subsection 4.2, we are going to adapt the proof of [2, Appendix A] to the case of our double splitting.
We now have to examine the optimality conditions.

OPT1:
Starting with 0 ∈ ∂ U L λ,µ un U k+1 , D k , Z k ; P k , Q k we insert the Lagrange updates and obtain Thus we see that U k+1 minimizes OPT2: with J as defined in (37). Inserting P k from (56) yields 0 ∈ β1 M ×N − P k+1 + ∂J(D k+1 ) , and hence we see that D k+1 minimizes OPT3: Inserting Q k from (56) yields Therefore Z k+1 minimizes All in all follows that and Adding equations (57), (58) and (59) together leads to Using the definitions of R k+1 1,2 and S k+1 (see for instance (39),(40) and (38)) and the fact that U * = D * and U * B T = Z * we finally obtain