Dynamical Programming for off-the-grid dynamic Inverse Problems

In this work we consider algorithms for reconstructing time-varying data into a finite sum of discrete trajectories, alternatively, an off-the-grid sparse-spikes decomposition which is continuous in time. Recent work showed that this decomposition was possible by minimising a convex variational model which combined a quadratic data fidelity with dynamical Optimal Transport. We generalise this framework and propose new numerical methods which leverage efficient classical algorithms for computing shortest paths on directed acyclic graphs. Our theoretical analysis confirms that these methods converge to globally optimal reconstructions which represent a finite number of discrete trajectories. Numerically, we show new examples for unbalanced Optimal Transport penalties, and for balanced examples we are 100 times faster in comparison to the previously known method.


Introduction
The signal-processing task of dynamical super-resolution involves retrieving fine-scale features, in space and time, from a signal which evolves over time.A convex variational model was recently proposed for such tasks using Optimal Transport (OT) to regularise the associated inverse problem [11,12].This new approach allows the decomposition of a signal into a finite sum of smooth curves, for example to track the centers of multiple particles in time with smooth trajectories.Similar ideas were explored for a specific example in [1] where the shape of curves is built into the model, and without appealing to OT.
In this work we focus on dynamical super-resolution problems regularised by OT.We can consider potential models to be partitioned into two classes, depending on whether the particles have constant mass/brightness in time, or if mass is allowed to vary.These classes are referred to as balanced or unbalanced problems respectively, mathematically encoded in the choice of OT cost.Current literature provides analysis for the balanced Benamou-Brenier (BB) [11] and the unbalanced Wasserstein-Fisher-Rao (WFR) [9] energies, both are shown to reconstruct data into a finite number of smooth curves with constant or smoothly varying mass.Initial numerical experiments for the Benamou-Brenier model have also been carried out showing great promise [10], however current methods are too slow for large-scale applications.
Let Ω be an open bounded spatial domain.At the heart of the analysis of Bredies et al. is the interplay between measures ρ(t, x) defined on the time-space cylinder [0, 1] × Ω, and measures σ(γ) defined on the space of curves γ = (h, ξ) with mass h ∈ C([0, 1]) and trajectory ξ ∈ C([0, 1] ; Ω).We can think of ρ as representing the evolving physical volume that can be observed with a microscope, whereas σ more efficiently represents a collection of (trajectories of) particles, that we wish to reconstruct.The structure of the problem as seen from this second viewpoint closely resembles the Beurling-LASSO which is now well-understood [4,13,22,26].Its main advantage is that it paves the way for "off-the-grid" numerical methods when solving such dynamical inverse problems.Recent works in the field of sparse spike recovery [6,13,19] have demonstrated that it is possible to design efficient numerical solvers without reconstructing the unknown on a grid, by exploiting a conditional gradient descent / Frank-Wolfe approach together with a good knowledge of the regularising term.Indeed, the Frank-Wolfe minimisation algorithm and its variants (see the review [25]) build iterates that are convex combinations of the extreme points of level sets of the regulariser; being able to easily encode and handle such extreme points makes it possible to solve variational problems in a continuous (or up to floating point) setting.Moreover, having iterates that are convex combinations of a few extreme points of the level sets of the regulariser is particularly relevant, as it is known in inverse problems that some solutions have precisely that structure when the observation consists of a finite number of linear measurements [7,8,30].

Motivating example
To make these observations and the contribution of this work more concrete we will make reference to the motivating example described in [11] with numerical examples in [10].We will describe this problem briefly here and postpone precise assumptions and details of function spaces to Section 6 where it is the δ = +∞ case.For α, β > 0 the Benamou-Brenier penalty W is a map of non-negative space-time measures ρ defined by where the continuity equation is satisfied in the weak sense which will be clarified in (1.14).The function with α = 0, β = 1 is referred to as the Benamou-Brenier energy whose main properties can be found in [28,Sec. 5.3.1].It was shown (cf.[11,28]) that whenever W(ρ) < +∞, there exists a "disintegration" into spatial measures {ρ t } t∈[0,1] such that the curve t → ρ t is continuous in the narrow topology and such that ∀ψ ∈ L 1 ([0, 1] × Ω; ρ), [0,1]×Ω ψ(t, x) dρ(t, x) = Thanks to this property, given times 0 = t 0 < t 1 < . . .< t T = 1, it is justified to consider "slices" ρ tj and to assume that we are given data b j ∈ R m (for simplicity assume m does not change with j) at time t j , corresponding to linear observations of ρ tj , given by narrowly continuous linear operators A j .The more involved case of continuous-time observations is handled in [12] although we do not discuss it further in this work.To solve the corresponding inverse problem, [11, (43)] proposes the minimisation of 3) It is then shown in [11,Theorem 10] that there is a minimiser ρ * which is a finite sum of extreme points of the Benamou-Brenier unit ball, i.e.
for some a i ≥ 0, where AC 2 is the set of absolutely continuous function such that their (a.e.defined) pointwise derivative is square-integrable (see also [3,Sec. 1.1]).The sparse structure of ρ * motivates us to look at problems like (1.3) from the viewpoint of measures on paths, i.e. to have a minimiser σ * = m(T +1) i=1 a i δ ξ i .This will enable us to greatly generalise the types of regulariser W for which minimisers are sparsely supported on curves, and develop a new specialised tool for "off-the-grid" numerical methods in this setting.

Outline of the paper
In Section 2 we gather and present several results on the measure representation of the solutions of the continuity equation that are useful to understand the perspective of measures on paths.Our first main theoretical contribution is found in Section 3 where we introduce our main variational inverse problem and provide several theoretical results on the existence of sparse minimisers and their "geodesic" structure.Algorithmic details are discussed in Section 4 where we analyse a new stochastic Frank-Wolfe variant which is practically computable and proven to converge almost surely.This analysis is our other main theoretical contribution.
Numerical work on the Benamou-Brenier example was recently presented in [10].The most time consuming step is to compute new extreme points, i.e. curves γ, to add to the reconstruction.Even though only a small subset of L ∞ ([0, 1] ; R × Ω) is involved, optimising over such a set is a challenging task, the computation times are hardly compatible with practical applications.Our main numerical contribution is to reformulate this step into a shortest-path problem on an ordered, weighted, directed, acyclic graph in Section 5, which can be solved very efficiently using dynamical programming.Numerical experiments demonstrate the efficiency of this new technique in Section 7. In particular, we compare with the algorithm of [10] for the Benamou-Brenier penalty outlined in Section 1.1.Finally we show new results for the unbalanced Wasserstein-Fisher-Rao penalty which is described in detail in Section 6.

Notation
Convex sets and extreme points.Let V be a linear space.For all σ 0 , σ 1 ∈ V , we define the closed line segment between σ 0 and σ We say that σ ∈ D is an extreme point (or atom) of D, and write σ ∈ Ext (D), if there are no points σ 0 , σ 1 ∈ D such that σ ∈ ]σ 0 , σ 1 [.In other words, Furthermore, it is possible to define the notion of face (and elementary face) of a convex set, which extends the notion of extreme point to higher-dimensional sets.We refer to [21] for more detail.Measure spaces.For a separable metric space Γ and Banach space X, we define C b (Γ; X) to be the set of continuous bounded functions from Γ to X.When X = R, we simply write C b (Γ).Recall that for any Borel measure σ on Γ, we can define the non-negative Borel measure |σ| ∈ M + (Γ) by for all Borel measurable sets A ⊂ Γ.We denote by M(Γ) the space of signed Borel measures σ with finite total variation, i.e. σ def.
= Γ d|σ| < +∞.The total variation • defines a norm on M(Γ), but it is sometimes more convenient to use the narrow topology, i.e. the weakest topology on M(Γ) which makes the integration against continuous bounded functions a continuous linear form.The narrow topology is equivalent to the weak-* topology on (C b (Γ)) , in particular, a sequence The support of σ ∈ M + (Γ) is defined as This is a closed set satisfying σ(supp(σ)) = σ(Γ).Function domains.In this work we consider two measure domains, the time-space cylinder and a closed set Γ of continuous (weighted) curves which are viewed as pairs γ = (h, ξ) where h(t) is the mass at time t and ξ(t) is the location.A formal definition will be given in Lemma 2.1.In the time-space cylinder we will denote measures ρ ∈ M([0, 1] × Ω) with test functions ψ ∈ C([0, 1] × Ω), and similarly on the space of curves σ ∈ M(Γ) and φ ∈ C b (Γ).Narrowly continuous measures.An important subspace of M([0, 1] × Ω) is the space of narrowly continuous measures.With a slight abuse of the standard notation, we will say Given a measure on paths σ ∈ M(Γ) such that Γ h ∞ dσ(h, ξ) < +∞, one may define the family of measures (e t ) σ ∈ M(Ω), for t ∈ [0, 1], by Formally, (e t ) σ is the image measure of σ by the evaluation at time t.That family is narrowly continuous, and as we explain in Theorem 2.2 below, it is the evaluation (disintegration) of some measure Θ(σ) ∈ C w ([0, 1] ; M(Ω)).
The continuity equation.In the rest of the paper, we use the following distributional definition of the continuity equation, formally ∂ t ρ + div(ρv) = gρ, (1.13) which expresses mass variation (or mass conservation if g = 0).Definition 1.1.Let ρ ∈ M([0, 1] × Ω) be a measure.We say that ρ satisfies the continuity equation if there

Preliminary results
As previously mentioned, the main function space of this work is the space of measures on paths.In particular, M(Γ) where Γ ⊂ Γ 0 is the set of continuous weighted paths in Ω, modelling particles with (varying) mass h(t) at location ξ(t) defined by For technical reasons we permit curves ξ which may not be continuous at points t where h(t) = 0. Intuitively, the location ξ(t) is not necessarily meaningful if the particle has no mass and cannot be observed.
In this section we review the necessary assumptions for M(Γ) to be a sufficiently well-behaved space.Firstly we require Γ to be a complete separable metric space.We follow the suggestion of [9,Prop. 3.6] who used the flat metric on the space of measures We use the isometric space (Γ 0 , d Γ ), the properties of which are given by the following lemma. where then ( Γ0 / ∼, d Γ ) is a complete separable metric space where Convergence of a sequence γ n = (h n , ξ n ) ∈ Γ 0 in the metric d Γ can equivalently be stated as: = h(t)ψ(t, ξ(t)). (2.6) The proof is elementary but given in Appendix A for completeness as no specific reference could be found.
A key analytical tool in related prior works (cf.[9][10][11][12]) is a mapping between measures ρ ∈ M([0, 1] × Ω) which satisfy the continuity equation, and measures on (weighted) paths σ ∈ M(Γ 0 ), making some structures become more apparent.We will now recap and expand on those previous results.The first theorem collects results from [3,9] and provides minor extensions for the scope of the current work.In particular, we remove the necessity for h ≥ 0 or elements (h, ξ) to satisfy further smoothness conditions.Theorem 2.2.Let σ ∈ M(Γ 0 ).If Γ0 h 1 d|σ|(h, ξ) < +∞, then there is a unique finite Borel measure (2.7) Moreover, (1) The mapping Θ : These results are mainly proved in [3,9], we extend to signed measures σ in the appendix (Theorem A.2).The results achieve two key relations: characterising when the mapping from M([0, 1] × Ω) to M(Γ 0 ) is well-defined, and when Θ(σ) ∈ C w ([0, 1] ; M(Ω)).Weak continuity is of practical importance in applications.Without it, for example in the application of microscopy, we could not consider one frame of video to correspond to a single instance in time.Unfortunately we have seen that not all σ ∈ M(Γ 0 ) satisfy this smoothness requirement.However, in the next lemma we will confirm that, if Γ ⊂ Γ 0 is sufficiently "small", then Θ(σ) ∈ C w ([0, 1] ; M(Ω)) for all σ ∈ M(Γ) due to the implicit assumption that σ < +∞.A related property is the continuity of the operator Θ which we also confirm. then and Θ : In particular, sequentially we have that, for any sequence σ n * − σ narrowly in M(Γ p ): ) , (e t ) σ n * (e t ) σ narrowly in M(Ω). (2.15) The proof of this lemma is found in Lemma A.4, relying heavily on the definition of continuity given by [31].
Remark 2.4.Note that the definition of Γ p is consistent with the relation ∼, so ( Γp / ∼, d Γ ) is also a metric space.In the rest of the paper, we require that Γ is a closed subset of Γ∞ / ∼ in order for it to be a complete separable metric space with Θ : M(Γ) → C w ([0, 1] ; M(Ω)).
Summarising the results of this section, in the remainder of this work we want to use a domain D ⊂ M + (Γ 0 ) such that Θ| D has nice properties with respect to the space C w ([0, 1] ; M(Ω)).The combination of Theorem 2.2 and Lemma 2.3 show that it is sufficient to consider either . Analytically we will always consider D ⊂ M + (Γ ∞ ) as it is more concise, although numerically either convention is equivalent.The generality of allowing any closed subset Γ ⊂ Γ ∞ allows us to treat different applications with the same analysis, for example: This enforces balanced transport (e.g. the Benamou-Brenier example [11]).If σ ∈ M + (Γ), then Θ(σ) ≥ 0 and mass is preserved on paths (e.g. This allows unbalanced transport of non-negative mass (e.g. the Wasserstein-Fisher-Rao example [9]).We still have Θ(σ) ≥ 0, but t → Ω dΘ(σ) t is not (necessarily) constant.In particular, mass can be created or destroyed (continuously) at any time.Γ ⊂ Γ ∞ : In the general case Θ(σ) is a general signed measure, even when σ ≥ 0. In words, σ can give positive weight to curves with negative mass.The only constraint is that Θ(σ) ∈ C w ([0, 1] ; M(Ω)) is continuous in time.
= Ω a j i (x) dρ(x). (3.1) As stated at the end of Section 2, we work with a closed set of curves so that ∀t ∈ [0, 1], the map (e t ) : M(Γ) → M(Ω) is narrowly continuous.We therefore choose a data fidelity F : For lower semi-continuous w, ϕ : Γ → [0, +∞] define W : We consider minimising the energy E : def.

= F(σ) + W(σ). (3.5)
The motivation behind this in an Inverse Problems setting is that F represents a smooth data fidelity with linear observations recorded at times t j , and the combination of W and D represent a regularisation of the problem.The choice of ϕ (hence of D) is often made to ensure the well-posedness of the model, but we are mostly interested in cases where the constraint Γ ϕ(γ) dσ ≤ 1 is not active, so that the choice of ϕ has no impact on the set of minimisers.

Existence of sparse minimisers
Any choice of energy in this framework leads to a sparse reconstruction in the space of curves.for some The proof is given in Appendix B. The smaller value of s will often be valid in practice, for example any choice ϕ(γ) ≤ w(γ) E(0)+1 is always sufficient.Remark 3.2.The Benamou-Brenier example from Section 1.1 can be formulated in this setting with (3.7)More details are given in Section 6 where the Benamou-Brenier example is the limiting case δ → +∞.The choice of ϕ in [10] was equivalent to ϕ(γ) = w(γ) E(0) , whereas we suggest the default of ϕ(γ) = α E(0) which is easier to analyse.Both functions are strictly positive but sufficiently small to ensure Γ ϕ dσ * < 1.

Discrete-time formulation
Until now we have formulated E as a function of measures σ ∈ M(Γ), but in this subsection we show that it can be thought of equivalently as a function E of σ ∈ M( Γ).For the remainder of this section, without loss of generality, we assume there is a minimiser which is also a minimiser of the energy considered in (3.5), i.e. ϕ dσ * < 1.If that is not the case, one can use a Lagrange multiplier to form a modified energy with w ← w + λϕ for some λ ≥ 0. Our key observation is that σ * must be supported on "geodesics" of w (or w + λϕ) which interpolate the discrete-time curves.
Proof of Lemma 3.4.Let the set U ⊂ Γ × Γ be given by Note that both U and G( γ) are closed due to the lower semi-continuity of w.Finally, addressing the requirements of [14, Thm.1], we know that ( Γ / ∼, d Γ ) is a complete separable metric space, U is a (closed) Borel set, and G( γ) is compact for each γ ∈ Γ (as it is either empty or a closed subset of a sub-levelset of w).We conclude the existence of g as described in the claim.
We can now return to the main result of this subsection.
This shows we can perform computations in the discrete-time space Γ and later lift curves back to Γ using the geodesics G.This holds both pointwise between Γ ↔ Γ and with measures M( Γ) ↔ M(Γ).Remark 3.5.For the Benamou-Brenier example, w is given in Remark 3.2.If Ω is convex, then G( γ) = {γ} where γ is the unique piecewise linear interpolant of the points (t j , γ j ), as commented in [10,Rem. 4.10].Also, We therefore know that all minimisers σ * are supported on the set Restricting to this domain of curves is much more computationally convenient without losing analytical accuracy.

Frank-Wolfe convergence
While Theorem 3.1 guarantees an analytical structure of minimisers, we must now choose a reconstruction algorithm which is capable of taking advantage of this structure.As previously stated, we will use a variant of the Frank-Wolfe algorithm to take advantage of the sparse structure of reconstructions.

The Frank-Wolfe algorithm
In order to derive a variant of the Frank-Wolfe algorithm with inexact or stochastic steps, we first need to highlight a few properties which, to the best of our knowledge, have not been stated in the literature.We refer the reader to [25] for a thorough introduction to the Frank-Wolfe algorithm.For the sake of generality and future reference, we work in an abstract setting, assuming that we want to minimize a convex function f : D → R, where D is a nonempty convex set of a Hausdorff locally convex vector space X .
Standard results in convex analysis [2,Chap. 7] ensure that for all x ∈ D and all h ∈ (D − x), the directional derivative exists in R ∪ {−∞} and is a convex function of h.The main steps of the Frank-Wolfe algorithm are given in Algorithm 1.The standard choice [18,24] uses the Linear Minimisation Oracle defined as In the large majority of cases the existence of the LMO is implied by f being Gâteaux-differentiable and D compact.The analysis in [25] stresses that Algorithm 1 with the linear minimisation oracle (4.1) yields a minimising sequence, i.e.
In situations where the LMO is not easy to compute, we can instead choose any s n+1 ∈ D and measure its suitability using the primal-dual gap ∀x, s ∈ D, gap(x; s) In general gap ≥ 0 and gap(x; LMO(x)) = 0, suppose gap(x n ; s n+1 ) ≤ ε n for some controlled error ε n ≥ 0. It has previously been shown in [25, Thm.1] that x n is a minimising sequence whenever In our context, the guarantee ε n = O(1/n) would still be prohibitive for large n.The linear minimisation oracle consists of finding a curve γ ∈ Γ which minimises a certain energy, see Section 4.3 below.Instead, we propose to use random discretisations of the domain so that implicitly lim inf n→+∞ ε n = 0, without a guaranteed rate.In comparison to the result of [25], the relaxed assumption on ε n is accounted for by the linesearch in Line 4 which implicitly selects large/small steps when ε n is correspondingly large/small, i.e. when we have a lucky/unlucky draw.Our only assumptions on f are that it is Gâteaux differentiable with bounded curvature (C f < +∞) and gap(x; s) < +∞ for all x, s ∈ D.Then, in the case that lim inf n→+∞ ε n = 0 (possibly almost surely), we show that x n is a minimising sequence (almost surely).
One final aspect of our algorithm is the freedom which is granted in Line 5: one may choose any point which has lower energy than the one provided by the linesearch.This is now a standard addition to the Frank-Wolfe algorithm to enable much faster practical convergence, see for instance [6,13,19].We exploit this in Section 7.
To ease the analysis of the randomised algorithm, we first analyse the deterministic version.
Proposition 1.Let f , D be as above, and assume that then Algorithm 1 yields a minimizing sequence.
Algorithm 1 Abstract Frank-Wolfe algorithm Choose s n+1 ∈ D oracle 4: Choose n ← n + 1 7: until converged Proof.First observe from Lines 4 and 5, and by definition of Note from the λ = 0 case we have f We are required to show that = inf x∈D f (x).The case = −∞ is trivial, otherwise we also have lim n→+∞ f (x n+1 ) − f (x n ) = 0. Now, taking the lim inf on both sides of (4.6) gives lim inf Dividing by λ → 0 + , we obtain lim sup n→+∞ min D (f (x n ; • − x n )) ≥ 0. On the other hand, by convexity, Since x ∈ D is arbitrary, we deduce that = lim n→∞ f (x n ) = inf D f , as required.

Stochastic variant
Now, we may study the behaviour of the algorithm in a stochastic framework.We build a random process (X n , S n ) n∈N * in D × D, by considering a random initialisation X 0 in D, and applying Algorithm 1 by picking a random point S n+1 in D at Line 3. Note that, at each step, S n+1 may depend on {X k } n k=0 and {S k } n k=1 .Typically, as in Section 4.3, we consider a setting where solving (4.1) exactly is too costly, and where one draws a random grid on which to perform the optimisation.The variable S n+1 is then a minimizer of f (X n ; • − X n ) among a finite, small, subset of D.
Let (S n ) n∈N be the filtration generated by that random process, i.e. S n is the σ-algebra generated by Proposition 2. Let (X n ) n∈N and (S n ) n∈N * as described above, and let (S n ) n∈N be the filtration they generate.
If for all ε > 0, all n ∈ N, there is some deterministic p n (ε) > 0 such that and n∈N is a minimizing sequence almost surely.The above proposition is a consequence of the following lemma, setting G n+1 = gap(X n ; s n+1 ).Lemma 4.1.Let (S n ) n∈N be a filtration, and (G n ) n∈N a family of random variables such that for each n, m ∈ N, n ≤ m, G n is S m -measurable.If for all ε > 0, n ∈ N, there is some deterministic p n (ε) ≥ 0 such that and Letting M → +∞, we get P n≥N {G n ≥ ε} = 0, that is P n≥N {G n < ε} = 1.As a result, To summarise, the main convergence requirement of Proposition 1 is to guarantee lim inf n→+∞ G n ≤ 0. Our solution to this is to use random discretisations so that the stochastic variant of Algorithm 1 still converges asymptotically (almost surely), but the complexity of each individual iteration remains low.A similar idea in the setting of stochastic Frank-Wolfe was pursued in [29] where their proof relies on what is called Assumption P.8, in our notation this requires the sum ∞ n=0 λ n G n < +∞ to be finite almost surely, where λ n is chosen deterministically.To apply this algorithm in our setting we would therefore need to bound the magnitude of G n relative to the a priori choice of λ n .With Proposition 2, we overcome this limitation with the linesearch for λ, so the only remaining requirement is to guarantee a uniform probability of achieving a good search direction.

Back to the dynamic inverse problem
We consider again the setting of Section 3. The function E is convex on D, and we note that, for each σ ∈ D, its directional derivative is given by In particular, note that we can write E (σ; This structure enables us to prove that E satisfies the major requirement of Section 4, that inf µ∈D E (σ; µ − σ) > −∞ for all σ ∈ D.
To do this, we show that the infimum is always achieved by an extreme point of D.
In both cases we see there is a choice LMO(σ) = σ * ∈ Ext (D).Finally, by Lemma B.5 we have so we deduce (4.15) as required.
This lemma is also useful for applying the stochastic Frank-Wolfe algorithm, Lemma 4.1.For each n ∈ N, let z n ∼ Z be an independent random discretisation of the domain Γ = ([−1, 1]×Ω) T +1 (following the discrete-time formulation in Section 3.2).Continuing with the notation E = E (σ n−1 ) from (4.16), we will use the discrete minimiser denoted The remaining requirements for applying both Frank-Wolfe variants simplify greatly in the Benamou-Brenier example where ϕ > 0 is a constant, curves have constant mass, and Ω = ]0, 1[ ).An immediate consequence of this is uniform boundedness, as Γ dσ = T j=0 (e tj ) σ ≤ ϕ −1 for all σ ∈ D. Because E is quadratic and A j bounded (see (3.1), (A j ρ) i = Ω a j i dρ), the curvature bound follows immediately: Finally we must show that gap(σ n−1 ; µ n ) is uniformly small, independently of σ n−1 .Ignoring the σ * = 0 case, this quantity can be written where, from (3.7) and (3.16), for all ξ ∈ Ω T +1 we have for some α, β > 0. If moreover A j are Lipschitz, then E is also uniformly Lipschitz independently of σ n−1 .
Combining this uniform Lipschitz property with, for example, uniformly sampled discrete grids z n guarantees the uniform bound for Lemma 4.1.

Choice of constraint
In this section we will briefly discuss the main difference between our formulation of the Benamou-Brenier example and that implemented in [10].Although the construction in [10] looks very different, it can also be seen as Frank-Wolfe/Generalised Conditional Gradient approach to minimise the same function E (this equivalence is confirmed in Section 6).The parallel result to Lemma 4.15 is [10, Prop.3.6] which shows that they are also incrementally adding new curves ξ to the support of the reconstruction σ each iteration.Both implementations actually use a stochastic variant of Frank-Wolfe, but we will discuss the classical version for simplicity.
Recall from Remark 3.2, that the un-constrained energy is where w(ξ) = α+ β which does contain the minimisers of (4.25) since they must satisfy Γ w dσ ≤ E(σ) ≤ E(0).By Lemma B.5, the set of extreme points of D 1 is {0} ∪ E(0) w(ξ) δ ξ | ξ ∈ Γ , which yields the descent direction That is precisely the descent direction used in [10, (4.30)].On the other hand, we propose to use which also contains the minimisers of (4.25), since they must satisfy Γ α dσ ≤ Γ w dσ ≤ E(σ) ≤ E(0).Applying Lemma B.5, we see that the set of extreme points of We found (4.30) more convenient than (4.28) as it is amenable to dynamic programming techniques (see Section 5).The whole minimisation algorithm is summarised in Algorithm 2.

Dynamical programming method
Throughout Section 4 we have shown that we can find a minimising sequence to E from Section 3 by repeatedly evaluating a simplified linear oracle.In particular, we compute the minimiser from Lemma 4.2, but over a discretised domain.In Section 3.2 we motivated discretising in time to Γ = ([−1, 1]×Ω) T +1 without loss of precision, now we also discretise in space using the domain Λ def.

=
T j=0 Λ j for some discrete sets Λ j ⊂ [−1, 1] × Ω ("grids" in the mass-location space).Whilst Λ is much smaller than Γ, naive computation time for γ * still scales exponentially in T .We propose to compute this discrete minimiser efficiently using dynamical programming, for which we require two final simplifications: for some constant ϕ 0 > 0, and (A1) step j (γ(t j−1 ), γ(t j )) for some step j : Remark 5.1.For the Benamou-Brenier example, these assumptions are satisfied by the choice for all γ = (h, ξ) ∈ Γ.We will see later in Section 6.2 that these assumptions are also satisfied in the Wasserstein-Fisher-Rao example.In particular, the step function of the Benamou-Brenier penalty is Wasserstein optimal transport, and the step function of the dynamic Wasserstein-Fisher-Rao penalty is the static Wasserstein-Fisher-Rao penalty.
Under assumptions (A1) and (A2), the discretised optimisation problem (4.15) can be greatly simplified to where ∀j = 0, . . ., T, η j (h j , ξ j ) This minimisation problem can now be formulated as computing the minimal path on a weighted, directed, acyclic graph.The vertices are the points (t j , y) for y ∈ Λ j , the edge weights are given by η j and step j , and the time index provides an ordering and prevents cycles in the graph.Algorithms for computing minimal paths on directed acyclic graphs are well studied, the complexity bound is given below.where cost(η j ) is the cost of evaluating η j once etc.
Remark 5.3.In graph terminology, the two terms of (5.4) represent the number of vertices plus the number of edges respectively, asymptotically O(N T ) and O(N 2 T ) respectively.If, for example, the paths have a maximum velocity V , then the number of edges per vertex is reduced from Finally we outline how the minimal path can be computed efficiently in our specific example by using dynamic programming.To do so, define the truncated energies and minimal paths: for each J = 0, . . ., T and y ∈ Λ J .This generates γ * through the computation (5.7) Observe that for all J = 1, . . ., T and y ∈ Λ J , min = η J (y) + min (5.9) In particular, we can choose Y * [y, J] inductively to be (5.10) For each y and J each of these steps requires O(N ) computation, confirming the global complexity of O(N 2 T ).

The unbalanced Wasserstein-Fisher-Rao example
The two numerical examples considered in this work use the Benamou-Brenier and Wasserstein-Fisher-Rao (WFR) penalties.The former has already been discussed in previous remarks and is a limiting case of the WFR penalty so we will not discuss it in further detail here.
The penalty considered in [9] is = inf for some α, β, δ > 0, and the continuity equation is satisfied in the sense of (1.14).This leads to the energy where the properties of A j are as stated in (3.1).Much is already known about minimisers of this energy.= (h, ξ) The only aspect not directly covered by [9] is the existence of the minimal pair (v, g).To confirm this, note that the set is bounded, hence compact in the weak topology of L 2 ρ .There is therefore a weakly-convergent sequence converging to a point (v, g) which achieves the desired infimum in (6.1).As the weak form of the continuity equation is preserved under weak limits in (v, g), the triplet (ρ, v, g) also satisfies the continuity equation.

Reformulation in the space of measures on paths
The results of Theorem 6.1 highlight the close relationship between the representations ρ and σ, i.e. dynamical measures and measures on paths.We now reformulate the energy E into an equivalent energy E : M + ([0, 1] × Ω) → R ∪ {+∞} of the form in Section 3.This energy can be written (6.9) Remark 6.2.Since for all (h, ξ) ∈ Γ, √ h and √ hξ are absolutely continuous, they are differentiable almost everywhere in [0, 1], hence ξ is differentiable at a.e.t such that h(t) > 0. Hence, the integrand in (6.9) makes sense when regarded as We will use the operator Θ : M(Γ) → M([0, 1] × Ω) introduced in Section 2 to map between σ and ρ.The formula in (6.9) comes from [9, (3.9)], but it was only shown that W(δ γ ) = W(Θ(δ γ )).We cannot hope that this holds on the whole of M(Γ) because Θ is not a one-to-one mapping, M(Γ) is a much larger space than M([0, 1] × Ω).The next lemma is needed to confirm the equivalence of minimisers between E and E. Lemma 6.3.Choosing ϕ(γ) = α E(0) , the function E is lower semi-continuous with compact sub-levelsets with sparse minimisers in where min ∅ = +∞.We conclude that ) Proof.The only requirement for Theorem 3.1 which is not explicitly assumed is that the function w is lower semi-continuous with compact sub-levelsets.This is proved the appendix in Lemma B. 4.

Discrete-time formulation
Problems of the form (6.8) also have a discrete-time structure inherited from discrete-time data.The same argument from Section 3.2 is valid for the WFR penalty, although it is very hard to find the explicit form.Analytically the WFR penalty can be expressed in the form required for Assumption A2.In particular, for geodesics of the WFR penalty satisfy This formula can be verified with the Euler-Lagrange equation.The key point is that the left-hand side only involves up to first order derivatives so only the zeroth order constraints are needed to interpolate on intervals ]t j−1 , t j [.The exact shape or formulae for the WFR-geodesics is not so convenient as for the Benamou-Brenier penalty (δ → +∞), although for α = 0 it is known from ( [16, Thm.5.6]) More details can be found in Corollary 4.1(i) of the preprint (https://arxiv.org/pdf/1506.06430v2.pdf) of [15].In summary, the map γ → argmin γ∈Γ { w( γ) | γ(t j ) = γ(t j ) } is continuous, h(t) is a simple quadratic on each ]t j , t j+1 [, and ξ(t) follows a straight line between ξ(t j ) and ξ(t j+1 ) with speed varying like arctan.

Numerical results
For numerical experiments we implement variants of Algorithm 1 using the linear-oracle strategy discussed in Section 5.The expanded form of this algorithm is given in Algorithm 3. The choice of A j , F j , t j , and step j define the optimisation problem, then the final choices of Λ j and k dictate the variant of the algorithm.We always choose the sliding step to select a local minimum of E (or E) as computed by an implementation of the L-BFGS algorithm.All code to reproduce the results and figures in this work can be found online 1 .
The classical sliding Frank-Wolfe algorithm (Algorithm 1) can be recovered by choosing Λ j = [−1, 1] × Ω and k = 1.The potential for k > 1 was considered in [10, Sec 5.1.5]as a "multistart" parameter.The idea is that the approximation of optimal curves (i.e.Y * [•, T ]) is very expensive, so some of the other near-optimal curves should also be used to improve efficiency of the algorithm.Suppose Λ j is chosen to approximate [−1, 1] × Ω with a grid of M masses in [−1, 1] and N d points in Ω.Assuming T is fixed, Theorem 5.2 states that the computational complexity of the linear oracle at every iteration is O((N d M ) 2 ).If the "true curves" are easy to find, then we can hope to reduce this to O((N d M ) 2 k −1 ) per curve (see [20,23]).
We consider two families of algorithms implementing Algorithm 3, one stochastic and the other deterministic.Throughout this section we fix Ω def.

= ]0, 1[
2 and only consider non-negative curves h ≥ 0, so the algorithms can be stated as: (k, N, M )-random mesh: The multistart parameter is k.For each n and j we generate new independent uniformly random points 1 https://gitlab.inria.fr/rtovey/DP-for-dynamic-IPs When M = 0 we take H = {1} (i.e. a balanced mesh).(k, N, M )-uniform mesh: The multistart parameter is k.For each n we choose Λ 0 = . . .= Λ T to be of the form for some N ≥ 1.We start with N = 16 at n = 0, then increment N ← 2 N whenever E(σ n+1 ) = E(σ n ) (tested after line 13 of Algorithm 3).This process is terminated once N > N .Again, if M = 0 then we only allow the balanced mass h = 1.For all problems, (1, +∞, +∞)-random and (1, +∞, +∞)-uniform mesh algorithms are equivalent to the exact Algorithm 1.For balanced problems, such as in Section 1.1, it is sufficient to use the triplet (1, +∞, 0).The random algorithms with N, M ≥ 1 are guaranteed to converge (eventually) to the exact minimiser by Lemma 4.1.On the other hand, the uniform path algorithms have nice computational properties which allows Y * [•, T ] to be computed more efficiently at larger N .Whilst all results for random meshes are asymptotic, the uniform path algorithms also provide a clear stopping criterion.The algorithm stops at iteration n when E(σ n+1 ) = E(σ n ), often this will mean Expanding on the linesearch computation, due to the convexity and smoothness of E, If the linesearch terminates with λ = 0, then clearly The optimality given by the dual gap (see [25]), can therefore be stated We conclude that σ n is at least optimal up to a spatial resolution of 1 N .

Benamou-Brenier example
First we compare directly with the numerical results presented in [10] for the model discussed in Section 1.1.In particular, the energy we seek to minimise is E : where t j = j T for each j = 0, . . ., T , and A j : M([0, 1] 2 ) → R m represents a finite number of smoothed Fourier samples.The precise details are given in [10,Sec. 6].In view of the convexity of Ω, (7.5) reformulates to The two phantoms are also from [10,Sec. 6].In the notation of this work, we would say that, for example, phantom 1 is represented by σ = δ (h 1 ,ξ 1 ) + δ (h2 ,ξ 2 ) where h 1 = h 2 = 1, ξ 1 (t) = (0.2 + 0.6t, 0.2 + 0.6t), ( ξ 2 (t) = (0.8 − 0.6t, 0.2 + 0.6t).(7.9) Similarly phantom 2 is the sum of 3 Dirac masses, both phantoms are shown here in Figure 1.In the setting of Section 5, we choose ϕ 0 = 0.1 (as 10 is much larger than 2 or 3 which is the mass of phantoms 1 and 2 respectively), and For phantom 1 we have α = β = 0.5, T = 21, and α = β = 0.1, T = 51 for phantom 2. The algorithm of [10] found online 2 is run with original parameters as a baseline, although with a time limit of 5 days when necessary.This is compared with multiple variants of the random and uniform algorithms described at the beginning of the section.The uniform algorithms are run to convergence and the random variants are run for 100 and 10,000 iterations for phantoms 1 and 2 respectively.Figure 2a shows the reconstructions of each algorithm with default parameters, each image is visually equivalent.The convergence behaviour is shown in Figure 2b.The energy plots confirm that each reconstruction has approximately the same energy, although we find that the random mesh algorithm finds the lowest energy, closely followed by the uniform mesh.Similarly, the sparsity of each final reconstruction is equal.The greatest difference between algorithms is run-time.The random and uniform mesh algorithms are over 100 times faster than that of [10] in both examples.We also replicate the noise-scenarios for phantom 2 as tested in [10].Our only modification of the baseline algorithm is to remove the early-stopping routine and run the algorithm for the minimum of 21 iterations or 5 days (cf.[10,Table 1]).We again use the (1, 25, 0)-random mesh and (1, 256, 0)-uniform mesh algorithms for comparison in each example.In the three noisy scenarios we add 20%, 60%, and 60% Gaussian white noise to the data respectively.The first two scenarios use α = β = 0.1 while the third scenario uses α = β = 0.3.Seen in Figure 3, both the random and uniform mesh algorithms converge with similar rates, while the algorithm of Bredies et al. is still at least 100 times slower.We see the expected behaviour that the uniform variant converges to a (possibly non-optimal) energy.As predicted by Theorem 4.1, the random variant often finds an even lower energy, despite having a much smaller value of N .

Wasserstein-Fisher-Rao example
In this section we show numerical results for the unbalanced transport example presented in Section 6, the data fidelity is the same as in the Benamou-Brenier example.Ideally we would use the exact function d α,β,δ from (6.21) to define step j , but it lacks a closed-form expression, hence for computational reasons we use the approximation step j (γ 0 , γ 1 ) def.
= h j for each j so that step j is a C 1 function of h j .Note that this only effects the sliding step of the Frank-Wolfe algorithm, the remaining steps are unchanged.

.14)
The transformation for the second phantom is very similar and the precise formulae can be found in the supplementary code.All curves h have been normalised so Again, we run the uniform-mesh algorithms to convergence but now the random-mesh is only run for 100 or 1,000 iterations for phantoms 1 and 2 respectively.The reconstructions are shown in Figure 4a with corresponding convergence plots in Figure 4b.

Observations on parameter choices
Both the random and uniform mesh algorithms have three parameters to choose: the multi-start parameter k, spatial resolution N , and mass resolution M .The first phantom (balanced or unbalanced) nicely highlights characteristics of the reconstructions but is too trivial numerically to compare different algorithm choices, all methods converge within a few iterations.For the second phantom we prioritised the trade-off of between energy and computation time.
The classical choice of k = 1 was always a competitive choice.Large k potentially enables the algorithm to find multiple atoms in one iteration, but slows down the sliding steps.We found that k ∈ [1,5] was a reasonable range depending on the sparsity of the signal to be recovered.
Choice of spatial resolution made the largest impact on performance.If the resolution is too small then the algorithm will not find new atoms, but computation time of the linear oracle scales with N 4 .For the random mesh, we found that N = 25 was a good balance.For the uniform mesh the resolution is also a stopping criterion so we used the more conservative N = 256 and 512 for the balanced and unbalanced examples respectively.
In the unbalanced experiments the value of M had little effect.Although we don't include this figure, we observed that the choice M = 0 was also competitive for our unbalanced phantom 2, achieving energies within 10 −3 % of the best observed energy.It's unclear whether this will generalise to more complicated examples, we chose M = 10 as a default.Computational complexity of the linear oracle also scales with M 2 , so M should not be too large.
Both the random and uniform mesh algorithms have advantages over each other.The main advantages of a uniform mesh are computational: one can use a finer resolution (i.e.larger N ), and there is a clear stopping criterion.In practice this was a very reliable method without parameter tuning and was as fast as the random algorithm.The random algorithms have analytical guarantees of converging asymptotically to the true minimiser, and indeed it achieved the best observed energy in all but one of our experiments, performing noticeably better in the noisy case.The challenge is setting a stopping criterion, in phantom 2 of Figure 4b one sees several plateaux where the algorithm fails to find a descent direction for a number of iterations before continuing to descend.
Figure 5 shows the random variation of the random mesh algorithm with different values of N ∈ {5, 10, 15}.Increasing N both decreases the spread and improves the performance of the algorithm.In the balanced example, the three algorithms complete 1000 iterations in the same time, showing that the linear oracle step is a small part of the total time.On the other hand, with larger M the N = 25 algorithm is nearly 10 times slower than N = 10, indicating that the O(N 4 M 2 ) cost is starting to dominate the computation time.It's possible that N = 5 is slower than N = 10 for the unbalanced example because the sliding step has to work much harder to find local minima.

Conclusion
The main contribution of this work was to extend a variational model proposed by Bredies et al. in order to accelerate the reconstruction of sparse measures in dynamical inverse problems.Using algorithms developed for computing shortest paths on graphs, we can improve the speed by a factor of 100 while still finding lower energy solutions.This allows us to process new unbalanced examples where the mass of curves is not constant in time, our proposed algorithms still recover good reconstructions in a reasonable amount of time.
We also presented new analysis of a stochastic variant of Frank-Wolfe which guarantees the convergence of our algorithm (in energy) to a globally optimal solution.This is supported by our experiments where we see that the random-mesh algorithm achieves the lowest energy in almost every example.
One feature of the algorithm in [10] which we did not take advantage of was the idea of importance sampling.We chose points y ∈ Λ j uniformly randomly, whereas it is likely to be beneficial to choose y such that η j (y) is small.As an example, if the step functions step j satisfy the triangle inequality, then this bias could be implemented by choosing points y such that y is a local minima of y → η j ( y) + step j−1 (y, y) + step j ( y, y). (8.1) In Algorithm 3, implementing such a sliding step on Λ j between lines 4 and 5 would preserve the analytical properties of the current algorithm, whilst possibly improving practical performance.However, it is also possible that the sliding step already present (e.g.line 7) is powerful enough to find these improved mesh points after computing γ i , without the extra help beforehand.It is likely that the benefits depend on the application, particularly on the smoothness of the operators A j .

B.3. Lower semi-continuity of E
Recall that E : M + (Γ) → R is defined by E(σ) = F(σ)+W(σ).In Theorem 3.1 we assume ϕ, w : Γ → [0, +∞] are lower semi-continuous, therefore Lemmas B.1 and B.5 confirm that W is lower semi-continuous and D is closed.It remains to show that F is lower semi-continuous.Lemma 2.3 shows that (e t ) is narrowly continuous, therefore F inherits lower semi-continuity from each F j .From now on we consider any closed subset Γ ⊂ Γ ∞ with the topology induced by d Γ .

B.4. Compactness of sub-levelsets of E
We now prove that minimisers of E exist by showing that E has compact sub-levelsets.Note in the case ϕ has compact sub-levelsets, then D is already compact (Lemma B.5), so the following theorem is not required.The function E is lower semi-continuous, so the left-hand side is closed and the right-hand side is compact by assumption.We conclude that the left-hand side is compact for each t.

B.5. E has sparse minimisers
Since the function F is of the form F(σ) = with linearly closed level sets.The observation operator A : V → (R m ) (T +1) , is linear, defined as σ → (A j (e tj ) σ) 0≤j≤T , and H : (a 0 , . . ., a T ) → T j=0 F j (a j ) is a convex "fidelity term".Set k def.
Case t = inf V R: In the case t = inf V R = 0, [7,Cor. 3.8] tells us that there exists a minimiser σ * which belongs to an elementary face of {R ≤ t} with dimension at most k, therefore also a face of D with dimension at most k.In particular, by Carathéodory's theorem, we can express σ * in the form in (B.17 but with µ i ∈ Ext D . Observe that for t = 0, we can equivalently write D as so (B.17) is satisfied.Case t > inf V R: In this case the minimiser σ * from [7,Cor. 3.8] belongs to an elementary face of {R ≤ t} with dimension at most (k − 1) (hence also for D).
Since { σ ∈ V | w, σ = t } is a hyperplane, it is possible to prove (see for instance [21]) that σ * belongs to a face of D with dimension at most k.The formulation of (B.17) is again given by Carathéodory's theorem.We may now deduce (B.14) from (B.17).From Lemma B.5, we know that the extreme points of D are either 0 or of the form ϕ(γ) −1 δ γ where ϕ(γ) < +∞, hence the general form of (B.14) where s ≤ k + 1.
In the special case of ϕ dσ * < 1, one of the atoms µ i must be 0. As a result, we may remove it from the sum, so that s ≤ k.

Figure 1 .
Figure 1.Synthetic phantoms of the form σ = i δ (h i ,ξ i ) for balanced (top row) and unbalanced (bottom row) examples.Colour indicates the time t, the solid line indicates the positions ξ i (t), and the width of the overlayed band is proportional to h i (t).
Final reconstructions visualised equivalently to those in Figure1.
Convergence of energy and sparsity of reconstructions.For each phantom, every energy is translated by the smallest energy found by any method.
Final reconstructions visualised equivalently to those in Figure1.
Convergence of energy and sparsity of reconstructions.For each phantom, the energies are translated by the smallest energy found by either method.

Figure 5 .
Figure 5. Convergence of different random realisations of the random mesh algorithms applied to the balanced and unbalanced phantom 2. The two algorithms are (1, N, 0)-and (1, N, 10)random mesh run for 1000 and 100 iterations respectively.100 random instances of each algorithm are run, each drawn as a thin line.The median is drawn with a thick line, and the inter-quartile range indicated by the shaded area.