Empirical evolution equations

Evolution equations comprise a broad framework for describing the dynamics of a system in a general state space: when the state space is finite-dimensional, they give rise to systems of ordinary differential equations; for infinite-dimensional state spaces, they give rise to partial differential equations. Several modern statistical and machine learning methods concern the estimation of objects that can be formalized as solutions to evolution equations, in some appropriate state space, even if not stated as such. The corresponding equations, however, are seldom known exactly, and are empirically derived from data, often by means of nonparametric estimation. This induces uncertainties on the equations and their solutions that are challenging to quantify, and moreover the diversity and the specifics of each particular setting may obscure the path for a general approach. In this paper, we address the problem of constructing general yet tractable methods for quantifying such uncertainties, by means of asymptotic theory combined with bootstrap methodology. We demonstrates these procedures in important examples including gradient line estimation, diffusion tensor imaging tractography, and local principal component analysis. The bootstrap perspective is particularly appealing as it circumvents the need to simulate from stochastic (partial) differential equations that depend on (infinite-dimensional) unknowns. We assess the performance of the bootstrap procedure via simulations and find that it demonstrates good finite-sample coverage.


Introduction
Given a Banach space X (a complete normed vector space) and a vector field v on X , an evolution equation (Walker, 1980) is a model of the forṁ γ = v(γ) and γ(0) = x 0 ∈ X , (1.1) specifying a (differentiable) flow γ : R → X by relating its time derivativė γ to the field v. A flow γ solving (1.1) is often called an integral curve of v, borrowing the established term from the case where X is Euclidean (solving a set of ordinary differential equations is colloquially referred to as "integrating" the system) (Lee, 1992). The abstract formulation (1.1) spans a broad and rich class of models in the physical sciences, encompassing ordinary as well as partial differential equations as special cases. In physics, integral curves are known as lines of force if v is a force field, or lines of flow if v is a velocity field of fluid flow (Zachmanoglou and Thoe, 1986, Chapter 2). In the study of dynamical systems, they are known as trajectories or orbits (Lee, 1992, Chapter 2). The typical treatment of evolution equations takes v to be known and studies questions as related to existence, uniqueness, and regularity of the induced integral curves (Lee, 2012, Chapter 17), as well as their stability under perturbations of initial conditions (Bellman, 2008, Chapters 2, 4). Less explored is a setting where v is unknown and measured empirically under uncertainty. This setting can nevertheless be seen to encompass a surprising number of statistical and machine learning objects of interest, which can be formalized as solutions to empirical evolution equations. Important examples include filament estimation in galaxies (Genovese et al., 2009), diffusion tensor imaging tractography (Koltchinskii, Sakhanenko and Cai, 2007), fingerprint analysis (Huckemann, Hotz and Munk, 2008;Hill, Kendall and Thönnes, 2012), and structural economics (Vanhems, 2006). Less obvious examples include the widely used stochastic gradient descent algorithm for solving large-scale machine learning tasks (Bottou, 2010) can also be seen as a special case of integrating a (sampled) vector field. Given the empirical nature of the corresponding equations, one is compelled to consider methods for quantifying the induced uncertainty on its solution. The diversity of applied settings and methodological constructions can lead to a multitude of corresponding problem-specific approaches, but the elegant general specification begs the question whether a broad framework of uncertainty quantification that can be tractably ported to each specific context is feasible.
This question motivates us to study general evolution equations of the form (1.1) through the lens of sampling variation, when an unknown vector field v is replaced by a (potentially nonparametric) estimatev n depending on sampled data whose "sample size" (in a general sense) is n. This setting gives rise to what we term an empirical evolution equation: γ =v n (γ) and γ(0) = x 0 ∈ X .
Its solutionγ n will be called the empirical integral curve. In order to establish valid general inference procedures based on the empirical integral curve, we investigate how the asymptotic (in n) behavior of the estimatorv n of v relates that of the estimatorγ n of γ (Section 2) when the state space X is a (potentially infinite dimensional) Banach space. We then exploit these results in order to provide asymptotically valid bootstrap procedures for the integral curve (Section 3). Such bootstrapping is particularly useful from a practical standpoint, as it allows us to bypass solving stochastic differential equations with unknown parameters to find the limiting distribution of the integral curve. Indeed, we develop bootstrap procedures that are valid uniformly over smoothing parameters involved in the construction ofv n , thus correctly accounting for any uncertainties arising from data-dependent regularization (Section 4). We show how our results can be applied in the context of several modern statistical and machine learning problems (Section 5). These include finite-dimensional evolution equations arising in gradient line estimation, diffusion tensor tractography, and principal curve estimation; and an infinite dimensional setting describing anisotropic heat flow. Our work also elicits a noteworthy aspect of inference related to evolution equations: when the empirical evolution equation is based on a nonparametric estimate of the vector field, it appears necessary to adopt a scale-space view (Chaudhuri and Marron, 2000) if valid bootstrap procedures are desired (Section 4). The proofs of all formal statements are collected in Appendix B.

Asymptotic theory for the integral curve
We begin by investigating the limiting distribution of r n (γ n −γ 0 ) at some appropriate rate r n , where the subscript "0" is used to indicate the true parameter values under (1.1). We will need to distinguish between two regimes in the asymptotics, depending on the behavior of the vector field estimatorv n : 1. Regime 1: the processv n converges weakly to a tight limiting process (Theorem 1). 2. Regime 2:v n itself may fail to converge to a tight limit, but a related functional does (Theorem 2).
The limiting result under the first regime, derived in Section 2.2, can be interpreted as a delta method for the integral curve. We will show in Section 3 the first regime is particularly amenable to the construction of valid bootstrap procedures. The second regime, treated in Section 2.3, also accommodates tractable asymptotics, if at the cost of more stringent conditions on the vector field v 0 . Of greater consequence, though, is the possible lack of consistent bootstrap procedures for the integral curve under this second regime.

Notation
We will make heavy use of the following notation and key notions in our subsequent development.
Let X be a Banach space, and U ⊂ X be open. Let v 0 ∈ C(U, X ) be the vector field of interest. For well-definedness, we will assume henceforth that v 0 is locally Lipschitz on U. This implies, by the Picard-Lindelof theorem (Nelson, 1969), that there is some interval I ⊂ R containing zero on which there exists a unique integral curve γ 0 : I → U that solves (1.1). Throughout the paper, the asymptotic results hold for a fixed initial condition x 0 ∈ U and corresponding interval I.

Asymptotics in Regime 1
For parametric vector fields, one typically has a method for estimating the parameter defining v 0 , along with a limit theorem for the estimator. For instance, Ramsay et al. (2007) and Brunel, Clairon and D'Alché-Buc (2014) treat precisely such a setting, proposing parametric vector field estimates for which they then derive asymptotics. Our results apply to the parametric setting too, though our eventual interest is in nonparametric vector field estimators. The current development lays the groundwork for the uniform-in-bandwidth result of Section 4, which is crucial to the nonparametric setting. We will assume that for some r n it has been established that r n (v n − v 0 ) converges weakly to a tight limit, but will not circumscribe the type of estimatorv n can be, beyond this. Theorem 1 will require the following: Assumption A1. The vector field v 0 is Fréchet differentiable at γ 0 (t) for all t ∈ I.
Note that when v 0 is C 1 , it is both locally Lipschitz and Fréchet differentiable. The following result is derived using tools from Z-estimation theory (van der Vaart and Wellner, 1996). Theorem 1. Suppose A1 holds. In addition, suppose the vector field estimatê v n is such that for some sequence r n ≥ 0, r n → ∞ and tight process G that satisfies Then it holds that r n (γ n − γ 0 ) If G is a Gaussian process, then so is ξ.
Note the broadness of the result: X is assumed to be a general, potentially infinite-dimensional Banach space. The limiting distribution can then be interpreted as a stochastic partial differential equation governing the integral curve. The result simplifies considerable in the special case X = R d . Since this is prominent in statistical applications, we present it in more detail. In this case, v : U → X = R d corresponds to the familiar notion of a classic vector field on R d . The associated integral curve γ has the property that the tangent vector of the curve at time t coincides with the value of the vector field v at position γ(t). The Fréchet derivative of v 0 at γ 0 (t) is the familiar Jacobian matrix. Writing the vector field as v 0 = (v 1 , . . . , v d ) where each coordinate v i is a map from U to R, define the Jacobian at time t to be With this notation in place, Theorem 1 in conjunction with standard theory on systems of linear differential equations reviewed in Appendix A yields: Corollary 1. Suppose the conditions of Theorem 1 are satisfied. If the state space is X = R d , then the limiting distribution ξ in (2.2) can be equivalently expressed as and V (0) = I d .
If J = J 0 commutes, i.e. J(t 1 )J(t 2 ) = J(t 2 )J(t 1 ) for all t 1 , t 2 ∈ I, then closed-form solutions for V and its inverse exist: In the more typical situation where J does not commute, an approximation such as the Magnus expansion (Blanes et al., 2009) can be used to estimate V , as discussed in Appendix A.

Asymptotics in Regime 2
Whenv n is a nonparametric estimate depending on some regularization parameter h, condition (2.1) in Theorem 1 requiring the weak convergence of r n (v n −v 0 ) may not be satisfied. For instance, Theorem 2.2.3 in Bierens (1987) shows that whenv n is the standard Nadaraya-Watson kernel regression function estimator, the sequence nh d n (v n (x) − v(x)) for distinct points x 1 , . . . , x k in R d is asymptotically independent. Thus the processv n cannot have a tight limiting distribution.
Our interest in nonparametric estimatorsv n leads us to seek an alternative derivation of the limiting distribution of the empirical integral curveγ n under a relaxation of condition (2.1). We will discuss at the end of this section that we are likely, though, to lose the possibility of valid bootstrap procedures in the process. To remedy this, we point a way forward for nonparametricv n by adopting a scale-space perspective in Section 4, redefining the targets to be the resolution-h vector field and corresponding resolution-h integral curve.
Going back to the second regime, consider the following assumptions: Assumption A2. Dv 0 : U → L(X , X ) is uniformly continuous.
Assumption A3. The estimatesv n and Dv n are uniformly consistent: While Assumption A2 is largely made for technical reasons, Assumption A3 is reasonable, asking thatv n and the plug-in estimator for the derivative, Dv n , are consistent estimators for their respective theoretical counterparts.
When condition (2.1) does not hold, a valid bootstrap scheme for the integral curve is unlikely to be available. Heuristically speaking, since ξ results from a smooth operator applied to η, a valid bootstrap procedure for ξ would necessitate being able to bootstrap η. However,η n is itself a smooth functional of r n (v n − v 0 ). It is difficult to envision a valid bootstrap procedure for η ifv n does not have a tight weak limit. Thus, we opt to work under the setting of Theorem 1 which readily affords valid bootstrap procedures, the subject of the next section. We shall this naturally leads to the consideration of a scale-space perspective for nonparametric vector field estimators, and related uniform-in-bandwidth bootstrap procedures.

Bootstrapping the integral curve
This section concerns asymptotic validity of bootstrap procedures (Efron, 1979(Efron, , 1982 for the integral curve under the first regime, i.e. the setting of Theorem 1. In particular, we demonstrate bootstrap consistency for two standard bootstrap weights and vector field estimators that can be written as empirical processes relative to Donsker classes. Let {X 1 , . . . , X n } ⊂ U be a random sample from a probability measure P on U. The empirical measure is P n = n −1 n i=1 δ Xi where δ x is the measure that assigns mass 1 at x and zero elsewhere. We write, for a measurable function f : , to denote either of two bootstrapped empirical processes. The first is the multinomial bootstrapped empirical process where (W n1 , . . . , W nn ) is a multinomial vector with probabilities (1/n, . . . , 1/n) independent of the data sequence (X 1 , . . . , X n ). The second is the multiplier bootstrapped empirical process where The binomial and multiplier bootstrapped empirical processes were first studied by Praestgaard and Wellner (1993).
To rigorously define bootstrap consistency, we need to define the weak convergence of the conditional limit laws of bootstraps. For a metric space (D, d), let BL 1 denote the space of functions f : D → R with Lipschitz norm bounded by 1. IfŶ n is a sequence of processes in D involving random weights W , the notation The subscript in the expectation indicates conditional expectation over the weights W given the remaining data. The result below is an application of theory on bootstrapped empirical processes presented in van der Vaart and Wellner (1996).
and B is the standard Brownian bridge in in l ∞ (U, X ).

Uniform-in-bandwidth asymptotics
In this section, we adopt a scale-space view (Chaudhuri and Marron, 2000). This bypasses the difficulty encountered for nonparametric estimators that fail to verify (2.1), and thus not captured by Theorem 1. Specifically, rather than focusing on the one true underlying vector field v 0 , we focus on Ev n,h (x) simultaneously for a range of h values in H ⊂ (0, ∞). Certainly, Ev n,h (x) may be biased for v 0 , but examining it may still be attractive for data analysis as different levels of smoothing invariably reveal different aspects of the truth.
Technical matters aside, there are first-principle advantages to the scale-space perspective. For example, in certain settings, v 0 may not be well defined to begin with, see e.g. Rinaldo and Wasserman (2010) and Chen, Genovese and Wasserman (2017a). In the gradient line example of Section 5.1, we further discuss this phenomenon in the context of functional density estimation. There is yet another advantage in the scale-space viewpoint: in many practical situations, a tuning parameter is chosen in a data-driven way or by means of "data snooping". This induces an additional layer of sampling variation that needs to be accounted for by bootstrap procedures that are valid uniformly in the choice of a non-trivial bandwidth (an issue discussed further after the statement of our result).
Let v ∈ C(U × H, X ) be the vector field of interest and define the evolution We assume throughout that v is locally Lipschitz on U ×H so that the solution of . Throughout, fix the initial condition x 0 and the set D. We are now ready to present the uniform-in-bandwidth analogue of Theorem 1.
Theorem 4. Suppose B1 holds. In addition, suppose the vector field estimatê v n is such that for some sequence r n ≥ 0, r n → ∞ and tight process G that satisfies If G is a Gaussian process, then so is ξ.
The next result is analogous to Theorem 3, showing that vector fields which can be written as empirical processes of Donsker classes can be bootstrapped validly, uniform in bandwidth.
and B is the standard Brownian bridge in in l ∞ (U × H, X ).

Applications
We now demonstrate the use of our results in a series of concrete situations. The first three concern finite-dimensional evolution equations. In Section 5.1, the gradient line of a probability density function is formulated as a solution to a finite-dimensional evolution equation. In Section 5.2, white matter fiber tracts in the brain are modeled as integral curves of the eigenvector field derived from the underlying water molecule diffusion tensor field. In Section 5.3, we consider a variation of the local principal curve, formulated as the integral curve of a local covariance field, and demonstrate it on a traffic flow dataset. We conclude with an infinite-dimensional evolution equation related to anisotropic heat diffusion in Section 5.4. Based on the implications of Theorem 5, Algorithm 1 details the general construction of a bootstrapped confidence region for γ that is uniform simultaneously in time and bandwidth. This algorithm will be used throughout the first three examples in this section. Note that Theorem 3 can be similarly availed for a confidence region construction that is uniform only in time. A similar algorithm can be found in  for constructing a confidence region using the bootstrap that is uniform over the one-dimensional ridge set of interest though not uniform for the smoothing bandwidth.
Algorithm 1 Confidence region for γ uniform in time and bandwidth.

Gradient lines
Call γ a gradient line of a probability density function f : Gradient lines are of interest in applications such as modal clustering (Chen, Genovese and Wasserman, 2016), filament estimation (Genovese et al., 2009), and Morse-Smale complex estimation (Chen, Genovese and Wasserman, 2017b). They play an important role in the computer vision literature where the ubiquitous mean-shift algorithm can be seen as a discrete approximation to the continuous system in (5.1) (see Arias-Castro, Mason and Pelletier (2016) for more details). Gradient lines are typically estimated using the plug-in principle. Consider the kernel density estimator given byf n ( which is based on a random sample X 1 , . . . , X n from the density f 0 , where K is a kernel function with bandwidth h ∈ H ⊂ (0, ∞). For simplicity, assume K is the standard multivariate Gaussian density. Following the scale-space view of Chaudhuri and Marron (2000), define the resolution-h theoretical counter- be the bootstrapped kernel density estimator where the weights W i correspond either to the multinomial bootstrap or the multiplier bootstrap defined in Section 3.
Before moving on, a discussion is warranted regarding this scale-space view. In its absence, it would be natural to consider the sequence nh d n (f n (x, h n ) − f 0 (x)), which converges in distribution for fixed x as n → ∞ and h n → 0 at an appropriate rate. However, there is no equivalent statement forf n (x, h n ) considered as a process over x. This is because for a distinct collection {x 1 , . . . , x k }, the Cramer-Wold device gives the joint convergence of T to a multivariate normal with a diagonal covariance matrix. Thus nh d n (f n (x, h n ) − f 0 (x)) cannot converge weakly over x to a tight limit, which would preclude the application of Theorem 3 to establish bootstrap consistency.
There is further reason to consider f (x, h), rather than f 0 , as the theoretical counterpart tof n (x, h). In certain state spaces X , it may not be possible to define a proper probability density function, e.g. when X is a space of functions as in functional data analysis. A common way to overcome this is to introduce a surrogate density or pseudo-density (Hall and Heckman, 2002;Delaigle and Hall, 2010;Ciollaro, Genovese and Wang, 2016), in much the same spirit as f (x, h) is defined above.
Fix the initial condition x 0 , and let the gradient lines of f (x, h),f n (x, h), and f • n (x, h) be denoted, respectively, γ(t, h),γ n (t, h), andγ • n (t, h). Assume throughout that the vector field ∇f (x, h) is locally Lipschitz so that there exists some The following justifies bootstrapping for the resolution-h gradient line. Note the requirement below that H be a compact subset of (0, ∞) excludes the case h → 0.
Theorem 6. Let U and H be compact subsets of R d and (0, ∞), respectively. If f (x, h) is twice continuously differentiable, then √ n(γ n − γ) and √ n(γ • n −γ n ) converge weakly over (t, h) ∈ D to the same mean-zero U-valued Gaussian process.
Fix the initial value x 0 = (0.4, 0.5) and I = [0, T ] where T = 0.1 is the total length of flow. Figure 1 shows the gradient field ∇f 0 overlaid with its gradient line γ 0 , as well as the resolution-h gradient lines γ(t, h) for h ∈ H = {0.5, 0.6, . . . , 0.9, 1}. We implement Algorithm 1 with the multinomial bootstrap. The number of bootstrap replications B was set to 200. We assessed coverage probabilities using 200 Monte Carlo iterations. The performance for different sample sizes n at desired 95% coverage is shown in Figure 2. We see that the nominal coverage is attained as n increases. There is sign of under-coverage for n = 50 but already for n = 100 the coverage becomes adequate.

Noisy principal eigenvector fields
Diffusion tensor imaging is a magnetic resonance imaging technique that can be used to map fiber tract structures in the brain (Basser, Mattiello and LeBihan, 1994). A 3 × 3 symmetric positive-definite matrix is measured at a location in R 3 that captures the spatial covariation of water diffusion through tissue. The integral curves of the associated principal eigenvector field provide a continuous description of the diffusion tensor field. In fact, they form the basis of many tractography methods in diffusion tensor imaging (Mukherjee et al., 2008).
Since the diffusion tensor field is measured empirically with noise, so too is the principal eigenvector field derived from it. As in Koltchinskii, Sakhanenko and Cai (2007), we consider a model where the underlying principal eigenvector field v 0 is observed according to V i = v 0 (X i ) + i , i = 1, . . . , n, where X i is uniformly distributed in a bounded open set of R d , and i 's are mean-zero bounded random errors. We will employ the standard kernel regression estimatorv n (x, h) = 1 nh d n i=1 K x−Xi h V i to generate a plug-in estimate for the integral curve. To simplify matters, suppose K is the standard multivariate Gaussian density. Our treatment will now begin to diverge from that of Koltchinskii, Sakhanenko and Cai (2007). Their analysis utilizes a result similar to Theorem 2, but for a Euclidean state space, to derive the limiting distribution of the empirical integral curve. Importantly, this result concerns γ 0 , the integral curve of v 0 . We will instead consider the scale-space perspective and make inference for a resolution-h integral curve. This is so that we can justify the bootstrapping procedure which would otherwise be difficult to ascertain if γ 0 were the object of interest instead. Define the theoretical counterpart ofv n ( where the weights W i correspond either to the multinomial bootstrap or the multiplier bootstrap defined in Section 3. Fix the initial condition x 0 , and let the integral curves of the eigenvector fields v(x, h),v n (x, h), andv • n (x, h) be denoted, respectively, γ(t, h),γ n (t, h), andγ • n (t, h). Throughout, assume v is locally Lipschitz so that there exists some D = {(t, h) ∈ R × H : t ∈ J h ⊂ R} for which γ is well-defined. The following result justifies performing inference for the resolution-h integral curve via bootstrapping.
Theorem 7. Let U and H be compact subsets of R d and (0, ∞), respectively. If v(x, h) is twice continuously differentiable, then √ n(γ n − γ) and √ n(γ • n −γ n ) converge weakly over (t, h) ∈ D to the same mean-zero U-valued Gaussian process.
Consider the following example from Section 4.2 in Koltchinskii, Sakhanenko and Cai (2007). We have noisy observations of a circular vector field v 0 on R 2 given by v 0 (x 1 , x 2 ) = (−x 2 / x , x 1 / x ) and random errors i distributed  tions B was set to 200 and the number of Monte Carlo iterations to 200. The performance for different sample sizes n at desired 95% coverage is shown in Figure 4. We see that the nominal coverage is attained as sample size increases and the bootstrap method is not overly conservative until n = 400. Hastie and Stuetzle (1989) introduced the concept of principal curves, smooth curves that pass through the "middle" of a multivariate point cloud, to describe data exhibiting nonlinear variation. Many variations on the principal curve has since followed including the concept of local principal curves (sometimes people use the name ridge (Ozertem and Erdogmus, 2011)). Top-down methods for local principal curves start with the first principal component and successively bend it (Einbeck, Evers and Bailer-Jones, 2008) while bottom-up methods are exemplified by the the local principal curve proposed in Einbeck, Tutz and Evers (2005) which we describe below.

Local principal flow
The local principal curve of Einbeck, Tutz and Evers (2005) is constructed point by point using only local information at each iteration. Suppose we have a multivariate point cloud {X 1 , . . . , X n } ⊂ R d . Starting from an initial point, a local center of mass μ x is calculated. Next a local covariance matrix centered at μ x is formed: 1 , for some kernel function K and bandwidth h. From this, the principal eigenvector is extracted. We move μ x in the direction of said eigenvector by some step size. This is repeated until μ x stops changing. The local principal curve is comprised of this series of μ x .
In this illustration, we show how an adaptation of the local principal curve of Einbeck, Tutz and Evers (2005) can be formulated as a solution to an evolution equation. Rather than centering the local covariance matrix at a local center of mass, consider for each point x ∈ R d a local covariance matrix directly centered there: Let v n (x, h) be the associated principal eigenvector ofĈ n (x, h). The corresponding empirical integral curveγ n can be interpreted as an infinitesimal local principal curve-like object, which we will call a local principal flow, so named for its similarity to the (global) principal flow introduced in Panaretos, Pham and Yao (2014). Define the scale-space theoretical counterpart ofĈ n (x, h) to be n (x, h) be the bootstrapped local covariance matrix, for either the multinomial or multiplier bootstrap defined in Section 3. Denote by γ andγ • n the associated local principal flows of C andĈ • n , respectively.
We fit the local principal flow for a speed-flow diagram obtained from the R package LPCM (Einbeck and Dwyer, 2011). The data consists of 444 observations on speed and flow recorded from 9th of July 2007, 9am, to 10th of July 2007, 10pm, on a particular lane of a Californian freeway. The nonlinear variation of the data is apparent from Figure 5. Rather than performing a standard principal component analysis, we fit the local principal flow for K being the standard bivariate Gaussian density and bandwidth h = 0.22. The multinomial bootstrap (with B = 1000) was used to find an approximate 95% confidence region for the local principal flow γ(t, h), uniform in t. The confidence region appears as blue circles in Figure 5.

Heat flows
Now we consider an evolution equation on an infinite-dimensional state space, corresponding to a partial differential equation. The motivation for this example arises from novel tractography methods in diffusion tensor imaging that are probabilistic, rather than deterministic, in nature. We will examine specifically the tractography method described in Section D of Hageman et al. (2009), based on the anisotropic heat equation, where D is the diffusion tensor at x ∈ R d . In a typical interpretation, γ(t)(x) represents the density of a diffusing material at location x and time t.
Let P d denote the space of d×d symmetric positive-definite matrices. Suppose it is only possible to observe, with measurement error, the true diffusion coefficient D 0 at a finite number of positions. Specifically, for i.i.d. observations D i ∈ P d occurring at random locations is the standard multivariate Gaussian density and bandwidth h ∈ H ⊂ (0, ∞). Define the scale-space theoretical counterpart to be Next, let the bootstrapped diffusion coefficient beD where the weights W i correspond either to the multinomial bootstrap or the multiplier bootstrap in Section 3.
Call γ a heat flow of the tensor field Q h = Q(x, h) where Q : U × H → P d if γ solves the evolution equation for initial condition x 0 ∈ C 1 (U, R). Fix the initial condition x 0 , and let the heat flows associated to each of t, h). Assume throughout that the vector field D h associated to the heat equation is locally Lipschitz so that there exists some G = {(t, h) ∈ R × H : t ∈ J h ⊂ R} on which its heat flow is welldefined. The following result justifies performing inference for the resolution-h heat flow via bootstrapping.
Theorem 8. Let U and H be compact subsets of R d and (0, ∞), respectively. If vec (D(x, h)) is twice continuously differentiable, then √ n(γ n − γ) and √ n(γ • n − γ n ) converge weakly over G to the same mean-zero U-valued Gaussian process.

Future work
We conclude with a discussion on possible future research directions. First, the uniform-in-bandwidth results in this paper could lead to methods of bandwidth selection based on the idea of stability, e.g. in the spirit of Rinaldo and Nugent (2012); ; Ciollaro, Genovese and Wang (2016). In these works, a range of bandwidths is considered desirable if the features of interest persist across it.
Second, more refined statements in the current framework may be possible if we focus on specific classes of evolution equations. Gradient flow equations provide one such example. These equations, whose study forms an active research area, naturally appear in many real systems trying to decrease energy (or increase entropy at fixed energy) over time. Gradient flow equations take the formγ = −∇I(γ) and where I is a functional from a metric space X to R. Intuitively, the solution γ(t) "flows downhill" in the direction −∇I/ ∇I with velocity proportional to ∇I . It is easy to see that the gradient line of Section 5.1 falls under this framework since the vector field is the gradient of a probability density function. That the anisotropic heat equation in Section 5.4 can also be encompassed in the gradient flow framework is truly surprising, however. The results of Lisini (2009) make apparent that the heat flow associated to Equation (5.2) is also the gradient flow of the functional I(u) = R d F (u(x)) dx where F (z) = z log z with respect to the 2-Wasserstein distance between probability measures on the space R d , endowed with the Riemannian distanced induced by D −1 , the inverse of the diffusion tensor.
Finally, future directions of research might extend the current framework to more exotic evolution equations. For instance, time-dependent vector fields are not treated here but are important tools in the visualization field (Theisel et al., 2004). Also, the increasing interest in manifold data leading to applications involving integral curves on manifolds (using the idea of charts (Lee, 2012)) is not yet treatable by the current work.

Appendix A: Linear differential equations in Banach spaces
The review in this section relies heavily on Krein (1971). Consider an equation where A(t) is a bounded operator in a Banach space E, f (t) is a given function taking value in E, and u(t) is an unknown function also taking value in E. The derivativeu is understood to be the limit of the difference quotient with respect to the norm of E.
If A(t) and f (t) are continuous (or, more generally, measurable and integrable on every finite interval), then the solution to the homogeneous counterpart to exists for any u 0 ∈ E and is given by u(t) = U (t, 0)u 0 where U (t, s) is known as the evolution operator of (A.2), and is given by where I is the identity operator. Going back to the non-homogeneous equation Example 1. Suppose E = R and A(t) = 1. Then U (t, 0) = 1 + t + t 2 2! + t 3 3! + . . . i.e. the power expansion of exp(t).

the power expansion of exp(t 2 /2).
These examples suggest that in the Euclidean setting, the solution to the linear differential equation in (A.1) simplifies substantially, which is indeed the case. Suppose E = R d , then u and f are vector-valued functions. Take the operator A(t) to be a d × d matrix. First, suppose A commutes, i.e. A(t 1 )A(t 2 ) = A(t 2 )A(t 1 ), the evolution operator in (A.3) simplifies to U (t, s) = exp t s A(τ ) dτ exp − s 0 A(τ ) dτ where, for a matrix A, the matrix exponential exp tA is formally defined as the convergent power series exp tA = I + tA + t 2 A 2 2! + . . . . The reader is referred to Moler and Van Loan (2003) on various ways to compute the exponential of a matrix.
In general, however, the matrix A will not commute. In that case the solution to (A.1) is given by The Magnus expansion discussed in Appendix A can be used for V (t) whereby we first write V as a matrix exponential V (t) = exp Ω(t) where Ω(0) = O. The Magnus expansion is a series expansion for the matrix in the exponent We refer the reader to Blanes et al. (2009) for derivation details of the expansion. To illustrate, the first three terms of the series are where

Appendix B: Proofs of formal statements
Proof of Theorem 1. Define the parameter space Θ to be a closed subset of C 1 (I, U) that contains an open neighborhood of γ 0 . Additionally, suppose Θ is such that sup θ∈Θ θ C 1 < ∞ and θ(0) = x 0 for all θ in Θ. We endow Θ with the C 1 norm which makes it a Banach space. Also consider the Banach space L = C(I, X ) equipped with the supremum norm. Consider the operators Ψ and Ψ n , treated as elements of l ∞ (Θ, L), given by Ψ(γ) =γ − v 0 • γ and Ψ n (γ) =γ −v n • γ. The integral curve γ 0 ∈ C 1 (I, U) of v 0 may now be formulated as a zero of Ψ. Viewing the extraction of the zero from Ψ and Ψ n as a continuous mapping allows the application of a functional delta method on the process r n (Ψ n − Ψ) to derive the limiting distribution of the empirical integral curve. Heuristically, as Ψ n approaches Ψ in an appropriate sense, so too should their respective roots. We will apply the master Z-estimator result that is Corollary 13.7 of Kosorok (2008) which makes this idea rigorous.
We first check Ψ n , Ψ ∈ l ∞ (Θ, L). We have the following where we have, in the third line, used the fact that v 0 is locally Lipschitz and γ, γ 0 ∈ Θ, and taken k to be greater than or equal to 1. We can show that Ψ n ∈ l ∞ (Θ, L) in a similar way. Next, the unique existence of γ 0 is guaranteed by the fact that v 0 is locally Lipschitz. Thus the identifiability condition in Corollary 13.7 of Kosorok (2008) is satisfied. To derive the Fréchet derivative of Ψ at γ 0 , write Ψ The notation (Dv•γ 0 ) is understood to be the map from I to L(X , X ). By Assumption A1, we We now show thatΨ γ0 is continuously invertible. First, note thatΨ −1 γ0 (f )(t) is the solution to the following linear differential equation in the Banach space Since the linear operator (Dv 0 • γ 0 )(t) acting in the Banach space X is the Fréchet derivative of v 0 at γ 0 (t), it is bounded and continuous. The general theory on linear differential equations reviewed in Appendix A includes (B.1) as a special case.
is the evolution operator given by where I is the identity operator. That the inverse operatorΨ −1 γ0 is continuous can be seen from this form. Now we check the conditions regarding the estimating equation Ψ n . First, we check that Ψ n p → Ψ uniformly in l ∞ (Θ, L). Asymptotic normality ofv n impliesv n p → v 0 uniformly in l ∞ (U, X ). It then follows that the restriction (v n • γ) p → (v 0 • γ) uniformly in l ∞ (I, X ) and thus Ψ n p → Ψ in l ∞ (Θ, L). The asymptotic normality ofv n gives r n (Ψ n − Ψ) . Now all of the conditions of Corollary 13.7 in Kosorok (2008) are satisfied and we have r n (γ n − γ 0 ) d → −Ψ −1 γ0 X(γ 0 ), i.e. the desired result in (2.2).
Finally, if G is a Gaussian process, the process G • γ 0 is a Gaussian process. This is because all finite dimensional distributions of a Gaussian process are Gaussian. SinceΨ −1 γ0 is continuous and continuous mappings preserve Gaussianity, ξ is itself a Gaussian process if G is a Gaussian process.
Remark 1. The differential operator that appears inΨ γ0 is typically unbounded between two Banach spaces. For instance, consider the differential operator from C 1 [0, 1] to C[0, 1] where both spaces are endowed with the uniform norm; if f n = x n , then f n ∞ = 1 and ḟ n ∞ = n. However, with our particular choice of Θ and L and the norms they are equipped with, the differential operator is bounded and thus continuous.
Proof of Theorem 2. This is an adaptation of the proof of Theorem 1 in Koltchinskii, Sakhanenko and Cai (2007) to the Banach space setting. Without loss of generality, we take the interval I ⊂ R to be I = [0, T ] for some T > 0. Let us begin by representing the integral curves γ 0 andγ n in their integral form: γ 0 (t) = x 0 + t 0 v 0 (γ 0 (s)) ds andγ n (t) = x 0 + t 0v n (γ n (s)) ds. These integrals are to be interpreted as Bochner integrals (Mikusiński, 1978), whose construction for Banach-space-valued functions is analogous to that of a Lebesgue integral for real-valued functions.
The local Lipschitz property of v 0 gives us By the Grönwall-Bellman inequality, e.g. Lemma 1 of Chandra (1970), we have Uniform consistency now follows immediately from Assumption A3.

Lemma 2.
Under the conditions of Theorem 2, r n z n (t) Proof. We proceed by showing that D is a continuous operator which implies r n z n (t) = Dη n (t) d − → ξ(t) = Dη(t). Now, D is a linear operator between two normed spaces. Thus by e.g. Theorem 2.2.4 in Atkinson and Han (2009), D is continuous if and only if it is bounded, i.e. there exists some M > 0 such that for all F ∈ C 1 0 (I, X ), DF ≤ M F . We can write u = DF in its integral representation: u(t) = F (t) + t 0 Dv 0 (γ 0 (s))u(s) ds. Then we have Dv 0 (γ 0 (s)) op ds where the last line follows from an application of the Grönwall-Bellman inequality, which can be applied since the map t → Dv 0 (γ 0 (t)) op is continuous on I by Assumption A2.