Nonparametric Clustering of Functional Data Using Pseudo-Densities

We study nonparametric clustering of smooth random curves on the basis of the L2 gradient flow associated to a pseudo-density functional and we show that the clustering is well-defined both at the population and at the sample level. We provide an algorithm to mark significant local modes, which are associated to informative sample clusters, and we derive its consistency properties. Our theory is developed under weak assumptions, which essentially reduce to the integrability of the random curves, and does not require to project the random curves on a finite-dimensional subspace. However, if the underlying probability distribution is supported on a finite-dimensional subspace, we show that the pseudo-density and the expectation of a kernel density estimator induce the same gradient flow, and therefore the same clustering. Although our theory is developed for smooth curves that belong to an infinite-dimensional functional space, we also provide consistent procedures that can be used with real data (discretized and noisy observations).


Introduction
In Functional Data Analysis (Ramsay and Silverman, 2005;Ferraty and Vieu, 2006;Ferraty and Romain, 2011;Horváth and Kokoszka, 2012;Zhang, 2013;Hsing and Eubank, 2015), henceforth FDA, we think of curves (and other functions) as the fundamental unit of measurement. Clustering is an important problem in FDA because it is often of critical interest to identify subpopulations based on the shapes of the measured curves. In this paper, we study the problem of functional clustering in a fully infinite-dimensional setting. We are motivated by recent work on modal clustering in finite dimensions (Chacón, 2015, and references therein) that, in contrast to many commonly-used clustering methods, has a population formulation, and by recent advances in clustering of functional data (Bongiorno and Goia, 2015). Specifically, we prove the existence of population clusters in the infinite-dimensional functional case, under mild conditions. We show that an analogue of the mean-shift algorithm (see, for example, Fukunaga and Hostetler, 1975;Cheng, 1995, and the more recent works of Comaniciu, Ramesh and Meer, 2001;Carreira-Perpiñán, 2006) can identify local modes of a "pseudo-density". We devise an algorithm to classify local modes as representatives of significant clusters, and under some regularity assumptions on the pseudo-density, we further show that the algorithm is consistent. We develop our theory assuming that the data are observed as continuous curves defined on some interval. Because in practice one does not observe continuous curves, we also show how to apply the procedures that we propose to real data (e.g. noisy measurements of random curves on a grid).
Modal clustering is typically a finite-dimensional problem, but motivated by the flourishing literature on FDA and by the increasing interest in developing sound frameworks and algorithms for clustering of random curves, we extend the idea of modal clustering to the case where X is a functional random variable valued in an infinite-dimensional space. In particular, we develop a theory of modal clustering for smooth random curves that are assumed to belong to the Hölder space H 1 ([0, 1]) of curves defined on the standard unit interval whose first weak derivative is square integrable. We focus on H 1 ([0, 1]) for concreteness, but our theory generalizes to any more general spaces. Furthermore, our theory is density-free and nonparametric, as no assumptions are made regarding the existence of a dominating measure for the law P of the functional data, nor it is assumed that P can be parametrized by a finite number of parameters.
In the finite-dimensional modal clustering problem, we have that p : X → R + is the probability density function associated to the law P of a random variable X valued in X ⊆ R d . If p is a Morse function (i.e. p is smooth and its Hessian is not singular at the critical points), then the local modes of p, μ 1 , . . . , μ k , induce an partition of the sample space X = C 1 ∪ C 2 ∪· · ·∪C k where the sets C i satisfy 1. P (C i ) > 0 ∀i = 1, . . . , k 2. P (C i ∩ C j ) = 0 if i = j 3. P (∪ k i=1 C i ) = 1 4. x ∈ C i ⇐⇒ the gradient ascent path on p that starts from x eventually converges to μ i .
Note that this framework characterizes C i as a high-density region surrounding the local mode μ i of p and each set C i ∈ C is thought of as a cluster at the population level. Unlike other approaches to clustering which define clusters exclusively at the sample level (consider k-means, for instance), modal clustering provides an inferential framework in which the essential partition C is a population parameter that one wants to infer from the data. In fact, as soon as an i.i.d. sample X 1 , . . . , X n ∼ P and an estimatorp of p are available, the goals of modal clustering are exactly • estimating the local modes of p by means of the local modes ofp • estimating the population clustering C = {C 1 , . . . , C k } by means of the empirical partitionĈ = {Ĉ 1 , . . . ,Ĉk} induced byp Thus, the typical output of a modal clustering procedure consists of the estimated clustering structureĈ and a set of cluster representativesμ 1 , . . . ,μk. At the sample level, each data point is then uniquely assigned to a clusterĈ i ∈Ĉ and represented by the corresponding local modeμ i . Because it is generally not possible to define a probability density function in infinite-dimensional Hilbert spaces, we instead focus on a surrogate notion of density which we call "pseudo-density". Generally, by pseudo-density we mean any suitably smooth functional which maps the sample space X into the positive reals R + = [0, ∞). In particular, we focus on a family of pseudo-densities P = {p h : X → R + ; h > 0} which is parametrized by a bandwidth parameter h and, more specifically, p h is the expected value of a kernel density estimator, where K is an appropriately chosen kernel function and h is the bandwidth parameter. Clusters of curves are then defined in terms of the L 2 gradient flow associated to p h . The gradient flow associated to p h ∈ P is the collection of the gradient ascent paths π x : R + → X corresponding to the solution of the initial value problem d dt π x (t) = ∇p h (π x (t)) π x (0) = x, (1.2) where ∇p h (x) is the L 2 functional gradient of p h at x ∈ X . In complete analogy with the finite-dimensional case, the gradient of p h induces a vector field and a gradient ascent path is a path in H 1 ([0, 1]), π x ⊂ H 1 ([0, 1]), that solves the initial value problem and flows along the direction of the vector field (at any time t ≥ 0, π x (t) is an element of H 1 ([0, 1])). The path π x has the property that, at any time t ≥ 0, the derivative of π x (t) corresponds to the gradient of p h evaluated at π x (t). If the trajectory π x converges to a local mode μ i = μ i (h) of p h as t → ∞, then x is said to belong to the i-th cluster of p h , C i = C i (h). Thus, the cluster C i is defined as the set where π x is a solution of the initial value problem of equation (1.2). According to the above definition, the i-th cluster of p h corresponds to the basin of attraction of the i-th local mode μ i of p h , and the collection of the clusters C i provides a summary of the subpopulations associated to the probability measure P . The main contribution of our work is to identify conditions under which 1. there exist population clusters in functional data, i.e. the population clusters defined in equation (1.3) exist and are well-defined 2. these clusters are estimable.
Additionally, we describe a procedure to estimate the clusters and assess their statistical significance. This procedure can also be used as a means to perform bandwidth selection in practice.
As we further discuss later in the paper, the most remarkable challenge arising in the infinite-dimensional setting is the lack of compactness. As opposed to the finite-dimensional setting, in the functional case it is hard to show the existence, the uniqueness, and the convergence of the gradient ascent paths described by the initial value problem of equation (1.2), unless the sample space X can be compactly embedded in another space. We show that we can overcome this challenge by exploiting the compact embedding of H 1 ([0, 1]) in L 2 ([0, 1]), the space of square-integrable functions on the unit interval, and by studying equation (1.2) using these two non-equivalent topologies. For convenience, we focus on functional data belonging to H 1 ([0, 1]) and on the gradient flow under the L 2 norm, but the exact same theory carries over to other function spaces, different norms and different pseudo-density functionals, as long as it is possible to compactly embed the sample space X in a larger space and the chosen pseudo-density functional is sufficiently smooth. In particular, we remark that the results of this paper can be straightforwardly generalized to arbitrary pairs of Sobolev spaces of integer order satisfying the compact embedding requirement.
The theory of clustering that we develop in this work is projection-free, since it does not involve projecting the random curves onto a finite-dimensional space. However, if the probability law P of the functional data is supported on a finitedimensional space and admits a proper density with respect to the Lebesgue measure, we show that the gradient flow on the pseudo-density p h and the gradient flow on the expectation of the kernel density estimator of the data coincide (and so coincide the corresponding population clusterings).
One of the most important practical tasks in modal clustering is to identify significant local modes, as these are associated to informative clusters. We provide an algorithm that • identifies the local modes of the population pseudo-density p h by analyzing its sample versionp h ; furthermore, all of the local modes ofp h identified by the algorithm converge asymptotically to their population correspondents of p h • is consistent (under additional regularity assumptions on p h ), in the sense that it establishes a one-to-one correspondence between the sample local modes that it identifies and their population equivalents.
While from a purely mathematical standpoint a sample of functional data {X i } n i=1 is thought of as a collection of continuous curves defined on an interval, we never observe such objects in practice. Rather, we typically only observe noisy measurements of the X i 's at a set of design points {t j } m j=1 . As an intermediate step, we therefore estimate the X i 's from these observations (which constitutes a typical regression problem), and then use the estimates as the input of our procedure. The remainder of the paper is organized as follows. Section 2 provides a concise literature review. Section 3 is devoted to the development of our theory of population clustering for smooth random curves. In particular, there we study in detail the L 2 gradient flow on the pseudo-density p h and establish that, in analogy to the finite-dimensional case, population clusters of smooth random curves can be defined in terms of the basins of attraction of the critical points of p h . Section 4 describes the behavior of the L 2 gradient flow of p h when the probability law P of the data is supported on a finite-dimensional subspace. Section 5 provides an algorithm to identify the significant local modes ofp h and shows that, under additional regularity assumptions, the algorithm is consistent. Section 6 extends the results of Section 5 to discretized and noisy functional data, and Section 7 provides examples of application of the clustering methodology described in this paper both to simulated and real functional data. Section 8 contains a discussion on the choice of the pseudo-density functional while Section 9 provides guidelines for the selection of the bandwidth parameter h. Section 10 summarizes the contributions of this paper. The proofs of the main results can be found in Appendix A, while auxiliary results (such as probability bounds for the estimation of the pseudo-density functional p h and its derivatives) are deferred to Appendix B.

Related literature
The difficulties associated to the lack of proper density functions in infinitedimensional spaces are well-known among statisticians. This has stimulated the introduction of various surrogate notions of density for functional spaces. The literature on pseudo-densities includes the work of Gasser, Hall and Presnell (1998), Hall and Heckman (2002), Delaigle and Hall (2010), and Ferraty, Kudraszow and Vieu (2012).
A population framework based on Morse theory for nonparametric modal clustering in the finite dimensional setting is presented in Chacón (2015). Whenever a proper density p : X → R d exists and it is a Morse function, the problem of equation (1.2) induces an essential partition of the sample space X ⊆ R d in the sense that each set C i in the partition of X such that P (C i ) > 0 corresponds to the basin of attraction of a local mode μ i of p, i.e. C i = {x ∈ X : lim t→∞ π x (t) = μ i }. Furthermore, if p has saddle points, the basin of attraction of each saddle is a null probability set (similarly, the basin of attraction of a local minimum is a singleton and hence negligible as well).
A number of gradient ascent algorithms have been developed to perform modal clustering in the finite-dimensional case. One of the most popular modefinding and modal clustering algorithms is the mean-shift algorithm (Fukunaga and Hostetler, 1975;Cheng, 1995). A version of the mean-shift algorithm for functional data is discussed in Ciollaro et al. (2014). A gradient ascent algorithm for functional data is proposed in Hall and Heckman (2002).
The international FDA community is very vibrant. The most recent literature includes the work of Cuevas (2014) and Goia and Vieu (2015), who provide a general overview and an account of some key advances in high and infinitedimensional statistics, as well as a book on functional ANOVA by Zhang (2013) and a book on the theoretical foundations of FDA by Hsing and Eubank (2015).
With particular focus on clustering, we would like to point out the recent work Bongiorno and Goia (2015) where the authors propose a clustering method for functional data based on the small ball probability function ϕ h (x) = P ( X − x 2 ≤ h) and on functional principal components. Finally, a recent overview of other clustering techniques for functional data can be found in Jacques and Preda (2013).

A population background for pseudo-density clustering of functional data
We denote by X ∼ P a functional random variable valued in L 2 ([0, 1]), the space of square integrable functions on the unit interval with its canonical inner product x, y L 2 = 1 0 x(s)y(s) ds and induced norm x L 2 = x, x L 2 . As we previously mentioned, it is not possible to define a proper probability density function for P . Instead, we study the L 2 gradient flow of equation (1.2) associated to the functional mapping L 2 ([0, 1]) into R + , where h > 0 is a bandwidth parameter, K : R + → R + is a kernel function, and P X−x 2 L 2 /h denotes the probability measure induced by P through the map X → X − x 2 L 2 /h. Note that p h is closely related to the so-called small-ball probability function ϕ h (x) = P ( X − x 2 L 2 ≤ h). It is easy to see that p h (x) = ϕ h (x) when K(s) = 1 [0,1] (s), therefore p h can be thought of as a smoother version of ϕ h .
Unless otherwise noted, we make the following assumptions throughout the paper: (H1) K : R + → R + is twice continuously differentiable and the following bounds hold on the derivatives of K h (·) = K(·/h): where the constants K 0 , K 1 , K may depend on h. (H2) K (t 2 ) + K(t 2 ) ≤ 0 for all t ∈ R + . (H3) X is P -almost surely absolutely continuous and its moments satisfy E P X L 2 ≤ M 1 < ∞ and E P X L 2 ≤ N 1 < ∞ for some constants M 1 and N 1 .
(H4) All the non-trivial critical points of p h are isolated under the L 2 norm, i.e. there exists an open L 2 neighborhood around each critical point x * of p h with p h (x * ) > 0 such that there are no other critical points of p h that also belong to that neighborhood.
Various kernels can be shown to satisfy assumptions (H1) and (H2). For instance, both the compactly supported kernel K(t) ∝ (1−t) 3 1 [0,1] (t) and the exponential kernel K(t) ∝ e −t 1 [0,∞) (t) satisfy our assumptions. (H3) is an assumption on the smoothness of the random curves. Intuitively, (H3) corresponds to assuming that the probablility law P does not favor curves that are too irregular or wiggly.
(H4) is a regularity assumption on the functional p h : essentially, under the above assumptions on K, (H4) corresponds to assuming that the functional p h does not have flat "ridges" in regions where it is positive.
Then, for h sufficiently small, all the critical points of p h in int(S c ) are nondegenerate and there are no non-trivial critical points outside of int(S c ).
In order to simplify the discussion, from now on we focus on the shifted random curves X −X(0); however, with a little abuse of notation, we will keep using the letter X to mean X − X(0). This choice is just made for convenience as it significantly simplifies the proofs of many of the results that we present. Following this notational convention, X thus belongs P -almost surely to the space H 1 0 ([0, 1]) = {x : [0, 1] → R such that x L 2 < ∞ and x(0) = 0}. Poincaré inequality ensures that the semi-norm x L 2 is in fact a norm on H 1 0 ([0, 1]) and x L 2 ≤ C p x L 2 with C p = 1 (i.e. H 1 0 ([0, 1]) can be continuously embedded in L 2 ([0, 1])). In the following, we denote x H 1 0 = x L 2 for x ∈ H 1 0 . Moreover, to alleviate the notation, from now on we denote L 2 = L 2 ([0, 1]) and . If the curves were not shifted so that X(0) = 0, then they would belong P -almost surely to The main goal of this section is to show that the L 2 gradient flow associated to p h is well-defined. In particular, we establish the following facts: 1. the L 2 gradient flow associated to p h is a flow in H 1 0 2. for any initial value in H 1 0 , there exists exactly one trajectory of such flow which is a solution to the initial value problem of equation (1.2) 3. for any initial value in H 1 0 , the unique solution of the initial value problem of equation (1.2) converges to a critical point of p h as t → ∞ and the convergence is with respect to the L 2 norm 4. all the non-trivial critical points of p h are in H 1 0 , the support of P . These facts guarantee that the clusters described in equation (1.3) exist and are well-defined. Remark 2. In general, in an infinite dimensional Hilbert space, the trajectory of the solution of an ordinary differential equation such as the one of equation (1.2) may not converge as t → ∞. In fact, such trajectory can be entirely contained in a closed and bounded set without converging to any particular point of that set. To guarantee the convergence of the gradient flow trajectories, one needs that (see Jost, 2011) 1. the trajectories satisfy some compactness property 2. the functional of interest (in our case p h ) is reasonably well-behaved: for instance it is smooth, with isolated critical points.
In L 2 , compactness is a delicate problem: no closed bounded ball in L 2 is compact. However, any closed and bounded H 1 ball is compact with respect to the L 2 norm (and so is any closed and bounded H 1 0 ball). In fact, H 1 can be compactly embedded in L 2 (see, for instance, Chapter 5.7 of Evans, 1998), which means that every bounded set in H 1 is totally bounded in L 2 and H 1 can be continuously embedded in L 2 . Since H 1 0 is a closed subspace of H 1 , H 1 0 can also be compactly embedded in L 2 . From a theoretical point of view, L 2 is strictly larger than H 1 . However, H 1 is dense in L 2 .
The remainder of our discussion focuses on the main results of this section, which concern the computation of the derivatives of p h and their properties, the existence, the uniqueness, and the convergence of the solution of the initial value problem of equation (1.2).
Before we state our results, let us recall that for a functional random variable X ∼ P valued in L 2 ([0, 1]) the expected value of X is defined as the element E P X ∈ L 2 ([0, 1]) such that E P X, y L 2 = E P X, y L 2 for all y ∈ L 2 ([0, 1]) (Horváth and Kokoszka, 2012). Furthermore, the expectation commutes with bounded operators. Also, recall that for a functional F mapping a Banach space B 1 into another Banach space B 2 , the Frechét derivative of F at a point a ∈ B 1 is defined, if it exists, as the bounded linear operator DF such that F (a + δ) − F (a)−DF (δ) B2 = o( δ B1 ). The most common case in this paper sets B 1 = L 2 , B 2 = R + , and F = p h . Because DF is a bounded linear operator, if B 1 is also an Hilbert space then the Riesz representation theorem guarantees the existence of an element ∇F (a) ∈ B 1 such that, for any The element ∇F (a) corresponds to the gradient of F at a ∈ B 1 . In this way, the gradient ∇F (a) and the first derivative operator DF at a ∈ B 1 can be identified. In the following, with a slight abuse of notation, we will use DF both to mean the functional gradient of the operator F (which is an element of B 1 ) and its Frechét derivative (which is a bounded linear operator from B 1 to B 2 ). It will be clear from the context whether we are referring to the derivative operator or to the functional gradient. Note that higher order Frechét derivatives can be similarly identified with multilinear operators on B 1 (see, for example, Ambrosetti and Prodi, 1995).
Recall that, by assumption, the function K h ( X − x 2 L 2 ) is bounded from above by a constant K 0 . Furthermore, it is three times differentiable and its first Frechét derivative at x is The second Frechét derivative at x corresponds to the symmetric bilinear operator Remark 3. Any bounded bilinear operator B on L 2 can be represented as a bounded linear operator from L 2 to L 2 . In fact, let z 1 be any element of L 2 ; then, B(z 1 , ·) is a bounded linear operator from L 2 to R. By the Riesz representation theorem, one can define B(z 1 ) ∈ L 2 by letting B(z 1 ), z 2 L 2 = B(z 1 , z 2 ) for any z 2 ∈ L 2 . The operator norm of B is then defined by It is straightforward to check that both derivatives correspond to bounded linear operators under assumption (H1). The following Lemma provides the first and the second Frechét derivatives of p h .

Lemma 1. Under assumption (H1) the Frechét derivative of p
The second Frechét derivative of p h at x corresponds to the symmetric bilinear operator Furthermore, both derivatives have bounded operator norm for any x ∈ L 2 ([0, 1]).
The following Proposition shows that the L 2 gradient of p h is an element of H 1 0 . Intuitively, this means that if the starting point of the initial value problem of equation (1.2) is in H 1 0 (and a solution exists for that starting point), then we should expect that the path π x only visits elements of H 1 0 , i.e. the L 2 gradient flow associated to p h is a H 1 0 flow. Proposition 2. For any x ∈ H 1 0 , the L 2 gradient of p h at x, Dp h (x), is an element of H 1 0 such that for any y ∈ L 2 , Proposition 2 also implies that the equation d dt π x (t) = Dp h (π x (t)) is meaningful when restricted to H 1 0 . The next Lemma, Lemma 3, establishes that Dp h is locally Lipschitz under the H 1 0 norm. The subsequent Lemma, Lemma 4, guarantees that a solution of the problem (if it exists) is necessarily bounded. These two Lemmas allow us to claim that if the starting point π x (0) = x is an element of H 1 0 , then the initial value problem of equation ( , π x (t) H 1 0 ≤ 0 as soon as π x (t) H 1 0 > M. The intutive interpretation of Lemma 4 is that a trajectory π x that is a solution to the initial value problem of equation (1.2) cannot wander too far from the origin in H 1 0 . In fact, if the H 1 0 norm of π x increases too much, then the path π x is eventually pushed back into the closed and bounded H 1 0 ball of radius K 2 N 1 /δ (or radius M if one makes the stronger assumption that the probability law P of the random curves is completely concentrated on the H 1 0 ball of radius M ). This "push-back" effect is captured by the condition Dp h (π x (t)), π x (t) H 1 0 ≤ 0. By combining Lemma 3 and Lemma 4, we obtain the following Proposition 3. Under assumptions (H1), (H2), and (H3), the initial value Remark 4. Proposition 3 establishes the existence and the uniqueness of a solution to the initial value problem of equation (1.2) in the H 1 0 topology. The initial value problem can be solved uniquely in the L 2 topology as well. In fact, it is easily verified that, because D 2 p h is bounded, then the first derivative of p h , Dp h : L 2 → L 2 , is uniformly Lipschitz with respect to the L 2 norm. Thus, one only has to show that the H 1 0 flow π x of Proposition 3 solved in the H 1 0 2932 M. Ciollaro et al. topology corresponds to the L 2 gradient flow associated to p h . To verify this, one needs to check that the H 1 0 solution also satisfies the initial value problem of equation (1.2) under the L 2 norm. Specifically, consider the H 1 0 solution π x of Proposition 3 with any π x (0) = x ∈ H 1 0 . The path π x is continuously differentiable as a map from R + to H 1 0 . It suffices to check that π x (t) is continuously differentiable as a map from R + to L 2 as well. This is easily established using Poincaré inequality since where the Poincaré constant is C p = 1 for the pair (L 2 , H 1 0 ). It is clear from equation (3.8) and the definition of Frechét derivative that the H 1 0 solution π x also satisfies the initial value problem of equation (1.2) under the L 2 norm. Thus, π x is the unique L 2 solution of the intial value problem of equation (1.2).
The following Theorem, based on Proposition 3, guarantees the convergence of π x to a critical point of p h as t → ∞. The statement about the convergence strongly relies on the compact embedding of H 1 0 in L 2 , the boundedness of the first two derivatives of p h , and assumption (H4).

Theorem 1. Assume (H1), (H2), (H3), and (H4) hold. Let π
The results above show that the L 2 gradient flow on p h is well-defined and its trajectories converge to critical points of p h that are in H 1 0 whenever the starting point x = π x (0) is an element of H 1 0 . We conclude this section with the following Lemma which states that all the non-trivial critical points of p h belong to H 1 0 : thus, even though the functional p h "spreads" the probability law P of the random curves outside of its support H 1 0 (in fact, it is easily seen that there exists points x ∈ L 2 that are not in H 1 0 with p h (x) > 0), all of its non-trivial critical points still lie in the support of P .

Lemma 5. Assume (H1), (H2), and (H3) hold. Let
. Note that the stronger assumption that P ( X H 1 0 ≤ M ) = 1 is a functional analogue of the boundedness assumption which is frequently made with finitedimensional data.

Finite-dimensional adaptivity
If X ∼ P is a functional random variable and P is supported in a finitedimensional subspace of L 2 of dimension d, there is a d-dimensional orthonor-mal system that one can use to represent X in terms of a d-dimensional vector of Fourier coefficients. In this case X is just a d-dimensional random vector, hence the statistical model is multivariate and not really a functional one. Suppose now that P admits a proper density p which is supported on the span of this d-dimensional orthonormal system. Then, in this case, one could use a traditional kernel density estimatorp KDE h of p and estimate its population clusters based on its gradient flow. This section clarifies that the gradient flow of Since the two gradient flows are equivalent, the population clustering induced by p h and E Pp KDE h is the same. This is essentially a consequence of the isometric isomorphism between any d-dimensional finite-dimensional subspace of L 2 and R d .
Keeping this in mind, we can now move on to the mathematical details. For the remainder of this section, we assume that the distribution of the random function X is supported on some compact subset S c of a finite dimensional vector space. In other words, where S c is a compact subset of a finite-dimensional subspace S ⊂ L 2 . We discuss two insightful facts.
1. Under some mild extra assumptions on the finite dimensional distribution of X, it is shown in Lemma 7 that p h , as a functional from L 2 to R + , is a Morse functional. This provides an important sufficient condition under which (H4) holds. 2. If the functional random variable X admits a finite-dimensional distribution on S c , it is natural to ask whether the L 2 gradient flow on p h corresponds to the finite-dimensional gradient flow associated to the expectation of a kernel density estimator of the density of X on S. This section provides a positive answer to this question. Furthermore, we show that such finite-dimensional gradient flow is entirely contained in S.
Suppose that the probability law P of the functional random variable X is supported on a compact subset S c of a finite-dimensional space S ⊂ L 2 . If this is the case, there exists δ > 0 such that if 0 < h ≤ δ, then p h is a Morse function on the interior of S c (see Remark 1 and Proposition 1). Moreover, as implied by Lemma 6 and Lemma 7 of this section, the trajectories of the L 2 gradient flow associated to p h are all contained in S and they end at critical points of p h that belong to S c . It is natural to ask whether the L 2 gradient flow on p h corresponds to the finite-dimensional gradient flow associated to some pseudodensity on S. This section answers this question and shows that, if X admits a density function p (when X is viewed as a finite-dimensional random vector in S c ), then the L 2 gradient flow associated to p h corresponds to the gradient flow associated to the expectation of a kernel density estimator of p with bandwidth h.
Let S = span{f 1 . . . f d } be a linear subspace of L 2 . Without loss of generality, assume that the f i 's form an orthonormal basis of S equipped with the L 2 norm and that X ∈ S c almost surely. Then, X admits the decomposition X = . . . , a d ] T and suppose that the distribution ofX has density p : R d → R + with respect to the Lebesgue measure. We have the following Lemma 6. Assume (H1) and (H2) hold and P (X ∈ S c ) = 1. If x = π x (0) ∈ S, then π x (t) ∈ S for any t ≥ 0. Furthermore, all the non-trivial critical points of p h belong to S.
For the rest of this section, let us replace assumption (H4) with (H4') X is an element of S c with probability 1, X ∼ P admits density p on S c , and p satisfies the assumptions of Proposition 1.
. To see the connection between the functional gradient Dp h (x) and ∇p h (x), the gradient ofp h atx, note that the random variable agrees with the i-th component of the gradient ofp h atx. This equivalence implies that the gradient flow (with starting points in the subspace S) on p h andp h coincide (note that scaling thep h by h −d does affect the associated gradient flow). Furthermore, there exists a δ > 0 depending on p such that p h (x) is a Morse function for 0 < h ≤ δ (see Remark 1). Therefore, all the non-trivial critical points ofp h are separated in R d . In light of Lemma 6, all the non-trivial critical points of p h are thus separated in S (and in L 2 ). Next, we have the following Lemma which guarantees that if p is a Morse density on S c , then the non-trivial critical points of p h are non-degenerate for h sufficiently small and they all belong to S c (a critical point In the finite-dimensional case considered in this section, we can say more about the behavior of the L 2 gradient flow on p h . In particular, we can characterize the solutions to the initial value problem of equation (1.2) also for the case in which the starting point x = π x (0) does not belong to the support of P (which is, in this case, S c ⊂ L 2 ). In fact, let x be an element of L 2 which does not belong to S. The Gram-Schmidt orthogonalization process guarantees that The following Lemma guarantees that the gradient ascent path originating from x is entirely contained in S . Its proof is identical to that of Lemma 6.

Lemma 8. Assume (H1) and (H2) hold and that
Remark 5. In the finite-dimensional setting of this section (in particular under assumption (H4')), and for h sufficiently small, the basin of attraction of a saddle point of p h is negligible: in fact, from the above disussion, it is clear that if the random function X ∼ P is valued in a compact subset S c of a finitedimensional linear subspace S of L 2 and P has a proper Morse density p on S c , then the basin of attraction of any saddle point of

Statistical relevance of the estimated local modes
The empirical counterpart of i.d functional random variables with probability law P . The critical points ofp h can be found, for example, by using a functional version of the mean-shift algorithm (see Hall and Heckman, 2002;Ciollaro et al., 2014). In this section, we provide a statistical algorithm to detect whether a critical point ofp h corresponds to a local maximum of p h . This algorithm provides two insights for functional mode clustering.
1. For finite-dimensional clustering problems, if the underlying density p is a Morse function, then the basin of attraction of a saddle point of p has null probability content as it corresponds to a manifold of lower dimension. In functional data clustering, however, the structure of the functional space is more complicated in the sense that there is no guarantee that the probability content of the basin of attraction of a saddle point of p h is negligible, even if p h is a Morse function. However, in analogy with the finite-dimensional case, clusters associated to non-degenerate local modes should generally be considered more informative as opposed to clusters associated with saddle points. 2. Several results in the previous section are derived under assumption (H4), which essentially states that the relevant critical points of p h are wellbehaved. Without assuming (H4), the algorithm provides a simple way to classify well-behaved local modes of p h by analyzingp h . Thus, informative clusters can still be revealed in a less restrictive setting.
Since the local modes ofp h that correspond to non-degenerate local modes of p h provide the greatest insight about the population clustering, we refer to these local modes as "significant" local modes. In the following, we derive a procedure that allows us to discriminate the significant local modes from the non-significant ones. Before giving the definition non-degeneracy for a critical point of a functional defined on an Hilbert space (L 2 in our case), it is convenient to adopt the convention that a linear operator from an Hilbert space to itself can be associated to a bilinear form on the Hilbert space and vice versa. For example if T : L 2 → L 2 is a linear operator, then it can be associated to a bilinear form by letting It is a known fact that for any x, the second derivative of f , D 2 f (x), is a selfadjoint linear operator. Furthermore, the following Lemma follows as a simple consequence of the fact that the second derivative of f at a non-degenerate local maximum is a self-adjoint negative-definite isomorphism. (5.1) Let now f 1 , f 2 : L 2 → R + be twice continuously differentiable with bounded third derivative. Consider the following abstract setting for f 1 and f 2 .
(C1) The non-trivial critical points of f 1 and f 2 are all in have H 1 0 solutions whose trajectories admit a convergent subsequence in where · stands for the appropriate norms. Also, for i = 1, 2 and k = 0, 1, 2, 3, let Remark 6. Of course, the results that we obtain here are most useful for the particular case where and X 1 , . . . , X n ∼ P are i.i.d. functional random variables valued in H 1 0 . In this case, Lemma 5 and Proposition 3 provide sufficient conditions for (C1) and (C2), respectively. The boundedness for β k is ensured by (H1), and the probability bounds in Appendix B guarantee that η l converges to 0 as the sample size n increases. (5.5). For any α ∈ (0, 1), we can derive a procedure based on Lemma 10 which allows us to classify nondegenerate local modes ofp h as significant and construct an L 2 neighbor around them with the property that the probability that each of such neighbors contains a non-degenerate local mode of p h is at least 1 − α for n large enough. The procedure is summarized in Display 1 and its statistical guarantees are described in Proposition 4.
Output: a setR of significant local modes ofp h .
1. Computep h and determine the set of non trivial local max ofp h ,Ĉ (here non-trivial then classifyx * as a significant local mode ofp h . Here β 3 = 12K 3 .
Display 1 . Assume (H1) and (H2) hold and P ( X H 1 0 ≤ M ) = 1 for some known M > 0. LetR denote the set of points classified by the algorithm of Display 1. Then, for large enough n, with probability 1 − α the following holds for allx * ∈R: ( 5.6) According to Proposition 4, with probability 1 − α, for everyx * ∈R, there exists a unique x * ∈ R contained in the right hand side of equation (5.6). In other words, with probability 1 − α, Φ is a well-defined map. Under suitable assumptions on p h (x), more can be said.
Proposition 5. Assume that (H1) and (H2) hold and that P ( X H 1 0 ≤ M ) = 1 for some known M > 0. Suppose further that p h has finitely many nondegenerate local modes. Let R denote the collection of non-trivial local maxima of p h . Then, with probability converging to 1 as n → ∞, every x * ∈ R has a unique preimage of Φ inR.
Remark 7. Under the assumptions of Proposition 4 and 5, one can conclude that with probability converging to 1 − α, the map Φ :R → R is bijective. In other words, the algorithm of Display 1 is consistent.

From theory to applications
So far, all the results have been developed in an infinite-dimensional functional space. In this section, we connect the theory that we developed to practical applications and, in particular, we address the following challenges.
1. Complete functional data can never be observed: a functional datum is always observed on a discrete grid. For example, let {X i } n i=1 be an i.i.d sample from a distribution P on H 1 0 and let {t j } m j=1 be a set of equally spaced design points. In practice, only noisy measurements of the X i 's at {t j } m j=1 are available. It is therefore important to design procedures that work with discretized curves. 2. While the theory is developed in an infinite-dimensional functional space, in practice any functional clustering method relies on the use of only finitely many basis functions. However, a flexible algorithm for functional data clustering should be asymptotically consistent with the infinite-dimensional theory.
One way to accomplish these two tasks at the same time is to apply a projection method. As shown later in this section, projections onto a linear space introduce small L 2 perturbations to the functional data and to the pseudo-density.
Nonetheless, the procedure that we describe is tolerant to such perturbations (see Corollary 1 for more details). Before turning to the technical arguments, let us describe the following simple example which motivates the projection approach.
Example 1. Consider the simple model y = X(t)+ , where X ∼ P is a random function and is a random variable independent of X. Instead of observing n complete random function samples {X i } n i=1 , one only observes the discrete noisy measurements j=1 is a set of equally spaced design points for the samples and the measurement errors ij ∼ N (0, σ) are independent of {X i } n i=1 . In Gasser and Müller (1984), for example, if one assumes further that the random function X is bounded in H 2 , i.e. it is shown that there exists a kernel W so that an approximation of X can be constructed asX If b is chosen to be of order m −1/5 , the above estimator also satisfies

4)
where C(M 2 , W ) is a constant only depending on M 2 and W , and the expectation is taken with respect to only.
As shown in the example, with noisy discrete measurements of the functional datum X, one can construct an approximatioñ where b is of order O(m −1/5 ). This approximation corresponds to a perturbed version of the underlying complete functional datum. The perturbation vanishes asymptotically as the number of discrete measurements m goes to infinity. This motivates the following assumption.
Recall that in our theoretical results, the sample version of the pseudo density takes the formp When the only available functional data are The following simple Lemma is useful to characterize the aforementioned L 2 perturbation and allows us to derive Corollary 1.
where · stands for the appropriate L 2 operator norm.

Corollary 1. Consider a modified version of the algorithm of Display 1, wherê
25K 2 2 log (4 log(n)/α) 8n (6.10) LetR be the significant local modes learned by this modified version of the algorithm. Then, the following statements are true.

Consider the map
where R denotes the collection of non-degenerate local modes of p h . Suppose further that p h has finitely many non-trivial local modes and that they are non-degenerate. Assume that {X i } n i=1 ∈ B H 1 0 (0, M). Then, with probability converging to 1 as n → ∞, every x * ∈ R has a unique preimage iñ R with respect to Φ.
It is desirable to have a method that does not use infinitely many L 2 basis functions to compute a non-degenerate local modex * ofp h and δ(x * ) := − sup u L 2 =1 Dp h (x * )(u, u). Lemma 6, together with the assumption that {X i } n i=1 ⊂S, ensures that all the non-trivial critical points ofp h (x) belong toS. Equation (A.31) of Appendix A shows that for any v ∈ (S) ⊥ , Therefore, in analogy to the results of Section 4, in order to classify the significant local modes ofp h it is not required to consider infinitely many L 2 basis functions.
Remark 9. The assumptions of Example 1 do not immediately guarantee that as we assume in the second claim of Corollary 1. However, simple computations show that for some positive constant C. Therefore, as long as n = o(m 2 5 ), the consistency result (the second claim) of Corollary 1 still holds. Projection using regression estimators introduces some bias to the pseudo density. However, as long as the observation grid is sufficiently fine (m is large enough), then the bias is of lower order. For example, if n = o(m 2/5 ), then φ(m) = O(m −4/5 ) = o(n −2 ), which is of lower order in (6.10). Thus, in this case, one can simply focus on the projected version of the observed functional data.

Simulations and applications
In this section, we apply the methodology that we discussed on two simulated functional datasets and on a real dataset. In the following examples, we describe two practical ways to select the bandwidth and we show that both lead to meaningful estimated clusters. We use the mean-shift algorithm (Fukunaga and Hostetler, 1975;Cheng, 1995) with the exponential kernel to identify the local modes and cluster the data.
The mean-shift algorithm is a recursive algorithm that is equivalent to gradient ascent with a particular choice of adaptive step-size. In particular, the mean-shift algorithm approximates the gradient ascent path ofp h starting at a point x = x 0 by means of the recursive update (7.1) In practice, one typically takes x 0 ∈ {X 1 , . . . , X n }. In what follows, the constants C 1 (α) and C 2 (α) of the algorithm of Display 1 are determined by using the nonparametric Bootstrap. Recall that C 1 (α) and C 2 (α) are simply the 1 − α quantiles of the random variables η 1 and η 2 of equation (5.3). The expressions of C 1 (α) and C 2 (α) reported in Display 1 allow us to deduce the convergence rate of the procedure, but they represent rather conservative bounds. In practice, it is advisable to determine this quantities in a data-driven way (for example, use non-parametric bootstrap to find the quantiles of equation (5.3)).

Integrated Brownian motion
DATA Let W i denote a realization of the standard Brownian motion process on t ∈ [0, 1]. We generate i.i.d. trajectories of the integrated Brownian motion process with drift The curves above are generated on a grid of 200 equally-spaced points in [0, 1] and they are displayed in Figure 1. Because of the variability of the process and the noise level σ, it is hard to distinguish the presence of the three distinct clusters in the observed trajectories.
PROCEDURE We run the mean-shift algorithm using the exponential kernel and a set of candidate bandwidths corresponding to the 10, 15, 20, 25, . . . 100 percentiles of the distribution of all pairwise L 2 distances between the trajectories X 1 , . . . , X 300 . Figure 2 depicts how the number of estimated local modes (hence the number of estimated clusters) varies as the bandwidth increases: when the bandwidth is too small (overclustering),p h has a large numbers of noninformative local modes, whereas when the bandwidth is too large (underclustering) all the data points are eventually merged in a unique cluster corresponding to the unique mode ofp h . Frequently, one observes a relatively large intermediate range of bandwidths where the clustering structure is stable. In Figure 2, we notice the presence of such range. There, the number of clusters is stable and equal to 3. This gives a heuristic to choose h in practice. Figure 3, depicts the estimated clustering structure when h is chosen within the aforementioned range. Note that the estimated local modes (continuous light blue lines) are very closed to the quadratic drift functions of equations (7.2) and (7.3) (broken light blue lines).

Fig 3. Estimated local modes (continuous light blue lines) and estimated clusters (red, grey, and black) in the integrated Brownian motion example. The broken lines (quadratic drifts) correspond to the true local modes.
These random coefficients are used to generate noiseless functional data according to Noisy measurements of these curves are then obtained from the model where {t j , j = 1, . . . , 500}, are equally-spaced points in [0, 1] and i,j are i.i.d. draws from N (0, 0.5 2 ). Figure 4 depicts two noisy observations of these curves while Figure 5 is a pairs scatterplot representing the coefficients of the projection of the Y i 's on an 8-dimensional cosine orthonormal basis. Because of the very low signal-to-noise ratio of the Y i 's, neither of these plots clearly provides conclusive evidence about the presence of two distinct clusters.

PROCEDURE
The goal is once again to choose h and recover the hidden clustering structure. We proceed as follows.
1. For all i = 1, . . . , 2000, we obtainX i , an estimate of X i , by using an orthogonal projection estimator on the canonical cosine orthonormal basis. We use Generalized Cross Validation to optimally pick the number of basis functions (which is 8 in this case). 2. We randomly split the set ofX i 's in two subsamples of size 1000 each. 3. We form a set of candidate values for h.  4. For each candidate value of h, the first subsample ofX i 's is used to estimate the functional modes and the corresponding clusters. 5. For each estimated clustering structure, the second subsample ofX i 's is used to test for the clusters' significance (α = 0.10).
Thanks to the availability of a test for the significance of the local modes, we have a principled way of selecting the bandwidth and an alternative to the heuristic proposed in the previous application. In particular, h should be chosen as the bandwidth which reveals the largest amount of statistically significant structure in the data, i.e. as the bandwidth that maximizes the number of significant clusters.

RESULTS
The final result of this procedure is summarized in Figure 6. In this case, the final bandwidth is h = 0.05 and, despite the low signal-to-noise ratio of the observed data, we detect the presence of two distinct significant clusters. Note once again that, for too large values of the bandwidth, we observe a single cluster corresponding to the entire sample.

Neural activity curves
Figure 7 100 out of 1000 curves which correspond to recordings of neural activity at 32 equally-spaced time points. These data come from a behavioral experiment performed at the Andrew Schwartz's Motorlab (University of Pittsburgh) on a macaque monkey. The monkey performs a center-out and out-center targetreaching task with 26 targets in a virtual 3D environment and these curves represent voltage changes over time in the monkey's neurons.
It is known that in this dataset there are three distinct clusters (see Taylor, Tillery Helms and Schwartz, 2002 for more details). In this example, we show that our procedure allows us to recover the true clustering structure, even if we did not have any a priori information about the true number of clusters.
We follow the same strategy outlined in the previous example regarding the two-dimensional linear space spanned by the sine functions. While we only show 100 curves in Figure 7, the clustering procedure is run on the entire  dataset (1000 curves). Figure 8 displays the number of clusters and the number of significant clusters as the bandwidth is varied. Once again, the bandwidth is chosen to be the value which maximizes the number of significant local modes (α = 0.10). In this case, this corresponds to h = 140 or h = 150, both of which produce three significant clusters. Finally, Figure 9 displays the same subset of curves of Figure 7, this time colored by the cluster membership, together with the three local modes associated to the clusters.

On the choice of the pseudo-density functional
It is well-known that the Lebesgue measure does not exist in infinite-dimensional spaces. As a consequence, a proper density function for a functional random variable cannot generally be defined (Delaigle and Hall, 2010). Developing a theory of modal clustering for functional data necessarily requires a choice of a surrogate notion of density that substitutes the probability density function associated with the data. A pseudo-density should satisfy some basic differentiability properties, so that one can study the associated gradient flow. While we explicitly choose to use the functional p h of equation (3.1), one could in principle work with a different functional. The choice of the pseudo-density is not an easy one, however.
First of all, with particular emphasis on the setting that we consider, we point out that, while tempting, one cannot naively assume that the pseudo-density is L 2 differentiable (or even just continuous) and also vanishes as the H 1 0 norm diverges. Proof. Consider the sequence of functions x n (t) = n −1 sin(n 2 t) for n ≥ 1 and t ∈ [0, 1]. Clearly, x n ∈ H 1 0 ([0, 1] for any n ≥ 1. Notice that x n L 2 → 0 and x n H 1 0 → ∞ as n → ∞. Thus, by assumption, p(x n ) → p(0) = 0 as n → ∞. Consider now z n = y + x n where y ∈ H 1 0 ([0, 1]). We have z n L 2 → y L 2 as n → ∞, hence p(z n ) → p(y) as n → ∞. However z n H 1 0 ≥ x n H 1 0 − y H 1 0 . Hence z n H 1 0 → ∞ as n → ∞. We thus have p(z n ) → p(y) = 0 as n → ∞ for any y ∈ H 1 0 ([0, 1]), implying that p is null on H 1 0 ([0, 1]).
The argument above shows that requiring a pseudo-density to be L 2 continuous and to vanish outside of H 1 0 necessarily leads to an uninteresting scenario for modal clustering, despite the fact that these two requirements apparently sound reasonable at first and carry some resemblance with the standard assumptions that are made on density functions in finite-dimensional problems.
Secondly, analyzing the asymptotic regime where h = h n → 0 makes little sense even in the most well-understood situations. In fact, let us consider the following two settings in which one typically chooses h = h n → 0 as n → ∞.
1. If the law P of X is supported on a finite-dimensional space and admits a density p, then the bias ofp h is easy to compute and one can choose h n → 0 to optimally balance the bias-variance trade-off as it is usually done in density estimation. However, if p is defined over a finite-dimensional vector space S, the gradient flow is not well-defined outside of S. As discussed in Section 6, all of the observed functional data are generally reconstructed from noisy discrete measurements. As a result, neither the observed discrete measurements nor the reconstructed functional data are in S, and the gradient flow with respect to p starting at any of these elements is not well-defined. 2. If the law of P of X is supported on an infinite-dimensional space (for example X is a diffusion process), then some authors (see, for instance, Gasser, Hall andPresnell, 1998 andFerraty, Kudraszow andVieu, 2012) suggest to implicitly define a pseudo-density by assuming a particular factorization of the small-probability function associated to P . In particular, it is assumed that as h → 0 for some pseudo-density functional p depending only on the center of the ball x and some function φ depending only on the radius of the ball h. In this second case, p is non-zero only on its domain S 1 , which is typically taken to be a compact subset of the infinite-dimensional functional space. Compact subsets of L 2 are singular in the sense that any L 2 closed ball is not compact. As a result, the pseudo-density p is also singular, hence not L 2 differentiable. One might then assume that p is differentiable with respect to norm induced by S 1 and study the gradient flow associated to p using the S 1 topology. In light of the first point just given, one should then assume that S 1 is the closure of an open set of an infinite-dimensional functional subspace. This leads to an even more serious problem: closed bounded balls in S 1 are not compact under the S 1 topology. The lack of compactness implies that the gradient ascent paths are not guaranteed to converge.
On the other hand, the pseudo-density functional p h (x) = E P K( with h > 0 is a natural candidate to develop a theory of modal clustering of smooth random curves in a density-free setting. Furthermore, the functional p h corresponds to the functional discussed by Hall and Heckman (2002), who proposed a mode-finding algorithm for functional data and had the intuition that their algorithm was approximating a gradient flow on the estimatorp h .
Of course, from a practical point of view, one still has to choose a value for h. The next section describes two bandwidth selection strategies.

Bandwidth selection
The choice of the bandwidth is generally a difficult task in nonparametric problems. Furthermore, the difficulty increases as the dimension of the model increases (see for instance Scott, 2015).
For multivariate data, the behavior of the topological structure ofp (the estimator of the underlying density function) often exhibits a phase-transition: for small values of h (undersmoothing) the estimated density generates many irrelevant clusters (overclustering), while for large values of h (undersmoothing) p generates few uninformative clusters (underclustering) which eventually merge into a single cluster once h becomes large enough. Interestingly, one can usually identify a relatively broad range of intermediate values of h for which the number of clusters associated top is stable (see for instance, Genovese et al., 2016).
As we illustrate in Section 7,p h tends to behave in a very similar way when h varies (see also Figure 2). This leads to a simple yet effective heuristic for bandwidth selection: h should be chosen in the range where the clustering structure is stable.
If a test for the significance of the estimated local modes ofp h is available, then one can use a complementary and more principled approach: h should be chosen so to reveal the largest amount of statistically significant structure in the data (Genovese et al., 2016). In particular, in the context of nonparametric modal clustering for functional data, h should be chosen to maximize the number of significant local modes ofp h using, for example, the test proposed in Section 5. This second approach to bandwidth selection is illustrated by means of two applications in Section 7 (see Figure 6 and Figure 9).

Discussion and conclusions
In this paper, we provide a general theoretical background for clustering of functional data based on pseudo-densities. We show that clusters of functional data can be characterized in terms of the basins of attraction of the critical points of a pseudo-density functional, both at the population and at the sample level. Our theory can be generalized to different functional spaces, as long as the chosen pseudo-density functional is sufficiently smooth and the range of the functional random variable X can be compactly embedded in a larger space to guarantee the compactness of the gradient flow trajectories. Because of the need of a compact embedding, one has to consider two non-equivalent topologies at the same time (in our case the L 2 and the H 1 topologies): from a statistical viewpoint, this means that the data need to be smoother than the ambient space in which they are embedded.
Besides compactness, there is another element that makes the theory of population clustering in the functional data setting more challenging when compared to the finite-dimensional case. This is the fact that the basin of attraction of a saddle point of p h is not necessarily negligible. While in the finite-dimensional setting the basin of attraction of a saddle point of the Morse density function p is a manifold whose dimension is strictly smaller than the dimension of the domain of p (and therefore its probability content is null), the same property is not necessarily satisfied by a pseudo-density functional in the infinite-dimensional and density-free setting that we consider. Nonetheless, in analogy to the finitedimensional case, one expects that clusters that are associated to the local modes of p h are more informative than those associated to the saddle points of the same functional.
It becomes natural to ask whether it is possible to derive a statistical procedure that marks a local mode ofp h (and its associated empirical cluster) as significant whenever it corresponds to a non-degenerate local mode of p h . We provide a consistent algorithm to achieve this task that can be applied to real data, such as noisy measurements of random curves on a grid. When only noisy and discretized versions of the functional data are observed, one typically projects the noisy observations onto a linear space and use the resulting projections (which correspond to perturbed versions of the partially observed underlying functional data) for subsequent statistical analyses. We show that the asymptotic properties of the proposed algorithm are preserved with noisy and discretized functional data, as long as the size of the discretization grid grows suitably fast with the sample size.
The algorithm that we propose can also be used to appropriately select the bandwidth parameter in practice: h should be chosen as the bandwidth that maximizes the number of significant local modes (and therefore the number of significant clusters) in the data. We demonstrate the this bandwidth selection criterion performs well both on simulated and real data in Section 7. The logic behind this rule is that one wants to uncover as much structure as possible based on the observed data, while at the same time being confident that the uncovered structure corresponds to actual population features.
Finally, it should be mentioned that kNN (k-Nearest Neighbors) bandwidth parameters have gained a growing popularity in nonparametric FDA. kNN bandwidths have the advantage of being more sensitive to local features of the empirical distribution of the data. The kNN bandwidth at location x ∈ L 2 is defined as i.e. H n,k (x) is the smallest positive real number h such that the L 2 ball of radius h centered at x ∈ L 2 contains exactly k sample observations (see, for instance, Kudraszow and Vieu, 2013). It is clear from the definition that H n,k is a random variable that depends on the sample and the sample size. Extending our theory to kNN bandwidths is a challenging task because the population functional becomes much more involved. Specifically, the population pseudo-density corresponding to a kNN bandwidth is Future work shall investigate the extent to which our theory can be generalized to these types of local bandwidths and the practical advantages that these may have in the context of nonparametric clustering based on local modes and gradient flows.

Appendix A: Proofs of the results
Proof of Proposition 1. For convenience, we prove the result under the additional assumption that K is compactly supported on [0, 1]. The proof that we present can be easily extended to exponentially decaying kernel functions and this extra assumption can be safely removed.
Consider the set of assumptions of the Proposition. Furthermore, letK 2 = sup x∈R d ∇ 2 p(x) 2 and K 1 = inf x∈∂Sc ∇p(x) 2 . Note that since ∂S c is compact, K 1 > 0. Consider now the set S = {x ∈ S c : d(x, ∂S c ) ≥ }. Then, there exists a set Ω such that S 2 ⊂ Ω ⊂ S and ∂Ω is also smooth. As a result, if x ∈ ∂Ω, then ≤ d(x, ∂S) ≤ 2 and inf x∈S∩Ω c ∇p(x) 2 ≥ K 1 /3. Since p is Morse on Ω and twice continuously differentiable on int(S c ), then standard mollification results guarantee that there exist η > 0 and h 1 > 0 such that if 0 < h ≤ h 1 , then sup x∈Ω ∇ (i) p h (x) − ∇ (i) p(x) 2 ≤ η for i = 0, 1, 2. Then, Lemma 16 of Chazal et al. (2014) 6K2 and let n(·) denote the outward normal vector to S c with unitary norm. We have Note that p(y) = 0 if y ∈ ∂S c and, since K is compactly supported on [0, 1], h). Hence, the last two integrals on the boundaries are null. Now, since ∇p isK 2 -Lipschitz, we have This shows that p h has no non-trivial critical points outside of Ω.
Proof of Lemma 1.
Now, using the bounds on the derivatives of K h and equation (3.3), we have Taking the expectation and applying Jensen's inequality in equation (A.3) yields Thus, by definition, equation (3.5) is established. It is clear from assumption (H1) that Dp h (x) L 2 ≤ 2K 1 . In order to derive D 2 p h (x), a similar computation is used. The Taylor expansion of By assumption (H1), sup s∈[0,1] D 2 F (x + sδ)(δ, δ) L 2 ≤ 6K 3 δ 2 L 2 . Thus, (A.10) and the claim then easily follows.

Proof of Proposition 2.
where the second equality holds by integration by parts. An application of Equivalently, one has to show that Dp h (x) − Dp h (y) L 2 ≤ C(L) x − y L 2 . By Lemma 2 and Proposition 2 we have that, for any v ∈ L 2 , is Lipschitz with Lipschitz constant not larger that 2K 3 . Therefore, (A.13) We have and (A.15) By putting together equations (A.14) and (A.15) we then have the following bound for equation (A.12): Proof of Lemma 4. Lemma 2 and Proposition 2 allow us to write where the last inequality follows because (H2) guarantees that K h (t 2 ) ≤ 0.

M. Ciollaro et al.
For the first claim, assumption (H2) and p h (π x (t)) ≥ p h (π x (0)) ≥ δ imply For the second part, equation (A.17) gives (A.20) Thus, Dp h (π x (t)), π x (t) H 1 0 ≤ 0 as soon as π x (t) L 2 = π x (t) H 1 0 > M. Proof of Proposition 3. Proposition 2 and Lemma 3 guarantee the existence and uniqueness of a local solution under the H 1 0 norm from the standard theory of ordinary differential equations. Some extra work is needed to extend the local solution to a global one. We provide a complete proof in three steps which builds on Theorem 3.10 of Hunter and Nachtergaele (2001) (their Theorem 3.10 holds more generally on Banach spaces, see for instance Schechter, 2004) and the authors' subsequent remark concerning the extension of the local solution to a global one.
Step 1. In this step, we show that if the solution π x (t) exists for any time interval [0, T ], then there exists C 1 > 0 such that π x (t) H 1 0 ≤ C 1 . If p h (x) = 0, then x is a trivial local minimum of p h (x). As a result, Dp h (π x (0)) = 0 and π x (t) = π x (0) for all t. Thus, in this case it suffices to take C 1 = R. Suppose instead that p h (x) = δ > 0. Consider Dp h (π x (t)) H 1 0 . Note that g(0) ≤ R 2 . Take C 1 = max{R, K 2 N 1 /δ}. Fix an arbitrary > 0 and suppose that there exists T such that 0 ≤ T ≤ T and g(T ) ≥ C 2 1 + . Then, there must exist 0 ≤ t * ≤ T such that and g(t * ) ∈ (C 2 1 , C 2 1 + ). This is a contradiction because, by Lemma 4, Step 2. Let π x : [0, T 1 ] → H 1 0 be the local solution of the ordinary differential equation π x (t) = Dp h (π x (t)) with π x (0) = x. Suppose that π x (t) H 1 0 ≤ C 1 if t ≤ T 1 . Given C 2 > C 1 , we show that there exists T 2 > 0 such that the solution can be uniquely extended to π x : [0, To see this, consider the ordinary Now, by the Picard-Lindelöf theorem on Banach spaces, if one takes T 2 = (C 2 − C 1 )/N then the solution φ exists on [0, T 2 ] and φ(t) ∈ B H 1 0 (π x (T 1 ), C 2 − C 1 ). Consider the extension π x (t) given by (A.23) The newly defined π x is well-defined and continuous. Since The uniqueness of the extended solution follows from the fact that Dp h is Lipschitz on Step 2 guarantees that the solution can be uniquely extended to [0, T 1 +T 2 ].
Step 1 then implies that such extended solution π x satisfies π x (t) H 1 0 ≤ C 1 for all t ∈ [0, T 1 +T 2 ]. By Step 2 again, the extended solution π x can be extended again to the larger time interval [0, T 1 +2T 2 ] and, once again, by Step 1 the extended solution is entirely contained in the H 1 0 ball of radius C 1 . By iterating this procedure, one sees that the unique solution π x can be extended to all of R + and π x (t) H 1 0 ≤ C 1 for all t ≥ 0. Proof of Theorem 1. Since p h is a bounded functional and both Dp h and D 2 p h are bounded operators on L 2 , it is clear that lim t→∞ Dp h (π x (t)) L 2 = 0 (A.25) (see Lemma 7.4.4 in Jost, 2011). Furthermore, since π x (t) H 1 0 ≤ C 1 for all t ≥ 0 and closed H 1 0 balls are compact with respect to the L 2 norm, there exist By the continuity of Dp h : L 2 → L 2 , one also has that Dp h (π x (∞)) = 0.
Recall that by assumption (H4), all the non-trivial critical points of p h are isolated. Hence, for any non-trivial critical point of p h , one can find a L 2 neighborhood around it in which there are no other critical points of p h . Let δ 1 > 0 be the radius of such neighborhood around π x (∞). Suppose now that the sequence {π x (t)} t≥0 does not converge to π x (∞) in the L 2 sense. Then, there exists δ 2 > 0 and a subsequence {π x (s k )} k≥1 such that π x (∞)−π x (s k ) L 2 ≥ δ 2 for all k ≥ 1.
Without loss of generality, one can assume that π x (t k ) − π x (∞) L 2 ≤ δ 1 /3 and that t k < s k < t k+1 for all k. But then, by the continuity of the path π x , there exists r k such that t k ≤ r k ≤ s k and π x (∞) − π x (r k ) L 2 = min{δ 1 , δ 2 }/2 for all k ≥ 1. Since π x (r k ) H 1 0 ≤ C 1 , {π x (r k )} k≥1 also has a subsequence which converges with respect to the L 2 norm as well. Without loss of generality assume that π x (r k ) →π x (∞) in L 2 sense. By the continuity of Dp h (x),π x (∞) is also a critical point of p h . But then, π x (∞) −π x (∞) L 2 = min{δ 1 , δ 2 }/2 < δ 1 , which is a contradiction. This establishes the uniqueness of π x (∞) and concludes the proof.
Proof of Lemma 5. By assumption, . (A.26) Note that, by assumption (H2), For the second claim of the Lemma, suppose that X H 1 0 ≤ M P-almost surely. Then, any x which is a non-trivial critical point of p(x) satisfies equation (A.26). As a result, for any v ∈ C ∞ c ([0, 1]), (A.28) By Lemma 2, it follows that x H 1 0 ≤ M and the proof is complete. Proof of Lemma 6. In light of Proposition 2, for the first claim it suffices to show that if x ∈ S, then Dp h (x) ∈ S. Note that S is a closed subspace of L 2 . As a result, there exists another subspace S ⊥ ⊂ L 2 which is the orthogonal complement of S. Let g ∈ S ⊥ , so that X, g L 2 = 0 almost surely. Then, (A.29) and thus Dp h (x) ∈ S. The second claim is established in a similar way as in Lemma 5.
Proof of Lemma 7. By Lemma 6, if x * is a non-trivial critical point then x * ∈ S. If one views D 2 p h (x * ) as a linear operator from L 2 to L 2 , it is sufficient to show that D 2 p h (x * ) is an isomorphism (i.e. a continuous map from L 2 to L 2 such that its inverse is also continuous). Note first that for any v ∈ L 2 Observe that One can use a similar computation as in equation (4.2) to show that Thus, S and S ⊥ are invariant subspaces of D 2 p h (x * ). In order to see that D 2 p h (x * ) is indeed an isomorphism, it is therefore enough to show that it is isomorphism on both S and S ⊥ separately. Under assumption (H4'), p is a Morse density on S c and there exists h > 0 small enough so that p h is also a Morse function on the interior of S c (see Remark 1). Then, x * is in S c by Proposition 1 and since L 2 )) ≤ −δ < 0. According to equation (A.31), D 2 p h (x * ) acts on S ⊥ by multiplying every vector in S ⊥ by 2E P K h ( X − x * L 2 ) and hence D 2 p h (x * ) is clearly an isomorphism on S ⊥ .
Proof of Lemma 9. Denote T = −D 2 f (x * ) for simplicity. Then, T is a positive definite isomorphism on L 2 . Thus, there exists C > 0 such that T −1 ≤ C where · here denotes the operator norm. Also, it is straightforward to check that T induces a well-defined inner product ·, · T on L 2 by v, w T = T v, w L 2 . Now, for any v ∈ L 2 we have Therefore, by taking δ = 1/C the claim of the Lemma follows.
Step 2. By condition (C2), π 1 admits a convergent subsequence in L 2 . Thus there is a subsequence {t k } ∞ k=1 and a critical point x * 1 such that π 1 (t k )− x * 1 L 2 → 0 as k → ∞ and x * 1 ∈ B L 2 (x * 2 , ). In order to show that x * 1 is a non-degenerate local maximum in B L 2 (x * 2 , ), consider η 2 ≤ δ/8. Given any u L 2 = 1, for any x ∈ B L 2 (x * 2 , ) one has (A.35) Therefore sup u L 2 =1 D 2 f 1 (x)(u, u) ≤ −3δ/8 and Df 1 (x * 1 ) is negative definite. If one views −Df 1 (x * 1 ) as a linear operator from L 2 to L 2 , then by the Lax-Milgram theorem, it is an isomorphism and hence x * 1 is a non-degenerate local maximum. Moreover, x * 1 is the unique maximum in B L 2 (x * 2 , ): suppose that y * 1 is another local maximum of f 1 in B L 2 (x * 2 , ); then, Df 1 (x * 1 ) = 0 and Df 1 (y * 1 ) = 0, and by equation (A.35), sup u L 2 =1 D 2 f 1 (y * 1 )(u, u) ≤ −3δ/8. A Taylor expansion shows that and by symmetry, Step 3. Now it is only left to show that x * 1 − x * 2 L 2 ≤ Cη 1 . Since Df 1 (x) is a twice continuously differentiable function, a Taylor expansion around x * 2 allows us to write Note that one can replace Df 1 (x * 1 ) by Df 2 (x * 2 ) as both of them are 0. Apply this identity to x * 2 − x * 1 , then This is equivalent to Taking C = 8 completes the step.
Proof of Proposition 4. First of all note that since P ( X H 1 0 ≤ M ) = 1, lemma 5 ensures that all the non trivial critical points of f 1 (x) = p h (x) and and Consider the events A = {η 1 ≤ C 1 (α)} and B = {η 2 ≤ C 2 (α)} where C 1 (α) and C 2 (α) are defined in Display 1. We can then use the uniform exponential inequalities on the first and second derivatives of Lemma 13 and Lemma 15 of ≤ α for n large enough (which will be justified later in the proof). For now, under the event A ∩ B, for each pointx * marked by the algorithm of Display 1, i.e,x * ∈R we have hence the assumptions of Lemma 10 are satisfied. Furthermore, Lemma 10 ensures that the ball B L 2 (x * , δ(x * )/(2β 3 )) contains a unique non-degenerate local mode x * of p h and that x * − x * To justify P (A c ) = P (η 1 ≥ C 1 (α)) ≤ α/2, consider the inequality of Lemma 13. We have With this particular choice of = C 1 (α) it then follows that An almost identical argument is used to justify P (B) = P (η 2 ≥ C 2 (α)) ≤ α/2.
Proof of Proposition 5. Taking f 1 (x) =p h (x) and f 2 (x) = p h (x), the goal is to apply Lemma 10 for all non-trivial critical points of p h . It is worth to mention that, under the given assumption, R is a finite set and R ⊂ B H 1 0 (0, M). As a result, there exists a γ such that According to Lemmas 13 and 15, for l = 1, 2 there exist constants 0 < H l < ∞ and 0 < h l < ∞ depending only on K 1 , K 2 and K 3 and M such that Let η l , l = 1, 2 be defined as in (C3). Let F n := {η 1 ≤ H1 n 1/3 } ∩ {η 2 ≤ H2 n 1/3 }. Then, P (F n ) → 1. The rest of the argument follows by assuming that F n holds.
Suppose that, for large n, H 1 n −1/3 ≤ γ 2 /(8β 3 ) and H 2 n −1/3 ≤ γ/8. Then, for all x * ∈ R, one has where as before One can apply Lemma 10 to all x * to conclude that there exists ax * such that The following three steps complete the proof. step 1. In this step, one shows thatx * ∈R. According to item 2. in the first because both C 1 (α) and C 2 (α) are of order O(n −1/3 ).
Proof of Lemma 11. We discuss the case l = 1. Only the constants differ in the remaining cases. For any x ∈ L 2 Thus,

E sup
where the last step uses Hoeffding's inequality.
Proof. The proof is very similar to that of the previous Lemma. Notice first that (B.6) By taking = /(10K 2 ) and using the same argument of the previous Lemma, we have (B.7) In oder to proceed, we need an Hoeffding-type inequality for Hilbert spaces. Specifically, using Lemma 1, one has Now, if one denotes Z i as Thus, E P Z i = 0 as a L 2 function and Z i L 2 ≤ 4K 1 . Finally, by using the exponential inequality of the Corollary of Lemma 4.3 in Yurinskiȋ (1976), and for sufficient small that 1 + 1.62 K1/5 ≤ 2, one gets the desired result.
Next we derive a similar result for the second derivative. Obtaining such a result is a little more difficult because the operator norm of a linear operator defined on a Hilbert space does not induce a Hilbert space structure. The following discussion and intermediate results are useful to circumvent this problem.
Definition 3. Let A : L 2 → L 2 be a linear operator. A is said to be a Hilbert-Schmidt operator on L 2 if is an orthonormal basis of L 2 . Remark 10. The above definition is independent of the choice of the orthonormal basis. Furthermore, Hilbert-Schmidt operators form a Hilbert space with the following inner product: for two Hilbert-Schmidt operators A and B, the Hilbert-Schmidt inner product between A and B is defined as where {e 1 } ∞ i=1 is any orthonormal basis of L 2 . Recall that the operator norm of bilinear operator A is defined as A(v) L 2 (B.14) A standard result guarantees that A ≤ A HS .
Proof. Let Y = 2 K h ( X − x 2 L 2 )(x − X), hence Y ∈ L 2 . It is easily seen that Y L 2 ≤ 2 √ K 2 P -almost surely by (H1). ConsiderB(v) = Y, v L 2 and let {e i } ∞ i=1 be an orthonormal basis. We can write Y = ∞ i=1 y i e i , where y i are random coefficients. Therefore, Y 2 L 2 = ∞ i=1 y 2 i ≤ 4K 2 P -almost surely. Finally, B : L 2 → L 2 can be expressed as B(v) =B(v)Y and

2970
M. Ciollaro et al. As a result,