Convergence and Rates for Fixed-Interval Multiple-Track Smoothing Using $k$-Means Type Optimization

We address the task of estimating multiple trajectories from unlabeled data. This problem arises in many settings, one could think of the construction of maps of transport networks from passive observation of travellers, or the reconstruction of the behaviour of uncooperative vehicles from external observations, for example. There are two coupled problems. The first is a data association problem: how to map data points onto individual trajectories. The second is, given a solution to the data association problem, to estimate those trajectories. We construct estimators as a solution to a regularized variational problem (to which approximate solutions can be obtained via the simple, efficient and widespread $k$-means method) and show that, as the number of data points, $n$, increases, these estimators exhibit stable behaviour. More precisely, we show that they converge in an appropriate Sobolev space in probability and with rate $n^{-1/2}$.


Introduction
Given observations from multiple moving targets we face two (coupled) problems. The first is associating observations to targets: the data association problem. The second is estimating the trajectory of each target given the appropriate set of observations. When there is exactly one target the data association problem is trivial. However, when the number of targets is greater than one (even when the number of targets is known) the set of data association hypotheses grows combinatorially with the number of data points. Very quickly it becomes infeasible to check every possibility. Hence fast approximate solutions are needed in practice.
In this paper we interpret the problem of estimating multiple trajectories with unknown data association (see Figure 1) in such a way that the k-means method [32] may be applied to find a solution. As in [42], this is a non-standard application of the k-means method in which we generalize the notion of a 'cluster center' to partition finite dimensional data using infinite dimensional cluster centers. In this paper the cluster centers are trajectories in some function space and the data are space-time observations. Let Θ ⊂ (H s ) k where H s is the Sobolev space of degree s (where we consider the case s ≥ 1, see Section 2.1 for a precise definition). We have a data set {(t i , y i )} n i=1 ⊂ [0, 1] × R d and a model for the observation process where µ † = (µ † 1 , . . . , µ † k ) is some unknown function, ǫ i iid ∼ φ 0 and t i iid ∼ φ T for densities φ 0 and φ T on [0, 1] and R d respectively. We assume that the index of the cluster responsible for any given observation t y Figure 1: Unlabeled data is generated from three targets and using minimizers of (2) we can find a partitioning of the data set and nonparametrically estimate each trajectory using the k-means algorithm.
is an independent random variable with a categorical distribution of parameter vector p = (p 1 , . . . , p k ), writing ϕ(i) ∼ Cat(p) to mean P(ϕ(i) = j) = p j . This assumptions allow us to write the density of y given t (and, implicitly, the cluster centres), which we denote by φ Y (y|t), as We can summarize the stylized data generating process as follows. A cluster is selected at random: P(ϕ = j) = p j , the time and observation error are drawn independently from their respective distributions, t ∼ φ T , and ǫ ∼ φ 0 ; and we observe (t, y = µ † ϕ (t) + ǫ). The aim is to estimate µ † = (µ † 1 , . . . , µ † k ) ∈ Θ from observed data {(t i , y i )} n i=1 . In particular the data association ϕ : {1, 2, . . . , n} → {1, 2, . . . , k} is unknown. With a single trajectory (k = 1) the problem is precisely the spline smoothing problem, see for example [46]. For k > 1 trajectories there is an additional data association problem coupled to the spline smoothing problem. We call this the smoothing-data association (SDA) problem. Although the estimator µ n we propose is not necessarily a consistent estimator for µ † (we do not show µ n → µ † ) we do consider our estimator a natural choice. We believe it is possible to bound the asymptotic error lim n→∞ µ n − µ † (L 2 ) k ≤ C where C depends on the distribution of the data points, however it is beyond the scope of this work to show such a bound. We refer to [28,Section 4.5] for a bound of the type µ ∞ − µ † ≤ C, where µ ∞ = lim n→∞ µ n , for k-means in Hilbert spaces.
We assume k is fixed and known. The aim of this paper is to construct a sequence of estimators µ n of µ † based upon increasing sets of observations {(t i , y i )} n i=1 and to study their asymptotic behavior as n → ∞. For each n our estimate is given as a minimizer of f n : Θ → R defined by where | · | is the Euclidean norm on R d , k j=1 z j = min{z 1 , . . . , z k } and λ is a positive constant. Penalizing the s th derivative ensures that the problem is well posed. Optimizing this function can be interpreted as seeking a hard data association: given µ ∈ Θ each observation (t i , y i ) is associated with the trajectory closest to it so the corresponding data association solution is given by As with many ill-posed inverse problems with a data association component recovering the 'true' values of the (infinite-dimensional) parameters is in general infeasible. Two approaches are possible: to impose strong parametric assumptions, reducing the problem to that of inferring a (finite-dimensional) collection of parameters (which will perform poorly when those assumptions are inappropriate) or to proceed nonparametrically, optimising a cost function which balances the trade-off between a good fit to the data and regularity of the solution (which requires the precise specification of the notion of regularity). In this paper we pursue the second route, showing that in the large data limit the proposed estimators behave well. The main contribution of this paper is to establish the stability of k-means like estimators to the SDA problem.
Although exact solution of the underlying optimization problem is NP-complete even in benign Euclidean settings [17], the computational cost of iterative numerical approximation has been shown to have a polynomial (smoothed) cost in certain Euclidean settings, e.g. [3], and in practice the performance is often much better than these bounds would suggest: it is accepted to be a numerically efficient method for obtaining approximate solutions (i.e. local minimizers). Our empirical experience is that this property holds also within the context considered by this paper. Our focus is upon the asymptotic properties of the ideal estimator and it is beyond the scope of this paper to upper bound the computational complexity of the numerical iteration scheme. We do however point out that a key advantage of the k-means method is that it reduces the problem of solving the multiple target problem (k > 1) to the problem of repeatedly solving the single target problem (k = 1) which can be done efficiently with, for example, splines.
There are of course several variations of the k-means method, e.g. fuzzy C-means clustering [6] (a soft version of k-means closely-related to the EM algorithm [19]), k-medians clustering [8] (an L 1 version of k-means), Minkowski metric weighted k-means [18] for which the analysis, particularly the convergence result in Theorem 3.1, could be easily adapted. Indeed, for bounded noise, the weak convergence k-medians clustering is a special case of [42] and to extend the result to unbounded noise one can follow the strategy given in the proof of Theorem 3.4. The strong convergence and rate of convergence will require a different approach as one loses differentiability when going from L 2 to L 1 .
The choice of regularization scheme and, in particular, of λ is not straightforward. For k = 1 there are many results in the spline literature on the selection of λ = λ n and the resulting asymptotic behavior as n → ∞, see for example [1, 11-13, 29, 33, 37-40, 43-45, 47]. In this case one has λ n → 0 and can expect µ n to converge to µ † . Convergence is either with respect to a Hilbert scale, e.g. L 2 , or in the dual space, i.e. weak convergence. Using a Hilbert scale in effect measures the convergence in a norm weaker than H s . We remark that when k > 1 and λ n → 0 we would expect that minimizers µ n converge to a minimizer µ * of In particular we do not expect that µ * = µ † , indeed even the k-means in Euclidean spaces is known to be asymptotically biased. In this paper we do not take λ n → 0 which adds a further bias. The approach we take, as is common in settings in which smooth solutions are expected, is to penalize the s th derivative. By Taylor's Theorem we can write H s = H 0 ⊕ H 1 where We use · 1 = ∇ s · L 2 as the norm on H 1 and denote the H 0 norm by · 0 , and therefore we use the norm · H s = · 0 + · 1 on H s (which is equivalent to the usual Sobolev norm). Since H 0 is finite dimensional we are free to use any norm we choose without changing the topology. We can view H s = H 0 ⊕ H 1 as a multiscale decomposition of H s . The polynomial component represents a coarse approximation. The regularization penalizes oscillations on the fine scale, i.e. in H 1 .
In the case k = 1, f n is quadratic and one can find an explicit representation of µ n , i.e. there exists a random function G n,λ such that with probability one µ n = G n,λ ν n for some function ν n which depends on the data. When k > 1 the problem is no longer convex and the methodology used in the k = 1 case fails.
The first result of this paper (Theorem 3.1) is a weak convergence result, we show that there exists µ ∞ ∈ Θ such that (up to subsequences) µ n ⇀ µ ∞ a.s. in H s and µ ∞ is a minimizer of f ∞ defined by One should note that if µ ∞ = (µ ∞ 1 , . . . , µ ∞ k ) is a minimizer of f ∞ then so isμ ∞ = (µ ∞ ρ(1) , . . . , µ ∞ ρ(k) ) for any permutation ρ : {1, . . . , k} → {1, . . . , k} and therefore we do not expect uniqueness of the minimizer. Considering the law of large numbers the limit f ∞ is natural. The functional f ∞ can be seen as a limit of f n , the nature of which will be made rigorous in Section 3. The second result is to go from almost sure weak convergence to strong convergence in probability. In other words, we obtain convergence of the minimizing sequence in a stronger topology at the expense of considering a weaker mode of stochastic convergence.
We recall that one motivation for considering the minimization problem (2) is to embed the problem into a framework that allows the application of the k-means method. Large data limits for the k-means have been studied extensively in finite dimensions, see for example [2,5,10,25,31,[34][35][36]. There are fewer results for the infinite dimensional case, with [4, 7, 14, 15, 22, 26-28, 30, 41, 42] the only results known to the authors. Of these, only [42] can be applied to finite dimensional data and infinite dimensional cluster centers but required bounded noise and furthermore the conclusion were limited to weak convergence. The first contribution of this paper is to extend this convergence result to unbounded noise for the SDA problem (Section 3). We point out that [4,7,26,28] give results for the convergence and rates of convergence of the minimum min f n (in infinite dimensional settings) and [27] gives results for the convergence of the minimizers.
The result of Theorem 4.1 is that, upto subsequences, the convergence is strong in H s . The final result is to show that the rate of convergence is of order 1 √ n in probability. I.e.
This is closely related to the central limit theorem first proved for the k-means method by Pollard [36] for Euclidean data. We extend his methodology to cluster centers in H s to prove our rate of convergence result and in doing so provide a theoretical justification for using this method in the more complex scenario which we consider and, in particular, for using such approaches to address post hoc tracking of multiple targets using k-means type algorithms. As with Pollard's finite dimensional result we require an assumption on the positive definiteness of the second derivative of the limiting function f ∞ .
In the next section we remind the reader of some preliminary material which underpins our main results. Section 3 contains the weak convergence result. In Section 4 we go from weak convergence to strong convergence with rates.

Notation
The Borel σ-algebra on [0, 1] × R d is denoted B([0, 1] × R d ) and the set of probability measures on Our main results concern sequences of data {(t i , y i )} ∞ i=1 sampled independently with common law P ∈ P([0, 1] × R d ) which is assumed to have a Lebesgue density, φ((t, y)) = φ Y (y|t)φ T (t). We work throughout on a probability space (Ω, F, P) rich enough to support a countably infinite sequence of such observations, (t i , y i ) : Ω → [0, 1] × R d . All random elements are defined upon this common probability space and all stochastic quantifiers are to be understood as acting with respect to P unless otherwise stated. With a small abuse of notation we say We will define the space Θ ⊂ (H s ) k in Section 3. The Sobolev space H s is given by Note that data is of the form A sequence of probability measures P n is said to weakly converge to P if for all bounded and continuous functions h we have P n h → P h.
Where we write P h = h(x) P (dx). If P n weakly converges to P then we write P n ⇒ P . We use the following standard definitions for rates of convergence.
Definition 2.1. We define the following.
(i) For deterministic sequences a n and r n , where r n are positive and real valued, we write a n = O(r n ) if an rn is bounded. If an rn → 0 as n → ∞ we write a n = o(r n ).
(ii) For random sequences a n and r n , where r n are positive and real valued, we write a If an rn → 0 in probability: for all ǫ > 0 we write a n = o p (r n ).
When a = a(r) can be written as a function of r we will often write a = O(r) or a = o(r) to mean for any sequence r n → 0 that a n := a(r n ) satisfies a n = O(r n ) or a n = o(r n ) respectively.

Γ-Convergence
Our proof of convergence will use a variational approach. In particular the natural convergence for a sequence of minimization problems is Γ-convergence. The Γ-limit can be understood as the 'limiting lower semi-continuous envelope'. It is particular useful when studying highly oscillatory functionals when there will often be no strong limit and the weak limit (if it exists) will average out oscillations and therefore change the behavior of the minimum and minimizers. See [9,16] for an introduction to Γ-convergence and [23,24,42] for applications of Γ-convergence to problems in statistical inference. We will apply the following definition and theorem to H = Θ ⊂ (H s ) k .
(ii) (recovery sequence) there exists a sequence (ν n ) weakly converging to ν such that When it exists the Γ-limit is always weakly lower semi-continuous [9, Proposition 1.31] and therefore achieves its minimum on any weakly compact set. An important property of Γ-convergence is that it implies the convergence of minimizers. In particular, we will make use of the following result which can be found in [9, Theorem 1.21].

Theorem 2.3 (Convergence of Minimizers).
Let H be a Banach space, Θ ⊂ H be a weakly closed set and f n : Θ → R ∪ {±∞} be a sequence of functionals. Assume there exists a weakly compact subset Furthermore if µ n ∈ K minimizes f n then any weak limit point is a minimizer of f ∞ .

The Gâteaux Derivative
As in Section 2.2 we will apply the following to H = Θ ⊂ (H s ) k .

Definition 2.4.
We say that f : exists. We may define second order derivatives by for µ, ν, ω ∈ H. In cases where the second derivative does not necessarily exist we will define ∂ 2 − f by To simplify notation, we write: for some t * ∈ [0, 1].
Proof. The theorem is only a slight generalisation of Taylor's theorem.
Hence we can equivalently show that and note that, by definition of J, > 0 for all t then F is strictly increasing, which contradicts F (1) = F (0), and so there must exist such a t * .

Weak Convergence
To show weak convergence we apply Theorem 2.3. The following two subsections prove that the conditions required to apply this theorem, i.e. that f ∞ is the Γ-limit of f n and that the minimizers µ n are uniformly bounded, hold with probability one.
For a fixed δ > 0 we define the set Θ to be the set of functions in (H s ) k which have minimum separation distance of δ: For d = 1 this is a strong assumption as we restrict ourselves to trajectories that do not intersect. When considering the tracking of real objects in 2 or more dimensions, the assumption is typically physically reasonable. For example if µ j are to represent trajectories of extended objects by modelling the location of the centroid, it is natural to require a minimum separation of those centroids on a scale corresponding to the extent of the objects in question.
In practical implementations the constraint could be difficult to implement, but it is straightforward to check whether it is satisfied post hoc. For a wide range of distributions on the data it is reasonable to expect that any two cluster centers obtained by numerical procedures will not intersect and therefore have a minimum separation distance. Of course, this separation distance is only known with posterior knowledge and not prior knowledge as we assume here. We expect that one could improve this reasoning to state explicitly that with high probability any two cluster centers are at least δ * apart for some δ * that depends upon the distribution of the data. We do not attempt to prove any such statement here. Such a statement would imply that one could carry out the classification using a k-means method without directly imposing the constraint.
We use the assumption in order to infer that the spatial partitioning induced by any set of cluster centers µ ∈ Θ is such that every element of the partition is non-empty, at every time t, i.e. the sets for j = 1, . . . , k are all non-empty.
First let us show that Θ is weakly closed in (H s ) k . Take any sequence µ n ∈ Θ such that µ n ⇀ µ ∈ (H s ) k . We have to show µ ∈ Θ. Pick t ∈ [0, 1], j = l and define F : Furthermore we can show that f n , f ∞ are weakly lower semi-continuous [42, Propositions 4.8 and 4.9] hence they obtain their minimizers over weakly compact subsets of Θ. We will show that minimizers are contained in a bounded, and hence weakly compact set, and therefore there exists minimizers of f n and f ∞ on Θ.
We now state our assumptions. Assumptions.
We assume φ 0 and φ T are continuous densities with respect to the Lebesgue measure on R d and [0, 1] respectively and use the same symbols to refer to these densities and to their associated measures.
2. The density φ 0 is centered and has finite second moments.

For all
where the convergence is almost surely by the strong law of large numbers. Hence Assumption 2 implies that there exists N such that min µ∈Θ f n (µ) < α + 1 for n ≥ N and N < ∞ with probability one (although N could depend on the sequence {t i , y i } n i=1 and so we could have sup ω∈Ω N = ∞). To simplify our proofs we use Assumption 3 although the results of this paper can be proved without it. The assumption is used in bounding the minimizers of f n . Clearly if φ 0 has bounded support then each y i is uniformly bounded (a.s.) and one can show that |µ n (t)| is bounded uniformly in n and t (a.s.). Assumption 3 can be relaxed at the expense of some trivial but notationally messy modifications.
Assumption 4 is used the next section to uniformly control the decay in the density φ Y . In particular the assumption allows us bound the error due to restricting to bounded sets. Although Assumption 4 implies that φ 0 has at least two moments we include the second moment condition in Assumption 2 as the decay in density is not needed until later sections.
Note the second moment condition implies that φ 0 decays as |ǫ| → ∞ and therefore, by continuity, φ 0 is bounded in L ∞ .
We now state the main result for this section. The proof is an application of Theorem 2.3 once we have shown that f ∞ is the Γ-limit (Theorem 3.2) and established the uniform bound on the set of minimizers Theorem 3.4 (which by reflexivity of the space (H s ) k implies weak compactness).  (2) and (3) respectively, where Θ ⊂ (H s ) k for s ≥ 1 is given by (5). Under Assumptions 1-3 any sequence of minimizers µ n of f n is, with probability one, weakly compact and any weak limit µ ∞ is a minimizer of f ∞ .

Proof.
We are required to show that the two inequalities in Definition 2.2 hold with probability 1. In order to do this we follow [42] and consider a subset of Ω of full measure, Ω ′ , and show that both statements hold for every data sequence obtained from that set.
be the associated empirical measure arising from the particular elementary event ω, which we define via it's action on any continuous bounded function h : emphasizes that these are the observations associated with elementary event ω. Define g µ (t, y) = k j=1 (y − µ j (t)) 2 . To highlight the dependence of f n on ω we write f (ω) n . We can write We define ∩ ω ∈ Ω : (B(0,q)) c |y| 2 P (ω) n (d(t, y)) → (B(0,q)) c |y| 2 P (d(t, y)) ∀q ∈ N then P(Ω ′ ) = 1 by the almost sure weak convergence of the empirical measure [20] and the strong law of large numbers.
And, as norms in Banach spaces are weak lower semi-continuous, lim inf n→∞ ∇ s µ n j 2 as required. We now establish the existence of a recovery sequence for every ω ∈ Ω ′ and every µ ∈ Θ. Let µ n = µ ∈ Θ. Let ζ q be a C ∞ (R d+1 ) sequence of functions such that 0 ≤ ζ q (t, y) ≤ 1 for all (t, y) ∈ R d+1 , ζ q (t, y) = 1 for (t, y) ∈ B(0, q − 1) and ζ q (t, y) = 0 for (t, y) ∈ B(0, q). Then the function ζ q (t, y)g µ (t, y) is continuous for all q. We also have, for any (t, y) ∈ [0, 1] × R d , so ζ q g µ is a continuous and bounded function, hence by the weak convergence of P (ω) n to P we have by the dominated convergence theorem. We now show that the right hand side of the above expression is equal to zero. We have where the last limit follows by the monotone convergence theorem and Assumption 2. We have shown lim n→∞ |P (ω) n g µ − P g µ | = 0.

Boundedness
The aim of this subsection is to show that the minimizers of f n are uniformly bounded in n for almost every sequence of observations. We divide this into two parts; bounding each of the H 0 and H 1 norms. The H 1 bound follows easily from the regularization. For the H 0 bound we exploit the equivalence of norms on finite-dimensional vector spaces to choose a convenient norm on H 0 .
By the argument which followed the assumptions we have, for n sufficiently large and with probability one, min µ∈Θ f n (µ) ≤ α + 1 < ∞. Now we let µ n be a sequence of minimizers. Then there existŝ Ω ⊂ Ω such that P(Ω) = 1 and for all ω ∈Ω we have Therefore for all ω ∈Ω there exists N = N (ω) such that for n ≥ N we have Therefore µ n j 1 is bounded almost surely for each j. We are left to show the corresponding result for µ n j 0 .
The following lemma will be used to establish the main result of this subsection, Theorem 3.4. It shows that, if for some sequence ν n ∈ H s with ∇ s ν n L 2 ≤ √ α and ν n 0 → ∞, then we have that, up to a subsequence, |ν n (t)| → ∞ with the exception of at most finitely many t ∈ [0, 1]. When applied to µ n j this will be used to show that in the limit, if any center is unbounded, then the minimization can be achieved over k − 1 clusters -and hence to provide a contradiction.
Proof. Let the norm on H 0 be given by By Taylor's theorem and the bound on ∇ s ν n L 2 we have Qn 0 . In particular Q n 0 = 1. Take any subsequence n m then since d iQ n dt i are uniformly bounded equi-continuous for all i = 0, 1, . . . , s − 1 so by the Arzelà-Ascoli theorem there exists a further subsequence (which we relabel) for which d iQ n dt i converges uniformly to d iQ dt i for someQ and all i = 0, 1, . . . s − 1. In particular d s−1Q dt s−1 is a constant and thereforeQ is a polynomial of degree at most s − 1. It follows that Q 0 = 1 and thereforeQ is not identically zero, henceQ has at most s − 1 roots. For any t that is not a root ofQ we have |Q nm (t)| = |Q nm (t)| Q nm 0 → ∞. This implies that |ν n (t)| → ∞.
Now pick t ∈ [0, 1] with |ν n (t)| → ∞ and assume t n → t. We assume that there exists a subsequence n m such that |Q nm (t nm )| is bounded. By going to a further subsequence (which we relabel) we assume thatQ nm →Q uniformly. Choose δ > 0 sufficiently small then there exists ǫ > 0 and N < ∞ such that for all s with |s − t| < ǫ and n m ≥ N then In particular |Q nm (t nm )| = Q n 0 |Q nm (t nm )| ≥ δ Qn m 0 2 → ∞. This contradicts the assumption that |Q nm (t nm )| is bounded. We have shown that |ν n (t n )| → ∞.
We proceed to the main result of this subsection. Proof. As in the proof of Theorem 3.2 we let ω ∈ Ω ′′ where where Ω ′ is defined in the proof of Theorem 3.2. We have P(Ω ′′ ) = 1. For the remainder of the proof we assume ω ∈ Ω ′′ . Then there exists N (ω) < ∞ such that f (ω) n (µ n ) ≤ α + 1 for all n ≥ N (ω) . Hence, for sufficiently large n, It remains to show the H 0 bound. The structure of the proof is similar to [27, Lemma 2.1]. We will argue by contradiction. In particular we argue that if a cluster center is unbounded then in the limit the minimum is achieved over the remaining k − 1 cluster centers.
Step 1: The minimization is achieved over k −1 cluster centers. We assume sup j µ n j 0 is unbounded, then there exists j * and a subsequence (which we relabel) such that µ n j * 0 → ∞. By Lemma 3.3 there exists a further subsequence (again relabelled) such that |µ n j * (t)| → ∞ for all but finitely many t. For any such t, by Lemma 3.3, we have This easily implies lim n→∞,(t ′ ,y ′ )→(t,y) for any y ∈ R d . Therefore Note that the above expression holds for P -almost every (t, y) ∈ [0, 1] × R d (as by Lemma 3.3 the collection of t for which |µ n j * (t)| → ∞ has Lebesgue measure zero). By Fatou's lemma for weakly converging measures [21, Theorem 1.1] and the above we have Hence Step 2: The contradiction. If we can show that there exists ǫ > 0 such that (i.e. we can do strictly better by fitting k centers than fitting k − 1 centers) then we can conclude the theorem. Now, for a constant c n . By definition, theμ n j must have a minimum separation distance of δ. For now we assume that we can choose c n such that this criterion is fulfilled. So if |y i − c n | ≤ δ 4 then Where (t i , y i ) ∼ n j means coordinate (t i , y i ) is associated to centerμ n j in the sense that (t, y) ∼ n j ⇔ j = argmin i=1,...,k |y −μ n i (t)| (and if the minimum is not uniquely achieved then we take the smallest j such that j ∈ argmin i=1,...,k |y −μ n i (t)|). If we can show that P is bounded away from zero, then the result follows.
Since we assumed ǫ 1 has unbounded support on R d if we can show that |c n | ≤ M for a constant M and n sufficiently large (a.s.) then we can infer the existence of a subsequence such that and c nm converges to some c. This implies (after applying Fatou's lemma for weakly converging measures [21, Theorem 1.1]) By Assumption 3 and the continuity in Assumption 1, there exists ǫ ′ > 0 such that φ Y (y|t) ≥ ǫ ′ for all y ∈ [−M, M ] d and t ∈ [0, 1]. Hence we may bound the final expression above by We are left to show such an M exists. Assume there exists M k−1 such that for all j = j * we have µ n j H s ≤ M k−1 . By the Sobolev embedding of H s into L ∞ there exists a constant C ′ such that µ L ∞ ≤ C ′ µ H s for all µ ∈ H s . And therefore |µ n j (t)| ≤ C ′ M k−1 for all j = j * and t ∈ [0, 1]. Let C = C ′ M k−1 + δ then it follows that there exists c n ∈ [0, C] d such thatμ n j * (t) = c n andμ n ∈ Θ. Now if no such M k−1 exists then there exists a second cluster such that µ n j * * H s → ∞ where j * * = j * . By the same argument for a constant c ′ n . By induction it is clear that we can find M k−l such that k − l cluster centers are bounded. The result then follows.

Remark 3.5.
Note that in the above theorem we did not need to assume a correct choice of k. If the true number of cluster centers is k ′ and we incorrectly use k = k ′ , then the resulting cluster centers are still bounded. In fact for all the results of this paper the correct choice of k is not necessary: although the minimizers of f ∞ may no longer make physical sense, the problem is still robust in that the conclusions of Theorems 3.1 and 4.1 and Corollary 4.2 hold.

Weak to Strong Convergence
We now strengthen the result of the previous section and show that in fact (upto subsequences) convergence of minimizers is strong in H s . Our proof is based on the methodology Pollard used for proving the central limit theorem for the k-means method in Euclidean spaces [36]. In Pollard's proof he assumed a positive definiteness condition on the second derivative of, what we call in this paper, f ∞ . Under an analogous condition we are also able to give a rate of convergence on convergent sequences of minimizers. Whether this condition holds will depend on the interplay between the integral over the boundaries of each partition and the size of each partition.
We state the main results of this section now but leave the proofs to the end.
where Θ is given by (5), by (2) and (3), respectively. Let {µ n } n∈N ⊂ Θ where µ n minimizes f n . Let µ nm be any subsequence that weakly converges almost surely to some µ ∞ then under Assumptions 1-4 we have that, after passing to a further subsequence, µ nm converges to µ ∞ strongly in H s and in probability.

Corollary 4.2.
If in addition to the conditions in Theorem 4.1 and where µ ∞ is a minimizer of f ∞ we assume that there exists ρ > 0 and κ > 0 such that Then any sequence µ n of minimizers with µ n → µ ∞ in H s obeys the rate of convergence For clarity, we will assume that the entire sequence µ n weakly converges in the remainder of this paper to avoid reference to subsequences. Relaxing this assumption is trivial, but notationally cumbersome.
We let Y n (µ) = √ n(f n (µ) − f ∞ (µ)) and then, by Taylor expanding around µ ∞ , we have In Lemma 4.6, using Chebyshev's inequality, we bound the Gâteaux derivative of Y n in probability. Similarly one can Taylor expand f ∞ around µ ∞ . After some manipulation of the Taylor expansion, where we leave the details until the proof of Theorem 4.1, one has We note that f n (µ n ) − f n (µ ∞ ) ≤ 0. We also show that 2λ ∇ s ν 2 The above expression allows us to convert weak convergence into strong convergence. Lemmata 4.3 and 4.5 provide the first Gâteaux derivative and a lower bound on the second Gâteaux derivatives of f ∞ , respectively.
Remark 4.4. Since µ j are continuous the boundary between each element of the resulting partition is itself continuous and has Lebesgue measure zero. The set on which j(t, y) is not uniquely defined therefore has measure zero. Hence we will treat j(t, y) as though it was uniquely defined.
More precisely consider two points y 1 , y 2 ∈ R d , with |y 1 − y 2 | ≥ δ and let B y 1 ,y 2 be the boundary defined by B y 1 ,y 2 = y ∈ B(0, M ) : |y − y 1 | = |y − y 2 | for a constant M > 0. Letỹ 1 ∈ B(y 1 , Cr) andỹ 2 ∈ B(y 2 , Cr). We will denote by d H the Hausdorff distance between sets in R d , in particular we wish to bound d H (B y 1 ,y 2 , Bỹ 1 ,ỹ 2 ). Elementary geometry implies that this can be bounded by the Euclidean distance between points on the boundary of each set, in particular where ∂B y 1 ,y 2 = y ∈ R d : |y| = M and |y − y 1 | = |y − y 2 | .
Without loss of generality assume that B y 1 ,y 2 ⊂ {x : x 1 = 0}. (All assumptions other than 4 are rotation and translation invariant, whilst 4 is rotation invariant it is not translation invariant as the constant c 1 could increase with the size of the translation. However the cluster centers are bounded in L ∞ , so in particular the size of the translation can be bounded. Therefore, up to redefining the constant c 1 , all the assumptions hold in the rotated and translated coordinate system. For d ≥ 3 we consider a cross section at x 3:d = a ∈ R d−2 , then there exists constants γ 1 , γ 2 ∈ R (depending on a) such that x 1 = γ 1 x 2 + γ 2 parametrizes the set {x ∈ Bỹ 1 ,ỹ 2 : x 3:d = a} (for a > M the set is empty and we have nothing to prove). Let θ a = | tan −1 γ 1 | ∈ [0, π 2 ] be the angle between the lines x 1 = 0 and x 1 = γ 1 x 2 + γ 2 . When d = 2 the set Bỹ 1 ,ỹ 2 is already a straight line in R 2 and it is unnecessary to take a cross section (i.e. x 3:d is null and θ a is independent of a). We will find θ * such that sin θ * = O(r) and sup a θ a ≤ θ * then we can bound the Hausdorff distance by the above bound holding as it is the maximum distance that can arise from rotation plus the maximum possible translation of the set ∂B y 1 ,y 2 . Let ℓ be the ray through y 1 and y 2 andl be the ray throughỹ 1 andỹ 2 . Let P be the point of intersection between ℓ andl. The point P exists if and only if the lines ℓ andl are not parallel. The lines ℓ andl are parallel if and only if θ = 0, trivially any choice of θ * ≥ 0 will bound this case. Therefore we assume that θ > 0 and therefore the point P exists.
One can easily show that ỹ 1 P y 1 = θ (the angle between the linesỹ 1 P and P y 1 is θ). There are two possibilities, either (1) P is between y 1 and y 2 or (2) it isn't.
We now consider Y n . In particular we want to bound ∂Y n (µ ∞ ; µ n − µ ∞ ).
Proof. Calculating the Gâteaux derivative is similar to Lemma 4.3 and is omitted. By linearity and continuity of ∂Y n we can write ∂Y n µ; ν n ν n (L 2 ) k = m (ν n , e m ) ν n (L 2 ) k ∂Y n (µ; e m ) where e m is the Fourier basis for (L 2 ) k (we assume e m = (ê m 1 , . . . ,ê m k ) whereê m is the Fourier basis for L 2 ). Let V m = E (∂Y n (µ; e m )) 2 and Z i = (y i − µ j(t i ,y i ) (t i )) ·ê m j (t i ,y i ) , then By Assumptions 1 and 2 and since µ ∈ (L ∞ ) k (by the embedding of (H s ) k into (L ∞ ) k )) C is finite. Therefore,