Consistent change-point detection with kernels

In this paper we study the kernel change-point algorithm (KCP) proposed by Arlot, Celisse and Harchaoui (2012), which aims at locating an unknown number of change-points in the distribution of a sequence of independent data taking values in an arbitrary set. The change-points are selected by model selection with a penalized kernel empirical criterion. We provide a non-asymptotic result showing that, with high probability, the KCP procedure retrieves the correct number of change-points, provided that the constant in the penalty is well-chosen; in addition, KCP estimates the change-points location at the optimal rate. As a consequence, when using a characteristic kernel, KCP detects all kinds of change in the distribution (not only changes in the mean or the variance), and it is able to do so for complex structured data (not necessarily in $\mathbb{R}^d$). Most of the analysis is conducted assuming that the kernel is bounded; part of the results can be extended when we only assume a finite second-order moment.


Introduction
In many situations, some properties of a time series change over time, such as the mean, the variance or higher-order moments.Change-point detection is the long standing question of finding both the number and the localization of such changes.This is an important front-end task in many applications.For instance, detecting changes occuring in comparative genomic hybridization array data (CGH arrays) is crucial to the early diagnosis of cancer [34].In finance, some intensively examined time series like the volatility process exhibit local homogeneity and it is useful to be able to segment these time series both for modeling and forecasting [38,49].Change-point detection can also be used to detect changes in the activity of a cell [45], in the structure of random Markov fields [43], or a sequence of images [30,1].Generally speaking, it is of interest to the practitioner to segment a time series in order to calibrate its model on homogeneous sets of datapoints.
Addressing the change-point problem in practice requires to face several important challenges.First, the number of changes can not be assumed to be known in advance -in particular, it can not be assumed to be equal to 0 or 1 -, hence a practical change-point procedure must be able to infer the number of changes from the data.Second, changes do not always occur in the mean or the variance of the data, as assumed by most change-point procedures.We need to be able to detect changes in other features of the distribution.Third, parametric assumptions -which are often made for building or for analyzing change-point procedures -are often unrealistic, so that we need a fully non-parametric approach.Fourth, data points in the time series we want to segment can be high-dimensional and/or structured.If the dimensionality is larger than the number of observations, a non-asymptotic analysis is mandatory for theoretical results to be meaningful.When data are structured -for instance, histograms, graphs or strings -, taking their structure into account seems necessary for detecting efficiently the change-points.
We focus only on the offline problem in this article, that is, when all observations are given at once, as opposed to the situation where data come as a continuous stream.We refer to Tartakovsky, Nikiforov and Basseville [51] for an extensive review of sequential methods, which are adapted to the later situation.Numerous offline change-point procedures have been proposed since the seminal works of Page [44], Fisher [21] and Bellman [9], which are mostly parametric in essence.We refer to Brodsky and Darkhovsky [13,Chapter 2] for a review of non-parametric offline change-point detection methods.Among recent works in this direction, we can mention the Wild Binary Segmentation (WBS, [22]) and the non-parametric multiple change-point detection procedure (NMCD, [56]).Some authors also consider the case of high-dimensional data when only a few coordinates of the mean change at each change-point [53, and references therein], or the problem of detecting gradual changes [52]; this paper does not address these slightly different problems.
To the best of our knowledge, no offline change-point procedure addressed simultaneously the four challenges mentioned above, until the kernel changepoint procedure (KCP) was proposed by Arlot, Celisse and Harchaoui [4].In short, KCP mixes the penalized least-squares approach to change-point detection [17,39] with semi-definite positive kernels [5].It is not the only procedure that uses positive semi-definite kernels to detect changes in a times series.Apart from Harchaoui and Cappé [27], who introduced KCP for a fixed number of change-point, and Arlot, Celisse and Harchaoui [4] who extended KCP to an unknown number of change-points, we are aware of several closely related work.
Maximum Mean Discrepancy [MMD,24] has been used for building two sample tests; a block average version of the MMD, named the M -statistic, has lead to an online change-point detection procedure [41].A kernel-based statistic, named kernel Fisher discriminant ratio, has been used by Harchaoui, Moulines and Bach [28] for homogeneity testing and for detecting one change-point.Sharipov, Tewes and Wendler [48] build an analogue of the CUSUM statistic for Hilbert-valued random variables in order to detect a single change in the mean, and could be applied in our setting to the images of the observations in the feature space.Kernel change detection [18] is an online procedure that uses a kernel to build a dissimilarity measure between the near past and future of a data-point.
On the computational side, the KCP segmentation can be computed efficiently thanks to a dynamic programming algorithm [27,4], which can be made even faster [16].An oracle inequality for KCP is proved by Arlot, Celisse and Harchaoui [4]; this is not exactly a result on change-point estimation, but a guarantee on estimation of the "mean" of the time series in the RKHS associated with the kernel chosen.The good numerical performance of KCP in terms of change-point estimation is also demonstrated in several experiments.
So, a key theoretical question remains open: does KCP estimate correctly the number of change-points and their locations with a large probability?If yes, at which speed does KCP estimate the change-point locations?
This paper answers these questions, showing that KCP has good theoretical properties for change-point estimation with independent data, under a boundedness assumption (Theorem 3.1 in Section 3.1).This result is non-asymptotic, hence meaningful for high-dimensional or complex data.In the asymptotic setting -with a fixed true segmentation and more and more data points observed within each segment -, Theorem 3.1 implies that KCP estimates consistently all changes in the "kernel mean" of the distribution of data, at speed log(n)/n with respect to the sample size n.Since we make no assumptions on the minimal size of the true segments, this matches minimax lower bounds [14].We also provide a partial result under a weaker finite variance assumption (Theorem 3.2 in Section 3.3) and explain in Section 5 how our proofs could be extended to other settings, including the dependent case.These findings are illustrated by numerical simulations in Section 4.
An important case is when KCP is used with a characteristic kernel [23], such as the Gaussian or the Laplace kernel.Then, any change in the distribution of data induces a change in the "kernel mean".So, Theorem 3.1 implies that KCP then estimates consistently and at the minimax rate all changes in the distribution of the data, without any parametric assumption and without prior knowledge about the number of changes.
Our results also are interesting regarding to the theoretical understanding of least-squares change-point procedures.Indeed, when KCP is used with the linear kernel, it reduces to previously known penalized least-squares changepoint procedures [54,17,39, for instance].There are basically two kinds of results on such procedures in the change-point literature: (i) asymptotic statements on change-point estimation [54,55,6,37] and (ii) non-asymptotic oracle inequalities [17,39,4], which are based upon concentration inequalities and model selection theory [10] but do not directly provide guarantees on the estimated changepoint locations.Our results and their proofs show how to reconciliate the two approaches when we are interested in change-point locations, which is already new for the case of the linear kernel, and also holds for a general kernel.

Kernel change-point detection
This section describes the general change-point problem and the kernel changepoint procedure [4].
An important example to have in mind is the following.
1 and P 1 , . . ., P K+1 some probability distributions on X be fixed.Then, for any n and i ∈ {1, . . ., n}, we set t i := i/n and the distribution of X i is P j(i) where j(i) is such that t i ∈ [b j , b j+1 ).In other words, we have a fixed segmentation of [0, 1], given by the b j , a fixed distribution over each segment, given by the P j , and we observe independent realizations from the distributions at discrete times t 1 , . . ., t n .The corresponding true change-points in {0, . . ., n} are the nb j , j = 1, . . ., K. For n large enough, there are K + 1 segments.Figure 2 shows an example.Let us emphasize that in this setting, n going to infinity does not mean that new observations are observed over time.Recall that we consider the change-point problem a posteriori : a larger n means that we have been able to observe the phenomenon of interest with a finer time discretization.Also note that this asymptotic setting is restrictive in the sense that segments size asymptotically are of order n; we do not make this assumption in our analysis, which also covers asymptotic settings where some segments have a smaller size.
Illustration of the asymptotic setting (Example 2.1) in the case of changes in the mean of the X i .Here, X = R, X i = f (t i ) + ε i with ε 1 , . . ., εn i.i.d. and centered, and f : [0, 1] → R is a (fixed) piecewise constant function (shown in red).The goal is to recover the number of abrupt changes of f (here, 2) and their locations (b 1 = 0.5 and b 2 = 0.7).Note that other kinds of changes in the distribution of the X i can be considered, see Section 4.

Kernel change-point procedure (KCP)
Let k : X × X → R be a positive semidefinite kernel, that is, a measurable function such that the matrix (k(x i , x j )) 1≤i,j≤m is positive semidefinite for any m ≥ 1 and x 1 , . . ., x m ∈ X [46].Classical examples of kernels are given by [4, section 3.2], among which: -the linear kernel : -the polynomial kernel of order d ≥ 1: As done by Harchaoui and Cappé [27] and Arlot, Celisse and Harchaoui [4], for a given segmentation τ ∈ T D n , we assess the adequation of τ with the kernel least-squares criterion Elementary algebra shows that, when X = R p and k = k lin , R n is the usual least-squares criterion.Minimizing this criterion over the set of all segmentations always outputs the segmentation with n segments reduced to a point, that is [0, . . ., n]; this is a well-known overfitting phenomenon.To counteract this, a classical idea [36, for instance] is to minimize a penalized criterion crit(τ ) := R n (τ ) + pen(τ ), where pen : T n → R + is called the penalty.Formally, the kernel change-point procedure (KCP) of Arlot, Celisse and Harchaoui [4] selects the segmentation In this paper, we focus on the classical choice of a penalty proportional to the number of segments, similarly to AIC, BIC and C p criteria.Namely, we consider pen(τ ) = pen (τ where C is a positive constant and M is specified in Assumption 1 later on. As mentioned in the Introduction, slightly different penalty shapes can be considered, as suggested by Arlot, Celisse and Harchaoui [4].Our results could be extended to the penalty of Arlot, Celisse and Harchaoui [4], but we choose to consider the linear penalty (2.3)only for simplicity.

The reproducing kernel Hilbert space
Let H be the reproducing kernel Hilbert space (RKHS) associated with k [5], together with the canonical feature map Φ : X → H We write •, • H (resp. • H ) for the inner product (resp.the norm) of H.For any i ∈ {1, . . ., n}, define Y i := Φ(X i ) ∈ H.In the case where k = k lin , then Y i = •, X i R p and the empirical risk R n reduces to the least-squares criterion where X is the empirical mean of the X i over the segment {τ −1 + 1, . . ., τ }.
It is well-known that penalized least-squares procedures detect changes in the mean of the observations X i , see Yao [54].Hence the kernelized version of this least-squares procedure, KCP, should detect changes in the "mean" of the Y i = Φ(X i ), which are a nonlinear transformation of the X i .More precisely, assume that H is separable and that ∀i ∈ {1, . . ., n}, E k(X i , X i ) < +∞ .
Then µ i , the Bochner integral of Y i , is well-defined [40].The condition above is satisfied in our setting (when either Assumption 1 or Assumption 2 holds true, see Section 2.5), and H is separable in most cases [20].The Bochner integral commutes with continuous linear operators, hence the following property holds, which will be of common use: We now define the "true segmentation" τ ∈ T n by We call the τ i s the true change-points.It should be clear that it is always possible to define τ .A kernel is said to be characteristic if the mapping P → E X∼P [Φ(X)] is injective, for P belonging to the set of Borel probability measures on X [50].In simpler terms, when k is a characteristic kernel, X i and X i+1 have the same distribution if and only if µ i = µ i+1 , and τ indeed corresponds to the set of changes in the distribution of the X i .For instance, all strictly positive definite kernels are characteristic, including the Gaussian kernel, see Sriperumbudur et al. [50].Therefore, in the setting of Example 2.1, for n large enough, D τ = K + 1 and τ = nb for = 1, . . ., K.
For a general kernel, some changes of P Xi , the distribution of X i , might not appear in τ .For instance, with the linear kernel, τ only corresponds to changes of the mean of the X i .In most cases, a characteristic kernel is known and we can choose to use KCP with a characteristic kernel; then, as we prove in the following, KCP eventually detects any change in the distribution of the observations.But one can also choose a non-characteristic kernel on purpose, hence focusing only on some changes in the distribution of the X i .For instance, the polynomial kernel of order d is not characteristic and leads to the detection of changes in the first d moments of the distribution; with the linear kernel, KCP detects changes in the mean of the X i .
From now on, we focus on the problem of detecting the changes of τ only, whether the kernel is characteristic or not.

Rewriting the empirical risk
It is convenient to see the images of the observations by the feature map as an element of H n .To this extent, we define Y := (Y 1 , . . ., Y n ), as well as µ := (µ 1 , . . ., µ n ) ∈ H n and ε := Y − µ ∈ H n .We identify the elements of H n with the set of applications {1, . . ., n} → H, naturally embedded with the inner product and norm given by ∀x, y ∈ H n , x, y := n i=j x j , y j H and We now rewrite the empirical risk as a function of τ and Y .For any segmentation τ ∈ T n , define F τ the set of applications {1, . . ., n} → H that are constant over the segments of τ .We see F τ as a subspace of H n as a vector space.Take f ∈ H n , we define Π τ f the orthogonal projection of f onto F τ with respect to It is shown by Arlot, Celisse and Harchaoui [4] that for any f ∈ H n and any We are now able to write the empirical risk as where µ τ = Π τ Y , following [27,4].

Assumptions
A key ingredient of our analysis is the concentration of ε.Intuitively, the performance of KCP is better when ε concentrates strongly around its mean, since without noise we are just given the task to segment a piecewise-constant signal.
It is thus natural to make assumptions on ε in order to obtain concentration results.We actually formulate assumptions on the kernel k, which translate automatically onto ε.
As done by Arlot, Celisse and Harchaoui [4], the main hypothesis used in our analysis is the following.
and Arlot, Celisse and Harchaoui [4] show that ε i H ≤ 2M almost surely.Assumption 1 is always satisfied for a large class of commonly used kernels, such as the Gaussian, Laplace and χ 2 kernels.
Note that Assumption 1 is weaker than assuming k to be bounded -that is, k(x, x) ≤ M for any x ∈ X , which is equivalent to k(x, x ) ≤ M for any x, x ∈ X since k is positive definite.For instance, if X = R p and the data X i are bounded almost surely, Assumption 1 holds true for the linear kernel and all polynomial kernels, which are not bounded on R p .
It is sometimes possible to weaken Assumption 1 into a finite variance assumption.
As a consequence, Assumption 1 implies Assumption 2 with V = M 2 .Note that Assumption 2 is satisfied for the polynomial kernel of order d provided that ∀i ∈ {1, . . ., n}, E X i 2d < +∞.
In the setting of Example 2.1, Assumption 2 holds true with provided this maximum is finite.

Theoretical guarantees for KCP
We are now able to state our main results.In Section 3.1, we state the main result of the paper, Theorem 3.1, which provides simple conditions under which KCP recovers the correct number of segments and localizes the true changepoints with high probability, under the bounded kernel Assumption 1.Then, Section 3.2 details a few classical losses between segmentations which can be considered in addition to the one used in Theorem 3.1.Corollary 3.1 formulates a result on τ in terms of the Frobenius loss.Finally, Section 3.3 states a partial result on KCP -requiring the number of change-points D τ to be knownunder the weaker Assumption 2.

Main result
We first need to define some quantities.The size of the smallest jump of Intuitively, the higher ∆ is, the easier it is to detect the smallest jump with our procedure.The quantity µ i − µ i+1 H is often called the (population) maximum mean discrepancy [MMD,24] between the distributions of X i and X i+1 .
In the scalar setting (with the linear kernel), the ratio ∆/σ (where σ 2 is the variance of the noise) is called the signal-to-noise ratio [7] and is often used as a measure of the magnitude of a change in the signal.In Example 2.1, where µ Pj denotes the (Bochner) expectation of Φ(X) when X ∼ P j .
For any τ ∈ T n , we denote the (normalized) sizes of its smallest and of its largest segment by It should be clear that the smaller Λ τ is, the harder it is to detect the segment that achieves the minimum in equation (3.2).For instance, in the particular case of Example 2.1, For any τ 1 and τ 2 ∈ T n , we define which is a loss function (a measure of dissimilarity) between the segmentations τ 1 and τ 2 .Note that d ∞ is not a distance; other possible losses between segmentations and their relationship with d ∞ are discussed in Section 3.2.Theorem 3.1.Suppose that Assumption 1 holds true.For any y > 0, an event Ω of probability at least 1 − e −y exists on which the following holds true.For any C > 0, let τ be defined as in Eq. (2.2) with pen defined by Eq. (2.3).Set Then, if on Ω, we have We delay the proof of Theorem 3.1 to Section 6.4.Some remarks follow.
Theorem 3.1 is a non-asymptotic result: it is valid for any n ≥ 1 and there is nothing hidden in o(1) remainder terms.The latter point is crucial for complex data -for instance, X = R p with p > n -since in this case, assuming X fixed while n → +∞ is not realistic.
Nevertheless, it is useful to write down what Theorem 3.1 becomes in the asymptotic setting of Example 2.1.As previously noticed, D τ , Λ τ , ∆ 2 and M 2 then converge to positive constants as n → +∞.Therefore, C min is of order log(n), C max is of order n and we always have C min < C max for n large enough.The upper bound on C matches classical asymptotic conditions for variable selection [47].The necessity of taking C of order at least log(n) is shown by Birgé and Massart [11] in a variable selection setting, which includes changepoint detection as a particular example; Birgé and Massart [11], Abramovich et al. [2] provide several arguments for the optimality of taking a constant C of order log(n).When C satisfies (3.3), the result of Theorem 3.1 implies that P (D τ = D τ ) → 1.For the linear kernel in R d , this is a well-known result when the distribution of the X i changes only through its mean.The first result dates back to Yao [54, Section 2] for a Gaussian noise, later extended by Liu, Wu and Zidek [42] and Bai and Perron [6, Section 3.1] under mixingale hypothesis on the error, and Lavielle and Moulines [37] under very mild assumptions satisfied for a large family of zero-mean processes [for the precise statement of the hypothesis, see 37, Section 2.1].Theorem 3.1 also shows that the normalized estimated change-points of τ converge towards the normalized true change-points at speed at least log(n)/n.
Up to a logarithmic factor, this speed matches the minimax lower bound n −1 which has been obtained previously for various change-point procedures [32, 12, 33, for instance] including least-squares [37], assuming that Λ τ ≥ κ > 0. When D τ ≥ 3 and the assumption on Λ τ is removed -that is, segments of length much smaller than n are allowed, which is compatible with Theorem 3.1 since it is non-asymptotic-, Brunel [14,Theorem 6] shows a minimax lower bound of order log(n)/n.Therefore, in this setting, KCP achieves the minimax rate.We do not know whether KCP remains minimax optimal (without the log factor) under the assumption Λ τ ≥ κ > 0.
Note finally that KCP also performs well for finite samples, according to the simulation experiments of Arlot, Celisse and Harchaoui [4].Theorem 3.1 emphasizes the key role of ∆ 2 /M 2 , which can be seen as a generalization of the signal-to-noise ratio, for the change-point detection performance of KCP.The larger is this ratio, the easier it is to have Eq.(3.3) satisfied and the smaller is v 1 (y).This suggests to choose k (theoretically at least) by maximizing ∆ 2 /M 2 , as we discuss in Section 5. Note that ∆ 2 /M 2 is invariant by a rescaling of k, hence the result of Theorem 3.1 is unchanged when k is rescaled.
The hypothesis in Eq. (3.3) is actually three-fold.First, we use that C > C min to get D τ ≤ D τ .We have to assume C large enough since a too small penalty leads to selecting (with KCP or any other penalized least-squares procedure) the segmentation with n segments, that is D τ = n.Second, C < C max is used to get D τ ≥ D τ .Such an assumption is required since taking a penalty function too large in Eq. (2.2) would result in selecting the segmentation with only one segment, that is, D τ = 1.Third, C max has to be greater than C min for providing a non-empty interval of possible values for C.This inequality is also used in the proof of the upper bound on d ∞ τ , τ when we already know that D τ = D τ .In Example 2.1, the C min < C max hypothesis translates into Λ τ log(n)/n.That is, the size of the smallest segment has to be of order log n/n.This is known to be a necessary condition to obtain the minimax rate in multiple change-point detection [14, section 2].
Theorem 3.1 helps choosing C, which is a key parameter of KCP, as in any penalized model selection procedure.However, in practice, we do not recommend to directly use equation (3.3) for choosing C for two reasons: C min , C max depend on unknown quantities D τ , Λ τ , ∆, and the exact values of the constants in C min , C max might be pessimistic compared to what we can observe from simulation experiments.We rather suggest to use a data-driven method for choosing C, see Section 5.
If we know D τ , we can replace τ by Then, assuming that Λ τ > v 1 (y) -which is weaker than assuming C min < C max -, the proof of Theorem 3.1 shows that, on Ω, we have

Loss functions between segmentations
Theorem 3.1 shows that τ is close to τ in terms of d ∞ .Several other loss functions (measures of dissimilarity) can be defined between segmentations [29].We here consider a few of them, which are often used or natural for the changepoint problem.
Let us first consider losses related to the Hausdorff distance.For any τ 1 and τ 2 ∈ T n , we define ∞ is symmetric thus there is no need to define d H .One could also define d (1) H as the Hausdorff distance between the subsets {τ 1 1 , . . ., τ 1 D τ 1 −1 } and {τ 2  1 , . . ., τ 2 D τ 2 −1 } with respect to the distance δ(x, y) = |x − y| on R.These definitions are illustrated by Figure 3.
(ii) For any τ 1 , τ 2 ∈ T n such that H (τ 1 , τ 2 ) .Lemma 3.1 is proved in Section B.1.As a direct application of Lemma 3.1 we see that the statement of Theorem 3.1 holds true with d ∞ replaced by any of the loss functions that we defined above, at least for n large enough.
Another loss between segmentations is the Frobenius loss [35], which is defined as follows.For any where Π τ is the orthogonal projection onto F τ , as defined in Section 2.4, and • F denotes the Frobenius norm of a matrix: A closed-form formula for d F can be derived from the matrix representation of Π τ that is given by (2.5): for any i, j ∈ {1, . . ., n}, if i and j belong to the same segment λ of τ 0 otherwise.
An interesting feature of the Frobenius loss is that it is smaller than one only when τ 1 and τ 2 have the same number of segments, whereas Hausdorff distances can be small with very diffferent numbers of segments.Indeed, we prove in Section B.2 that (3.4) The next proposition shows that there is an equivalence (up to constants) between the Hausdorff and Frobenius losses between segmentations, provided that they are close enough.
Proposition 3.1 was first stated and proved by [35,Theorem B.2].We prove it in Section B.2 for completeness.
As a corollary of Theorem 3.1 and Proposition 3.1, we get the following guarantee on the Frobenius loss between τ and the segmentation τ estimated by KCP.
Corollary 3.1.Under the assumptions of Theorem 3.1, on the event Ω defined by Theorem 3.1, for any τ satisfying (2.2) with pen defined by (2.3), we have: Note that Corollary 3.1 gives a better result (at least for large n) than the obvious bound Proof.On the event Ω, we have ∞ (τ , τ ) < Λ τ /(D τ + 1) and D τ = D τ .Therefore, according to Proposition 3.1, Up to this point, we assessed the quality of the segmentation τ by considering the proximity of τ with τ .Another natural idea is to measure the distance between µ and µ τ in H n .It is closely related to the oracle inequality proved by Arlot, Celisse and Harchaoui [4], which implies an upper bound on µ − µ τ 2 .
We can also observe that there is a simple relationship between µ − µ τ 2 and the Frobenius distance between τ and τ .Indeed, (3.5) Equation (6.9) in the proof of Theorem 3.1 shows that on Ω, under the assumptions of Theorem 3.1, which is slightly better (but similar) to what Corollary 3.1, equation (3.5) and the bound µ 2 ≤ M 2 n imply.

Extension to the finite variance case
Theorem 3.1 is valid under a boundedness assumption (Assumption 1).What happens under the weaker Assumption 2? As a first step, we provide a result for for some δ n > 0. In other words, we restrict our search to segmentations τ of the correct size -hence D τ must be known a priori -and having no segment with less than nδ n observations.We discuss how to relax this restriction right after the statement of Theorem 3.2.Note that the dynamic programming algorithm of Harchaoui and Cappé [27] can be used for computing τ (D τ , δ n ) efficiently.
We postpone the proof of Theorem 3.2 to Section 6.5.Let us make a few remarks.
As for Theorem 3.1, our result is non-asymptotic.However, it is interesting to write it down in the setting of Example 2.1.If n goes to infinity, then the assumption Λ τ ≥ δ n is satisfied whenever δ n → 0. If we furthermore require that nδ n → ∞, then Eq. (3.7) implies that by taking a well-chosen y of order √ n + √ nδ n .In the particular case of the linear kernel, this result is known under various hypothesis [37, for instance]; it is new for a general kernel.
More precisely, if we take goes to zero at least as fast as n / √ n, where ( n ) n≥1 is any sequence tending to infinity, for instance n = log(n).This speed seems suboptimal compared to previous results [37, for instance] -which do not consider the case of a general kernel -, but we have not been able to prove tight enough deviation bounds for getting the localization rate log(n)/n under Assumption 2.
How does Theorem 3.2 compares to Theorem 3.1?First, as noticed by Remark 6.4 in Section 6.4, the result of Theorem 3.1 also holds true for τ (D τ , δ n ) as long as Λ τ ≥ δ n .Second, v 1 (y) is usually smaller than v 2 (y, δ n ) -its order of magnitude is smaller when n → +∞ -, and the lower bound on the probability of Ω is better than the one for Ω 2 .There is no surprise here: the stronger Assumption 1 helps us proving a stronger result for τ (D τ , δ n ).Nevertheless, these only are upper bounds, so we do not know whether the performance of τ (D τ , δ n ) actually changes much depending on the noise assumption.For instance, as already noticed, we do not believe that the localization speed log(n)/n requires a boundedness assumption; in particular cases at least, it has been obtained for unbounded data [37,12].
The dependency in k of the speed of convergence of τ (D τ , δ n ) is slightly less clear than in Theorem 3.1.The signal-to-noise ratio appears through ∆ 2 /V , as expected, but the size ∆ of the largest true jump also appears in v 2 .At the very least, it is clear that ∆ 2 /V should not be too small.
As noted by Lavielle and Moulines [37], it may be possible to get rid of the minimal segment length δ n , either by imposing stronger conditions on εwhich are not met in our setting -or by constraining the values of µ to lie in a compact subset Θ ⊂ H D τ +1 .

Numerical simulations
One consequence of our main result, Theorem 3.1, is that for a bounded kernel, the KCP procedure is consistent in the asymptotic setting presented in Example 2.1.We now illustrate this fact by a simulation study.Detecting changes in the mean with the Gaussian kernel Let us consider the archetypic change-point detection problem -finding changes in the mean of a sequence of independent random variables-and show how these changes are localized more precisely when more data are available.
We define three functions µ m : [0, 1] → R, 1 ≤ m ≤ 3, previously used by Arlot and Celisse [3], which cover a variety of situations (see Fig. 4).For each m ∈ {1, 2, 3} and several values of n between 10 2 and 10 3 , we repeat 10 3 times the following: -Sample n independent Gaussian random variables g i ∼ N (0, 1); -Set X i = µ m (i/n) + g i -Fig. 4 shows one sample for each m ∈ {1, 2, 3}; -Perform KCP with Gaussian kernel and linear penalty on X 1 , . . ., X n ; the penalty constant is chosen as indicated in Section 5, the bandwidth is set to 0.1, and the maximum number of change-points is set to 30; -Compute d The results are collected in Fig. 5, where each graph corresponds to a regression function µ m .We represent in logarithmic scale the mean distance between the true segmentation and the estimated segmentation for each value of n.The error bars are ± σ/ √ N , where σ is the empirical standard deviation over N = 10 3 repetitions.We want to emphasize that, though these experiments illustrate our main result Theorem 3.1, they are carried out in a slightly different setting since the penalty constant C is not chosen according to equation (3.3), but using the dimension jump heuristic [8].
H (τ , τn) towards 0 when the number of data points n is increasing.A linear regression between log n and 1 n d H (τ , τn) for n ≥ 300 yields slope estimates −0.97, −1.04 and −1.00, respectively.
The three segmentation problems considered here are quite different in nature, but all lead to a linear convergence rate (slopes close to −1 on the graphs of Figure 5) with different constants (different values for the intercept on the graphs of Figure 5).Recall that Theorem 3.1 combined with Lemma 3.1 states that, with high probability, Hence, whenever D τ , ∆ and M are fixed, 1 n d H (τ , τ n ) converges to 0 at rate at least log n/n when the number of data points increases.In our experimental setting, these quantities are fixed, and the observed convergence rate matches our theoretical upper bound.The performance of KCP still depends on the regression function µ m experimentally, by a constant multiplicative factor, like the theoretical bound v 1 .
We test KCP with various kernels assuming that the number of change-points (D τ = 3) is known; this simplification avoids possible artifacts linked to the choice of the penalty constant.Results are shown on Figure 6.The X i all have zero mean and unit variance, hence a classical penalized least-squares procedure -KCP with the linear kernel-is expected to detect poorly the changes in the distribution of the X i , as confirmed by Figure 6 (for instance, according to the right panel, it is not consistent).On the contrary, a Gaussian kernel with wellchosen bandwidth yields much better performance according to the middle and right panels of Figure 6 (with a rate of order 1/n).
H (τ , τn) vs. n in log scale, for KCP with a Gaussian kernel with bandwidth 0.01 (blue solid line; estimated slope −1.05) and with the linear kernel (red dashed line; estimated slope 0.16).

Discussion
Before proving our main results, let us discuss some of their consequences regarding the KCP procedure.
Fully non-parametric consistent change-point detection We have proved that for any kernel satisfying some reasonably mild hypotheses, the KCP procedure outputs a segmentation closeby the true segmentation with high probability.
An important particular example is the "asymptotic setting" of Example 2.1, where we have a fixed true segmentation τ and fixed distributions P 1 , . . ., P K+1 from which more and more points are sampled.How fast can KCP recover τ , without any prior information on the number of segments D τ or on the distributions P 1 , . . ., P K+1 ?
Let us take a bounded characteristic kernel -for instance the Gaussian or the Laplace kernel if X = R d -, so that Assumption 1 holds true.Then, Theorem 3.1 shows that KCP detects consistently all changes in the distribution of the X i , and localizes them at speed log(n)/n.This speed also depends on the adequation between the kernel k and the differences between the P j , through the ratio ∆ 2 /M 2 .Obtaining such a fully non-parametric result for multiple changepoints with a general set X -we only need to know a bounded characteristic kernel on X -has never been obtained before.To the best of our knowledge, non-parametric consistency results for the detection of arbitrary changes in the distribution of the data have only been obtained for real-valued data [56] or for the case of a single change-point [15,13].
Choice of k An important question remains: how to choose the kernel k?In Theorem 3.1, k only appears through the "signal-to-noise ratio" ∆ 2 /M 2 , leading to better theoretical guarantees when this signal-to-noise ratio is larger: a larger value for C max and a smaller bound v 1 on d ∞ τ , τ .Therefore, a simple strategy for choosing the kernel is to pick k that maximizes ∆ 2 /M 2 , at least among a family of kernels, for instance Gaussian kernels.This first idea requires to know the distributions of the X i , or at least to have prior information on them.Interestingly, when the change-points locations are known, ∆ 2 corresponds to the maximum mean discrepancy [MMD,24] between the distributions of the X i over contiguous segments.In this particular setting, it is feasible to estimate and to maximize ∆ 2 with respect to the kernel k, as done by Gretton et al. [25].An interesting future development would be to build an estimator of ∆ 2 without knowing the change-point locations and to maximize this estimator with respect to the kernel k.We refer to Arlot, Celisse and Harchaoui [4, section 7.2] for a complementary discussion about the choice of k for KCP.
Choice of C Another important parameter of the KCP procedure is the constant C that appears in the penalty function.As mentioned below Theorem 3.1, our theoretical guarantees provide some guidelines for choosing C, but these are not sufficient to choose precisely C in practice.We recommend to follow the advice of [4, section 6.2] on this point, which is to choose C from data with the "slope heuristic" [8].

Modularity of the proofs and possible extensions Finally, we would like
to emphasize what we believe to be an important contribution of this paper.The structure of the proofs of Theorems 3.1 and 3.2 -which follow the same strategy -is modular, so that one can easily adapt it to different sets of assumptions.
Our proof strategy is not fully new, since it is similar to the one of almost all previous papers analyzing the consistency of least-squares change-point detection procedures.In particular, we adapted some ideas of the proofs of Lavielle and Moulines [37] to the Hilbert space setting.Nevertheless, these papers formulate their main results in asymptotic terms, which can be seen as a limitation -especially when n is small or X is of large dimension.Another approach is the one of Lebarbier [39], Comte and Rozenholc [17], Arlot, Celisse and Harchaoui [4] where non-asymptotic oracle inequalities -using concentration inequalities and following the model selection results of Birgé and Massart [10] -are provided as theoretical guarantees on some penalized least-squares change-point procedures.Up to now, these two approaches seemed difficult to combine.The proofs of Theorems 3.1 and 3.2 show how they can be reconciled, which allows us to mix their strengths.Indeed, the assumptions on the distributions of the X i -Assumptions 1 and 2-are only used for proving bounds on two quantities -a linear term L τ and a quadratic term Q τ -, uniformly over τ ∈ T n .Under Assumption 1, this is done thanks to concentration inequalities (Lemmas 6.7 and 6.8) which have been proved first by Arlot, Celisse and Harchaoui [4] in order to get an oracle inequality.Under Assumption 2, this is done by generalizing the method of Lavielle and Moulines [37] to Hilbert-space valued data, through two deterministic bounds (Lemmas 6.5 and 6.6) and a deviation inequality for ε j H (Lemma 6.10).The rest of the proofs does not use anything about the distribution of X 1 , . . ., X n .
As a consequence, if one can generalize these bounds to another setting, a straightforward consequence is that a result similar to Theorem 3.1 or 3.2 holds true for the KCP procedure in this new setting.In particular, this could be used for dealing with the case of dependent data X 1 , . . ., X n .We could also consider an intermediate assumption between Assumption 2 and Assumption 1, of the form: max for some α ∈ (1, +∞).

Proofs
Let us start by describing our general strategy for proving our main results.Our goal is to build a large probability event on which any τ ∈ arg min τ ∈Tn crit(τ ) belongs to some subset E of T n .For proving this, we use the key fact that crit(τ ) ≥ crit( τ ), together with a lower bound on crit(τ ) holding simultaneously for all τ ∈ T n -hence for τ = τ .In order to get such a lower bound on the empirical penalized criterion, we start by decomposing it in Section 6.1 into terms that are simpler to control individually: two random terms -a linear function of ε and a quadratic function of ε -, and two deterministic terms -the approximation error and the penalty.Then, we control these terms thanks to deterministic bounds (Section 6.2) and deviation/concentration inequalities (Section 6.3).Finally, we prove Theorem 3.1 in Section 6.4 and Theorem 3.2 in Section 6.5.

Decomposition of the empirical risk
The first step in the proofs of Theorems 3.1 and 3.2 is to decompose the empirical risk (2.6).Lemma 6.1.Let τ ∈ T n be a segmentation.Define µ τ = Π τ µ .Then we can write Proof.First, recall that µ τ = Π τ Y and that Y = µ + ε, hence Since Π τ is an orthogonal projection, Since each term of Eq. ( 6.1) behaves differently and is controlled via different techniques depending on the result to be proven, we name each of these terms: 2) It should be clear that L stands for "linear", Q stands for "quadratic" and A stands for "approximation error".We also define Therefore a reformulation of Lemma 6.1 is Notice that L τ = A τ = 0 and Q τ ≥ 0, hence ψ τ ≤ 0. Also note that ψ, L and Q are random quantities depending on ε.

Deterministic bounds
In this section, we provide some deterministic bounds that are used in the proofs of Theorems 3.1 and 3.2.

Approximation error A τ
We begin by the following result, which is the reason for the Λ τ ∆ 2 term in Theorem 3.1.
Lemma 6.2.Let τ ∈ T n be a segmentation such that D := D τ < D τ .Then The proof of Lemma 6.2 can be found in Section B.3.2.
We next state an analogous result, valid for any τ ∈ T n , which plays a key role in the proofs of Theorems 3.1 and 3.2.Lemma 6.3.For any τ ∈ T n , Lemma 6.3 is proved in Section B.4.

Linear term L τ and quadratic term Q τ
The proof of Theorem 3.2 relies on some deterministic bounds on L τ and Q τ .We start with a preliminary lemma.
Lemma 6.4.For any ε 1 , . . ., ε n ∈ H, Proof.For every a < b, we have: The following result is a deterministic bound on Q τ in terms of M n .Lemma 6.5.Let τ ∈ T n be a segmentation.Then Proof.By Eq. (2.5), where we used Lemma 6.4 for the last inequality.
The following result is a deterministic bound on L τ .Lemma 6.6.For any τ ∈ T n , Lemma 6.6 is proved in Section B.5.

Concentration
In this subsection, we present concentration results on Q τ , L τ , and deviation bounds for M n -which will imply deviation bounds on Q τ and L τ by Lemmas 6.5 and 6.6).For any j ∈ {1, . . ., n}, τ ∈ T n and ∈ {1, . . ., D τ }, we define Concentration under Assumption 1 The first result takes care of the linear term L τ when Assumption 1 is satisfied.Lemma 6.7 (Prop.3 of Arlot, Celisse and Harchaoui [4]).Suppose that Assumption 1 holds true.Then for any x > 0, with probability at least 1 − 2 e −x , for any θ > 0, The next result deals with the quadratic term Q τ when Assumption 1 is satisfied.Lemma 6.8 (Prop. 1 of Arlot, Celisse and Harchaoui [4]).Suppose that Assumption 1 holds true.Then for any x > 0, with probability at least 1 − e −x , We merge Lemmas 6.7 and 6.8 for convenience.
Concentration under Assumption 2 Lemma 6.5 and 6.6 directly translate upper bounds on M n into controls of L τ and Q τ .Under Assumption 2, this is achieved via the following lemma, a Kolmogorov-like inequality for the noise in the RKHS.This result is a straightforward generalization of the inequality obtained by Kolmogorov [31] into the Hilbert setting.A more precise result (for real random variables only) can be found in [26], of which we follow the proof.The scheme of Hájek and Rényi [26] adapts well in our setting even though we do not need the full result.
Remark 6.3.We can reformulate Lemma 6.10 as follows.For any y > 0, there exists an event of probability at least 1 − y −2 on which M n < y nV .Equivalently, for any z ≥ 0, there exists an event of probability at least 1 − e −z such that M n < e z/2 n i=j v j ≤ e z/2 √ nV .

Proof of Theorem 3.1
We follow the strategy described at the beginning of Section 6.
Definition of Ω Let us define Ω := τ ∈Tn Ω τ,λ with λ = y + log n + 1 > 1, where we recall that Ω (0) τ,λ is defined in Lemma 6.9.By the union bound, and since the Ω where the last inequality uses that n ≥ 2. From now on we work exclusively on Ω.

Loss between τ and τ
We have proved that D τ = D τ on Ω, therefore, Eq. (6.8) can be rewritten By Lemma 6.3 and the definition of λ, we get Remark that assumption (3.3) implies that Therefore, Eq. (6.10) can be simplified into instead of (2.2), for any δ n ≥ 0 such that Λ τ ≥ δ n .Indeed, this assumption allows to write crit(τ ) ≥ crit( τ ) in the key argument, and the rest of the proof can stay unchanged (with the same event Ω).More generally, any constraint can be added in the argmin defining τ , provided that τ satisfies this constraint.

Proof of Theorem 3.2
We follow the strategy described at the beginning of Section 6.Throughout the proof, we write τ 2 as a shortcut for τ (D τ , δ n ).
Since we assumed 1 n d By the triangle inequality, hence j = k.Next, we show that φ is increasing.Take i, j ∈ 1, . . ., D 1 − 1 such that i < j.Recall that τ k • is increasing (k = 1, 2).Then Hence φ(i) < φ(j), so φ is increasing.As a consequence, φ is injective and we get D 1 ≤ D 2 .The same argument, exchanging τ 1 and τ 2 , shows that D 2 ≤ D 1 .Therefore, D 1 = D 2 and φ is an increasing permutation of 1, . . ., D 1 − 1 , hence it is the identity.As a consequence, d ∞ (τ 1 , τ 2 ).Finally, since d Take i and j distincts elements of {1, . . ., D − 1}, and suppose that φ(i)∩φ(j) is non-empty.Let k be any element of φ(i) ∩ φ(j).By the triangle inequality and the definition of d Hence, the φ(i) are disjoint and we can write From now on, we identify φ(i) with its unique element.Let us show that φ is increasing similarily to what we have done for proving (i).Take i, j ∈ {1, . . ., D − 1} such that i < j.We showed that thus according to the definition of Λ τ 1 , and our assumption, Hence φ(i) < φ(j): φ is increasing.As a consequence, d H (τ We start by proving a general formula for d F , which is stated by Lajugie, Arlot and Bach [35], we prove it here for completeness: Indeed, by definition, we have and Tr(Π , where we denoted by λ k (i) the segment of τ k to which i ∈ {1, . . ., n} belongs.
B.2.2.Proof of Eq. equation (3.4) Eq. equation (3.4) is stated by Lajugie, Arlot and Bach [35].The upper bound is a straightforward consequence of Eq. (B.1).We prove the lower bound here for completeness.We remark that The lower bound follows since τ 1 and τ 2 play symmetric roles.

B.3. Lower bounds on the approximation error
This section provides the proofs of Lemmas 6.2 and 6.3.

B.3.1. Preliminary lemma
We start with a lemma useful in the two proofs.
Suppose now that the result is proved for all D τ ∈ {2, . . ., p} and consider a change-point problem (τ , µ ) with D τ = D τ = p + 1 and n ≥ p + 1.Let D < p + 1 and some segmentation τ ∈ T D n be fixed.Then one of these two scenarios occurs: (i) there exists λ i with 2 ≤ i ≤ D τ − 1 that does not contain any change-point of τ , or (ii) λ 2 ,...,λ D τ −1 all contain a change-point of τ .
Case (i) Suppose that there exists an inner segment λ i of τ , 2 ≤ i ≤ D τ −1, that does not contain any change-point of τ (see Figure 7).Therefore, there exists k ∈ {1, . . ., D} such that λ i λ k .By definition, there are i − 1 changepoints of τ to the left of λ i and k−1 change-points of τ to the left of λ i .Suppose that k < i.We define τ • as the segmentation obtained by adding τ i to τ (see Figure 7).Then , . . ., τ i } in k segments and τ to a segmentation τ of {1, 2, . . ., τ i } in i segments and defining µ = (µ 1 , . . ., µ τ i ) ∈ H i , we get back to a situation covered by the induction since i ≤ D τ − 1 and k < i.So, and we get the result since µ − µ τ A symmetric reasonning can be applied if k ≥ i, considering change-points to the right of λ i and using that D − k + 1 < D τ − i + 1 since D < D τ .
First, notice that We recognize the right-hand side of equation (6.7) up to 1/x 2 .For any r > 1, let us denote by A r the event and by A 1 the event ε 1 H ≥ x.These events are disjoints, thus we can write Finally, let ≤ r < k be integers.Since ε is independent from ε k conditionally to σ(ε 1 , . . ., ε r ), ε is independent from ε k conditionally to A r .Furthermore, ε k is independent from A r and Because of this relation and the positivity of the (real) conditional expectation, for any integers r ≤ k ≤ j, Therefore, E ζ A r ≥ x 2 , which gives E [ζ] ≥ x 2 P (A r ).This concludes the proof, thanks to Eq. (B.8).
Remark B.1.The independence between ε j and ε k for j = k yields E ε j , ε k H = 0. Indeed, we dispose of a conditional expectation on H [19, chapter 5], which satisfies the same properties than the conditional expectation with real random variables.Hence we can write Note that the ε j s expectation vanishes by hypothesis.

3 Fig 4 .
Fig 4. In red, the three piecewise constant functions used in the simulations of Section 4. In blue, a noisy version of these functions.Both µ 1 and µ 2 have 4 jumps; µ 3 has 9 jumps.

Fig 7 .
Fig 7.Proof of Lemma 6.2, Case (i): λ i is a segment of τ that is included in a segment of τ .The segmentation τ • is obtained by joining τ i to the segmentation τ .

8 )
The law of total expectation and the positiveness of ζ yieldE [ζ] ≥ n r=1 E ζ A r P (A r ) .