On change-point estimation under Sobolev sparsity

: In this paper, we consider the estimation of a change-point for possibly high-dimensional data in a Gaussian model, using a maximum likelihood method. We are interested in how dimension reduction can aﬀect the performance of the method. We provide an estimator of the change-point that has a minimax rate of convergence, up to a logarithmic factor. The minimax rate is in fact composed of a fast rate —dimension-invariant— and a slow rate —increasing with the dimension. Moreover, it is proved that con- sidering the case of sparse data, with a Sobolev regularity, there is a bound on the separation of the regimes above which there exists an optimal choice of dimension reduction, leading to the fast rate of estimation. We propose an adaptive dimension reduction procedure based on Lepski’s method and show that the resulting estimator attains the fast rate of convergence. Our results are then illustrated by a simulation study. In particular, practical strategies are suggested to perform dimension reduction.


The model
An important problem in the vast domain of statistical learning is the question of unsupervised classification of high-dimensional data. Many examples fall into this category such as the classification of curves or images. Here, we will address a framework where the change between classes occurs on a time scale, which casts the problem into the change-point estimation issue.
We consider, for the sake of simplicity, a change-point problem with exactly two classes: we assume that there exists a change-point τ : before nτ , the observations are in a certain state, after nτ , they are in another state. More precisely, we observe independent random vectors Y 1 , . . . , Y n with values in R d , such that In practice, such a model is obtained for instance in the monitoring of patients, where the variables Y i are a bunch of d biological, chemical and/or clinical observations collected each ten minutes (for example) on a patient, and nτ reflects a time of change in the patient's condition.
Our aim is to estimate the change-point τ . Using the maximum likelihood approach, also known in this context as CUSUM method, we derive rates of convergence for the estimation of τ , under conditions specified below.
For high-dimensional data, from a computational point of view, there is an obvious need for dimension reduction when estimating τ . Without such a step, the segmentation algorithm might be unstable or even not work at all. Here, we will consider the dimension reduction problem from a theoretical point of view (as opposed to the algorithmic point of view). From a theoretical point of view, one might suspect that it should always be better to keep the whole data, to get the best precision on the estimation of the change-point. In fact, we show that this intuition is not correct. Addressing this dimension reduction problem can require sophisticated tools directly connected to smoothing questions in nonparametric estimation. Especially, as will be seen along the paper, sparsity assumptions, as well as smoothing adaptive methods, can be directly borrowed from the nonparametric statistical inference and fruitfully applied in this context.
To sum up, our goal is to answer the following questions: (a) Without referring to the technical feasibility, is there a theoretical gain in reducing the dimension, which somehow reduces to the choice between two options: ignoring a part of the data, versus keeping all the data? (b) If the data is high-dimensional but "sparse", is there a way to use this sparsity to get better results? (c) If dimension reduction proves to be theoretically more efficient, how could it be performed? Do usual nonparametric smoothing methods work well in a change-point problem? (d) Does on-line (signal by signal) dimension reduction perform as well as offline (using a preprocessing involving all the signals)?

Literature review and related work
The change-point problem has a long history, going back at least to [49]. For an introduction to the domain, the reader may refer for instance to the monographs and articles by [53], [51], [46], [7], [10], [12], [17] or [27]. Change-point detection has many practical applications, ranging from genetics [47] or health [55] to aerospace industry [25]. Note that high-dimensional change-point problems may occur in a wide range of areas. This is the case for instance in the analysis of traffic network data ( [43,44]), in bioinformatics, when studying copy-number variation ( [8,60]), for studying functional magnetic resonance imaging (fMRI) ( [2]), in astrostatistics ( [9,56,45]) or in multimedia indexation ([22]). In these practical applications, the number of observations is relatively small compared to their dimension, with the change-point possibly occurring only for a few components.
Various different change-point methods and settings have been developed in the literature, in particular in the univariate case. For instance, [33] introduced the pruned exact linear time method, [21] wild binary segmentation, and [20] a simultaneous multiscale change-point estimator, whereas change-point estimation based on resampling has been investigated in [19] and [1]. Some multivariate extensions are described among others in [28], [48], [3] and [34]. Dependent sequences are considered for instance in [23], [24], [4], [6] and [38]. Regarding the high-dimensional context, [31] considers several dependent change-point tests and studies the behavior of the maximum over all test statistics as both the sample size and the number of tests tend to infinity. [15] propose a sparse version of binary segmentation. [5], [26] and [14] are interested in the highdimensional context of panel data. The high dimension problem is addressed through changes in cross-covariance in [39], [3], [11], [50], [16]. In [54], convex optimization is used to perform regularization for solving the high dimensional change-point problem. In [13,52], graph-based approaches which are efficient in high dimension are designed. [30] proposes a method based on a statistics inspired by Hotelling's T 2 statistics. [18] consider high-dimensional change-point detection, from the testing point of view.
In this paper, we will study the performance of the procedure from a minimax point of view, in a high-dimensional context. This approach provides an evaluation of the best expectable performances in a particular framework, and the aim is then to provide a procedure attaining these performances.
Minimax estimation is considered already in [36], in the Gaussian white noise model. High-dimensional change-point problems are also studied in [35], which proposes an asymptotically minimax estimator of the change-point location, when the Euclidean norm of the gap tends to infinity as the dimension d goes to infinity.
Our approach is deeply connected to this paper and can be considered as a continuation of this project. The main difference is that in [35], the authors do not question the dimension reduction problem and do not consider the same estimation method. Moreover, [35] make the assumption that the change-point only occurs after a known number of observations and before another known number of observations. This is a crucial difference since knowing that some observations are in the first or last state allows to provide an efficient estimation of this state. Without this assumption, we do not have this opportunity, which adds a difficulty to the problem.
Another related reference is the paper by [59], who proposed a two-stage procedure based on a projection followed by a univariate change point estimation algorithm applied to the projected data, providing rates of convergence for the estimator of the change-point location.

Outline of the paper
The paper is organized as follows. We begin Section 2 by introducing the changepoint model. We also present the problem of dimension reduction and the maximum likelihood estimator of the change-point. We prove that, for a fixed dimension, up to a logarithmic term, the maximum likelihood method, has a minimax rate of convergence. Let us point out that we do not know whether this logarithmic term is necessary or not. Indeed, as explained earlier, in [35], "the edges are known" (and the estimation method uses this knowledge), meaning that the minimax rate is established in the case where the change-point cannot occur before or after a known proportion of the observations. Our method is agnostic to this knowledge, creating obvious additional difficulties. Moreover, we show that if the data is sparse, in a Sobolev sense, there exists an optimal dimension reduction, depending on the sparsity constants.
Of course, these constants are not known in practice. The aim of Section 3 is to provide a procedure which behaves as well as if the sparsity constants were known. To attain this optimal projection dimension in an adaptive way, we provide a method relying on the Lepski method. The Lepski method (see for instance [40,41,42] and Section 3.1 for more details) is one of the famous methods to obtain adaptivity in various functional estimation settings such as white noise model, regression, density estimation... In these models, minimax optimality is linked with the regularity assumptions imposed on the functions which are estimated. Adaptation methods provide ways to escape from this knowledge and still perform optimally. Note that the proposed method has the advantage of being performed off-line, before the main segmentation step.
Numerical experiments are provided in Section 4. Section 5 is devoted to the proofs.

Change-point model and assumptions
Let n ≥ 3. We observe n independent signals Y 1 , . . . , Y n . We assume that each signal Y i , i = 1, . . . , n, is a d dimensional vector: for every i, We suppose that there exist a change-point 0 < τ < 1 and two vectors θ − and θ + of R d , such that the model is given by

Remark 1.
More than one change. Dealing with a finite and known number N ≥ 2 of change-points would not change the theoretical results but would add unnecessary complexity to the proofs.
Connection to Functional Data Analysis: White Noise model. Theoretically, Model (1) is directly connected to the model in [35] where the observation is a sequence of white noise models: where the Z i 's are i.i.d. Gaussian white noises. The goal is to locate a change occurring in the functions μ i (i.e. ∀i ≤ nτ , μ i (u) = μ − (u), and ∀i > nτ, . To connect this model to (1), we simply project the observations on an orthonormal basis (Φ ) ≥1 of L 2 ([0, 1]), limiting the observation to the first d projections, with the following formula: . The white noise model may seem a bit theoretical but it is standardly used in nonparametric statistics as an approximation to more refined models such as functional regression models (as detailed in the following paragraph) or even density estimation models. Connection to Functional Data Analysis: Regression analysis. Suppose that our data is composed of n independent curves discretely observed on [0, 1] with a grid of step size equal to 1 d .
where the Z ij 's are i.i.d. Gaussian N (0, 1) variables. The goal again is to locate a change occurring in the functions μ i (i.e. ∀i ≤ nτ , μ i (u) = μ − (u), and ∀i > nτ, μ i (u) = μ + (u), ∀u ∈ [0, 1]). Then, again, this model can be connected to (1), with the help of an orthonormal basis (Φ ) ≥1 of L 2 ([0, 1]), writing where θ i = [0,1] μ i (u)Φ (u)du, the variables η i are i.i.d. N (0, σ 2 0 d ) and r i are deterministic quantities describing the difference between the Riemann sum and the integral which can be negligible for d large enough and standard regularity assumptions -which will not be discussed here-on the functions μ i and Φ . Heteroscedasticity. For the sake of simplicity, the covariance matrix of the noise η i in (1) is chosen to be proportional to identity. In various examples, it could be reasonable to choose a covariance of the form σ 2 J, where J is a known matrix, different from identity. Similarly, in the FDA example, if the Z i 's are i.i.d. Gaussian processes but not white noise, then taking an orthonormal basis of L 2 ([0, 1]) would not necessarily lead to a covariance matrix proportional to identity. Then, a simple change of variables like J −1/2 Y i would lead to a similar behavior, provided appropriate regularity assumptions are made on the parameters J −1/2 θ + and J −1/2 θ − .
Variance. We suppose here σ 2 to be known. Note that σ 2 may depend on d or on n. For instance, as in one of the previous examples, it may be of the form σ 2 0 /d, where σ 2 0 is an absolute and known constant. Gaussianity. Considering Gaussian noise is a useful simplification, but this assumption is not crucial. We essentially need concentration inequalities, and similar results may likely be obtained under sub-Gaussian hypotheses on the errors. In that case, the considered estimator of τ is no more a maximum likelihood estimator, but simply a CUSUM estimator.

Estimation method
We are interested in the behavior of the maximum likelihood estimator, also called in this case CUSUM estimator: To prove some of our results, we will need the following sparsity conditions.

Condition on the means
For s > 0, we define We will suppose that θ − and θ + are in Θ(s, L).

Remark 2.
This assumption expresses a form of sparsity of the coefficients which is standard in nonparametric settings. It corresponds to conditions which are directly connected to the regularity of the function to be estimated in nonparametric estimation (see [57] for a general introduction). So this condition is easily interpretable in the cases of FDA mentioned above. In the more general setting of a high-dimensional physical observation, it is commonly accepted to solve learning problems by introducing sparsity constraints (see for instance [32] or [29] among many others). These constraints can take various forms: we deliberately chose here the Sobolev type. It is among the simplest forms to handle technically. Note that it reflects an ordering: the first coefficients are supposed to be more important than the last ones. This is quite a reasonable assumption since, generally, modeling of high-dimensional or functional data via a basis expansion results in such a situation. However, other type of structural sparsity could be investigated (like having a finite support of coefficients) which would lead to different methods in particular for the adaptivity part.
Note that there are possible extensions to other kinds of sparsity, considering for instance coefficients belonging to the set where q < 1. Note that this choice requires more sophisticated smoothing algorithms.
To end up this section we introduce the following important parameter: As we are in a not degenerate case -there is a change, ε is a strictly positive quantity, which measures the potential lack of information at the border of the interval [0, 1]. It is important to notice that the theoretical performances of the procedures will depend on ε. However, the procedure is agnostic to ε, which therefore will not be supposed to have a known lower bound.

Dimension reduction for the estimation of τ
Our aim is to determine whether or not it is efficient to perform a dimension reduction when estimating the change-point τ . More specifically, we will investigate the effect of replacing the vectors For each projection dimension p, we may define: We setτ

1655
In the sequel, we will use the notation We also define, for p ≤ d, The rate Ψ n is plotted as a function of n and p in Figure 1. The next result describes the behavior of the estimated change-pointτ (p).

Remark 3.
No sparsity required. Note that no condition on the sparsity of θ + and θ − is needed for this result.
Minimaxity. Thanks to [35], one can observe that Ψ n (p, Δ p ) is the minimax rate in this framework. Compared to their result, we are apparently loosing a logarithmic factor. However, it is important to stress that in [35], a fixed lower bound on ε is supposed to be known, whereas our estimatorτ (p) is adaptive in ε.
In other terms, the procedureτ (p) has minimax rate possibly up to a logarithmic factor. About this logarithmic factor, it is also worth noticing that ln(n) could be substituted by any sequence r n , provided that the factor n −γ is simultaneously replaced by exp(−cr n ) in Proposition 1. (Here, c is a constant which can be made explicit on closer inspection of the proof.) ε-dependence. Looking carefully at the proofs, the constants c(ε, γ) and κ(ε, γ) can be taken proportional to (γ+1) ε 2 . This remark proves that no condition is required on the proximity of the change point to one extremity of the interval of observation. It also shows that the dependence on ε is of polynomial form. An interesting point, beyond the aim of this paper, would be to investigate whether ε 2 is the optimal rate. One comment that can be made is that our proofs are especially adequate when ε is fixed (does not depend on n).
The rate is composed of two different regimes: a "fast one" σ 2 ln(n) nΔ 2 , which does not depend on the dimension d and a "slow one" σ 4 ln(n)d (nΔ 2 ) 2 , which is rapidly deteriorating with the dimension. From the results above, we deduce that if c(γ, ε) σ 2 ln(n) This last rate is obviously much better, and with this latter condition on Δ, taking p = d (so raw data) allows to obtain the best rate σ 2 ln(n) nΔ 2 . Taking a smaller p could lead to a reduction of Δ p damaging the rate.
However this latter condition is quite restrictive on Δ when d is large. In the next paragraph, we will try to refine this condition, gaining on the size p of the projection. Without assumptions on the behavior of the parameters θ + and θ − , there is nothing much to hope about the way Δ p is increasing in p. At this stage, it is fruitful to introduce sparsity assumptions. If we assume that the means θ − and θ + belong to Θ(s, L), then, for p such that Δ 2 ≥ 8L 2 p −2s , Δ p and Δ are comparable, in the sense that Δ 2 This is precisely what is exploited in the first part of Theorem 1 below. Let us observe that if Δ p and Δ are comparable (in the sense above), then Ψ n (p, Δ p ) ∼ Ψ n (p, Δ) becomes much easier to analyse. In particular, we see that, again, it is composed of two regimes -a slow one and a fast one-and the dependence in p becomes easily understandable: σ 2 ln(n) This corresponds to what may be observed in practical applications: when the dimension p is increasing, one first observes an improved convergence ofτ , then the rate remains stable for a while, and then convergence gets less good again.
Note that two different convergence rates have also been highlighted in other change-point settings, for instance in [58].

Minimax convergence rate under sparsity condition
The following theorem is an immediate consequence of Proposition 1 in the case where one assumes the sparsity condition on the means. (1), with the means θ + and θ − in Θ(s, L). For any γ > 0, there exist constants κ(γ, ε) and c(γ, ε) such that, if

Theorem 1. We consider Model
If, now, then Here, c is an absolute constant.
It is important to notice that a big difference with Proposition 1 is that the rate is Ψ n (p, Δ) instead of Ψ n (p, Δ p ), which is more "honest" in a sense. The price to pay is then, as is intuitive, that p should be large enough. In the second statement, we look at the condition on Δ and p to obtain the fast rate. For Δ fixed, we see that p must not be too large or too small. Condition (3) contains two terms: one is increasing in p, one decreasing. Hence it can be optimized leading to We obtain the next corollary, corresponding to this projection dimension p s .

A. Fischer and D. Picard
Interpretation is that the quantity 2c(γ, ε) σ 2 ln(n) is the minimal gap between the two regimes ensuring that the "fast" rate σ 2 ln(n) nΔ 2 can be obtained, with an appropriate projection dimension.

Remark 4.
1. We see here that there is an obvious advantage in reducing the dimension, since it allows to obtain the best rate with less restricting conditions on the gap Δ. 2. We observe that the greater Δ, the faster the rate of convergence ofτ , which is quite natural. 3. At first sight, the rate of convergence and the conditions could seem quite unsatisfactory, but observe that very often σ 2 is of the form In this case, the rate of convergence is of the order nd 4. Formula (4) indicates that the optimal p depends on the sparsity constant s, which is rarely known. 5. If we now look for a procedure searching for an optimal p in an adaptive way (without knowing the regularity s), some comments can be made before proposing a solution. In particular, one may ask whether it is possible to optimize individually (on each signal Y i of R d ), or if it is necessary to perform an off-line preprocessing (requiring the use of all the signals).
The form of the optimal projection dimension p s ∼ nd 1+2s allows to answer this question. Indeed, any adaptive smoothing performed individually on each signal Y i (such as thresholding, lasso...) would lead at best to a dimension of the form p opt = d σ 2 0 1 1+2s , which would induce a lost of a polynomial factor in n in the rates. This means that it is obviously more efficient to find a procedure performing the smoothing globally (off-line).

Fast rate of convergence: Adaptive choice of p
The message of the section above is the following: for a multichannel signal with sparsity conditions, there is a lower bound on Δ above which there exists an efficient choice of p, leading to the fast rate of estimation for the parameter τ , σ 2 ln(n) nΔ 2 . However, in Corollary 1, this choice depends on the knowledge of the regularity parameter s. An essential question then is to construct a adaptive procedure, that is, to design a strategy still performing optimally, without knowledge about the regularity.
There are several ways to give an answer to this question and one can for instance look at the procedure introduced in [59], which proves adaptivity under slightly different conditions.
Our preference here will be to take inspiration into nonparametric statistics which provide many adaptive procedures, and in particular Lepski's method. This will lead to a procedure which is quite simple, and interesting by itself in this context.

Lepski's procedure
Let us recall, as already mentioned in the Introduction, that the Lepski method ( [40,41,42]) is a strategy allowing to obtain adaptivity in various functional estimation settings such as white noise model, regression or density estimation. In these models, minimax optimality is linked with the regularity assumptions imposed on the functions which are estimated. In these nonparametric problems, there is a balance to obtain between a "variance" term, typically of the form p n , and a "bias" term, typically of the form p −2s . The Lepski procedure proposes to choose the minimal p among those such that an estimated version of the bias is below a bound.
For the sake of clarity, let us first recall the classical Lepski procedure in the standard Gaussian white noise model. Note that it will not be described in the original form presented in the first papers, corresponding to kernel estimation methods. Here, the procedure is adapted to the orthogonal series estimation methods, which is more suitable for a transposition to our case. Consider the following model: where the ε j 's are i.i. d. N (0, ν 2 ). The Lepski procedure for choosing the optimal projection dimension p consists in definingp as follows: where C L is a tuning constant of the procedure. In our change-point setting, a transformation of the data is necessary to fall into the frame of Model (5). We will apply the Lepski method to a surrogate data vector built on the whole observation.

Preprocessing
Using the complete data set (so off-line), we define a surrogate data vector, which will be used to find an optimalp. We assume, for the sake of simplicity, that n is even; otherwise, the modifications are elementary.
We set: This vector Z = (Z j ) 1≤j≤d is a special case of Model (5), where

Adaptive convergence rate
We consider the Lepski procedure applied to the vector Z, producing a projection dimensionp. This parameter is then just plugged in the maximum likelihood procedure for estimatingτ . It is well-known that estimating the regularity of a signal is impossible without important extraneous assumptions, but Lepski's procedure provides a projection dimensionp which, with overwhelming probability, is smaller than the optimal p s (defined in (4) above) and such that the bias Δ 2 − Δp is controlled, which is precisely the need here.
The following theorem states that the method leads to an optimal selection, up to logarithmic terms. As announced, Lepski's method allows an adaptive choice of p: though the optimal p s is unknown, we are able to achieve the same convergence rate as in Corollary 1 with p = p s .

Remark 5.
A thorough examination of the proof in Section 5.2 shows that the constant C L only needs to be "large enough" (see (10)). Obviously, there is no point in pretending that the bound in (10) is optimal. Hence, C L is to be considered as a tuning constant of the method. The theorem proves that if the constant is large enough, then an optimal result is obtained.

Numerical study
In this section, we provide some simulations illustrating our theoretical results.

Rate of convergence
In this experiment, we study the rate of convergence of the estimatorτ . Let d = 20, p = 10, σ = 1, τ = 0.3. Let us consider data generated from Model (1) with the means θ − and θ + obtained from the following distribution: θ − ∼ N (0, 1 20j 2 ), θ + ∼ N (−θ − , 10 −4 ). To get a first insight about the rate of convergence, we simulate 1000 times a sample of length n, for n chosen between 20 and 4000, and plot in Figure 2 the mean and median of the error |τ −τ | over the 1000 trials in function of n, together with the function n → ln(n)Ψ n (p, Δ p ) corresponding to the theoretical rate of convergence obtained in Proposition 1. Note that the rate of convergence of |τ −τ | is given in the proposition up to a constant κ(γ, ε). Nevertheless, the figure provides an appropriate illustration of the result as soon as n is large enough. Then, simulating 1000 samples, for each value of the sample size n between 500 and 4000, we try to estimate of the rate of convergence by computing the linear regression of |τ −τ | by ln(n): omitting the logarithmic factor, an exponent −1 is to be found, corresponding to the rate of convergence 1 n . Figure 3 provides an illustration of this linear regression, considering again the mean and the median over the 1000 trials. On this example, the estimated slope of the regression line is −1.172 for the mean and −1.098 for the median.

Selection of p
In Theorem 2, we suggest to select p using Lepski's method. Before introducing a practical procedure for the selection of p, let us illustrate the fact that the performance of the estimatorτ may indeed vary a lot as a function of p, so that selecting the right p is a crucial issue in the estimation of τ .

A. Fischer and D. Picard
. . , d. We simulated 5000 data sets according to Model (1) in each of the two cases. Figure 4 and 5 show the mean and median error |τ − τ | over the 5000 trials as a function of p. In the first case, the best result is obtained already with p = 1, whereas for the second, taking p around 30 is a good choice.
Theorem 2 provides a theoretical way to select p. However, the statement depends on a tuning constant C L . In practice, it is simpler to try to select directly p. In the sequel, two such procedures are investigated, yielding two estimatorsp 1 andp 2 .
• Method 1. This method is often used to search for tuning constants in adaptive methods. The idea is to find a division of the set {1, . . . , d} into {1, . . . ,p 1 } and its complementary, where the two subsets are corresponding to two "regimes" for the data, one with "big coefficients", one with small ones.
This quantity V is computed for every p = 1, . . . , d and the valuep 1 is chosen such thatp Indeed, this procedure, by searching for a change-point along Z 1 , . . . , Z d , should separate the first most significative differences θ − j − θ + j , where j = 1, . . . ,p 1 , from the remaining ones, expected to be less significative for  estimatingτ , in such a way that keeping for the estimation all components untilp 1 seems a reasonable choice.
• Method 2. The second idea is more computationally involved and based on subsampling. When performing subsampling, the indices drawn at random are sorted, so that the parameter of interest τ remains indeed approximatively unchanged. For each p = 1, . . . , d, we computeτ (p) for a collection of subsamples. Then,p 2 is set to the value of p minimizing the variance ofτ over all subsamples. Here, 100 subsamples are built, each of them containing 80% of the initial sample.

Remark 6.
Proportions of data from 50% to 90% have also been tried, with quite similar results. Observe that picking a quite small proportion of data for subsampling could be interesting since it provides more variability between the subsamples, but, at the same time, the fact that the ratio between the dimension d and the sample size is modified may be annoying when the aim is to select p. We also considered a version of subsampling where a different  subsampling index is drawn for every p = 1, . . . , d: again, this provides more variability in the subsamples, but τ may also vary more than in the classical version. The results were not significantly different.
The performance of the two methods is compared with the result obtained using the value of p minimizing the average value of |τ −τ (p)| over a large number of trials, called hereafter oracle p (here, p = 30 as obtained above for 5000 trials). Of course, p is not available in practice, since it depends on the true τ . However, it is introduced as a benchmark. The results, corresponding to 1000 trials, for Model B, are shown in Figure 6 and Table 1. The performances of the proposed methods could seem unsatisfactory in absolute terms. Nevertheless, the data has deliberately been chosen difficult to segment. Indeed, to illustrate the selection of p, it seems more appropriate to consider a high-dimensional, hard situation, rather than an easy one where the true τ is always found exactly. Observe that the two methods perform very similarly, with a slight advantage of Method 2 over Method 1. However, Method 2 is based on subsampling, and, as such, is more CPU-time consuming.

Proof of Proposition 1
Our proof will heavily rely on standard concentration inequalities for Gaussian and chi-square distributions, detailed in the Appendix (see Section A).
In the sequel, for the sake of simplicity, we will assume that nτ ∈ {2, . . . , n − 2}. This will not have any consequence on the result but will save us the repeated use of integer parts. Also, in this proof, Ψ n (p, Δ) will be replaced by Ψ n to lighten notation. The proof uses several lemmas, whose proofs are given at the end of the section. First, Lemma 1 shows thatτ may be written using Gaussian and chi-square random variables. (1), the estimatorτ (p) may be written as follows:

The expression obtained forτ (p) is then used in the sequel for building an upper bound for
We will only evaluate the probability P inf k n −τ ≥λΨn K p k n < K p (τ ) in what follows, since the second term can be treated in a symmetrical way.

A. Fischer and D. Picard
Lemma 4 provides a control of the term P N . For P V,W , 3 different cases will be considered, addressed in Lemmas 5, 6, and 7 below. Lemma 4 (Control of P N ). The following inequality holds: To get an upper bound for the term P V,W , let us begin with the case where nΔ 2 p ≤ 32pσ 2 /ε 2 . Moreover, the situation where σ 4 p (nΔ 2 p ) 2 ≥ 1 λ will be addressed first.
The more intricate situation where σ 4 p is considered in the next lemma.
Lemma 6 (Control of P V,W , case 2). Assume that nΔ 2 p ≤ 32pσ 2 /ε 2 and that Then, Lemma 7 provides an upper bound for P V,W in the case where nΔ 2 p ≥ 32pσ 2 /ε 2 . Lemma 7 (Control of P V,W , case 3). Assume that nΔ 2 p ≥ 32pσ 2 /ε 2 . Then, End of the proof of Proposition 1. Collecting the results of the different lemmas, we see that P (|τ − τ | ≥ λΨ n ) may be upper bounded by a sum of terms all of the form where c 1 denote an absolute constant, and c 2 is polynomial in ε. Recalling that nΔ 2 p σ 2 ≥ c(γ, ε) ln(n), and taking λ = κ(γ, ε) ln(n), this proves Proposition 1.
Proof of Lemma 1. Let us denote by P (τ,θ + ,θ − ) the probability distribution associated with Model (1). We will consider the behavior of our estimators under the probability P (τ,θ + ,θ − ) . Using the notation x + = (x + 1 , . . . , x + p ), and , observe thatτ may be defined in the following way: Here, the function L is given (for t ∈ { 2 n , . . . , n−2 n }) by As an aside, not used in the sequel, note that writing highlights the facts that we are actually searching for a maximum likelihood estimator. Let us consider the case t ≥ τ . The other case can be treated in a symmetrical way. For t ≥ τ , and under the distribution P (τ,θ + ,θ − ) , we may write where δ = (δ 1 , . . . , δ p ) is the vector θ + − θ − . Now, we have to minimize in (x − , x + ) the previous expression. By differentiation, we obtain that the minimum is attained by taking, for every j, x Plugging the minimizers (7) and (8) into expression (6) leads to the minimum Under P (τ,θ + ,θ − ) , K p (t) can be written in the following way: where Proof of Lemma 2. We write Proof of Lemma 3. We have
Proof of Lemma 4. Using the Gaussian concentration inequality (12) recalled in the Appendix, we may write Proof of Lemma 5. We have

Using concentration results for the chi-square distribution recalled in Corollary
On change-point estimation under Sobolev sparsity 1671 2 in the Appendix, we obtain The last but one inequality follows from the assumption σ 4 p (nΔ 2 Proof of Lemma 6. We have Let us compute Similarly, As a consequence, we get In the last two bounds, we have first applied Corollary 2, then used the fact that we are in the case

A. Fischer and D. Picard
Thus, For P W 2 , let us denote by F , the σ-algebra spanned by the variables {η i,j , i ≤ nτ, j ≤ p}. We write Conditionally on F , the random variable Hence, We used here nΔ 2 p ≤ 32pσ 2 /ε 2 together with Corollary 2. To end the proof, we investigate the term P V 2 . For k ∈ {nτ + nλΨ n , . . . , n − 2}, let F k denote the σ-algebra spanned by the variables {η i,j , i > k, j ≤ p}.
Conditionally on F k , the random variable follows a centered normal distribution N (0, σ 2 (k−nτ ) Using Gaussian concentration (12), we write: Hence, Again, we used the assumption nΔ 2 p ≤ 32pσ 2 /ε 2 and Corollary 2. Proof of Lemma 7. The proof is very similar to the proof of Lemma 6. Recall that To bound P V and P W , we will use the next inequalities: and As a consequence, we get Likewise, Lemma 8. We consider Model (1), and assume that θ + and θ − belong to Θ(s, L). We suppose that there exists a constant α > 0 such that Then, for any γ, if C L is large enough (see condition (10) below), there exists a constant R = R(γ, L, C L , ε) (see condition (11)) such that, if then we have where c is an absolute constant.
The proof is based on an intermediate lemma, stating that, with large probability,p ≤ p s .

Lemma 9. Under the conditions above, for any γ, if we have
then Recall that Z is defined by as soon as x ≥ 10(k − + 1)σ 2 /n.
Proof of Lemma 10.
Observe that k j= ε j β j follows a Gaussian distribution N 0, σ 2 n k j= (β j ) 2 , so that, using the concentration of the Gaussian distribution (see (12) in the Appendix) and the fact that Using Corollary 2 in the Appendix, we get as soon as x ≥ 10(k − + 1)σ 2 /n.
Proof of Lemma 9. We have Thus, if C L ≥ 4L 2 , we have k j= (β j ) 2 ≤ (C L /2)k σ 2 n ln(d ∨ n). We get, with 2x := C L k σ 2 n ln(d ∨ n), the following inequality: Equipped with Lemma 9, let us go back to the proof of Lemma 8.
Proof of Lemma 8. To simplify the exposition, let us suppose that τ ≥ 1/2, the other case can be treated similarly, with elementary modifications.
Now, we use again Gaussian and chi-square concentration results, as in the proof of Lemma 10.

A. Fischer and D. Picard
Now, nΔ 2 σ 2 ≥ R ln(d ∨ n)p s . Hence, for R large enough, the right-hand terms admit an upper bound of the order n −γ . Combining this bounds with Lemma 9, we get the desired result, as soon as Conditions (10) and are satisfied.
Proof of Theorem 2. We use Lemma 8, the definition of p s , and Proposition 1. For any γ, γ , we have as soon as R is large enough, which proves the theorem.

Appendix A: Concentration inequalities
Gaussian concentration If N ∼ N(0, 1), then it is well known that, for x > 0, Exponential inequality for the chi-square distribution The next lemma is proved in [37].
Lemma 11. Let k be a positive integer and U be a χ 2 distribution with k degrees of freedom. For z > 0, In this paper, the following form of the result is used.

Corollary 2.
Let k be a positive integer and U be a χ 2 distribution with k degrees of freedom. For 0 < x ≤ 4k, For x > 0, Consequently,