Change-point detection in high-dimensional covariance structure

In this paper we introduce a novel approach for an important problem of break detection. Specifically, we are interested in detection of an abrupt change in the covariance structure of a high-dimensional random process -- a problem, which has applications in many areas e.g., neuroimaging and finance. The developed approach is essentially a testing procedure involving a choice of a critical level. To that end a non-standard bootstrap scheme is proposed and theoretically justified under mild assumptions. Theoretical study features a result providing guaranties for break detection. All the theoretical results are established in a high-dimensional setting (dimensionality $p \gg n$). Multiscale nature of the approach allows for a trade-off between sensitivity of break detection and localization. The approach can be naturally employed in an on-line setting. Simulation study demonstrates that the approach matches the nominal level of false alarm probability and exhibits high power, outperforming a recent approach.


Introduction
The analysis of high dimensional time series is crucial for many fields including neuroimaging and financial engineering. There, one often has to deal with processes involving abrupt structural changes which necessitate a corresponding adaptation of a model and/or a strategy. Structural break analysis comprises determining if an abrupt change is present in the given sample and if so, estimating the change-point, namely the moment in time when it takes place. In literature both problems may be referred to as change-point or break detection. In this study we will be using terms break detection and change-point localization respectively in order to distinguish between them. The majority of approaches to the problem consider only a univariate process [14] [2]. However, in recent years the interest for multi-dimensional approaches has increased. Most of them cover the case of fixed dimensionality [31] [29] [1] [41] [42]. Some approaches [11,28,12] feature high-dimensional theoretical guaranties but only the case of dimensionality polynomially growing in sample size is covered. The case of exponential growth has not been considered so far. In order to detect a break, a test statistic is usually computed for each point t (e.g. [31]). The break is detected if the maximum of these values exceeds a certain threshold. A proper choice of the latter may be a tricky issue. Consider a pair of plots (Figure 1) of the statistic A(t) defined in Section 2. It is rather difficult to see how many breaks are there, if any. The classic approach to the problem is based on the asymptotic behaviour of the statistic [14] [2] [1] [28] [7] [42]. As an alternative, permutation [28] [31] or parametric bootstrap may be used [28]. Clearly, it seems attractive to choose the threshold in a solely datadriven way as it is suggested in the recent paper [11], but a careful bootstrap validation is still an open question.
In the current study we are interested in a particular kind of a break -an abrupt transformation in the inverse covariance matrix -which is motivated by applications to neuroimaging. The covariance structure of data in functional Magnetic Resonance Imaging has recently drawn a lot of interest, as it encodes so-called functional connectivity networks [40] which refer to the explicit influence among neural systems [22]. A rather popular approach to inferencing these networks is based on estimating inverse covariance or precision matrices [21]. The technique generally makes use of the observation that functional connectivity networks are of small-world type [40], which makes sparsity assumptions feasible. Analysing the dynamics of these networks is particularly important for the research on neural diseases and also in the context of brain development with emphasis on characterizing the re-configuration of the brain during learning [4].
A similar problem is found in finance: the dynamics of the covariance structure of a high-dimensional process modelling exchange rates and market indexes is crucial for a proper asset allocation in a portfolio [13,6,16,33].
One approach to the change-point localization is developed in [29], the corresponding significance testing problem is considered in [1]. However, neither of these papers address the high-dimensional case.
A widely used break detection approach (named CUSUM) [12,1,28] suggests to compute a statistic at a point t as a distance of estimators of some parameter of the underlying distributions obtained using all the data before and after that point. This technique requires the whole sample to be known in advance, which prevents it from being used in online setting. In order to overcome this drawback we propose a method ideologically similar to MOSUM [5] [15]: choose a window size n ∈ N and compute parameter estimators using only n points before and n points after the central point t (see Section 2 for formal definition). Window size n is an important parameter and its choice is case-specific (see Section 4 for theoretical treatment of this issue). Using a small window results in high variability and low sensitivity, while a large window implies higher uncertainty in change-point localization yielding the issue of a proper choice of window size. The multiscale nature of the proposed method enables us to incorporate the advantages of narrower and wider windows by considering multiple window sizes at once in order for wider windows to provide higher sensitivity while narrower ones improve change-point localization. Moreover, the local nature of the proposed statistic allows for detection of multiple change points if the change-points are not too close to each other (see Section 4). The contribution of our study is the development of a novel break detection approach which is • high-dimensional, allowing for up to exponential growth of the dimensionality with the window size • suitable for on-line setting • suitable for detection of multiple change-points • multiscale, attaining trade-off between break detection sensitivity and change-point localization accuracy • using a fully data-driven threshold selection algorithm rigorously justified under mild assumptions • featuring formal sensitivity guaranties in high-dimensional setting We consider the following setup. Let X 1 , ... X N ∈ R p denote sample of independent zero-mean vectors (the on-line setting is discussed in Section 3) and we want to test a hypothesis versus an alternative suggesting the existence of a break: and localize the change-point τ as precisely as possible or (in on-line setting) to detect a break as soon as possible.
V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 3 The approach proposed in the paper focuses on applications in neuroimaging. Independence of the vectors X i is fulfilled only approximately in practice but functional connectivity network analysis, which assumes temporal independence, has been proven to be very successful and validated [35].
In the current study it is also assumed that some subset of indices I s ⊆ 1..N of size s (possibly, s = N ) is chosen. The threshold is chosen relying on the sub-sample {X i } i∈Is while the test-statistic is computed based on the whole sample.
To this end we define a family of test statistics in Section 2.1 which is followed by Section 2.2 describing a data-driven (bootstrap) calibration scheme and Section 2.3 describing change-point localization procedure. The theoretical part of the paper justifies the proposed procedure in a high-dimensional setting. The result justifying the validity of the proposed calibration scheme is stated in Section 3. Section 4 is devoted to the sensitivity result yielding a bound for the window size n necessary to reliably detect a break of a given extent and hence bounding the uncertainty of the change-point localization (or the delay of detection in online setting). The theoretical study is supported by a comparative simulation study (described in Section 5) demonstrating conservativeness of the proposed test and higher sensitivity compared to the other algorithms and by analysis of real-world datasets (Section 6). Appendix A contains a finitesample version of sensitivity result along with the proofs. Appendix B provides a a finite-sample version of bootstrap sensitivity result which is followed by the proofs. Finally, Appendix H lists results which were essential for our theoretical study.

Proposed approach
This section describes the proposed approach along with a data-driven calibration scheme. Informally the proposed statistic can be described as follows. Provided that the break may happen only at moment t, one could estimate some parameter of the distribution using n data-points to the left of t, estimate it again using n data-points to the right and use the norm of their difference as a test-statistic A n (t). Yet, in practice one does not usually possess such knowledge, therefore we propose to maximize these statistics over all possible locations t yielding A n . Finally, in order to attain a trade-off between break detection sensitivity and change-point localization accuracy we build a multiscale approach: consider a family of test statistics {A n } n∈N for multiple window sizes n ∈ N ⊂ N at once.

Definition of the test statistic
Now we present a formal definition of the test statistic. In order to detect a break we consider a set of window sizes N ⊂ N. Denote the size of the widest window as n + and of the narrowest as n − . Given a sample of length N , for each window size n ∈ N define a set of central points T n := {n + 1, ..., N − n + 1}.
Next, for all n ∈ N define a set of indices which belong to the window on the left side of the central point t ∈ T n as I l n (t) := {t − n, ..., t − 1} and correspondingly for the window on the right side define I r n (t) := {t, ..., t + n − 1}. Denote the sum of numbers of central points for all window sizes n ∈ N as For each window size n ∈ N, each central point t ∈ T n and each side S ∈ {l, r} we define a de-sparsified estimator of precision matrix [26] [27] aŝ is a consistent estimator of precision matrix which can be obtained by graphical lasso [37] or node-wise procedure [27] (see Definition 3.1 and Appendix H.5 for details). Note, the symmetricity ofΘ S n (t) is not required, yetT S n (t) is symmetric by construction. Now define a matrix of size p × p with elements where Θ * := E X i X T i −1 for i ≤ τ , Θ * u stands for the u-th row of Θ * . Denote their variances as σ 2 uv := Var [Z 1,uv ] and introduce the diagonal matrix S = diag(σ 1,1 , σ 1,2 , ..., σ p,p−1 , σ p,p ). Denote a consistent estimator (see Definition 3.1 for details) of the precision matrix Θ * obtained based on the sub-sample {X i } i∈Is asΘ . In practice, the variances σ 2 uv are unknown, but under normality assumption one can plug inσ 2 uv :=Θ uuΘvv +Θ 2 uv which have been proven to be consistent (uniformly for all u and v) estimators of σ 2 uv [26] [3]. If the node-wise procedure is employed, the uniform consistency of an empirical estimate of σ 2 uv has been shown under some mild assumptions (not including normality) [27].
For each window size n ∈ N and a central point t ∈ T n we define a statistic where we write M for a vector composed of stacked columns of matrix M . Finally we define our family of test statistics for all n ∈ N as Our approach heavily relies on the following expansion under H 0 V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 5 where the residual term The main reason why we prefer to use de-sparsified estimatorsT S n (t) over using ℓ 1 -penalized estimatorsΘ S n (t) is that the former allows for the expansion (2.1) which pitches an idea behind the bootstrap procedure we suggest and makes the theoretical analysis possible.
This expansion might have been used in order to investigate the asymptotic properties of A n and obtain the threshold, however we propose a data-driven scheme.
Remark 2.1. A different test statistic can be defined as the maximum distance between elements of empirical covariance matricesΣ l n (t) andΣ r n (t). Such a method would be computationally less burdensome, since it does not involve precision matrix estimation. However, there are extremely effective implementations of the latter which makes the computational costs of calibration procedure dominate. In turn, the calibration complexity is the same for both of the approaches, since matrix inversion is not involved therein (see Section 2.2). Therefore, the computational gain would be negligible. Furthermore, application to neuroimaging motivates the search for a structural change in a functional connectivity network which is encoded by the structure of the corresponding precision matrix. Clearly, a change in the precision matrix also means a change in the covariance matrix, though we believe that the definition (2.1) increases the sensitivity to this kind of alternative (see Remark 4.1 for more details).
Remark 2.2. The estimatorT S n (t) is indeed a de-biased estimator. We can easily rearrange the definition (2.1) aŝ in order to represent it as a difference of ℓ 1 -penalized estimator and bias-correcting term.

Bootstrap calibration
Our approach rejects H 0 in favor of H 1 if at least one of statistics A n exceeds the corresponding threshold x ♭ n (α) or formally if ∃n ∈ N : A n > x ♭ n (α).
In order to properly choose the thresholds, we define bootstrap statistics A ♭ n in the following non-standard way. Note, that we cannot use an ordinary scheme with replacement or weighted bootstrap since in a high-dimensional case (|I s | ≤ p) the covariance matrix of bootstrap distribution would be singular which would made inverse covariance matrix estimation procedures meaningless.
First, draw with replacement a sequence {κ i } N i=1 of indices from I s and denote stands for averaging over values of index belonging to I s e.g., E s [X j ] = 1 |Is| j∈Is X j . Denote the measure X ♭ i are distributed with respect to as P ♭ . In accordance with (2.1) define which is intuitively reasonable due to expansion (2.1). And finally we define the bootstrap counterpart of A n as Now for each given x ∈ [0, 1] we can define quantile functions z ♭ n (x) such that Next for a given significance level α we apply multiplicity correction choosing α * as α * := sup x : P ♭ ∃n ∈ N : and finally choose thresholds as x ♭ n (α) := z ♭ n (α * ). Remark 2.3. In order to detect multiple breaks we suggest to repeat the calibration procedure after each detected break using a portion of data acquired after the break.
Remark 2.4. One can choose I s = 1, 2, ..., N and use the whole given sample for calibration as well as for detection. In fact, it would improve the bounds in Theorem 3.1 and Theorem 4.1, since it effectively means s = N . However, in practise such a decision might lead to reduction of sensitivity due to overestimation of the thresholds. V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 7

Change-point localization
In order to localize a change-point we have to assume that I s ⊆ 1..τ . Consider the narrowest window detecting a change-point asn: n := min n ∈ N : A n > x ♭ n (α) (2.8) and the central point where this window detects a break for the first time aŝ τ := min t ∈ Tn : An(t) > x ♭ n (α) . Clearly, if a non-multiscale version of the approach is employed, i.e. |N| = {n}, n =n and the precision of localization (delay of the detection in online setting) equals n.

Bootstrap validity
This section states and discusses the theoretical result demonstrating the validity of the proposed bootstrap scheme i.e.
Our theoretical results require the tails of the underlying distributions to be light. Specifically, we impose Sub-Gaussianity vector condition. ∃L : ∀i ∈ 1..N sup Naturally, in order to establish a theoretical result we have to assume that a method featuring theoretical guaranties was used for estimating the precision matrices. Such methods include graphical lasso [37], adaptive graphical lasso [43] and thresholded de-sparsified estimator based on node-wise procedure [27]. These approaches overcome the high dimensionality of the problem by imposing a sparsity assumption, specifically bounding the maximum number of non-zero elements in a row: d := max i {j|Θ * ij = 0} . These approaches are guaranteed to yield a root-n consistent estimate revealing the sparsity pattern of the precision matrix [37,3,27] or formally V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 8 Definition 3.1. Consider an i.i.d. sample x 1 , x 2 , ...x n ∈ R p . Denote their precision matrix as Θ * = E X 1 X T 1 −1 . Let p and d grow with n. A positive-definite matrixΘ n is a consistent estimator of the high-dimensional precision matrix if log p n and ∀i, j ∈ 1..p and Θ * ij = 0 ⇒Θ n ij = 0. Graphical lasso and its adaptive versions impose an assumption, common for ℓ 1 -penalized approaches.

Assumption 3.2 (Irrepresentability condition). Denote an active set
The interpretation of irrepresentability condition under normality assumption is given in [26] [37]. Particularly, Assumption 3.2 requires low correlation between the elements of empirical covariance matrix from the active set S and from its complement. The higher the constant ψ is, the stricter upper bound is assumed.
While the Assumption 3.2 allows for the recovery of the active set, the test statistic does not explicitly require it, since it is based on the de-sparsified estimators. But this assumption is still essential for demonstrating consistency of estimation of the non-zero elements Θ * S by either graphical lasso or its adaptive versions. Alternatively, one can use thresholded de-sparsified estimator, for which the theoretical guaranties can be established in the absence of Assumption 3.2.
These observations give rise to the two following assumptions.
Assumption 3.3.A. Suppose, either graphical lasso or its adaptive version was used with regularization parameter λ n ≍ log p/n and also impose Assumption 3.2.
Assumption 3.3.B. Suppose, thresholded de-sparsified estimator based on nodewise procedure was used with regularization parameter λ n ≍ log p/n. Now we are ready to establish a result which guarantees that the suggested bootstrap procedure yields proper thresholds.
Theorem 3.1. Assume H 0 holds and furthermore, let X 1 , X 2 , ...X N ∈ R p be i.i.d. Let Assumption 3.1 and either Assumption 3.3.A or Assumption 3.3.B hold. Also assume, the spectrum of Θ * is bounded. Allow the maximal number d of non-zero elements in a row of the matrix Θ * , the size s of the set I s , the dimensionality p, the number |N| of window sizes being considered, the maximum and minimum window sizes n + and n − grow with the sample size N . Further let N > 2n + , n + ≥ n − and also impose the sparsity assumption The finite-sample version of this result, namely, Theorem B.1, is given in Appendix B along with the proofs.
Bootstrap validity result discussion Theorem 3.1 guarantees under mild assumptions (Assumption 3.2 seems to be the most restrictive one, yet it may be dropped if the node-wise procedure is employed) that the first-type error rate meets the nominal level α if the narrowest window size n − and the set I s are large enough. Clearly, the dependence on dimensionality p is logarithmic which establishes applicability of the approach in a high-dimensional setting. It is worth noticing that, unusually, the sparsity bound gets stricter with N but the dependence is only logarithmic. Indeed, we gain nothing from longer samples, since we use only 2n data points each time.
On-line setting As one can easily see, the theoretical result is stated in offline setting, when the whole sample of size N is acquired in advance. In on-line setting we suggest to control the probability α to raise a false alarm for at least one central point t among N data points (which differs from the classical techniques controlling the mean distance between false alarms [38]). Having α and N chosen one should acquire s data-points (the set I s ), use the proposed bootstrap scheme with bootstrap samples of length N in order to obtain the thresholds. Next the approach can be naturally applied in on-line setting and Theorem 3.1 guarantees the capability of the proposed bootstrap scheme to control the aforementioned probability to raise a false alarm.
Proof discussion The proof of the bootstrap validity result, presented in Appendix B, mostly relies on the high-dimensional central limit theorems obtained in [10], [9]. These papers also present bootstrap justification results, yet do not include a comprehensive bootstrap validity result. The theoretical treatment is complicated by the randomness of x ♭ n (α). We overcome it by applying the socalled "sandwiching" proof technique (see Lemma C.1), initially used in [39]. The high-level structure of the proof can be summarized as follows: 1. Approximate statistics A n by norms of sub-vectors η n of a high-dimensional Gaussian vector η up to the residual R A using the high dimensional central limit theorem by [10].
2. Similarly, approximate bootstrap counterparts A ♭ n of the statistics by norms of sub-vectors ζ n of a high-dimensional Gaussian vector ζ up to the residual R A ♭ . 3. Prove that the covariance matrix Var [ζ] is concentrated in the ball of radius ∆ Y centered at its real-world counterpart Var [η]. 4. By employing the Gaussian comparison result provided by [10] and [9] obtain similarity of joint distributions of the norms of ζ and η, which in combination with the steps 1 and 2 yields similarity of joint distributions of A n and A ♭ n . 5. Finally, obtain the bootstrap validity result using the sandwiching Lemma C.1.
The rigorous treatment of steps 1 through 4 is presented in Sections E, F, G and D respectively, while step 5 is formalized in Sections B and C.

Sensitivity and Consistency results
Consider the following setting. Let there be index τ , such that Define the break extent ∆ as The question is, how large the window size n should be in order to reliably reject H 0 and how firmly can we localize the change-point. hold. Also assume, the spectrums of Θ 1 and Θ 2 are bounded. Allow the maximal number d of non-zero elements in a row of the matrix Θ * , the size s of the set I s , the dimensionality p, the number |N| of window sizes being considered, the maximum and minimum window sizes n + and n − grow with the sample size N and let the break extent ∆ decay with N . Further let N > 2n + , n + ≥ n − , and there exists n * ∈ N such that Then H 0 will be rejected with probability approaching 1.
This result is a direct corollary of the finite-sample sensitivity result established and discussed in Appendix A.
The assumption I s ⊆ 1..τ is only technical. A similar result may be proven without relying on it by methodologically the same argument. See Remark A.1 for more details.
Next we formulate a trivial corollary of Theorem 4.1 establishing consistency of the change-point estimatorτ defined by (2.3).
Corollary 4.1. Let assumptions of Theorem 4.1 hold. Moreover, assume a suitable window is being considered Sensitivity and consistency results discussion Assumptions (Theorem 4.1) and (Theorem 4.1) are essentially a sparsity bound and a bound for the sufficient windows size n * . Clearly, they do not yield a particular value n * necessary to detect a break, since it depends on the underlying distributions. A more restrictive assumption (Corollary 4.1) suggests a particular choice of n * which is suitable in asymptotic setting. Note, the results include dimensionality p only under the sign of logarithm, which guarantees high sensitivity of the test and proper localization of the change-point in high-dimensional setting.
Online setting Theorem 4.1 and Corollary 4.1 are established in off-line setting. In on-line setting they guarantee that the proposed approach can reliably detect a break of an extent not less than ∆ with a delay at most n * .
Multiple change-point detection Consider a setting which allows for multiple breaks: let there be n b change-points j=0 -one for each region between a pair of consecutive change-points. Define the minimal extent of the breaks as Then under assumption that two change-points are not too close to each other min j |τ j − τ j+1 | ≥ n * + s we can apply Theorem 4.1 and conclude that the method detects all the change points with probability approaching 1.
Remark 4.1. The sensitivity and consistency results heavily depend on the break extent ∆ defined by (4). As Remark 2.1 suggests, a test statistic based on the ℓ ∞ norms of the covariance matrices, not of their inverses, may also be considered. For such a statistic a similar result could be established, but the break extent would have to be correspondingly redefined as a ℓ ∞ distance between the covariance matrices instead of Θ 1 and Θ 2 which would unnecessarily restrict the application of this approach in the field of neuroimaging where the precision matrix is the main object of interest.

Design
In our simulation we test versus an alternative . The alternative covariance matrix Σ 1 was generated in the following way. First we draw k ∼ P oiss (3) was used in order to obtainΘ S n (t). In the same way, graphical lasso with penalization parameter λ s was used in order to obtainΘ.
We assess the performance of the proposed approach in both on-line and off-line settings. Multiple break detection is left out of the scope because the suggested method attacks the problem repetitively detecting the breaks oneby-one in on-line fashion. In on-line setting the method is calibrated in order to raise a false alarm with probability α = 0.05 on N data points using for calibration the set {X i } i∈Is which is known in advance (unlike the rest of the dataset).

Experiment results
We have also come up with an approach to the same problem not involving bootstrap. The paper [30] defines a high-dimensional two-sample test for equality of matrices. Moreover, the authors prove asymptotic normality of their statistic which makes computing p-value possible. We suggest to run this test for every t ∈ T n and every n ∈ N, adjust the obtained p-values using Holm method [23] and eventually compare them against α. This example is considered in order to demonstrate inapplicability of a straightforward application of a two-sample test for change-point detection.
The paper [31] suggests an approach based on comparing characteristic functions of random variables. The critical values were chosen with permutation test as proposed by the authors. In our experiments the method was allowed to consider all the sample at once. The R-package ecp [25] was used.
In [12] a high-dimensional approach aiming to detect a change-point in secondorder structure of the time series is suggested. In order to investigate its performance in our setting we use the implementation kindly provided by the authors of the paper.
Our approach is implemented as an R-package covcp, which is available on GitHub 1 .
Below we report and discuss results for a particular alternative matrix Σ 1 generated in such a way suggested in Section 5.1. All the methods being considered exhibit comparable performance for other matrices Σ 1 drawn from the same distributions.
The first type error rate and power for our approach are reported in Table 1. As one can see, our approach allows to properly control first type error rate in both off-line and on-line setting. In fact, the test is conservative and we believe this is caused by the ≤ signs entering the definitions of z ♭ n (·) (2.2) and the multiplicity correction (2.2). As expected, the power of the test is higher for larger windows and it is decreased by adding narrower windows into consideration which is the price to be paid for better localization of a change point. The power of the test is rather similar in on-line and off-line settings which is due to its local nature.
In our study the approaches proposed in [31], [12] and the one based on the two sample test [30] turned out to be conservative, but neither of them exhibited power above 0.1.
In order to justify application of multiscale approach (i. e. |N| > 1) for the sake of better change-point localization in off-line setting we report the distribution of the narrowest detecting windown (defined by (2.3)) over N in Figure  2. The Table 1 represents average precision of change-point localization for various choices of set of window sizes N. One can see, that multiscale approach significantly improves the precision of localization.
Similarly, in on-line setting we analyze delay of detection -the average number of data points after the break which have to be considered before the method detects a break. As Table 1 shows, the delay of detection is significantly decreased by the use of multiscale approach which justifies its use in on-line setting as well.

Analysis of real-world datasets
The paper [36] presents a functional magnetic resonance imaging study on human subjects, who had to learn the relation between different auditory stimuli and a monetary reward while being scanned. Based on their recorded performance (e.g. success rate) those of them who have managed to learn the relation within the time course of the experiment are considered learners. The authors Table 1 First type error rate and power of change-point localization of the proposed approach for various sets of window sizes N. Localization precision and delay of detection are also reported for off-line and on-line settings respectively.  have investigated the activity of 4 brain regions-of-interest (ROI) which are believed to be involved in solving problems like the suggested one. Among the learners 3 of the ROIs exhibit a statistically significant change in the Blood Oxygenation Level Dependent (BOLD) response between the first and the last quarters of the experiment.
In this paper we analyze the BOLD responses for 18 subjects which were classified as learners. We have considered a finer-grained brain atlas with p = 256 ROIs [19]. For each subject and for each ROI a time-course of length N = 1680 was acquired. Datasets for each subject were analyzed separately. We have applied the proposed approach to the residuals of linear modeling usual in fMRI experiments [35]. The residuals are publicly available 2 . We have used graphical lasso with the penalty parameter λ n = log p n , a single window size n = 50 was considered and the first 200 data points were used for bootstrap simulations (I s = {1...200}). The proposed approach has detected a change-point in the covariance structure of residuals at significance level α = 0.05 for each of the subjects.

Appendix A: Proof of sensitivity result
Proof of Theorem 4.1. Proof consists in applying of the finite-sample Theorem A.1. Its applicability is guaranteed by the consistency results given in papers [37,3,27] and by the results from [26,27,3] bounding the term RT . High probability of set T ↔ is ensured by Lemma G.1. ≤ RT for all S ∈ {l, r} and t ∈ T n+ on some set Moreover, let the residual R A ♭ defined in Lemma F.2 be bounded: and ∆ Y is defined in Lemma G.2. Then on set T ↔ with probability at least where p ΣY s (x, q) is defined in Lemma G.2, H 0 will be rejected.

Discussion of finite-sample sensitivity result
The assumption (Theorem A.1) is rather complicated. Here we note that if either graphical lasso [37], adaptive graphical lasso [43] or thresholded de-sparsified estimator based on node-wise procedure [27] with penalization parameter chosen as λ s ≍ o( log p/n) was used, given d, s, p, N, n − , n + → ∞, N > 2n + , n + ≥ n − , s ≥ n − and d = o( √ n + ) it boils down to for some positive constant D 6 independent of N, N, p, d, S while the parameters q, γ and x may be chosen as in (B), (B), (B) (high probability of T ↔ is ensured by Lemma G.1). At the same time the remainder R A ♭ can be bounded by (B). As expected, the bound for sufficient window size decreases with growth of the break extent ∆ and the size of the set I s , but increases with dimensionality p. It is worth noticing, that the latter dependence is only logarithmic. And again, in the same way as with Theorem 3.1, the bound increases with the sample size N (only logarithmically) since we use only 2n data points.
Proof of Theorem A.1. The strategy of the proof is straightforward.

Bound the probability of a large deviation of the Gaussian approximation
ζ n+ ∞ of A ♭ n+ using Chernoff bound. 2. Bound the corresponding critical level x ♭ n+ (α) up to the remainder term R A ♭ , using the approximation guaranties provided by Lemma F.2. 3. Conclude that A n+ > x ♭ n+ (α) and therefore the H 0 is rejected, by construction of the test statistic A n+ .
Lemma A.2 applies here and yields for all positive q and T n+ is the number of central points for window of size n + . Applying Lemma G.2 on a set of probability at least 1 ≤ ∆ Y , and hence, due to the fact that ||Σ * Y || ∞ = 1 by construction, Due to Lemma F.2 and continuity of Gaussian c.d.f.
and due to Lemma F.2 along with the fact that T n+ < N , choosing q as proposed by equation (Theorem A.1) we ensure that x ♭ n+ (α) ≤ q. Now by assumption of the theorem and by construction of the test statistics A n Finally, we notice that due to assumption (Theorem A.1) A n+ > q and therefore, H 0 will be rejected.
Remark A.1. The assumption I s ⊆ 1..τ is only technical. A similar result may be proven without relying on it by methodologically the same argument. Really, if we drop the assumption, the matrix S (depending only on the distribution before the break) will fail to normalize the vectors Y n♭ ·j and therefore Σ Y ∞ will significantly deviate from 1. Yet a bound (omitted for brevity) of sort Σ Y ∞ < C where C would depend on ∆ Y , distributions before and after the break as well as on the portions of the data points before and after the break included in the I s can be established. The term C will have to enter the definition (Theorem A.1) of q instead of (1 + ∆ Y ).

V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 18
Lemma A.1. Consider a centered random Gaussian vector ξ ∈ R p with arbitrary covariance matrix Σ. For any positive q it holds that Proof. By convexity we obtain the following chain of inequalities for any t Chernoff bound yields for any t Finally, optimization over t yields the claim.
As a trivial corollary, one obtains Lemma A.2. Consider a centered random Gaussian vector ξ ∈ R p with arbitrary covariance matrix Σ. For any positive q it holds that

Appendix B: Proof of bootstrap validity result
Proof of Theorem 3.1. Proof consists in applying of the finite-sample Theorem B.1. Its applicability is guaranteed by the consistency results given in papers [37,3,27] and by the results from [26,27,3] bounding the term RT . High probability of set T T is ensured by Lemma G.1.
Theorem B.1. Assume H 0 holds and furthermore, let X 1 , X 2 , ...X N be i.i.d. LetΘ denote a symmetric estimator of Θ * s.t. for some positive r and Θ * ij = 0 ⇒Θ ij = 0. Suppose Assumption 3.1 holds and there exists RT such that √ n r S n (t) ∞ ≤ RT for all S ∈ {l, r}, n ∈ N and t ∈ T n on set Moreover, let where the remainders R A , R A ♭ , R ± Σ are defined in Lemma E.1, Lemma F.2 and Lemma C.1 respectively and the mis-tie ∆ Y involved in the definition of R ± Σ comes from Lemma G.2. Then on set T T it holds that  [27] (for node-wise procedure) and [26] (for graphical lasso). For the case of graphical lasso an explicit form of RT is given in [3].
Here we just note that ifΘ is a root-n consistent estimator, recovering sparsity pattern (graphical lasso [37], adaptive graphical lasso [43] or thresholded desparsified estimator based on node-wise procedure [27]), then for d, s, p, N, n − , n + → ∞, N > 2n + , n + ≥ n − , s ≥ n − and d 2 n− = o(1) given the spectrum of Θ * is bounded If either graphical lasso, adaptive graphical lasso or node-wise procedure [32] is used with λ n ≍ log p n in order to obtainΘ S n (t), then on set T T it holds that The high probability of T T may be ensured by means of Lemma G.1 e.g., choos- Here Appendix C: Sandwiching lemma Lemma C.1. Consider a normal multivariate vector η with a deterministic covariance matrix and a normal multivariate vector ζ with a possibly random covariance matrix such that where η n and ζ n are sub-vectors of η and ζ respectively. Then Proof. Let us introduce some notation. Denote multivariate cumulative distribution function of A n , A ♭ n , ||η n || ∞ , ||ζ n || ∞ as P, P ♭ , N , N ♭ : R |N| → [0, 1] respectively. Define the following sets for all δ ∈ [0, α] Define a set of thresholds satisfying the confidence level here and below comparison of vectors should be understood element-wise. Notice that due to continuity of multivariate normal distribution and assumption Now for all z − ∈ ∂Z − and for all z ♭ ∈ Z ♭ it holds that where we have consequently used (Lemma C.1), (Lemma C.1), (C) and (C). In the same way one obtains for all z + ∈ ∂Z + and for all z ♭ ∈ Z ♭ which implies that Z ♭ ⊂ Z − ∩ Z + . Now denote quantile functions of ||η n || ∞ as z N : [0, 1] → R |N| : In exactly the same way define quantile functions z N ♭ : [0, 1] → R |N| of ||ζ n || ∞ . Clearly for all x ∈ [0, 1], 2δ)).

V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 22
Using Taylor expansion with Lagrange remainder term we obtain for some 0 ≤ κ ≤ 2δ Next successively using Lemma C.2 and the fact that the quantile function is an inverse function of c.d.f. we obtain In the same way one obtains Next, by the argument used in the beginning of the proof we obtain z N (α * + δ), z N (α * − δ) ∈ Z − (δ (3 + 2 |N|)) ∩ Z + (δ (3 + 2 |N|)) .
As the final ingredient, we need to choose deterministic α + and α − such that (which is possible due to continuity), so α − ≤ α * ≤ α + and hence by monotonicity and finally V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 23 Lemma C.2. Consider a random variable ξ and an event A defined on the same probability space. Let c.d.f. P {ξ ≤ x} and P {ξ ≤ x&A} be differentiable. Then Proof. Indeed, denoting the complement of set A as A we obtain, Using the fact that derivative of c.d.f. is non-negative we finalize the proof.
Appendix E: Gaussian approximation result for A n Lemma E.1. Suppose there exists RT such that √ n r S (t) ∞ ≤ RT for all S and t on some set T . Then on set T it holds that where F is defined by (Lemma E.2) and η n by (Lemma E.2).
Proof. Substituting (2.1) to (2.1) yields V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 24 Now denote stacked S n Z (t) for all n ∈ N and as S n Z and for all n as S Z .
Now notice that ∀i : (Σ * Y ) ii = 1 and bound the latter two terms by means of Lemma H.2: V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 25 with γ defined by (E), β by (E) and Y by (E) and an independent constant C A .
Proof. Consider a matrix Y n with 2n + columns Clearly, columns of the matrix are independent and Next define a block matrix composed of Y n matrices: Clearly vectors Y ·l are independent and In order to complete the proof we make use of Lemma H.1. Denote Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 26 By means of Lemma G.3 one shows that the assumptions of Lemma E.3 hold for components of Z S i with where Λ (Θ * ) denotes the maximal eigen value of Θ * . Therefore condition (Assumption H.2) holds with B n defined by equation (E).
Hence, Assumption H.1 is fulfilled with b = 1. Next notice that for some k-th component of Z S i and central point t (both defined by j): and in the same way: Proof. Integration by parts yields By the same technique the following lemma can be proven Appendix F: Gaussian approximation result for A ♭ n Lemma F.1.
V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 28 which is a bootstrap counterpart of Y n from the proof of Lemma E.2 and construct a block matrix Y ♭ : Clearly vectors Y ♭ ·l are independent and Proof. Direct application of Lemma G.4 yields which in combination with the fact (provided by Lemma G.6) that V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 31 ΣZ and ∆ ΣZ along with the probabilities p ΣZ1 s (x, q) and p ΣZ2 s (x) are defined in Lemma G.5 and Lemma G.7 respectively.
Proof. Notice that because the matricesΣ Y and Σ * Y are composed of blocks S −1Σ Z S −1 and S −1 Σ * Z S −1 respectively, each block multiplied by some positive value not greater than 1 (which can be verified by simple algebra).
By Lemma G.7 and Lemma G.5 with probability at least Lemma G.3. Under Assumption 3.1 it holds for arbitrary 1 ≤ u, v ≤ p and positive x that Proof. Re-write the definition (2.1) of an element Z i,uv for arbitrary 1 ≤ u, v ≤ p The first term is clearly a value of a quadratic form defined by the matrix B = Θ * u (Θ * v ) T . Note that rankB = 1 which implies that it is either positive semi-definite or negative semi-definite. Next we apply Lemma H.4 and obtain for all positive x Again, due to the fact that B is a rank-1 matrix V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 32 and by construction of matrix B Substitution of (G) and (G) to (G) yields

And since
Finally, we obtain a bound for Z i,uv as Correction for all i, u and v establishes the following result  Proof. Denote kl are i.i.d., bounded and centered, Bernstein inequality applies here: where σ 2 W is the smallest variance of components of W (i) . Therefore The following lemma bounds the mis-tie between Z i andẐ i .
and Θ * ij = 0 ⇒Θ ij = 0. Then for positive x P ∀i ∈ I s : Now consider the mis-tie of arbitrary elements Z i,uv andẐ i,uv : V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 34 Now note that due to (G) and assumptions imposed on Θ * And hence Lemma G.7. Assume Assumption 3.1 holds. LetΘ be a symmetric estimator of Θ * s.t.
ΣZ = ∆ Z (x)(2Z s (x) + ∆ Z (x)) Proof. By Lemma G.4 with probability at least 1 − p Zs (x) we have ||Z i || ∞ ≤ Z s (x) and in combination with Lemma G.6 we obtain Ẑ i ∞ ≤ Z s (x) + ∆ Z (x) with probability at least 1 − p Zs (x) − p X s (x). Now denote And deliver the bound V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 35 Appendix H: Known results

H.1. Gaussian approximation result
In this section we briefly describe the result obtained in [10]. Throughout this section consider an independent sample x 1 , ..., x n ∈ R p of centered random variables. Define their Gaussian counterparts y i ∼ N (0, Var [x i ]) and denote their scaled sums as and the constant C depends only on b.

H.2. Anti-concentration result
Lemma H.2 (Nazarov's inequality [34]). Consider a normal p-dimensional vector X ∼ N (0, Σ) and let ∀i : Σ ii = 1. Then for any y ∈ R p and any positive a P {X ≤ y + a} − P {X ≤ y} ≤ Ca log p, where C is an independent constant. V. Avanesov, N. Buzun/Change-point detection in high-dimensional covariance structure 36

H.4. Tail inequality for quadratic forms
The following result is a direct corollary of Theorem 1 in [24] Lemma H.4. Consider a positive semi-definite or negative semi-definite matrix B and suppose Assumption 3.1 holds. Then for all t > 0

H.5. High-dimensional precision matrix estimation
In order to address the problem of high-dimensional precision matrix estimation one has to assume its sparsity. Below we describe two approaches exploiting this assumption. In both of them we assume that an i.i.d. sample X 1 , ...X n ∈ R p is supplied.

H.5.1. Graphical lasso
In [20] the graphical lasso approach was suggested. An estimate may be obtained as the solution of the following optimization problem over a positive-definite cone S p ++ of p × p dimensional matrices.

H.5.2. Node-wise lasso
This section describes the node-wise lasso approach which was suggested in [32].
For each 1 ≤ j ≤ n define a vector Γ j := (γ j 1 , ...,γ j j−1 , 1,γ j j+1 , ...,γ j p ) whereγ j is defined as a solution of the following lasso regression: Finally the j-th column of the estimator is defined aŝ Θ MB j :=Γ j /τ 2 j . Note, that this estimator might not be symmetric, so one cannot use it as an estimatorΘ based on the sub-sample {X i } i∈Is . The paper [27] suggests to construct a de-sparsified estimatorT (Θ MB ) wherê

H.5.3. Bounds for r
While graphical lasso and node-wise estimate are point estimates, de-sparsified estimators have been suggested in order to obtain confidence intervals [26] [27]. The analysis of these estimators relies on the bounds for the residual term r: The next two lemmas bound the remainder r for the case of graphical lasso and node-wise estimator.
Lemma H.6 (by [26] A finite sample-size bound for r along with its adaptations for the case of adaptive graphical lasso may be found in [3] Lemma H.7 (by [27]). LetΘ be yielded by the node-wise procedure with λ n ≍