High dimensional eﬃciency with applications to change point tests ∗

: This paper rigourously introduces the asymptotic concept of high dimensional eﬃciency which quantiﬁes the detection power of diﬀerent statistics in high dimensional multivariate settings. It allows for comparisons of diﬀerent high dimensional methods with diﬀerent null asymptotics and even diﬀerent asymptotic behavior such as extremal-type asymptotics. The concept will be used to understand the power behavior of diﬀerent test statistics as the performance will greatly depend on the assumptions made, such as sparseness or denseness of the signal. The eﬀect of misspeciﬁcation of the covariance on the power of the tests is also investigated, because in many high dimensional situations estimation of the full dependency (co-variance) between the multivariate observations in the panel is often either computationally or even theoretically infeasible. The theoretic quantiﬁca-tion by the theory is accompanied by simulation results which conﬁrm the theoretic (asymptotic) ﬁndings for surprisingly small samples. The development of this concept was motivated by, but is by no means limited to, high-dimensional change point tests. It is shown that the concept of high dimensional eﬃciency is indeed suitable to describe small sample power. MSC 2010 subject classiﬁcations: 62F05, 62M10, 62G10


Introduction
There has recently been a renaissance in research for statistical methods for change point problems (Horváth and Rice, 2014). This has been driven by applications where non-stationarities in the data can often be best described as change points in the data generating process (Eckley et al., 2011;Frick et al., 2014;Aston and Kirch, 2012b). However, data sets are now routinely considerably more complex than univariate time series classically studied in change point problems (Page, 1954;Robbins et al., 2011;Aue and Horváth, 2013;Horváth and Rice, 2014), and as such methodology for detecting and estimating change 1902 J.A.D. Aston and C. Kirch points in a wide variety of settings, such as multivariate (Horváth et al., 1999;Ombao et al., 2005;Aue et al., 2009b; functional (Berkes et al., 2009;Aue et al., 2009a;Hörmann and Kokoszka, 2010;Aston and Kirch, 2012a;Torgovitski, 2015) and high dimensional settings (Bai, 2010;Chan et al., 2012;Enikeeva and Harchaoui, 2013;Cho and Fryzlewicz, 2015) have recently been proposed. In panel data settings, these include methods based on taking maxima statistics across panels coordinate-wise (Jirak, 2015), use of scan statistic approaches (Enikeeva and Harchaoui, 2013), using sparsified binary segmentation for multiple change point detection (Cho and Fryzlewicz, 2015), uses of double CUSUM procedures (Cho, 2015), as well as those based on structural assumptions such as sparsity (Wang and Samworth, 2016).
In this paper, we develop a theoretic framework to understand and compare the power behavior of simple mean change tests in high dimensional settings. As benchmarks we investigate a class of tests based on projections, where the optimal (oracle) projection test is closely related to the likelihood ratio test under the knowledge of the direction of the change giving an upper benchmark. As a lower benchmark we consider a projection into a random direction. Secondly, we closely examine the power behavior of a universal change point test in this setting that has been introduced by . Here, we take universal to mean that its power behaviour does not depend on how sparse or dense the change is across the multivariate vector. The results and techniques, we introduce, can subsequently be extended to more complex change point settings as well as different statistical frameworks, such as two sample tests. In fact, Cho (2015) has already extended the findings from a preprint of this paper to some additional change point tests that have recently been proposed. We make use of the following two key concepts: Firstly, we consider contiguous changes where the size of the change tends to zero as the sample size and with it the number of dimensions increases leading to the notion of high dimensional efficiency. This concept is closely related to Asymptotic Relative Efficiency (ARE) (see Lehmann (1999, Sec. 3.4) and Lopes et al. (2011) where ARE is used in a high dimensional setting).
Optimal power in the sense of the oracle projection is only achieved if information about the direction of the change are known, where known can include assumptions such as sparse or balanced changes, meaning that there exists a small change of similar magnitude in each component. However, such procedures typically break down to the power of a random projection henceforth called tolerable power if those assumptions are not met. In addition, inherent misspecification in other parts of the model, such as the covariance structure, will have a detrimental effect on detection, which can result in procedures having no better than tolerable power.
We will consider a simple setup for our analysis, although one which is inherently the base for most other procedures, and one which can easily be extended to complex time dependencies and change point definitions using corresponding results from the literature (Kirch andKamgaing, 2015, 2016). For a set of observations X i,t , 1 i d = d T , 1 t T , the change point model is defined to be X i,t = μ i + δ i,T g(t/T ) + e i,t , 1 i d = d T , 1 t T, ( 1.1) where E e i,t = 0 for all i and t with 0 < σ 2 i = var e i,t < ∞ and g : [0, 1] → R is a Riemann-integrable function. Here δ i,T indicates the size of the change for each component. This setup incorporates a wide variety of possible changes by the suitable selection of the function g, as will be seen below. For simplicity, for now it is assumed that {e i,t : t ∈ Z} are independent, i.e. we assume independence across time but not location. If the number of dimensions d is fixed, the results readily generalise to situations where a multivariate functional limit theorem exists as is the case for many weak dependent time series. If d can increase to infinity with T , then generalizations are possible if the {e i,t : 1 t T } form a linear process in time but the errors are independent between components (dependency between components will be discussed in detail in the next section). Existence of moments strictly larger than two is needed in all cases.
The change (direction) is given by Δ d = (δ 1,T , . . . , δ d,T ) and the type of alternative is given by the function g in rescaled time. While g is defined in a general way, it includes as special cases most of the usual change point alternatives, for example, • At most one change (AMOC): The form of g will influence the choice of test statistic to detect the change point. As in the above two examples in the typical definition of change points the function g is modelled by a step function (which can approximate many smooth functions well). In such situations, test statistics based on partial sums of the observations have been well studied (Csörgő and Horváth, 1997). We focus on test statistics for the AMOC situation and show these statistics are robust (in the sense of still having non-zero power) to a wide variety of g. We derive the asymptotic theory for these partial sum processes and the results readily carry over to other statistics based on change point tests such as the ones for epidemic change points. The model in (1.1) is defined for univariate (d = 1), multivariate (d fixed) or panel data (d → ∞). The panel data (also known as "small n large p" or "high dimensional low sample size") setting is able to capture the small sample properties very well in situations where d is comparable or even larger than T using asymptotic considerations. In this asymptotic framework the detection ability or efficiency of various tests can be defined by the rates at which vanishing alternatives can still be detected. However, many of our results, particularly for the proposed projection tests, are also qualitatively valid in the multivariate or d fixed setting.
The paper proceeds as follows. In Section 2, the concept of high dimensional efficiency as a way of comparing the power of high dimensional tests is intro-duced. Also in Section 2, we derive the high dimensional efficiency for the panel based change point statistics already suggested in . This will be done for a correctly specified covariance structure as well as if the covariance assumptions are violated. In Section 3 we develop the asymptotic theory for projection statistics which will act as a lower and upper benchmark for the panel change point test from the previous section. Here, too, misspecification will be taken into account. In Section 3.4 we summarize and interpret the high dimensional efficiency for a wide range of high dimensional change point tests recently proposed in the literature based on results obtained by Cho (2015). Section 4 provides a short illustrative example with respect to multivariate market index data. Section 5 concludes with some discussion of the different statistics proposed. The proofs in addition to some further illustrative material is given in an appendix. In addition, rather than a separate simulation section, simulations will be interspersed throughout the theory. They complement the theoretic results, confirming that the conclusion are already valid for small samples, thus verifying that the concept of high-dimensional efficiency is indeed suitable to understand the power behavior of different test statistics. In all cases the simulations are based on 1000 repetitions of i.i.d. normally distributed data for each set of situations, and unless otherwise stated the number of time points is T = 100 with the change (if present) occurring half way through the series. Except in the simulations concerning size itself, all results are empirically size corrected to account for the size issues for the multivariate (panel) statistic that will be seen in Figure 3.1.

High dimensional efficiency and a universal panel mean change test
In this section, we will first derive a theoretic framework called high dimensional efficiency -an asymptotic concept to compare the power of several high dimensional tests. Secondly, we will calculate this high dimensional efficiency for universal panel CUSUM tests (with d → ∞) introduced by Horváth and Hušková (2012) extending a multivariate setting with d fixed (Horváth et al., 1999). Since we do not assume Gaussianity in order to obtain the corresponding limits it is necessary to assume independence between components, because the proofs are based on a central limit theorem across components. As such they cannot be generalized to uncorrelated (but dependent) data unless in the Gaussian case. For this reason, we cannot easily derive the asymptotic theory after standardization of the data. This is different from the multivariate situation, where this can easily be achieved. This test is related to a test in the highdimensional two-sample situation by Srivastava and Du (2008) who consider some kind of misspecification of the covariance structure but under the stronger assumption of Gaussianity of the data. We are interested in a comparison of the high dimensional efficiency under different covariance structures, which yield weighting matrices A, for example, the correctly specified covariance, i.e. A = Σ −1 , in addition to a comparison in the misspecified case, A = M −1 , for some M not equal to the true covariance. The latter has already been discussed in one particular case by . To be precise, a common factor is introduced and the limit of the statistic (with A = Λ −1 ) under the assumption that the components are independent (i.e. Λ being a diagonal matrix) is considered. Because of the necessity to estimate the unknown covariance structure for practical purposes, the same qualitative effects as discussed here can be expected if a statistic and corresponding limit distribution were available for the covariance matrix Σ.

High dimensional efficiency
As the main focus of this paper is to compare several test statistics with respect to their detection power, we introduce a new asymptotic concept that allows us to understand this detection power in a high dimensional context. In the subsequent sections, simulations accompanying the theoretic results will show that this concept is indeed able to give insight into the small sample detection power. Thus this concept provides a theoretic tool for a power comparison, which -unlike simulation studies -gives a simultaneous insight into a large variety of situations.
Consider a typical testing situation, where (possibly after reparametrization) we test for some parameter vector v d ∈ R l d . In this paper, this vector will be the change, i.e. v d = Δ d = (δ 1,T , . . . , δ d,T ) . However, it could also be the mean vector in a one-sample location problem (l d = d), or the difference of the mean vectors in a two-sample model (l d = d). For change point testing in parametric time series models it could be the difference of the corresponding parameter vectors, where l d is the effective dimension given by the number of unknown parameters in the situation where d-dimensional data is observed.
To understand the small sample power of different statistics we consider local or contiguous alternatives with v d = v d,T → 0 (as T → ∞). For a panel setting we define: Aston and C. Kirch Obviously, E is only defined up to multiplicative constants, and has to be understood as a representative of the class for all sequences of alternatives v d and some constants c, C only depending on E 1 and E 2 .
Remark 2.1. The above definition has the following connection to minimax optimality: is equal to the minimax separation rate in the sense of (Ingster and Suslina, 2012) (for a given norm), then the corresponding test is in fact minimax optimal in that sense (with respect to that norm). Additionally, the above notion allows us to compare the power behavior for particular types of alternatives (such as e.g. sparse alternatives) of different test statistics leading to the notion of relative high dimensional efficiency, where it is not constants that are of interest but dependence on d. For example a high dimensional efficiency of 2d v d is a factor √ d better than one of 10 √ d v d resulting in a relative high dimensional efficiency of √ d for the first test versus the second one. As in the classical interpretation this means that the magnitude of v d can be a factor 1/ √ d (again up to constants) smaller for the first test and still have the same detection power as the second test.
In particular a test has asymptotic power one for a sequence of alternatives usually resulting in an asymptotic power strictly between the level and one. In the classic notion (with d constant) of absolute relative efficiency (ARE, or Pitman Efficiency) for test statistics with a standard normal limit it is the additive shift between L(α) and L (see Lehmann, 1999, Sec 3.4) that shows power differences for different statistics. Consequently, this shift has been used to define asymptotic efficiency. This idea has been considered in high dimensional settings by Lopes et al. (2011) as well as Wang et al. (2015) in a two-sample setting and by Srivastava and Du (2008) in the one-sample setting where the tests considered all converge to a standard normal limit -an assumption that is not true for most change point tests.
It turns out that the distinction made by the rates as captured by the high dimensional efficiency is already sufficient to compare the power behavior of the change point tests in this paper. In fact, if those rates differ, then the classic asymptotic relative efficiency is not defined (or rather yields 0 or ∞). It is only in situations, where the high dimensional efficiency of two tests is equal as e.g. in Lopes et al. (2011);Wang et al. (2015); Srivastava and Du (2008), that the constants as in the classic notion of efficiency become important to understand the differences in efficiency.
For standard test statistics exhibiting the usual distributional asymptotics, the above definition guarantees that the high dimensional efficiency E(v d ) only depends on the type of alternatives and the dimension d but not on the sample size T . The reason is that the rate with respect to the sample size of contiguous alternatives is the same as in classical testing namely √ T . Nevertheless, the above concept also allows the investigation of test statistics exhibiting e.g. extreme-value behavior, which is often the case in change point analysis and appears for high dimensional change point tests see e.g. Chan et al. (2012) or Jirak (2015). In these examples, the high dimensional efficiency will now typically also depend on T as due to the extremal behavior the rate will no longer be √ T but T/ log log T . However, since the Logarithm converges very slowly, the dependence on d may be much more important. As illustrative example the maximum-likelihood CUSUM statistic which is an extreme-value-type change point test will be compared to a differently weighted CUSUM statistic with standard distributional asymptotics for d = 1 in Section A in the appendix.

Illustrative examples of spatial dependencies
In order to be able to prove asymptotic results for change point statistics even if d → ∞, we need to make the following assumptions on the underlying error structure. This is much weaker than the independence assumption of the universal panel statistic from the previous section as considered by . Furthermore, we do not need to restrict the rate with which d grows. If we do have restrictions on the growth rate in particular for the multivariate setting with d fixed, these assumptions can be relaxed and more general error sequences can be allowed.
Assumption A. 1. Let η 1,t (d), η 2,t (d), . . . be independent with E η i,t (d) = 0, var η i,t (d) = 1 and E |η i,t (d)| ν C < ∞ for some ν > 2 and all i and d. For t = 1, . . . , T we additionally assume for simplicity that (η 1,t (d), η 2,t (d), . . .) are identically distributed (leading to data which is identically distributed across time). The errors within the components are then given as linear processes of these innovations: a l,j (d) 2 < ∞ or equivalently in vector notation e t (d) = (e 1,t (d), . . . , e d,t (d)) and a j (d) = (a 1,j (d), . . . , a d,j (d)) These assumptions allow us to consider many varied dependency relationships between the components (and we will concentrate on within the component dependency at this point, as temporal dependency adds multiple layers of notational difficulties, but little in the way of insight as almost all results generalise simply for weakly dependent and linear processes including the particular cases we will now discuss).
The following three cases of different spatial dependency structures are very helpful in understanding the effect of misspecification of the covariance structure on the high dimensional efficiency. They will be used as examples throughout the paper: Case C. 1 (Independent Components). The components are independent, i.e. a j = (0, . . . , s j , . . . , 0) the vector which is s j > 0 at point j and zero everywhere else, j d, and a j = 0 for j d + 1. In particular, each channel has variance Case C. 2 (Fully Dependent Components). There is one common factor to all components, leading to completely dependent components, i.e. a 1 = Φ d = (Φ 1 , . . . , Φ d ) , a j = 0 for j 2. In this case, This case, while being somewhat pathological, is useful for gaining intuition into the effects of possible dependence and also helps with understanding the next case.
Case C.3 (Mixed Components). The components contain both an independent and dependent term. Let a j = (0, . . . , s j , . . . , 0) the vector which is s j > 0 at point j and zero everywhere else, and a d+1 = Φ d = (Φ 1 , . . . , Φ d ) , a j = 0 for j d + 2. Then This mixed case allows consideration of dependency structures between cases C.1 and C.2. It is used in the simulations with Φ d = Φ(1, . . . , 1) , where Φ = 0 corresponds to C.1 and Φ → ∞ corresponds to C.3. We also use this particular example for the universal panel statistic in this section to quantify the effect of misspecification.
Of course, many other dependency structures are possible, but these three cases give insight into the cases of no, complete and some dependency respectively.

Efficiency for universal change point test for independent panels
Multivariate CUSUM statistics have been adapted to the panel data setup under the assumption of independent components by Bai (2010) for estimation as well as  for testing. Those statistics are obtained as weighted maxima or sum of the following (univariate) partial sum process with X d (t) = (X 1,1 , . . . , X d,T ) and σ 2 i = var e i,1 . Theorem B.1 in the appendix gives a central limit theorem for errors as in Case C.1 for this partial sum process (under the null) from which null asymptotics of the corresponding statistics can be derived if d T 2 → 0. This was proven by Horváth and Hušková (2012, Theorem 1) who did allow for a linear process structure across time. However, the independence across components cannot be dropped, which has some effects on the high dimensional efficiency that we will investigate in Section 2.4. For the corresponding Darling-Erdösz-type theorem as discussed in Chan et al. (2012) the quite restrictive assumption d T 2 → 0 can be dropped. The corresponding test is related to the weighted CUSUM test M 2 as discussed in Appendix A for the univariate case, which also exhibits a Darling-Erdösz-type asymptotic. As in the discussion there, this Darling-Erdösz-test has similar high dimensional efficiency as max 0 x 1 V d,T (x) up to an additional log-term, but will not be discussed here in detail.
The following theorem derives the high dimensional efficiency in this setting for the universal panel statistics such as max 0 x 1 V d,T (x), which we use in simulations with both known and estimated standard deviations, or Then, the high dimensional efficiency of the universal panel statistic tests is given by where · refers to the Euclidean norm.
Most notably the efficiency of this test statistic unlike test statistics particularly developed for sparse changes as discussed in Section 3.4 only depends on the magnitude of the covariance scaled change but not on how the mass of the change is placed within the vector (i.e. if it is balanced across the vector or only sparsely in a few components). Proposition 1 in Baraud et al. (2002) shows that the minimax separation rate in the L 2 -norm for the signal detection problem (which also provides lower bounds for the present change point problem) is given by d 1/4 / √ T , i.e. no uniform test exists in the above change point situation when Consequently, the test by Horváth and Hušková (2012) achieves L 2 -minimax optimality in the sense of (Ingster and Suslina, 2012) and cannot be improved uniformly over all Δ.
For constant magnitude of the change the efficiency is given by d −1/4 and as such will turn out to be a magnitude of d −1/4 worse than oracle efficiency but a magnitude of d −1/4 better than tolerable efficiency (as obtained by a random projection).
We can see the finite sample nature of this phenomena in Figure 3.2 (a).

Efficiency of universal change point tests under dependence between components
We now turn again to the misspecified situation, where we use the above statistic in a situation where components are not uncorrelated. Following Horváth and Hušková (2012), we consider the mixed case C.3 for illustration. The next proposition derives the null limit distribution for that special case. It turns out that the limit as well as convergence rates depend on the strength of the contamination by the common factor.

but the rest of the dependency structure is not taken into account. The asymptotic behavior of V d,T (x) then depends on the behavior of
Because A d in the above theorem cannot feasibly be estimated, this result cannot be used to derive critical values for panel test statistics. Consequently, the exact shape of the limit distribution in the above lemma is not important. However, the lemma is necessary to derive the high dimensional efficiency of the panel statistics in this misspecified case. Furthermore, it indicates that using the limit distribution from the previous section to derive critical values will result in asymptotically wrong sizes if a strong contamination by a common factor is present. The simulations in Figure 3.1 also confirm this fact and show that the size distortion can be enormous. It does not matter whether the variance of the components in the panel statistic takes into account the dependency or simply uses the noise variance ( Figure 3.1(a)), or whether a change is accounted for or not in the estimation (Figure 3.1(b)-(c)). This illustrates, that the full panel statistic is very sensitive with respect to deviations from the assumed underlying covariance structure in terms of size.
In the situation of a) and b) above, the dependency structure introduced by the common factor is still small enough asymptotically to not change the high dimensional efficiency as given in Theorem 2.1, which is analogous to the proof of Theorem 2.1. Therefore, we will now concentrate on situation c), which is the case where the noise coming from the common factor does not disappear asymptotically.

then the corresponding panel statistics have high dimensional efficiency
Corollary 3.8 below will show that the efficiency of the universal panel test becomes as bad as the tolerable efficiency if A d /d → A > 0, which is typically the case if the dependency is non-sparse and non-negligible.

Projections
We now describe how projections can be used to obtain change point statistics in high dimensional settings, which will be used as an upper (in the form of an oracle projection) and a lower benchmark (in the form of a random projection) statistics for other change point tests.
In model (1.1), the change Δ d = (δ 1,T , . . . , δ d,T ) (as a direction) is always a rank one (vector) object no matter the number of components d. This observation suggests that knowing the direction of the change Δ d in addition to the underlying covariance structure can significantly increase the signal-to-noise ratio. Furthermore, for μ and Δ d / Δ d (but not Δ d ) known with i.i.d. normal errors, one can easily verify that the corresponding likelihood ratio statistic is obtained as a projection statistic with projection vector Σ −1 Δ d , which can also be viewed as an oracle projection. Under (1.1) it holds . . , μ d ) and e t = (e 1,t , . . . , e d,t ) . The projection vector p d plays a crucial role in the following analysis and will be called the search direction. Because multiplicative constants do not change the signal-to-noise ratio nor the high dimensional efficiency, we will always use the normalized vector p d = 1 for simplicity. This representation shows that the projected time series exhibits the same behavior as before as long as the change is not orthogonal to the projection vector. Furthermore, the power is better the larger Δ d , p d and the smaller the variance of e t , p d . Consequently, an optimal projection in terms of power depends on Δ d as well as Σ = var e 1 .

Efficiency of change point tests based on projections
In this section, we derive the efficiency of change point tests based on projections under rather general assumptions. Furthermore, we will see that the size behavior is very robust with respect to deviations from the assumed underlying covariance structure. The power on the other hand turns out to be less robust but more so than statistics taking the full multivariate information into account.As special cases, we then obtain our benchmark efficiencies, the oracle efficiency and the tolerable efficiency (obtained from random projections) in Section 3.3.
Standard statistics such as the CUSUM statistic are based on partial sum processes, so in order to quantify the possible power gain by the use of projections we will consider the partial sum process of the projections, i.e.
where Z T,i is as in (2.3). Different test statistics can be defined for a range of g (see Section C.1 in the appendix for more details). One popular test statistic designed for the at-most-one-change (but with power against any non-constant g) is the max-type statistic, analogous to that for the universal panel test given above.
In this section we first derive a functional central limit theorem for the process U d,T (x), which implies the asymptotic null behavior for these tests. Then, we derive the asymptotic behavior of the partial sum process under contiguous alternatives to obtain the high dimensional efficiency for projection statistics.

Null asymptotics
As a first step towards the efficiency of projection statistics, we derive the null asymptotics. This is also of independent interest if projection statistics are applied to a given data set in order to find appropriate critical values. In the Fig 3.1: Size of tests as the degree of dependency between the components increases. As can be seen, all the projection methods, Oracle, Quasi-Oracle, Pre-Oracle and Random projections defined in Section 3.3 maintain the size of the tests. Those based on using the full information, the universal test (indicated as H&H) as described in Section 2 have size problems as the degree of dependency increases. The simulations correspond to Case C.3 with s j = 1, Φ j = φ, j = 1, . . . , d with d = 200, where φ is given on the x-axis).
following theorem d can be fixed but it is also allowed that d = d T → ∞, where no restrictions on the rate of convergence are necessary.
Assumption A.1 can be replaced by a different assumption which is always fulfilled in the multivariate case but often too restrictive in the panel situation (see Theorem C.1 in the appendix for more details). Lemma C.2 shows that the following estimators fulfill (3.3): The second estimator is typically used in an at-most-one-change setting as it usually leads to a small-sample power improvement for the corresponding tests as it is also consistent under the at-most-one-change alternative.
From Theorem 3.1 one can easily derive the null asymptotics for standard change point tests such as the max-type and sum-type tests (see Corollary C.3 in the appendix for more details). As can be seen in Figure 3.1, regardless of whether the variance is known or estimated, the projection methods all maintain the correct size even when there is a high degree of dependence between the different components (the specific projection methods will be characterised in Section 3.3 below).

Absolute high dimensional efficiency
We are now ready to derive the high dimensional efficiency of projection statistics. Furthermore, we show that a related estimator for the location of the change is asymptotically consistent.
Theorem 3.2. Let the assumptions of Theorem 3.1 be fulfilled. Then, the maxtype statistic based on 3.1 has the following absolute high dimensional efficiency: where τ 2 (p d ) is as in Theorem 3.1 and α u,v is the (smallest) angle between u and v. In addition, the asymptotic power increases with increasing multiplicative constant.
The assertion remains true under the assumptions of Theorem C.1 as well as for the max and sum-type statistics with a weight function w(·) as in Corol- In the following, E 3 (Δ d , p d ) is fixed to the above representative of the class, so that different projection procedures with the same rate but with different constants can be compared.
Remark 3.1. For random projections the high dimensional efficiency is a random variable. The convergences in Definition 2.1 is understood given the projection vector p d , where we get either a.s.-convergence or P -stochastic convergence depending on whether √ T E 3 (Δ d , p d ) converges a.s. or in a P -stochastic sense (in the latter case the assertion follows from the subsequence-principle).
The above result shows in particular that sufficiently large changes (as defined by the high dimensional efficiency) are detected asymptotically with power one. For such changes in the at-most-one-change situation, one can easily derive that the corresponding change point estimator is consistent in rescaled time (see Corollary C.4 in the appendix).
Remark 3.2. The proof shows in particular that all deviations from a stationary mean are detectable with asymptotic power one as It is this g function which determines which weight function gives best power. Remark 3.3. We derive the high dimensional efficiency for a given g and disappearing magnitude of the change Δ d . For an epidemic change situation with g(x) = 1 {ϑ1<x ϑ2} for some 0 < ϑ 1 < ϑ 2 < 1, this means that the duration of the change is relatively large but the magnitude relatively small with respect to the sample size. Alternatively, one could also consider the situation, where the duration gets smaller asymptotically (see e.g. Frick et al. (2014)) resulting in a different high dimensional efficiency, which is equal for both the projection as well as the multivariate or panel statistic, as long as the same weight function and the same type of statistic (sum/max) is used. Some preliminary investigations suggest that in this case using projections based on principle component analysis similar to Aston and Kirch (2012a) can be advantageous, however this is not true for the setting discussed in this paper.
In the next section we will further investigate the high dimensional efficiency and see that the power depends essentially on the angle between Σ 1/2 p d and the 'standardized' change Σ −1/2 Δ if Σ is invertible. In fact, the smaller the angle the larger the power. Some interesting insight can also come from the situation where Σ is not invertible by considering case C.2 above (and this is given in section C.4 of the appendix).

High dimensional efficiency of oracle and random projections
In this section, we will further investigate the high dimensional efficiency of certain particular projections that can be viewed as benchmark projections. In particular, we will see that the efficiency depends only on the angle between the projection and the change both properly scaled with the underlying covariance structure.
The highest efficiency is obtained by o = Σ −1 Δ d as the next theorem shows, which will be called the oracle projection. This oracle is equivalent to a projection after first standardizing the data on the 'new' change Σ −1/2 Δ d . The corresponding test is related to the likelihood ratio statistic for i.i.d. normal innovations, where both the original mean and the direction (but not magnitude) of the change are known. This oracle is effectively the optimal linear classifier as proposed by Delaigle and Hall (2012). As a lower (worst case) benchmark we consider a scaled random projection r d,Σ = Σ −1/2 r d , where r d is a random projection on the d-dimensional unit sphere. This is equivalent to a random projection onto the unit sphere after standardizing the data. Both projections depend on Σ which is usually not known so that it needs to be estimated. The latter is rather problematic in particular in high dimensional settings without additional parametric or sparsity assumptions (see Zou et al. (2006), Bickel and Levina (2008) and Fan et al. (2013) including related discussion, and Cai and Liu (2011) for a case where the assumption of sparsity can be used to facilitate direct estimation of the vector of interest without full covariance estimation). Furthermore, it is actually the inverse that needs to be estimated which results in additional numerical problems if d is large. For this reason we check the robustness of the procedure with respect to not knowing or misspecifying Σ in a second part of this section.

Correctly scaled projections
In this section we characterize which projection yields an optimal high dimensional efficiency associated with the highest power if the covariance matrix Σ is invertible. In Section C.4 in the appendix, we look at the situation if Σ is not invertible.

Proposition 3.3. If Σ is invertible, then
(3.5) Proposition 3.3 shows in particular, that after standardizing the data, i.e. for Σ = I d , the power depends solely on the cosine of the angle between the oracle and the projection (see Figure 3.2 (a)).
From the representation in this proposition it follows immediately that the 'oracle' choice for the projection to maximize the high dimensional efficiency is o = Σ −1 Δ d as it maximizes the only term which involves the projection namely cos(α Σ −1/2 Δ d ,Σ 1/2 p d ). Therefore, we define: Since the projection procedure is invariant under multiplications with non-zero constants of the projected vector, all non-zero multiples of the oracle have the same properties, so that they correspond to a class of projections.
By Proposition 3.3 the oracle choice leads to a high dimensional efficiency of Another way of understanding the Oracle projection is the following: If we first standardize the data, then for a projection on a unit (w.l.o.g.) vector the variance of the noise is constant and the signal is given by the scalar product of Σ −1/2 Δ and the (unit) projection vector, which is obviously maximized by a projection with Σ −1/2 Δ/ Σ −1/2 Δ which is equivalent to using p d = Σ −1 Δ as a projection vector for the original non-standardized version.
So, if we know Σ and want to maximize the efficiency respectively power close to a particular search direction s d of our interest, we should use the scaled search direction s Σ,d = Σ −1 s d as a projection.
Because the cosine falls very slowly close to zero, the efficiency will be good if the search direction is not too far off the true change. From this, one could get the impression that even a scaled random projection r Σ,d = Σ −1/2 r d may not do too badly, where r d is a uniform random projection on the unit sphere. This is equivalent to using a random projection on the unit sphere after standardizing the data, which also explains the different scaling as compared to the oracle or the scaled search direction, where the change Δ d is also transformed to Σ −1/2 Δ d by the standardization. However, since for increasing d the space covered by the far away angles is also increasing, the high dimensional efficiency of the scaled random projection is not only worse than the oracle by a factor √ d but also by a factor d 1/4 than the universal statistic discussed in Section 2.
The following theorem shows the high dimensional efficiency of the scaled random projection.
Comparing the high dimensional efficiency of the scaled random projection with the one obtained for the oracle projection (confer Proposition 3.3) it becomes apparent that we lose an order √ d. The universal panel statistic taking the full multivariate information into account has a high dimensional efficiency between those two losing d 1/4 in comparison to the oracle but gaining d 1/4 in comparison to a scaled random projection. From these results one obtains a cone around the search direction such that the projection statistic has higher power than the universal panel statistic, if the true change falls within this cone. Figure 3.2 (b) shows the results of some simulations showing that a change that can be detected for the oracle with constant power as d increases rapidly loses power for the panel statistic as predicted by its high dimensional efficiency in Section 2 as well as for the random projection. This and the following simulations show clearly that the concept of high dimensional efficiency is indeed capable of explaining the small sample power of a statistic very well.

Misscaled projections with respect to the covariance structure
The analysis in the previous section requires the knowledge or a precise estimate of the inverse (structure) of Σ. However, in many situations such an estimate may not be feasible or too imprecise due to one or several of the below reasons, where the problems get worse due to the necessity for inversion.
• If d is large in comparison to T statistical estimation errors can accumulate and identification may not even be possible (Bickel and Levina, 2008). • The theory can be generalized to time series errors but in this case the covariance matrix has to be replaced by the long-run covariance (which is proportional to the spectrum at 0) and is much more difficult to estimate (Aston and Kirch, 2012b;Kirch and Tadjuidje Kamgaing, 2012). • Standard covariance estimators will be inconsistent under alternatives as they are contaminated by the change points. Consequently, possible changes have to be taken into account, but even in a simple at most one change situation it is unclear how best to generalize the standard univariate approach as in (C.6) as opposed to (C.5) to a multivariate situation as the estimation of a joint location already requires an initial weighting for the projection (or the multivariate statistic). Alternatively, componentwise univariate estimation of the change points could be done but require a careful asymptotic analysis in particular in a setting with d → ∞. • If d is large, additional numerical errors may arise when inverting the matrix (Higham, 2002, Ch 14).
We will now investigate the influence of misspecification or estimation errors on the high dimensional efficiency of a misscaled oracle o M = M −1 Δ d in comparison to the misscaled random projection r M,d = M −1/2 r d , where we only assume that the assumed covariance structure M is symmetric and positive definite and assumption A.1 is fulfilled.
Theorem 3.5. Let the alternative hold, i.e. Δ d = 0. Let r d be a random projection on the d-dimensional unit sphere and r M,d = M −1/2 r d be the misscaled random projection. Then, there exist for all > 0 constants c, C > 0, such that where tr denotes the trace.
We are now ready to prove the main result of this section stating that the high dimensional efficiency of a misscaled oracle can never be worse than the corresponding misscaled random projection.
where tr denotes the trace and equality holds iff there is only one common factor which is weighted proportional to Δ d , Because it is often assumed that components are independent and it is usually feasible to estimate the variances of each component, we consider the correspondingly misscaled oracles, which are scaled with the identity matrix (preoracle) respectively with the diagonal matrix of variances (quasi-oracle). The quasi-oracle is of particular importance as it uses the same type of misspecification as the universal panel statistic discussed in Section 2.

As with the oracle, these projections should be seen as representatives of a class of projections.
The following proposition shows that in the important special case of uncorrelated components, the (quasi-)oracle and pre-oracle have an efficiency of same order if the variances in all components are bounded and bounded away from zero. The latter assumption is also needed for the panel statistic below and means that all components are on similar scales. In addition, the efficiency of the quasi-oracle is even in the misspecified situation always better than an unscaled random projection.

b) Under Assumption A.1, it holds
The next corollary shows that the efficiency of the quasi oracle (which is scaled with diag analogously to the panel statistic) is always at least as good as the efficiency of the universal panel statistic. Additionally, the efficiency of the universal panel statistic becomes as bad as the efficiency of the corresponding (diagonally) scaled random projection (tolerable efficiency) if A d /d → A > 0, which is typically the case if the dependency is non-sparse and non-negligible.

Corollary 3.8. Let the assumptions of Lemma 2.2 on the errors be fulfilled, then the following assertions hold:
a) The high dimensional efficiency of the quasi-oracle is always at least as good as the one of the misspecified panel statistic, i.e. with Σ = diag(σ 2 1 , . . . ,

then the high dimensional efficiency of the panel statistic
is as bad as a randomly scaled projection, i.e. (1)).
In particular, for A d /d → A > 0 the efficiency of the misscaled panel statistic is always as bad as the efficiency of the random projection, this only holds for the misscaled (quasi-) projection if Δ d ∼ Φ. This effect can be clearly see in Figures 3.3 and 3.4, where in all cases H&H Sigma refers to the panel statistic using known variance, and H&H Var uses an estimated variance, showing again that this concept of efficiency is very well suited to understand the small sample power behavior of the corresponding statistics. Additionally, the following assertions are confirmed by the simulations: 1) The power of the pre-and quasi-oracle is always better than the power of the misscaled random projection (the random projection assumes an identity covariance structure). 2) The power of the (correctly scaled) oracle can become as bad as the power of the (misscaled) random projection but only if Φ d ∼ Δ d . In this case the power of the misscaled panel statistic (i.e. where the statistic but not the critical values are constructed under the wrong assumption of independence between components) is equally bad. 3) While the power of the (misscaled) panel statistic becomes as bad as the power of the (misscaled) random projection for φ → ∞ irrespective of the angle between Δ d and Φ d , it can be significantly better for the pre-and quasi-oracle. In fact, we saw above that the high dimensional efficiency of the misspecified panel statistic will be of the same order as a random projection for any choice Φ d with Φ d Φ d ∼ d, irrespective of the direction of any change that might be present. We will now have a closer look at the three standard examples in order to understand the behavior in the simulations better (Case C.1 is included in the simulations for Φ = 0, while C.3 is the limiting case for Φ → ∞).
Case C.1 (Independent components). If the components are uncorrelated, each with variance σ 2 i , i.e. Σ 1 = diag(σ 2 1 , . . . , σ 2 d ), we get which is of order d if 0 < c σ 2 j C < ∞. Proposition 3.7, Theorem 3.4 and Theorem 3.5 show that in this situation both the high dimensional efficiency of the pre-and (quasi-)oracle are of an order √ d better than the correctly scaled and unscaled random projection.
The second case shows that high dimensional efficiency of misscaled oracles can indeed become as bad as a random projection and helps in the understanding of the mixed case: Case C. 2 (Fully dependent components). As noted in section C.4 of the appendix, we have to distinguish two cases: (i) If Δ d is not a multiple of Φ d , then the power depends on the angle of the projection with Φ d with maximal power for an orthogonal projection. So the goodness of the oracles depends on their angle with the vector Φ d . (ii) If Δ d is a multiple of Φ d , the pre-and quasi-oracle are not orthogonal to the change, hence they share the same high dimensional efficiency with any scaled random projection as all random projections are not orthogonal to Φ d with probability 1.
We can now turn to the mixed case that is also used in the simulations.
Case C.3 (Mixed case). Let a j = (0, . . . , s j , . . . , 0) the vector which is s j > 0 at point j and zero everywhere else, and The high dimensional efficiency of the pre-oracle can become as bad as for the random projection if the change Δ d is a multiple of the common factor Φ d and there is a substantial common effect. This is similar to Case C.2 (which can be seen as a limiting case for increasing Φ d ). Intuitively, the problem is the following: By projecting onto the change, we want to maximize the signal i.e. the change in the projected sequence while minimizing the noise. In this situation however, the common factor dominates the noise in the projection as it essentially adds up in a linear manner, while the uncorrelated components add up only in the order of √ d (CLT). Now, projecting onto Δ d = Φ d maximizes not only the signal but also the noise, which is why we cannot gain anything (but this also holds true for other procedures such as the universal panel tests).
A different interpretation is the following one: In situation C.3, each component has a common factor {η t } weighted according to Φ d plus some independent noise. If a change occurs in sync with the common factor it will be difficult to detect as in order to get the correct size, we have to allow for the random movements of {η t } thus increasing the critical values in that direction. In directions orthogonal to it, we only have to take the independent noise into account which yields comparably smaller noise in the projection. In an economic setting, this driving factor could for example be thought of as an economic factor behind certain companies (e.g. ones in the same industry). If a change occurs in those companies proportional to this driving factor it will be difficult to distinguish a different economic state of this driving factor from a mean change that is proportional to the influence of this factor.
A mathematical analysis is given in Section C.5 in the appendix.

Data driven projections and high dimensional efficiency of some sparse change point tests
When using data-driven projections one has to be very careful as the projection will typically have an effect on the null asymptotic of the projection test requiring larger critical values. The reason is that in high-dimensional settings there are always directions in which the CUSUM statistic of the projected time series will become very large by chance. This effect can be made smaller by requiring additional assumptions such as sparsity.
In fact, most current change point tests for high dimensional data assume sparsity of the change point (Jirak, 2015;Cho and Fryzlewicz, 2015;Wang and Samworth, 2016) as well as possibly likelihood based considerations Chan and Walther (2015). Some of these tests are effectively based on projections. For example, Jirak (2015) uses the maximum (resulting in an extreme-value behavior) of all projections on unit vectors consisting of all zeroes and just one one. Cho and Fryzlewicz (2015) use thresholding, which can also be viewed as a data-driven projection into a lower dimensional space. Most notably, Wang and Samworth (2016) use a data-driven projection based on a sparse singular value decomposition of the high-dimensional CUSUM matrix. Due to the sparseness assumption the noise level of the projection can be kept at bay, which is no longer the case if an unconstrained singular value decomposition is used.
Furthermore, using a preprint version of this paper, Cho (2015) derived the high dimensional efficiency for a number of tests including the tests by Jirak (2015), Enikeeva and Harchaoui (2013), Cho and Fryzlewicz (2015) as well as the Double CUSUM statistic introduced in that paper (see Table 1 in Cho (2015)). It turns out, that the tests by Jirak (2015), Cho and Fryzlewicz (2015) as well as the scan statistic by Enikeeva and Harchaoui (2013) achieve oracle efficiency (up to a log-term) for sparse changes but only tolerable efficiency for balanced changes. The linear test statistic introduced by Enikeeva and Harchaoui (2013) has the same power behavior as the universal panel test statistic discussed in this paper. The efficiency of the double CUSUM statistic introduced in Cho (2015) depends on the number of components with a mean change in addition to a parameter choice of the statistic. Depending on the combination of choice of this parameter and the number of components contaminated it can achieve both oracle efficiency and tolerable efficiency.
This discussion shows that considering the high dimensional efficiency yields understanding about for which change alternatives a given test has particularly good power and at what cost this comes with respect to other changes.

Data example
As an illustrative example which shows the small sample behaviour of the statistics illustrated above also apply in real data, we examine the stability of change points detected by different methods in several world stock market indices. More specifically, the Fuller Log Squared Returns (Fuller, 1996, p 496) of the FTSE, NASDAQ, DAX, NIKKEI, Hang Seng and CAC 1 indices for the year 2015 were examined for change points. Tests based on the multivariate statistics using full covariance estimates, a multivariate statistic using only variance estimates (i.e., a diagonal covariance structure), a projection statistic in the average direction (1, 1, 1, 1, 1, 1) , and a projection statistic in the direction of European countries vs non-European countries (1, −1, 1, −1, −1, 1) (orthogonal to the average direction) were carried out. Given the considerable dependence between the different components, we would expect economies to likely rise and fall together, justifying the use of the former projection direction. However, we think it unlikely that there will be changes of the kind that when European markets goes  1, 1, 1, 1, 1) . Red vertical lines indicate changes deemed to be significant at 5% level.
up, non-European markets go down, and visa versa, so take this projection as an example of direction where no change is likely. It should be noted at this point that the multivariate statistic treats both of these alternatives as equally likely. As there are possible multiple changes points in this data, we examine stability by performing binary segmentation using the proposed tests, firstly on data from January to November 2015, and then subsequently adding in the data from December 2015.
As can be seen in Figure 4.1, the multivariate test statistic is considerably less robust than the average projection based statistic, both to the length of the data, as well as to the choice of the covariance estimate. The major cause of this instability was that the CUSUM statistic over time had two peaks, but the location of the maximal peak differed from one to the other when further data was added. This caused knock-on effects in the entire binary segmentation. Here, in all cases, independence in time was assumed as once the changes were accounted for, there was little evidence of temporal dependence in the data. However, even if time series dependence is accounted for by using an estimate of the long run covariance in place of the independent covariance estimate, there is no difference in the qualitative conclusions (although the change points themselves varied considerably in all cases depending on the parameters chosen in the long run covariance estimation procedure (Politis, 2005)). In addition, the projection estimate was robust to whether the direction was scaled by the full covariance, the diagonal of the covariance or not scaled at all, as well as to increasing the length of the data. The p-values for the changes on full year's data are given in Table 4.1. While it can be seen that the projection p-vals are larger for the two common change points than in the multivariate case, the same changes are detected with all methods. However, additional changes are found with the projection method, and the p-vals are well below the critical value of 5%. This shows that having knowledge of the likely direction of change can allow further changes to be found beyond those in an unrestricted multivariate search. As expected though, using an unlikely direction does not find change points, with the hypothesis that there are no changes which affect European markets differently to non-European markets being accepted (p=0.18).

Conclusions
The primary aims of this paper were to introduce a theoretic method to compare the small sample behavior of different high dimensional tests by asymptotic methods. The new concept of high dimensional efficiency allows a comparison of the magnitude of changes that can be detected asymptotically as the number of dimensions increases. Both, the simulations as well as the data example confirm the assertions obtained from that theoretic concept indicating it is in fact a useful tool to analyse high dimensional tests.
As a benchmark, projection tests were investigated, including as an upper benchmark an oracle projection as well as as a lower benchmark a random projection.
In summary, the following assertions were obtained: The panel statistic (Bai, 2010; test works well in situations where the panels are independent across components, in particular if there is little to no information about the direction or properties of the change such as whether it is sparse or balanced. However, as soon as dependency is present, the size properties of these statistics become difficult and their high dimensional efficiencies mimic those of random projections. Unfortunately, this problem cannot even be solved if the covariance structure is completely known unless under normality assumptions. Misspecification of the covariance structure can be problematic for all tests even projection tests with the correct change structure. Nevertheless, misscaled oracle tests (if accessible) are preferable to the misscaled panel statistic.
An investigation of Cho (2015) based on a preprint version of the present paper indicates that change point tests constructed for sparse alternatives will achieve approximately oracle power if the sparseness assumption is correct. However, they will achieve only tolerable power if the change is balanced (i.e. not sparse), so that both benchmarks are in fact important to understand the power behavior of recent change point tests in high dimensional settings.
The results in this paper raise many questions for future work. It would be of considerable interest to determine whether projections can be derived using data driven techniques, such as sparse PCA, for example, and whether such projections would be better than random projections. Preliminary work suggests that this may be so in some situations but not others, and a nice procedure by Wang and Samworth (2016) investigates a related technique.
While it is very unlikely that data-driven methods will be able to improve upon the behavior of the panel statistic without additional structural assumptions on the change, the question remains whether one can get close to the universal panel statistic's power properties while at the same time being more robust with respect to size. However, the framework here allows this question to be rigorously posed, and different approaches to be compared.

Appendix A: Comparing the power of two univariate CUSUM tests
We will illustrate how the concept of 'high dimensional' efficiency can be used even if very different asymptotics are involved. To this end we consider the following univariate change point setup with E e 1 = 0, var e 1 = 1 (merely for simplicity) and E |e 1 | ν < ∞ for some ν > 0. For k * T = θT we have the (univariate) AMOC situation from the present paper, but here we allow for arbitrary changes k * T . The goal is now to compare the power behavior of the following two CUSUM statistics Both statistics are well known in the change point community and very often accompagnied by statements such as 'statistic T 2 detects early and late changes better while the statistic T 1 detects changes in the middle of the observation period better'.
We will now demonstrate that the use of efficiency as defined in the present paper helps to make this statement precise. To this end, we adapt Definition 2.1 slightly by considering E(k * T , δ T ), which will now depend on T , k * T and δ T (and obviously drop the assumption d → ∞ as we consider the univariate case d = 1). Furthermore, we identify the efficiency of M 2 with the one ofM 2 (as they yield the same test) defined bỹ under H 0 , where G 2 has a Gumble extreme value distribution, i.e. P (G 2 x) = exp(−2 exp(−x)) (see e.g. the book by Csörgő and Horváth (1997)). The statistic M 1 on the other hand has the following standard null asymptotics (that follow immediately from the functional central limit theorem) Consequently, (i) of Definition 2.1 is fulfilled but with very different limit distributions (and in fact a very different limit behavior).
We will now show that the following efficiencies hold: To see this note that Assumption (ii) of Definition 2.1 with E M1 and EM 2 as in (A.1) follows from this decomposition by using the partial sum process at k = k * as lower bound for the statistics. On the other hand Assumption (iii) follows from this decomposition because uniformly in k From (A.1) we can see, that the efficiency of M 1 is an order √ log log T better than the efficiency of M 2 if k * = λT but a lot worse for early and late changes such as k * = log T or k * = T − log T .
where w 0 is continuous (which can be relaxed) and fulfills (C.7) (confer e.g. the book by Csörgő and Horváth (1997)). The choice of weight function w(·) can increase power for certain locations of the change points .
For the epidemic change, typical test statistics are given by In the next section we first derive a functional central limit theorem for the process U d,T (x), which implies the asymptotic null behavior for the above tests. Then, we derive the asymptotic behavior of the partial sum process under contiguous alternatives to obtain the high dimensional efficiency for projection statistics.
Similarly, a multivariate change point statistic (using the full multivariate information and no additional knowledge about the change) for the at most one mean change is given as a weighted maximum or sum of the following quadratic form where Z T (x) = (Z T,1 (x), . . . , Z T,d (x)) is defined as in (2.3). The usual choice is A = Σ −1 , where Σ is the covariance matrix of the multivariate observations. The weighting with Σ −1 has the advantages that it (a) leads to a pivotal limit and (b) the statistic can detect all changes no matter what the direction. The second remains true for any positive definite matrix A, the first also remains true for lower rank matrices with a decorrelation property of the errors, where this latter approach is essentially a projection (into a lower-dimensional space) as discussed in the previous sections. For an extensive discussion of this issue for the example of changes in the autoregressive structure of time series we refer to . The choice A = Σ −1 corresponds to the correctly scaled case, while the misscaled case corresponds to the choice A = M −1 . However, this multivariate setup is not very suitable for the theoretic power comparison we are interested in because the limit distribution (a sum of d squared Brownian bridges with covariance matrix Σ 1/2 AΣ 1/2 ) still depends on d as well as the possible misspecification. Therefore, a comparison needs to take both the rates, the additive term and the noise level (which depends also on the misspecification of the covariance) present in the limit distribution into account. The panel data settings on the other hand, allows for an analysis by means of the high-dimensional efficiency as introduced in this paper. Furthermore, the panel statistic is strongly related to the multivariate statistic so that the same qualitative statements can be expected, which is confirmed by simulations (results not shown).

C.2. Null asymptotics
In this section, we give some additional insights for projection statistics under the null hypothesis.
3) is always fulfilled for the multivariate situation with d fixed or if d is growing sufficiently slowly with respect to T as the left hand side of (C.3) is always bounded by √ d if p d cov(e)p d / p d 2 is bounded away from zero. Otherwise, the assumption may hold for certain projections but not others. However, in this case, it is possible to put stronger assumptions on the error sequence such as in a), which are still much weaker than the usual assumption for panel data, that components are independent.
The following lemma shows that the following two different stimators for τ (p d ) under the null hypothesis are both consistent. The second one is typically still consistent in the presence of one mean change which usually leads to a power improvement in the test for small samples. An analogous version can be defined for the epidemic change situation. However, it is much harder to get an equivalent correction in the multivariate setting because the covariance matrix determines how different components are weighted, which in turn has an effect on the location of the maximum. This problem does not arise in the univariate situation, because the location of the maximum does not depend on the variance estimate.

Lemma C.2. Consider
as well as where k d,T = arg max under the assumption while estimator (C.6) fulfills it under the assumption The following theorem gives the null asymptotic for the simple CUSUM statistic for the at most one change, other statistics as given in Section C.1 can be dealt with along the same lines.
for any continuous weight function w(·) with

C.3. Consistency of the AMOC change point estimator
The following theorem shows that the point of maximum is a consistent estimator for the change point in rescaled time in the at-most-one-change situation.
Corollary C.4. Let the assumptions of Theorem 3.2 hold and additionally Under the alternative of one abrupt change, i.e. g(x) = 1 {x>ϑ} for some 0 < ϑ < 1, the estimator is consistent for the change point in rescaled time, i.e.
An analogous statement holds, if the arg max of w 2 (k/T )U 2 2 has a unique maximum at ϑ, which is the case for many standard weight functions such as w(t) = (t(1 − t)) −β for some 0 β < 1/2.

C.4. The oracle in the case of non-invertibility
Let us now have a look at the situation if Σ is not invertible hence the above oracle does not exist. To this end, let us consider Case C.2 above -other noninvertible dependent situations can essentially be viewed in a very similar fashion, but become a combination of the two scenarios below.
Case C.2 (Fully dependent Components). In this case Σ = Φ d Φ d is a rank 1 matrix and not invertible. Consequently, the oracle as in Definition 3.1 does not exist. To understand the situation better, we have to distinguish two scenarios: (i) If Φ d is not a multiple of Δ d we can transform the data into a noise-free sequence that only contains the signal by projecting onto a vector that is orthogonal to Φ d (cancelling the noise term) but not to Δ d . All such projections are in principle equivalent as they yield the same signal except for a different scaling which is not important if there is no noise present. Consequently, all such transformations could be called oracle projections. (ii) On the other hand if Δ d is a multiple of Φ d , then any projection cancelling the noise will also cancel the signal. Projections that are orthogonal to Φ d hence by definition also to Δ d will lead to a constant deterministic sequence hence to a degenerate situation. All other projections lead to the same (non-degenerate) time series except for multiplicative constants and different means (under which the proposed change point statistics are invariant by definition) so all of them could be called oracles.
The following interpretation also explains the above mathematical findings: In this situation, all components are obtained from one common factor {η t } with different weights according to Φ d i.e. they move in sync with those weights. If a change is proportional to Φ d it could either be attributed to the noise coming from {η t } or from a change, so it will be difficult to detect as we are essentially back in a duplicated rank one situation and no additional information about the change can be obtained from the multivariate situation. However, if it is not proportional to Φ, then it is immediately clear (with probability one) that a change in mean must have occurred (as the underlying time series no longer moves in sync). This can be seen to some extent in Figure 3.3, where the different panels in the figure mimic the different scenarios as outlined above (with a large value of φ being close to the non-invertible situation).

C.5. Misscaled projections for the mixed case
In this section we derive some mathematical theory for the mixed case C.3 under misspecification explaining the intuition and simulation results already given in Section 3.3.2.
. If additionally Δ d = kΦ d , for some k > 0, we get the following high dimensional efficiency for the pre-oracle by (3.4) The high dimensional efficiency for the unscaled random projection is given by (confer Theorem 3.5 and (3.6)) As soon as s j , Φ j are of the same order, i.e. 0 < c s j , Φ j C < ∞ for all j, the pre-oracle behaves as badly as the unscaled random projection. The same holds for the quasi-oracle under the same assumptions. Interestingly, however, in this particular situation, even the oracle is of the same order as the random projection if the s j are of the same order, i.e. 0 < c s j < C < ∞. More precisely we get (for a proof we refer to the Section D) On the other hand, if Δ d is orthogonal to Φ d , then the noise from Φ d cancels for the pre-oracle projection and we get the rate which is of the order Δ d 2 if the s j are all of the same order. Anything between those two cases is possible and depends on the angle between Δ and Φ d (again see Figures 3.3 and 3.4 for finite sample simulations).
Similar assertions can be obtained along the same lines for max 1 k T 1 T k j=1p d e j (d) as well as max 1 k T 1 T T j=k+1p d e j (d) , which imply the assertion for τ 2 2,d,T (p d ). Proof of Corollary C.3. By an application of the continuous mapping theorem and Theorem 3.1 we get the assertions for the truncated maxima resp. the sums over [τ T, (1 − τ )T ] for any τ > 0 towards equivalently truncated limit distributions. Because we assume independence across time (with existing second moments) the Hájek-Rényi inequality yields for all > 0 P max as τ → 0 uniformly in T , where the notation of the proof of Theorem 3.1 has been used. This in addition to an equivalent argument for the limit process shows that the truncation is asymptotically negligible proving the desired results.
Proof of Theorem 3.2. We consider the situation where √ T E 3 (Δ d , p d ) converges a.s. Under alternatives it holds where U d,T (x; e) is the corresponding functional of the error process. By Theorem 3.1 it holds Furthermore, by the Riemann-integrability of g(·) it follows For any τ > 0 max τ k/T 1−τ almost surely, where P p d denotes the conditional probability given p d . Because by assumption sup τ x 1−τ w 2 (x) x 0 g(t) dt − x 1 0 g(t) dt 2 > 0 for some τ > 0, so that the above term becomes unbounded asymptotically. This gives the assertion for the max statistics, similar arguments give the assertion for the sum statistic.
Proof of Corollary C.4. Similarly to the proof of Theorem 3.2 it follows (where the uniformity at 0 and 1 follows by the assumptions on the rate of divergence for w(·) at 0 or 1) which implies the assertion by standard arguments on noting that Proof of Proposition 3.3. The assertion follows from Proof of Theorem 3.4. Let X d = (X 1 , . . . , X d ) be N(0,I d ), then by Marsaglia (1972) it holds r d L = (X 1 , . . . , X d ) / (X 1 , . . . , X d ) and it follows by (3.4) Since the numerator has a χ 2 1 distribution (not depending on d), there exist for any > 0 constants 0 < c 1 < C 1 < ∞ such that Furthermore, the denominator has a χ 2 d -distribution divided by its expectation, consequently an application of the Markov-inequality yields for any > 0 the existence of 0 < C 2 < ∞ such that 1940 J.A.D. Aston and C. Kirch By integration by parts we get E X d X d −1 2/d for d 3 so that another application of the Markov-inequality yields that for any > 0 there exists c 2 > 0 such that completing the proof of the theorem by standard arguments.