Inference For High-Dimensional Split-Plot-Designs: A Unified Approach for Small to Large Numbers of Factor Levels

Statisticians increasingly face the problem to reconsider the adaptability of classical inference techniques. In particular, divers types of high-dimensional data structures are observed in various research areas; disclosing the boundaries of conventional multivariate data analysis. Such situations occur, e.g., frequently in life sciences whenever it is easier or cheaper to repeatedly generate a large number $d$ of observations per subject than recruiting many, say $N$, subjects. In this paper we discuss inference procedures for such situations in general heteroscedastic split-plot designs with $a$ independent groups of repeated measurements. These will, e.g., be able to answer questions about the occurrence of certain time, group and interactions effects or about particular profiles. The test procedures are based on standardized quadratic forms involving suitably symmetrized U-statistics-type estimators which are robust against an increasing number of dimensions $d$ and/or groups $a$. We then discuss its limit distributions in a general asymptotic framework and additionally propose improved small sample approximations. Finally its small sample performance is investigated in simulations and the applicability is illustrated by a real data analysis.


Introduction
In our current century of data, statisticians increasingly face the problem to reconsider the adaptability of classical inferential techniques. In particular, divers types of high-dimensional data structures are observed in various research areas; disclosing the boundaries of conventional multivariate data analysis. Here, the curse of high dimensionality or the large d small N problem is especially encountered in life sciences whenever it is easier (or cheaper) to repeatedly generate a large number d of observations per subject than recruiting many, say N, subjects. Similar observations can be made in industrial sciences with subjects replaced by units. Such designs, where experimental units are repeatedly observed under different conditions or at different time points, are called repeated measures designs or (if two or more groups are observed) split-plot designs. In these trials, one likes to answer questions about the occurrence of certain group or time effects or about particular profiles. Conventionally, for d < N, corresponding null hypotheses are inferred with Hotelling's T 2 (one or two sample case) or Wilks's Λ, see e.g. [13][ Section 4.3] or [21] [Section 6.8]. Besides normality, these procedures heavily rely on the assumption of equal covariance matrices and particularly break down in high-dimensional settings with N < d. While there exist several promising approaches to adequately deal with the problem of covariance heterogeneity in the classical case with d < N (see e.g. [6,16,17,20,27,37,1,24,9,32,35,26,18,15]) most procedures for high-dimensional repeated measures designs rely on certain sparsity conditions (see e.g. [2,11,23,30,34,10,19] and the references cited therein). In particular, in an asymptotic (d, N) → ∞ framework, typical assumptions restrict the way the sample size N and/or various powers of traces of the underlying covariances increase with respect to d. These type of sparsity conditions guarantee central limit theorems that lead to approximations of underlying test statistics by a fixed limit distribution. However, as illustrated in [31] for one-sample repeated measures these conditions can in general not be regarded as regularity assumptions. In particular, they may even fail for classical covariance structures. To this end, the authors proposed a novel approximation technique that showed considerably accurate results and investigated its asymptotic behavior in a flexible and non-restrictive (d, N) → ∞ framework. Here, no assumptions regarding the dependence between d and N or the covariance matrix were made. In the current paper, we follow this approach and extend the results of [31] to general heteroscedastic split-plot designs with a independent groups of repeated measurements. To even allow for a large number of groups as in [3,4] or [39], we do not only consider the case with a fixed number a ∈ N of samples but additionally allow for situations with a → ∞. The latter case is of particular interest if most groups are rather small (as in screening trials) such that a classical test would essentially possess no power for fixed a. Here increasing the number of groups implies increasing the total sample size from which a power increase might be expected as well. This leads to one of the following asymptotic frameworks which we handle simultaneously in the sequel. For all considerations, the adequate and dimensionstable estimation of traces of certain powers of combined covariances turned out to be a major problem. It is tackled by introducing novel symmetrized estimates of U-statistics-type which possess nice asymptotic properties under all asymptotic frameworks given above.
The paper is organized as follows. The statistical model together with the considered hypotheses of interest are introduced in Section 2. The test statistic and its asymptotic behavior is investigated in Section 3, where also novel dimension-stable trace estimators are introduced.
Additional approximations for small sample sizes are theoretically discussed in Section 4 and their performance is studied in simulations in Section 5. Afterwards, the new methods will be applied to analyze a high-dimensional data set from a sleep-laboratory trial in Section 6. The paper closes with a discussion and an outlook. All proofs in this paper are shifted to the supplementary material.

Statistical Model and Hypotheses
We consider a split-plot design given by a independent groups of d-dimensional random vectors with mean vectors E(X i,1 ) = µ i = (µ i,t ) d t=1 ∈ R d and positive definite covariance matrices Cov(X i,1 ) = Σ i . Here j = 1, . . . , n i denotes the individual subjects or units in group i = 1, . . . , a, a, n i ∈ N, where no specific structure of the group-specific covariance matrices Σ i is assumed. In particular, they are even allowed to differ completely. Altogether we have a total number of N = a i=1 n i random vectors representing observations from independent subjects. Within this framework, a factorial structure on the factors group or time can be incorporated by splitting up indices. Also, a group-specific random subject effect can be incorporated as outlined in [31][Equation H ab 0 : (P a ⊗ P d ) µ = 0, where J d is the d-dimensional matrix only containing 1s and P d := I d − 1/d · J d is the centring matrix. For interpretational purposes it is sometimes helpful to decompose the componentwise means as µ i,t = µ + α i + β t + (αβ) it , i = 1, . . . , a; t = 1, . . . , d, where α i ∈ R represents the i-th group effect, β t ∈ R the time effect at time point t and (αβ) it ∈ R the (i, t)-interaction effect between group and time with the usual side conditions i α i = t β t = i,t (αβ) it = 0. With this notation the above null hypothesis can be rewritten as (a) H a 0 : α i ≡ 0 for all i, (b) H b 0 : β t ≡ 0 for all t and (c) H ab 0 : (αβ) it ≡ 0 for all i, t, respectively. These and other hypotheses will be utilized in the data analysis Section 6.

The Test Statistic and its Asymptotics
We derive appropriate inference procedures for H 0 (T ) and analyze their asymptotic properties under the following asymptotic frameworks a ∈ N fixed and min(d, n 1 , . . . , n a ) → ∞, d ∈ N fixed and min(a, n 1 , . . . , n a ) → ∞, or min(a, d, n 1 , . . . , n a ) → ∞, as N → ∞. Here, no dependency on how the dimension d = d(N) in (3) and (5) or the number of groups a = a(N) in (4)-(5) converges to infinity with respect to the sample sizes n i and N is postulated. In particular, we cover high-dimensional (d > n i or even d > N) as well as lowdimensional settings. For a lucid presentation of subsequent results and proofs we additionally assume throughout that However, by turning to convergent subsequences, all results can be shown to hold under the more general condition 0 < lim inf n i /N lim sup n i /N < 1, (i = 1, . . . , a).
It is convenient to measure deviations from the null hypothesis H 0 (T ) : T µ = 0 by means of the quadratic form where X = (X 1 , . . . X a ) with X i = n −1 i n i j=1 X i,j , i = 1, . . . , a, denotes the vector of pooled group means. Since Q N is in general asymptotically degenerated under (3)-(5) we study its standardized version. To this end, note that under the null hypothesis it holds that due to assumption (1). Thus, it follows from classical theorems about moments of quadratic forms, see e.g. [29] or Theorem A.4 in the supplement, that its mean and variance under the null hypothesis can be expressed as Henceworth we investigate the asymptotic behaviour (under H 0 (T )) of the standardized quadratic form N n i Σ i the inversely weighted combined covariance matrix the representation theorem for quadratic forms [29][p.90], implies that Here ' D =' denotes equality in distribution, λ s are the eigenvalues of T V N T in decreasing order, and (C s ) s is a sequence of independent χ 2 1 -distributed random variables. Note, that the eigenvalues λ s also depend on the dimension d and the sample sizes n i . Transferring the results of [31] for the one-group design with a = 1 to our general setting, we obtain the subsequent asymptotic null distributions of the standardized quadratic form for all asymptotic settings (3)-(5).
It is worth to note that the influence of the different asymptotic frameworks are hidden in the corresponding conditions on the sequence of standardized eigenvalues (β s ) s , which depend on both, a and d.
Since these quantities are unknown in general we cannot apply the result directly. In particular, we are not even able to calculate the test statistic W N , not to mention to choose its correct limit distribution. To this end, we first introduce novel unbiased estimates of the unknown traces involved in (8)- (9) and discuss their mathematical properties. Plugging them into (8)- (9) leads to the calculation of adequately standardized test statistics. Finally, the choice of proper critical values is discussed in Section 4.

Symmetrized Trace Estimators
Here we derive unbiased and ratio-consistent estimates for the unknown traces tr (T S Σ i ) , tr ((T S Σ i ) 2 ) and tr (T S Σ i T S Σ r ) , i = r, given in (8)- (9). Since it is not obvious that the usual plug-in estimates that are based on empirical covariance matrices are useful in high-dimensional settings we follow the approach of [8,31] and directly estimate the traces. Different, to the one-sample design studied therein we face the problem of additional nuisance parameters -the mean vectors µ i . To avoid their estimation we adopt Tyler's symmetrization trick from M-estimates of scatter (see e.g. [12], [14] or [36]) to the present situation, see also [7]. In particular, we consider differences of observation pairs ( 1 , 2 ), 1 = 2 , from the same group which fulfill and introduce the following novel estimators for i = 1, . . . , a : Here and throughout the paper expressions of the kind a = b = c mean that the indices are pairwise different. In this sense all estimators (11)- (14) are symmetrized U-statistics, where the kernel is given by a specific quadratic or bilinear form. Their properties are analyzed below.

Lemma 3.1:
For any µ ∈ R ad and i = 1, . . . , a it holds that is an unbiased and ratio-consistent estimator for E H 0 (Q N ).

2.
A 4 is an unbiased and ratio-consistent estimator for tr (T V N ) 2 .

3.
A i,1 , A i,r,2 and A i,3 are unbiased and ratio-consistent estimators for tr ( and tr (T S Σ i ) 2 , respectively. Remark 3.1: (a) Recall that an R-valued estimator θ N is ratio-consistent for a sequence of real parameters θ N iff θ N /θ N → 1 in probability as N → ∞. Here the estimators and parameters may depend on a = a(N) and/or d = d(N).
(b) Studying the proof of Lemma 3.1 given in the supplementary material in detail, we see that all estimators are even (dimension-)stable in the sense of [8], i.e. they fulfill |E( θ N /θ N − 1)| b N and Var( θ N /θ N ) c N for sequences b N , c N ↓ 0 not depending on a and d.
It follows from Lemma 3.1 that is an unbiased estimator of Var H 0 (Q N ). This motivates to study the standardized quadratic form for testing H 0 (T ). Its asymptotic behaviour under H 0 (T ) : T µ = 0 ad is summarized below.

Theorem 3.2:
Under H 0 (T ) : T µ = 0 ad and one of the frameworks (3)-(5) the statistic W N has the same asymptotic limit distributions as W N , if the respective conditions (a)-(c) from Theorem 3.1 are fulfilled.
The result shows that it is not reasonable to approximate the unknown distribution of the test statistic with a fixed distribution to obtain a valid test procedure. For example, choosing z 1−α , the (1 − α)-quantile of the standard-normal distribution (α ∈ (0, 1)), as critical value would lead to a valid asymptotic level α test ψ z = 1{W N > z 1−α } in case of β 1 → 0, i.e. E H 0 (ψ z ) → α. However, for β 1 → 1 we would obtain E H 0 (ψ z ) → P(χ 2 1 > √ 2z 1−α + 1) which may lead to an asymptotically liberal (α = 0.01 or 0.05) or conservative (α = 0.1) test decision, see Table 1. Contrary, choosing c 1−α = (χ 2 1;1−α − 1)/ √ 2 as critical value (where χ 2 1;1−α denotes the (1 − α)quantile of the χ 2 1 -distribution) for the test where Φ denotes the cumulative distribution function of N(0, 1). Again we obtain an asymptotically liberal (α = 0.1) or extremely conservative (α = 0.05 or 0.01) test decision, see the last column of Table 1.  Hence, an indicator (i.e. estimator) for whether β 1 → 0, β 1 → 1 or betwixt would be desirable. Nevertheless, even if the tests with fixed critical values are asymptotically correct (ψ z in case of β 1 → 0 or ψ χ in case of β 1 → 1), their true type-I-error control may be poor for small sample sizes, see the simulations in Section 5.1. Thus, in any case it seems more appropriate to approximate W N by a sequence of standardized distributions as already advocated in [31] for the case of a = 1. We will propose such approximations in the next Sections, where also a check criterion for β 1 → 0 or β 1 → 1 is presented.

Better Approximations
To motivate the subsequent approximation, recall from (10) that W N is of weighted χ 2 -form. Following [40] it is reasonable to approximate statistics of this from by a standardized (χ 2 f − 1)/ √ 2-distribution such that the first three moments coincide. Straightforward calculations show that this is achieved by approximating with In case of a = 1 this simplifies to the method presented in [31]. There it has already been seen that the approximation (15) performs much better for smaller sample sizes and/or dimensions than the above approaches with a fixed distribution. We will later rediscover this observation in Section 5 for our present design with general a. The next theorem gives a mathematical reason for this approximation.
Its properties together with a consistent estimator for f P are summarized below. (b) Suppose that a ∈ N is fixed. Then τ P := C 2 5 /A 3 4 is a consistent estimator for τ P = 1/f P as min(d, n 1 , . . . , n d ) → ∞, i.e. we have convergence in probability (c) Now suppose that a → ∞ and that there exists some p > 1 such that min(n 1 , . . . , n a ) = O (a p ). Then (18) even holds under the asymptotic frameworks (4) -(5).

Theorem 4.2:
Suppose (18). Then, Theorem 4.1 remains valid if we replace f P by its estimator f P = 1/ τ P .

Remark 4.2:
(a) Using similar arguments as in the proof of Lemma 8.1. of [31] we obtain the equivalences β 1 → 0 ⇔ τ P → 0 and β 1 → 1 ⇔ τ P → 1. Thus, τ P can also be used as check criterion for these two cases.
(b) It is also possible to derive a consistent estimator for τ CQ = tr((T V N ) 4 )/tr 2 ((T V N ) 2 ) = 1/f CQ , a key quantity in [11], see the supplement for details concerning the estimator. The corresponding approximation by the sequence K f CQ even shares the same asymptotic properties of the Pearson approximation (15) stated in Theorem 4.1 and Theorem 4.2. However, it only provides a two moment approximation which turned out to perform worse in simulations (results not shown).
(c) In the supplement, we additionally present an unbiased estimator C 7 for tr((T V N ) 3 ) such that C 2 7 /A 3 4 is consistent for τ P in all asymptotic frameworks (3) - (5). Particularly, the extra condition min(n 1 , . . . , n a ) = O (a p ) is not needed. However, it is computationally more expensive compared to C 5 and thus omitted here.
In practical applications, the computation costs for C 5 are nevertheless rather high. This leads to disproportional waiting times for p-values of the corresponding approximate test ϕ N = 1{W N > K f P ;1−α }, where the critical value is given as (1 − α)-quantile of K f P . Therefore, we propose a certain subsampling-type method. Since the unbiasedness of C 5 clearly stems from (16) it seems reasonable to proceed as follows: For each i = 1, . . . , a and b = 1, . . . , B we independently draw random subsamples {σ 1i (b), . . . , σ 6i (b)} of length 6 from {1, . . . , n i } and store them in a joint random vector σ(b) = (σ 11 (b), . . . , σ 6a (b)). Then, a subsampling-version of the estimator C 5 is given by Letting B = B(N) → ∞ as N → ∞ it is easy to see (see the supplement for details), that C 5 has the same asymptotic properties as C 5 . In particular, it is stated in the supplement that τ P := 1/f P := C 2 5 /A 3 4 is a consistent estimator for τ P and that the approximation Kf P has the same weak limits as Kf P stated in Theorem 4.2. This leads to ϕ N = 1{W N > Kf P ;1−α } which is an asymptotically exact test whenever β 1 → γ ∈ {0, 1}. The finite sample, dimension and group size performance of this approximation are investigated in the subsequent section.

Simulations
In the previous sections we considered the asymptotic properties of the proposed inference methods which are valid for large sample and fixed or possibly large dimension and/or group sizes. Here we investigate the small sample properties of our proposed approximation procedure ϕ N = 1{W N > Kf P ;1−α } in comparison to the statistical tests ψ z and ψ χ based on fixed critical values. In particular, we compare these procedures in simulation studies with respect to (a) their type I error rate control under the null hypothesis (Section 5.1) and (b) their power behaviour under various alternatives (Section 5.2).
All simulations were performed with the help of the R computing environment (R Development Core Team, 2013), each with n sim = 10 4 simulation runs.

Asymptotic distribution and Type I error control
First we study the speed of convergence, i.e. type I error control, of the three different tests under the null hypothesis. To be in line with the simulation results presented in [31] for the case a = 1 we also multiplied the statistic W N by N/(N − 1) to avoid a slightly liberal behaviour. Due to the abundance of different split-plot designs and the more methodological focus of the paper, we restrict our simulation study to two specific null hypotheses and a high dimensional and heteroscedastic two-sample setting. In particular, we investigate the type-I-error behaviour of all three tests for the null hypotheses In both cases sample sizes were chosen from n 1 ∈ {10, 20, 50} and n 2 ∈ {15, 30, 75} combined with various choices of dimensions d ∈ {5, 10, 20, 40, 70, 100, 150, 200, 300, 450, 600, 800}. For the covariance matrices a heteroscedastic setting with autoregressive structures (Σ 1 ) i,j = 0.6 |i−j| and (Σ 2 ) i,j = 0.65 |i−j| was chosen and for each simulation run B(N) = 500 · N, N = n 1 + n 2 , subsamples were drawn. Note that these settings imply β 1 → 1 for H a 0 and β 1 → 0 for H b 0 , see the supplement for details. Thus, ϕ N is asymptotically exact in both cases while ψ χ and ψ z posses the asymptotic behaviour given in Table 1. In particular, the z-test ψ z should be rather liberal for testing for H a 0 and ψ χ strongly conservative for H b 0 . All these theoretical findings can be recovered in our simulations: The results for H a 0 , displayed in Figure 1, show an inflated type I error level control of ψ z around 8% for smaller samples sizes (N = 25). For larger sample sizes (N = 125) it stabilizes in the region of its asymptotic level of 6.8% ± 0.2%. Moreover, the error control is only slightly effected by the varying dimensions under investigation. In comparison, the two asymptotically correct tests ϕ N and ψ χ are slightly liberal for smaller sample sizes and more or less asymptotically correct for moderate (N = 50) to larger sample sizes. Here, it is astonishing that both procedures are nearly superposable, suggesting a fast convergence of the degrees of freedom estimator f P . The results for H b 0 , presented in Figure 2, are slightly different. In particular, both the tests ψ χ and ψ z depending on fixed critical values are more effected by the underlying dimension: For smaller d < 100 the true level is considerably larger than their asymptotic level given in Table 1; resulting in a rather liberal behaviour of ψ z and close to exact type I error control for ψ χ . This effect is decreased with increasing sample sizes. Moreover, for larger dimension (d 200) both tests approach their asymptotic level. In comparison, the procedure ϕ N based on the Kf approximation shows a fairly good α level control through all dimension and sample size settings. Making this the method of choice.

Power Performance
We examined the power of the three procedures. Again a heteroscedastic two group splitplot design with autoregressive covariance structures ( (Σ 1 ) i,j = 0.6 |i−j| and (Σ 2 ) i,j = 0.65 |i−j| ) was selected. The alpha level (5%) and the null hypotheses were chosen as above (H a 0 : • a trend alternative for both hypotheses with µ 2 = 0 d and µ 1, and additionally • a shift alternative for H a 0 with µ 2 = 0 d and µ 1 = 1 · δ and • a one-point alternative for H b 0 , with µ 2 = 0 d and µ 1 = e 1 · δ, Power curves for a trend alternative and H a 0 : Power curves for a trend alternative and It can be readily seen that the power depends on the type of alternative: For the trend (Figure 3) and the shift alternative (left panel of Figure 4) the power gets larger with increasing dimension. This is essentially apparent for the shift alternative, where the power increases considerably from d = 10 to d = 40. Contrary, for the one-point alternative the power becomes smaller for higher dimensions d (right panel of Figure 4). However, this is as expected since a difference in one single component can be detected more easily for smaller d.
Power curves for a shift alternative and H a 0 : Power curves for an one-point alternative and

Analysis of a sleep laboratory data set
Finally, the new methods are exemplified on the sleep laboratory trial reported in [22]. In this two-armed repeated measures trial, the activity of prostaglandin-D-synthase (β-trace) was measured every 4 hours over a period of 4 days. The grouping factor was gender and the above d = 24 repeated measures were observed on n i = 10 young healthy men (group i = 2) and women (group i = 1). Since each day presented a certain sleep condition the repeated measures are structured by two crossed fixed factors: • intervention (with 4 levels: normal sleep, sleep deprivation, recovery sleep and REM sleep deprivation) and • time (with the 6 levels/time points 24h, 4h, 8h, 12h, 16h and 20h).
Due to d > n i we are thus dealing with a high-dimensional split-plot design with a = 2 groups and d = 24 repeated measures. The time profiles of each subject are displayed in Figure 5 (for the female group 1) and Figure 6 (for the male group 2). We note, that groupspecific profile analysis could already be performed by the methods given in [31]. In particular, they found a significant intervention and a borderline time effect for the male group. For the current two-sample design additional questions concern (1) whether there is a gender effect, i.e. the time profiles of the groups differ, and if so (2) whether they differ with respect to certain interventions. Moreover, investigations regarding (3) a general effect of time and (4) interactions between the different factors are of equal interest. Utilizing the notation from Section 2, the corresponding null hypotheses can be formalized via adequate contrast matrices.
In particular, we are interested in testing the null hypotheses : where e = (δ j ) j denotes the Kronecker delta. Applying the test ϕ N based on the standardized quadratic form W N as test statistic and the proposed Kf P -approximation with B = 50000 · N = 100, 000 subsamples we obtain the results summarized in Table 2 . There it can be readily seen that most hypotheses cannot be rejected at level α = 5%. In particular, there is no evidence for an overall gender effect, so that we have not performed post-hoc analyses on the interventions. Only a highly significant time effect, as well as a significant effect between the first two interventions (normal sleep and sleep deprivation), could be detected. However, applying a multiplicity adjustment (Bonferroni or Holm) only the time effect remained significant.

Conclusion & Outlook
In this paper we have investigated inference procedures for general split-plot models, allowing for unbalanced and/or heteroscedastic covariance settings as well as a factorial structure on the whole-and sub-plot factors. Inspired by the work of [31] for one group repeated measures designs the test statistics were based on standardized quadratic forms. However, different to their work novel symmetrized U-statistics were introduced to adequately handle the problem of additional nuisance parameters in the multiple sample case. To jointly cover low and highdimensional models as well as situations with a small or large number of groups we conducted an in-depth study of their asymptotic behaviour under a unified asymptotic framework. In particular, the number of groups a and dimensions d may be fixed as in classical asymptotic settings, or even converge to infinity. Here we do neither postulate any assumptions on how d and/or a and the underlying sample sizes converge to infinity nor any sparsity conditions on the covariance structures since such assumptions are usually hard to check for a practical data set at hand. As a consequence it turned out that the test statistic posses a whole continuum of asymptotic limits that depend on the eigenvalues of the underlying covariances. We thus argued that an approximation by a fixed critical value is not adequate and proposed an approximation by a sequence of standardized χ 2 -distributions with estimated degrees of freedom. For computational efficiency we additionally provided a subsampling-type version of the degrees of freedom estimator. Our approach provides a reasonably good three moment approximation of the test statistic and is even asymptotically exact if the influence of the largest eigenvalue is negligible (leading to a standard normal limit) or decisive (leading to a standardized χ 2 1 limit). Apart from these asymptotic considerations we evaluated the finite sample and dimension performance of our approximation technique. In particular, for varying combinations of sample sizes and dimensions, we compared its power and type I error control with test procedures based on fixed critical values. In all designs it showed a quite accurate error control over all low-(d 10) to highdimensional situations (with up to d = 800). In comparison, its per-formance was considerably better than that of the other two tests which partially disclosed a rather liberal or conservative behaviour. In future research we like to extend the current results to general highdimensional MANOVA designs, where we also like to relax the involved assumption of multivariate normality and/or even test simultaneously for mean and covariance effects as recently proposed in [28]. These investigations, however, require completely different (e.g., martingale) techniques and estimators of the involved traces. Moreover, we also plan to conduct more detailed simulations (especially for larger group sizes a and other covariance matrices) in a more applied paper.

'Inference For High-Dimensional Split-Plot-Designs: A Unified Approach for Small to Large Numbers of Factor Levels'
Paavo Sattler 1 and Markus Pauly 1 1 University of Ulm, Institute of Statistics Abstract. In this supplement we present all theoretical derivations and computations that were omitted in the paper for lucidity.

A Appendix
We start with some preliminary results and Lemmatas.

A.1 Basics
In Section 2 of the main paper we claimed that the unique projection matrix T to the hypothesis matrix H = H S ⊗ H W that equivalently describes the null is given by the product of two projection matrices T S ⊗ T W . We start with the proof of this claim: For each hypothesis Hµ = 0 with such a matrix H exist projectors T ∈ R ad×ad , T W ∈ R a×a , T S ∈ R d×d which can be used to formulate the same null hypothesis T µ = 0 with T = T W ⊗ T S .
Proof: It is known that the projector T = H [HH ] − H fulfills T µ = 0 ⇔ Hµ = 0. For this reason and utilizing well known rules ( see for example [33] ) for generalized inverses we obtain For proofing our main results we have to compare various traces of powers of combinations underlying covariance matrices. To this end, we will particularly apply the following inequalities: For positive real numbers a,b and a symmetric matrix A ∈ R d×d it holds For A ∈ R d×d symmetric with eigenvalues λ 1 , . . . , λ d 0 it holds that tr A 2 tr 2 (A) .
If Σ i ∈ R d×d is positive definite and symmetric and T ∈ R d×d is idempotent and symmetric it holds for every k ∈ N that Proof: The first part is an application of the Cauchy-Bunyakovsky-Schwarz inequality, with the Frobenius inner product. Therefore The second part just uses the binomial theorem together with the condition λ t 0 for t = 1, . . . , d: Finally, the last inequality follows from the second one, if we show that all conditions are fulfilled. With idempotence of T and invariance of the trace under cyclic permutations, it follows for all k ∈ N that Thus, it is sufficient to consider this term. Since T Σ i T is symmetric all powers are symmetric too and it follows with k = k/2 that ∀x ∈ R d : since Σ i and I d are positive definite and k − 2k ∈ {0, 1}. So both conditions of the second inequation are shown and Furthermore, an inequality for traces which contain Σ i and Σ r is needed.

Lemma A.3:
Let Σ i , Σ r ∈ R d×d be positive definite and symmetric matrices and suppose that T ∈ R d×d is idempotent and symmetric. Then it holds for i = r that tr (T Σ i T Σ r ) 2 tr 2 (T Σ i T Σ r ) .
Proof: As shown before T Σ i T and T Σ r T are symmetric and positive semidefinite. For this reason, a symmetric matrix W exists with WW = T Σ r T . Due the fact that all matrices are symmetric it holds This allows to use the inequalities from above for this matrix, and again utilizing the invariance of the trace under cyclic permutations we obtain To standardize the quadratic form we also have to calculate its moments. Here, the following theorem helps: with g (k) = 2 k k! tr (T Σ) k+1 + (k + 1) µ X (T Σ) k T µ X for k ∈ N and g (0) = tr (T Σ X )+µ X T µ X .

Proof:
The proof can be found on page 53 in [29].
Proof: Using the inequalities for traces and with the bilinear form written as all equations follows with the previous theorem.
Lemma A.6: Let X n ∈ L 2 be a real random variable with E(X n ) = µ, b n,d a sequence with lim n,d→∞ b n,d = 0, and c a,d,n min a sequence with lim a,d,n min →∞ c n,d = 0 then it holds • Var (X n ) b n,d ⇒ X n is an consistent estimator for µ, if n, d → ∞, • Var (X n ) c a,d,n min ⇒ X n is an consistent estimator for µ, if a, d, n min → ∞.
For µ = 0 they are especially ratio-consistent.
Proof: For arbitrary > 0 the Tschebyscheff inequality leads to Consider the limit for n, d → ∞ justifies the consistency and using this for X n /µ leads to ratioconsistency. The second part follows identically.
This result is especially true if b n,d or c a,d,n min only depends on n resp. n min . For completeness we state a straightforward application of the Cauchy-Bunyakovsky-Schwarz inequality: Lemma A.7: For real random variables X, Y ∈ L 2 it holds and so for X, Y identically distributed Cov (X, Y) Var (X).
The next result gives equivalent conditions for β 1 → a ∈ {0, 1}: Lemma A.8: Let be λ again the eigenvalues of T V N T sorted so that λ 1 is the biggest one. Then it follows Moreover we know 0 tr 2 ((TVN) 3 ) tr 3 ((TVN) 2 ) = τ P 1. This Lemma also holds if lim N,d→∞ is replaced by lim a,N → ∞ or lim a,d,N→∞ .
Proof: This follows from Lemma 8.1 given in the supplement in [31][page 21] since their result does not depend on the concrete matrix, i.e. can be directly applied for V N . Moreover, the different asymptotic frameworks do not influence the proof since they are hidden within the above convergences.
To prove the properties of the subsampling-type estimators some auxiliaries are needed. In particular, the following lemma allows us to decompose the variances and to use conditional terms for the calculation.
Lemma A.9: Let X be a real random variable and denote by F a σ-field. Then it holds that Var(X) = E (Var (X|F)) + Var (E (X|F)) .

Proof: With the rules for conditional expectations we calculate
The result follows by sum up this both parts.
Lemma A.10: Let M(B, σ(b, m)) be the amount of pairs (k, ) ∈ N B × N B , which fulfill σ(k, m) and σ( , m) have totally different elements and analogue M(B, σ i (b, m)). As long as m n i for all i ∈ N a , it holds where | · | denotes the number of elements.
Let M(B, (σ i (b, m), σ r (b, m))) be the amount of pairs (k, ) ∈ N B × N B fulfilling σ i (k, m) and σ i ( , m) and moreover σ r (k, m) and σ r ( , m) have totally different elements. If m n i it holds The number of totally different pairs can be seen as a binomial distribution with B 2 − B elements, and to calculate the necessary probability independence is used. With the fact that all combinations in this situation have the same probability it follows that . For M(B, (σ i (b, m), σ r (b, m))) and M(B, σ i (b, m)) less multiplications are needed, so the results follow.
If B(N) → ∞ (for example B could be chosen proportional to N) these terms converge to zero, disregarding the number of groups or of m.

A.2 Proofs of Section 3
Proof of Theorem 3.1 (p.5): The proof of this lemma is very similar to the one from [31][Theorem 2.1]. Due to the fact that a finite sum of multivariate normally distributed random variables is again multivariate normally distributed, the representation theorem can be used to (distributionally equivalently) express the quadratic form as The only differences to [31][Theorem 2.1] are that in the case of more groups the eigenvalues do not only depend on d but also on the n i and a and that there are more terms to sum. The first point has only an influence on the limit of the β s . The higher number of summands does not matter because we observe the asymptotic under the asymptotic frameworks (4)- (5), for which at least a or d converge to infinity. The proofs from [31][Theorem 2.1] only need the representation from above, a number of summations which goes to infinity and the conditions on the limits of the β s . Since these are fulfilled the proof can be conducted in the same way.

Proof of Lemma 3.1 (p.6):
Remember that with Y i, ,k := X i, − X i,k and i = r ∈ N a , a > 1 trace estimators were defined by For = k we know Y i, ,k ∼ N (0 d , 2Σ i ) and for totally different indices the Y i, ,k are statistically independent. So the previous lemmata can be used to calculate the moments. The unbiasedness can be shown by calculating the expectation values for each estimator The following argument will be used several times in this work with small differences, so incidentally it will be more detailed.
To check the variance we recognize first that Cov combinations remain. Instead of calculating the covariances of the remaining quadratic forms it is easier to use lemmata from above. By using the fact that all quadratic forms are identically distributed, we can calculate the variances which are all the same so it is just the number of remaining combinations multiplied with the variances. This leads to: With these values we know for So the conditions for an unbiased and ratio-consistent estimator are fulfilled.
The same steps with a different number of remaining combinations leads to Finally, the conditions for A 4 have to be checked. With the expectation values from above we calculate To calculate the variances the following additional inequalities are needed: Together this leads to and therefore A 4 is an unbiased and ratio-consistent estimator of tr (T V N ) 2 .
Moreover, we want to stress that the zero sequences used as upper border for E H 0 (Q N ) and A 4 do not depend on the number of groups or dimensions, so this estimators can be also used for increasing number of groups.
With the expectation values and variances from the beginning it follows directly that A i,1 , A i,r,2 , A i,3 are unbiased, ratio-consistent estimators of tr(T S Σ i ), tr (T S Σ i T S Σ r ) and tr (T S Σ i ) 2 .
It is worth to note that all of this estimators also consistent estimators which are even dimensionstable in the sense of [8].
For A i,r,2 there exists a alternative form which can be implemented substantially more efficient and was considered in [9]. It is based on matrices of the form M i,r = P n i (T S X i,1 , . . . , T S X i,n i ) · (T S X r,1 , . . . , T S X r,n r ) P n r . Recalling that 1 n is the vector of ones and # denotes the Hadamard-Schur-Product, it can be seen that For A i,3 there also exists an alternative formula, which expands much longer, but is more effi-29 cient: .
To finally prove Theorem 3.2 (p.7) we need another lemma.
Lemma A.11: For the previously defined estimators it holds for n min → ∞ that Thus, In the last step we used the fact that all terms are non-negative and applied the binomial theorem in the last inequality. It is a zero sequence which only depends on n min , so again with Lemma A.6 (p.23) the result is proved.
Thus, we can finally calculate the standardized quadratic form as The last two parts converge in probability to zero, so also in distribution and with Slutzky W N · o p (1) converge in distribution to zero if one of the conditions of Theorem 3.1 is fulfilled. Thereby W N has asymptotical the same distribution as W N .
For large numbers of groups many estimators A i,1 , A i,r,2 and A i,3 and have to be calculated which leads to long computation time. In this cases it is better to again use subsamling-type estimators which leads to A i,1 , A i,r,2 , A i,3 and therefore to A 4 .
Lemma A.12: With the definitions from above let be If B(N) → ∞, this estimators and a i=1 A i,1 have the same properties as A i,1 , A i,r,2 , A i,3 , A 4 and a i=1 A i,1 which were defined in Lemma 3.1 (p.6) .
Proof: For A i,1 (B), this lemma will be proved in detail. For all other terms only the major steps are shown.
The unbiasedness is clear because the random variables σ i1 (b), σ i2 (b) have no influence on the number of terms of the sum and also the terms are identically distributed. Hence, The second part is more complicated. Let F(σ i (B, m)) be the smallest σ-field which contains σ i (b, m) ∀b ∈ B, so obvious M(B, σ i (b)) is F(σ i (B)) -measurable. Identical for F(σ i (B, m), σ r (B, m)) and F (σ(B, m)). Similar to the previous part, the distribution of the bilinear form does not de-pend on the index combination. Together with the independence of the normally distributed vectors and σ i1 (b), σ i2 (b) this leads to With Lemma A.9 (p.24) we thus obtain (σ i (B, 2)) .
For the calculation of the conditional variance of the sum, it would be useful finding an upper bound that is based on the variance instead of calculate the covariances. To achieve this, we calculate the number of index combinations which leads to a covariance that is zero. This amount is non-deterministic and we recognize it contains the amount M(B, σ i (b, 2)) which was considered before.
Again not the amount is important but the number of elements which are contained in M(B, σ i (b, 2)) since the bilinear forms are identically distributed. Therefore the condition of the variance of the bilinear form disappears since the random indices have no influence on the variance. With the F(σ i (B, 2))-measurability of M(B, σ i (b, 2)) it thus follows that The other values are calculated in a similar way. .
For B(N) → ∞ the first factor is a zero sequence and therefore a i=1 N n i (T W ) ii A i,1 a ratioconsistent, unbiased estimator of tr So again this is a zero sequence, and A 4 is an unbiased and dimensional stable (i.e. also ratio consistent) estimator of tr (T V N ) 2 .

A.3 Proofs of Section 4
Lemma A.13: For  Λ 1 ( 1,1 , . . . , 6,a ) = Z ( 1,1 , 2,1 ,..., 1,a , 2,a ) T Z ( 3,1 , 4,1 ,..., 3,a , 4,a ) , With this notation it follows that Proof: Set With the rules for conditional expectation and the involved independence it follows that Due to the fact that all X i,j are identically distributed we can neglect the concrete indices, as long as we maintain the structure of dependence of the bilinear forms. The last term fulfills the requirements from Korollar A.5 (p.22) with Z (1,2) ∼ N (0 ad , 2V N ) and the matrix T V N T T V N T .
For the calculation of the variance it is useful to diagonalize the matrix V 1/2 N T V 1/2 N : It exists an orthogonal matrix P with PV We define E i := P Z (i,j) so with the properties of the standard normal distribution E i ∼ N ad (0 ad , I ad ), where the E i are independent for different indices. Thus, we can rewrite So we can control the variance by With this result, we can construct an estimator for τ P step by step: Lemma A.14: For C 5 as previously defined, it holds for fixed a that C 5 It even holds in the asymptotic frameworks (4)-(5) if p > 1 exists with n min = O(a p ).
Proof: From the previous lemma, we know that .
For fixed a this is a zero sequence. If we consider a → ∞ we need the existence of p > 1 and n min = O(a p ) to guarantee that the upper border is a zero sequence. So in both cases Lemma A.6 (p.23) can be used.
Lemma A.15: Moreover C 5 holds for fixed a If p > 1 exists with n min = O(a p ), the convergence even holds in the asymptotic frameworks (4)- (5).
Proof: With the last lemma it follows for both cases that For the last step we used that τ P ∈ [0, 1] which is known from Lemma A.8 (p.24) and hence 1]. As a product of a bound term and a term which converges to zero in probability, it also converges to zero in probability and with Slutzky's Lemma the result follows. With these limits in both cases we can calculate As in the previous lemma we used τ P ∈ [0, 1] and Slutzky.
Although n min = O(a p ) with p > 0 is not too critical in most settings we additionally developed an estimator which can be used without any restrictions.
If these values were summed up and divided by the number of rearrangements we get an alternative for C 5 which is shown in the following lemma.
Lemma A.17: For C 7 as defined before it holds Simulations (not shown here) show that higher values for w lead to better estimations.
Lemma A.18: For C 7 as previously defined, it holds independent of a or d. Therefore this holds for the asymptotic frameworks (3)-(5).
Proof: With the previous lemma we know So exactly the same steps as in the proof of Lemma 4.1 , which in this case uses that the zero sequence not depends on a or d, leads to the result.

64
A.13 With these values we can consider the whole estimator Λ 4 (j;σ 0 (b,6))·Λ 5 (j;σ 0 (b,6))·Λ 6 (j;σ 0 (b,6)) 8·B = tr (T V N ) 3 , Var B b=1 Λ 4 (j;σ 0 (b,6))·Λ 5 (j;σ 0 (b,6))·Λ 6 (j;σ 0 (b,6)) 8B Proof: For the proofs of the classical estimators from the first paragraph, only the expectation values are used together with upper bounds for the variances which are zero sequences. With random indices, the expectation is the same and for the variance, all traces are the same but the zero sequence changes. So the proofs of the subsampling-type estimators work identically. For the second paragraph, only some convergences are necessary, which the subsampling-type estimators also fulfills.

A.4 On the asymptotic distribution in our simulation designs
To chose the convenient test for our simulation the limit of β 1 has to be considered. Instead of this we calculate the value of τ P = tr 2 (T V N ) 3 tr 3 (T V N ) 2 and because V N is known no estimation is needed. The ratio n 1 /N and n 2 /N are the same for all our sample sizes, so the different numbers n 1 , n 2 has no influence on the values of τ P . The results can be seen in Table 3 and Table 4 which leads to the assumption τ P → 1 for H 0 a and τ P → 0 for H 0 b . With Lemma A.8 (p.24) this is equivalent to β 1 → 1 under H a 0 resp. β 1 → 0 under H b 0 .

A.5 On the Chen-Qui-Condition
We can also develop an estimator for τ CQ = tr (T V N ) 4 / tr 2 (T V N ) 2 = 1/f CQ on an analogical way as before. This leads to: Lemma A.21: Let be If p > 1 exists with n min = O(a p ), the convergence even holds in the asymptotic frameworks (4)- (5).
Proof: Again we first consider the parts: So with Lemma A.6 (p.23) for fixed a and d, n min → ∞ and moreover if the additional condition is fulfilled even for the asymptotic frameworks (4)- (5), it follows Together this leads to Then it holds E (C 6 (B)) = tr (T V N ) 4 , · O tr 4 (T V N ) 2 .
Proof: By using the same steps as before it holds With Lemma A. 19 we get an estimator for τ CQ with τ CQ (C 6 , A 4 ) = C 6 /A 2 4 and once more for a large number of groups A 4 should be used.
Lemma A.24: Theorem 4.1 is also valid if f P is replaced by f CQ or by ( τ CQ (C 6 , A 4 )) −1 . Using C 6 or A 4 also doesn't change the result. Identical the result of Lemma A.22 remains true if one or all estimators are replaced by their subsampling version.
Proof: With Lemma A.8 we know f p → 1 ⇔ f CQ → 1 and f p → 0 ⇔ f CQ → 0 so in both cases K f P is asymptotically identic with K f CQ .
From Lemma A.22 we know that τ CQ − τ CQ converges in probability to zero so this result follows identically to Theorem 4.1. At last the subsampling versions have the same properties like the standard estimators. Therefore this is a second way to test the hypotheses and moreover, it provides an indicator for the choice of the limit distribution, because of Lemma A.8. For situation c) from Theorem 3.1 there is no proof that this approach can be used but in the case of just one group it leads to good results.