Inference for the mean of large $p$ small $n$ data: A finite-sample high-dimensional generalization of Hotelling’s theorem

We provide a generalization of Hotelling's Theorem that en- ables inference (i) for the mean vector of a multivariate normal population and (ii) for the comparison of the mean vectors of two multivariate normal populations, when the number p of components is larger than the number n of sample units and the (common) covariance matrix is unknown. In par- ticular, we extend some recent results presented in the literature by finding the (finite-n) p-asymptotic distribution of the Generalized Hotelling's T2 enabling the inferential analysis of large-p small-n normal data sets under mild assumptions.


Introduction
The advent and development of high precision data acquisition technologies in active fields of research (e.g., medicine, engineering, climatology, economics), that are able to capture real-time and/or spatially-referenced measures, have provided the scientific community with large amount of data that challenge the classical approach to data analysis.
Data sets are indeed increasingly becoming characterized by a number of random variables that is much larger than the number of sample units (large p small n data sets) in contrast to the "familiar" data sets where the number of sample units is often much larger than the number of random variables (small p large n data sets). This makes many classical inferential tools (e.g. Hotelling's Theorem) almost useless in many fields at the forefront of scientific research and raises the demand for new inferential tools able to efficiently deal with this new kind of data.
The work of Srivastava (2007) is pioneering in this direction. In it, a generalization of the Hotelling's Theorem is proposed: a generalized T 2 test statistic is found and its distribution law is computed for p ≥ n under the assumptions of normality and of proportionality of the covariance matrix to the identity matrix with the proportionality constant unknown (Theorem 2.2, Srivastava (2007)); this assumption implies the independence among components (and among univariate test statistics as well), enabling also other classical inference procedures. We shall show that without relying on the latter assumption, it is possible to generalize this work in a much less stringent framework. In Srivastava (2007), some inferential results not depending on strong assumptions on the covariance structure are presented as well, but, being asymptotic in both p and n, they are not suitable to perform inferential statistical analysis of large p small n data (Theorem 2.3, Srivastava (2007)).

p-asymptotic generalized Hotelling's theorem
The classical approach to inference for the mean µ p of a p-variate normal random vector with unknown full rank covariance matrix Σ p relies on a famous corollary of the Hotelling's Theorem that holds when the number n of sample units is larger than the number p of random vector components.
Theorem 1 (Hotelling's Theorem). For m ≥ 1 and p ≥ 1, assume that: Then, for m ≥ p: Corollary 2 (Hotelling's T 2 Distribution Law). For n ≥ 2 and p ≥ 1, assume that: Then, for n > p: with X and S being the sample mean and the sample covariance matrix, respectively.
The quantity n(X − µ p ) ′ S −1 (X − µ p ) is known as Hotelling's T 2 due to its analogy with the squared of the univariate Student's t test statistic. Corollary 2 makes possible the development of inferential tools for the mean value of a pvariate normal random vector (e.g. confidence ellipsoidal regions or hypothesis testing) when the number n of sample units is larger than the number p of random vector components; there are no assumptions on the covariance matrix Σ p that is only required to be positive definite. Proofs of Theorem 1 and Corollary 2 can be found, for instance, in Anderson (2003).
Theorem 1 and Corollary 2 become useless in applications where the covariance matrix is unknown and the number p of random vector components is larger than the number n−1, with n being the number of sample units. Indeed, in these cases, T 2 is not defined since S is not invertible because rank(S) = min(n−1, p) a.s. Analogously to Srivastava (2007), we decide to suitably generalize the inverse of S in order to obtain a suitable generalization of T 2 . We considered the Moore-Penrose Generalized Inverse (Rao and Mitra (1971)) of a rectangular matrix, whose general definition can be found in Appendix A, since it always exists, it is unique, and it is equal to the inverse matrix when the latter is squared and invertible. Moreover, in the special case of a squared real positive Finite-sample high-dimensional Hotelling's theorem 2009 semi-definite matrix A, the Moore-Penrose generalized inverse A + can be proved (see Appendix A) to be equal to: ..,p being the eigenvalues and eigenvectors of A, respectively. We now present a generalization of Hotelling's Theorem that can be used to make inference for the mean of a multivariate normal random vector when the sample size n is finite, the number of components p goes to infinity, and the covariance matrix is unknown.
Theorem 3 (p-asymptotic Generalized Hotelling's Theorem). For m ≥ 1 and p ≥ 1, assume that: Then, for p → ∞: The proof of Theorem 3 is based on the p-asymptotic distribution of three auxiliary random matrices Y , L, H that provide alternative useful representation of the random matrix W appearing in (ii). In particular, Y is a p × m random matrix whose m columns are independent N p (0, Σ p ), i.e. we may represent W = Y Y ′ since the two random matrices have the same law; L is a random diagonal matrix whose diagonal elements are the m non-zero ordered eigenvalues of W ; and H is a m × p random matrix whose rows are the corresponding m eigenvectors (i.e. W = H ′ LH with HH ′ = I m ). Random matrices Y , L, and H exist almost surely since W is a Wishart random matrix with m degrees of freedom. Also the diagonal matrix Λ p = diag(λ 1 (Σ p ), . . . , λ p (Σ p )) with λ 1 (Σ p ) ≥ · · · ≥ λ p (Σ p ) > 0 being the ordered eigenvalues of Σ p always exists thanks to the positive definiteness of Σ p . Theorem 3 relies, among others, on the fact that under assumptions (ii) and (iv) of the same theorem: These p-asymptotic convergences are presented in Lemma A.2 (a), (b), and (d) of Srivastava (2007); their proofs rely on Chebyshev's Inequality, Prokhorov's Theorem, Slutsky's Theorem, and on algebraic relations supported by the properties of Moore-Penrose general inverses.
Proof of Theorem 3. Let us define two auxiliary matrices A = (HΛ p H ′ ) −1/2 and Z = AH(X − µ p ). The conditional distribution of Z given H is since X is distributed as N p (µ p , Λ p ). The conditional distribution of Z given H does not depend on H: therefore, Z and H are independent and Thanks to Proposition 9 in Appendix A, the following equalities in distribution hold: Because of Lemma A.2 in Srivastava (2007) and the continuity of the maps B → B −1/2 and B → B 1/2 over the set of positive definite matrices, we have that: Thus, Slutsky's Theorem (e.g., Serfling (2002)) implies that: Finally, since the Euclidean squared norm function on R m is continuous, and thus σ 2

Remarks about Theorem 3
1. The proof of Theorem 3 recalls the proof of Theorem 2.3 in Srivastava (2007) where a n-p-asymptotic result is proven (i.e., a result for both n and p going to infinity). That result is reported in Section 3 (paragraph Remarks about Corollary 4). The key point that allows us to obtain a (finite-n) p-asymptotic result for the distribution of p(X−µ p ) ′ W + (X−µ p ) is the use of the distributional equality (2.2) in place of the distributional equality (2.1), which is instead used in Srivastava (2007). Indeed, this new representation of the random quantity p(X − µ p ) ′ W + (X − µ p ) (identity (2.2)) makes unnecessary Lemma 2.1 of Srivastava (2007) (where an n-asymptotic result is proven) for identifying the distribution of p(X − µ p ) ′ W + (X − µ p ), leaving space for a finite-n result. 2. Note that in Theorem 3 the practical importance of "p → ∞" (i.e., pasymptoticity) is very general. For instance we might consider the situation where we add extra components to a random normal vector X and infinite extra rows and columns to a random Wishart matrix W . This is for instance the case of discrete-time series when time goes to infinity, or micro-array expressions when the number of genes goes to infinity. But "p → ∞" can also be relevant in more complex situations where a sequence of random vectors X and of random matrices W of increasing dimensionality is investigated without any "nesting" property as p increases. This is for instance the case of sequential finite-dimensional representations of functional data by means of sequential non-necessarily nested basis (e.g. B-splines) whose dimension goes to infinity. 3. It is easy to prove that if the eigenvalues of Σ p are uniformly bounded away from 0 and +∞, i.e.: and at least one of the limits lim p→∞ exists, assumption (iv) of Theorem 3 is satisfied.
3. Inference for the mean of large p small n data Provided that one can evaluate the ratio σ 2 /σ 2 (this issue is tackled in subsection 3.3), Theorem 3 can be straightforwardly used to make inference for the mean of a multivariate normal distribution when the number p of components is larger than the number n of sample units. Indeed, its natural consequence is the following: Corollary 4 (Generalized Hotelling's T 2 p-asymptotic distribution law). For n ≥ 2 and p ≥ 1, assume that: Then, for p → ∞: where X and S are the sample mean and the sample covariance matrix, respectively.
The random quantity can be naturally denoted as Generalized Hotelling's T 2 since it is defined for any n and p such that n ≥ 2 and p ≥ 1 and coincides with the classical Hotelling's T 2 = n(X − µ p ) ′ S −1 (X − µ p ) when p < n. Despite the simplicity of this generalization, important differences occur between the new framework p ≥ n and the classical framework p < n. These differences involve: • the connection between T 2 and the univariate Student's t test statistic (subsection 3.1); • the invariance properties of T 2 (subsection 3.2); • the distribution law of T 2 (subsection 3.3); • the geometrical characteristics of the confidence regions and of the critical regions for the mean that can be derived from Corollaries 2 and 4 (subsection 3.4).

Remarks about Corollary 4
1. The p-asymptotic distribution law of T 2 strongly depends on assumptions (i ′ ) and (iv) involving the normal distribution of the observations and the p-asymptotic behavior of the covariance matrix, respectively. In particular, while the latter assumption could probably be relaxed (this issue is still under investigations by the authors), the former assumption cannot since there is no Central Limit Theorem for p → ∞ providing that √ n(X − µ p ) is approximately normal and (n − 1)S is approximately a Wishart. 2. Corollary 4 (and similarly Corollary 5) generalizes to the finite-sample framework the result presented in Theorem 2.3 of Srivastava (2007) where, under the same assumptions (i ′ ) and (iv), the distribution of a transformation of T 2 is presented both for p and n → ∞. In detail, Theorem 2.3 of Srivastava (2007) states that, under assumptions (i ′ ) and (iv), for p and n → ∞: In the light of Corollary 4, Theorem 2.3 of Srivastava (2007) turns out to be a special case which is obtained when n → ∞ and the n-asymptotic normal approximation of the χ 2 (n − 1) distribution is used. Even if our result covers a wider variety of real applications, it does not cover the analysis of small p small n data with p > n. This latter scenario is still an open issue except for the case of homoscedastic and independent components. This case is indeed fully developed in Srivastava (2007) where Theorem 2.2 states that under assumption (i ′ ) and Σ p = γI p , for p ≥ n: Finite-sample high-dimensional Hotelling's theorem 2013

Connections between the generalized Hotelling's T 2 and the student's t test statistic
Student's t statistic comes natural in multivariate statistics if the R p -representations of the n sample units are projected along a certain direction a ∈ R p \ {0 p }. Along this direction, the usual Student's t statistic can be computed: and univariate inference can be carried on along that direction.
Note that for all a ∈ R p \ {0 p }, t a is almost surely defined. Indeed, since ker(S) has null Lebesgue measure on R p and since X i with i = 1, . . . , n are absolutely continuous random variables with respect to the same measure, the probability that ker(S) ∋ a is equal to zero.
The maximization lemma of quadratic forms points out a strong relation between T 2 defined in (3.1) and the univariate t a . Indeed, one can show that This means that making multivariate inference using T 2 at a certain confidence (significance) level is formally the same as making simultaneous univariate inference along any direction belonging to the "variability space explored by the data" (i.e. any direction a ∈ Im(S) \ {0 p }), while controlling the overall joint confidence (significance) level, ignoring all orthogonal directions (i.e. any direction a ∈ ker(S) \ {0 p }). Note that R p = Im(S) ⊕ ker(S) for n ≥ 2 and p ≥ 1, with Im(S) = R p and ker(S) = {0 p } almost surely if and only if n > p. Thus, when n > p, T 2 can be more simply defined as max a∈R p \{0p} t 2 a ; this is actually the most common way through which T 2 is introduced in the classical framework n > p. In the general framework the latter definition does not hold since t 2 a is not uniformly bounded in R p \ {0 p } when n ≤ p.

Invariance properties of the generalized Hotelling's T 2
The Generalized Hotelling's T 2 is invariant under similarity transformations of the components (affine transformation preserving angles), i.e. affine transformation A • +b such that A ∈ R p×p with A = aO, where a ∈ R + 0 and O is an orthogonal matrix, and b ∈ R p . Indeed, under these assumptions we have that: The previous result relies on the fact, proven in Appendix A, that ( Similarity transformations are also those transformations that do not affect assumption (iv) of Theorem 3 nor the value of the constant σ 2 /σ 2 . On the contrary, they might affect the sparsity of both the covariance matrix Σ p and the mean vector µ p . Thus, as pointed out by the referees, no extra standard assumption on sparsity is required to carry on inference based on the Generalized Hotelling'd T 2 . Indeed for instance, given a covariance structure, it is always possible by means of suitable orthogonal transformations to obtain inferentially equivalent scenarios in which teh covariance matrix is very sparse (i.e., diagonal) or completely full. thus, sparsity of the covariance matrix is not an issue for inference based on the Generalized Hotelling'd T 2 .
Finally, it is easy to show that for n > p, T 2 is invariant under the wider class of affine transformations of the components, i.e. transformations A • +b with A ∈ R p×p invertible and b ∈ R p . This is due to the fact that ( Lehmann and Romano (2005) proved that invariance under generic affinity transformations cannot be achieved in the framework p ≥ n.

On the p-asymptotic law of the generalized Hotelling's T 2
For Corollary 4 to have some impact for inferential purposes, the constant σ 2 /σ 2 needs to be known or at least efficiently estimated. Two cases may occur in practical situations when Σ p is not known: (a) the constant σ 2 /σ 2 is known even if Σ p is not completely known. This case may occur when partial knowledge of Σ p is available; (b) the constant σ 2 /σ 2 is not known and it thus needs to be estimated.
Case (a) covers a few practical situations. For instance, the constant σ 2 /σ 2 is known if the unknown covariance matrix Σ p is equal to Σ p = Σ p + γV p , with Σ p an unknown positive definite (or even semi-definite) matrix such that lim p→∞ tr( Σ p ) < +∞, γ an unknown positive constant, and V p a known positive definite matrix satisfying (iv ′ ). Indeed, in this case it can be proven that ; the proof comes straightforward once it is noticed that, without loss of generality, V p can be assumed to be diagonal. A covariance matrix of this form occurs for instance in all applications where the observed p-variate random vectors are assumed to be generated by the sum of two independent terms: a structural term whose variability is concentrated on a finite number of components (or even infinite but with finite total variance) and a zero-mean nuisance term (due to background noise or measurement errors) satisfying (iv) acting on all components. If the covariance matrix of the nuisance term is assumed to be proportional to the identity matrix (as it often happens), in this case we have σ 2 /σ 2 = 1. This assumption may hold for instance in genetics, where long arrays of genes are observed on a small number of patients, the variability of the array can indeed be assumed to be generated by two independent terms, an informative variability concentrated on a reduced number of positively/negatively correlated genes and a nuisance homoscedastic error variability acting independently on each gene. Spectral data presents another situation where the latter assumption may hold; indeed, spectral data are characterized by the presence of nuisance background variability along the entire set of observed frequencies plus a series of independent sources of variability at some specific frequencies (bands) associated to the spectral firms of different molecules.
Case (b) is the case where the information about the covariance structure is sufficient to know that Σ p satisfies (iv), but not sufficient to know the value of the constant σ 2 /σ 2 . For instance, referring to the previous examples, we might know that V p has a block structure with blocks ℓ × ℓ all equal to an unknown positive definite matrix B. In this case we know that σ and σ 2 are for sure positive and finite without knowing their actual values σ = tr(B)/ℓ and σ 2 = tr(B 2 )/ℓ.
In this second case having a good estimate of the constant σ 2 /σ 2 becomes of primary importance. First of all, one can rely on some natural bounds to the constant based on the ratio between the maximal and minimal eigenvalue of Σ p . For instance, 1 − (λ max /λ min − 1) 2 ≤ σ 2 /σ 2 ≤ 1 useful when all eigenvalues are known to be similar (i.e., Σ p is well-conditioned) or (λ min /λ max ) 2 ≤ σ 2 /σ 2 ≤ 1 useful when some eigenvalues are known to be very different (i.e., Σ p is illconditioned).
A better estimate of σ 2 /σ 2 can be obtained by using some estimates for tr(Σp) p and tr(Σ 2 p ) p . Indeed for p → ∞ these quantities converge by definition to σ and σ 2 , and thus any unbiased estimator for tr(Σp) p (or tr(Σ 2 p ) p ) for a given p is also a p-asymptotic unbiased estimator for σ (or σ 2 ). The following estimators (introduced in Srivastava (2005)), defined for all n ≥ 3 and p ≥ 1, can be proven to satisfy this property: 3.4. p-asymptotic confidence region and hypothesis test for the mean of a normal population when p ≫ n Corollary 4 turns out to be a useful tool for the construction of confidence regions and hypothesis tests for the mean in all practical situations where the number p of random vector components is far larger than the number n of sample units or even virtually infinite (e.g., functional data) and data can be assumed normally distributed.
A p-asymptotic Confidence Region for the mean µ p can be defined as follows: with χ 2 α (n − 1) being the upper α-quantile of a χ 2 (n − 1) random variable and 1 − α being the p-asymptotic confidence level.
Equivalently, a p-asymptotic Hypothesis Test for H 0 : µ p = µ 0p versus H 1 : µ p = µ 0p with p-asymptotic significance level α has the following rejection region: Reject H 0 in favor of H 1 if: The confidence region CR 1−α (µ p ) is not of practical use for graphical purposes since a clear visual representation of it is not straightforward due to the large value of p. Similarly to the traditional multivariate framework, univariate projections of the confidence region along some directions (i.e. T 2 -simultaneous confidence intervals) can give a rough idea about the location and shape of the confidence region, providing -in the case of rejection of H 0 -also some help in detecting the directions that have taken to the rejection of H 0 . From Corollary 4 and characterization property 3.2 we have that for Thus, given a direction a ∈ Im(S) \ {0}, the corresponding T 2 -simultaneous confidence interval with p-asymptotic family-wise confidence 1 − α can be defined as follows: If a / ∈ ImS \ {0}, then the corresponding T 2 -simultaneous confidence interval is not bounded, i.e, equal to [−∞, +∞].
Confidence region (3.4) and rejection region of test (3.5) present some peculiar features that are worth a little discussion.
Because S + is positive semi-definite, the confidence region CR 1−α (µ p )which for n > p is an ellipsoid subset of R p -turns out to be a cylinder in R p generated by the orthogonal extension in ker(S) of an n − 1-dimensional ellipsoid contained in Im(S). As illustrative examples, three confidence regions for the mean vector when p = 3 and n = 2, 3, 4, respectively, are reported in Figure 1. In particular, as shown by the analytic expression of the generalized T 2 -simultaneous confidence intervals, CR 1−α (µ p ) is bounded in all directions belonging to the random space Im(S). These directions are easily identifiable since the first n − 1 sample principal components provide an orthonormal basis for Im(S).
Due to the non-null dimension of the random space ker(S) and to the orthogonality between ker(S) and Im(S), we have that the statistic σ 2 σ 2 np n−1 (X − µ 0p ) ′ S + (X − µ 0p ) in the hypothesis test (3.5) does not change if µ 0p is replaced by µ 0p + m ker(S) with m ker(S) being any vector belonging to ker(S). This implies that H 0 might not be rejected even for values of the sample mean X that are "really very far" from µ 0p in some direction within ker(S). This is not surprising, because the use of S + implies an exclusive focus on the space Im(S) (the variability space explored by the data), neglecting all p − n + 1 directions associated to ker(S) (the space orthogonal to the variability space explored by the data).

p-asymptotic pooled confidence region and hypothesis test for
comparing the means of two normal populations when p ≫ n Theorem 3 can also be used to tackle the problem of comparing the means of two normal populations when the number p of components is larger than the number n of sample units. Indeed, under the same assumptions of the classical multivariate analysis of variance, we have that: Corollary 5 (Generalized Pooled Hotelling's T 2 pooled p-asymptotic distribution law). For n a ≥ 1, n b ≥ 1, and p ≥ 1, assume that: Then, for n a + n b ≥ 3 and p → ∞: where X a and X b are the two sample means, and S pooled is the pooled sample covariance matrix.
It is natural to denote the following quantity as Generalized Pooled Hotelling's T 2 pooled : (3.6) Indeed, it is defined for any n a , n b , and p such that n a + n b ≥ 3, n a ≥ 1, n b ≥ 1 and p ≥ 1 and coincides with the classical definition of Pooled Hotelling's T 2 pooled when p ≤ n a +n b −2. The similarities and the differences between the framework p > n a + n b − 2 and the classical framework p ≤ n a + n b − 2 are analogous to the ones presented in Section 3 for the Generalized Hotelling's T 2 .
In particular, also in the two-population framework we obtain a confidence region to estimate the difference of the two means and a rejection region to test the difference of the two means.
A p-asymptotic Confidence Region for difference of the means µ pa − µ pb can be defined as follows: with 1 − α being the p-asymptotic confidence level. Equivalently, a p-asymptotic Hypothesis Test for H 0 : µ pa − µ pb = ∆µ 0p versus H 1 : µ pa − µ pb = ∆µ 0p with p-asymptotic significance level α has the following rejection region: Also the analytical expression of the T 2 pooled -simultaneous confidence intervals for the difference of the means comes naturally.
Also in the two population framework, the unknown constant σ 2 /σ 2 can be estimated by means of the following p-asymptotically unbiased and consistent estimators: Recently, Srivastava and Yanagihara (2010) and Pigoli et al. (2012) proposed tests for the equality of two covariance matrices in the large p small n data framework, enabling to check for the homoscedasticity assumption which the previous results rely on.

Simulation study
In this section, we estimate, by means of MC simulations, the power and the actual level of significance of the new test, presented in (3.8); from now on we will refer to it as the p-asymptotic Generalized Hotelling's test, being based on the (finite-n) p-asymptotic distribution of the Generalized Hotelling's T 2 . In particular, we estimate the probability of rejecting the null hypothesis H 0 : µ a = µ b in favor of the alternative hypothesis H 1 : µ a = µ b in twelve different cases and for increasing values of the number p of components ranging between 2 0 and 2 10 (i.e., 1 and 1024) and n a = n b = 10: where I is the identity matrix; D is a diagonal matrix whose diagonal alternatively assumes the values 0.5 and 1.5; R is a block matrix whose blocks are equal to the matrix ( 1 0.5 0.5 1 ); S is a block matrix whose blocks are equal to the matrix 1 −0.5 −0.5 1 . Covariance matrices R and S can be simply obtained from D by means of an orthogonal transformation: 45 • anticlockwise and 45 • clockwise pairwise rotations, respectively; L is a diagonal matrix whose diagonal alternatively assumes the values 0.001 and 1.999. The values for n a and n b are the same used in the simulation study presented in Salmaso (2010, 2009). The values for µ a and µ b used in cases I0, D0, R0, S0, and L0 are the ones under null hypothesis; the ones used in cases I1, D1, R1, S1, and L1 are the same used in the simulation study presented in Salmaso (2010, 2009); the ones used in cases I2, D2, R2, S2, and L2 investigate an alternative hypothesis scenario where (for the same value of the non-centrality parameter ||µ a − µ b ||) the mean difference between the two populations is concentrated on just one component (i.e., cases I2, D2, R2, S2, and L2) and not uniformly spread over all components (i.e., cases I1, D1, R1, S1, and L1). The value for Σ used in cases I0, I1, and I2 are once again the same used in Salmaso (2010, 2009), while the value used in the remaining cases are meant to provide less trivial situations where assumption (iv) still holds. Note that in cases I0, I1, and I2 the constant σ 2 /σ 2 = 1, in cases L0, L1, and L2 the constant σ 2 /σ 2 ≈ 1/2, while in the remaining ones σ 2 /σ 2 = 4/5. All simulations have been performed twice: either assuming the constant σ 2 /σ 2 known, to avoid confounding effects due to its estimation; and estimating the constant σ 2 /σ 2 by means of (3.9), to evaluate the bias due to its estimation. Results of the latter group of simulations are briefly discussed at the end of subsection 4.2. The rest of this session refers on the former group of simulations.
In details, for each case and for each value of the number p of components, 1000 synthetic data sets have been randomly generated according to the corresponding model and, for each one of these, the p-asymptotic Generalized Hotelling's test has been performed at a nominal level of significance α = 0.05. The relative number of times the null hypothesis has been rejected provides the estimate of either the actual level of significance of the p-asymptotic Generalized Hotelling's test (cases I0, D0, R0, S0, and L0) or its power (all remaining cases). The same data sets have been also used to perform three other tests recently appeared in the literature: the one presented in Salmaso (2010, 2009), the one presented in Theorem 2.2 of Srivastava (2007), and the one presented in Theorem 2.3 of Srivastava (2007)). In this section we will refer to them as the Pesarin-Salmaso's test, the Identity-matrix Generalized Hotelling's test, and the n-p-asymptotic Generalized Hotelling's test, respectively.
Analogously to the p-asymptotic Generalized Hotelling's test, also the n-pasymptotic and the Identity-matrix Generalized Hotelling's tests are based on the generalized T 2 pooled . The three tests differ for the distribution used to build the corresponding rejection region: the p-asymptotic Generalized Hotelling's test uses a rejection region built from its p-asymptotic distribution under the assumption (iv) (equation 3.8), the n-p-asymptotic Generalized Hotelling's test uses a rejection region built from its n-p-asymptotic distribution under the assumption (iv) (Theorem 2.3 of Srivastava (2007)), and the Identity-matrix Generalized Hotelling's test uses a rejection region built from its exact distribution under the assumption of independent and homoscedastic components (Theorem 2.2 of Srivastava (2007)).
The Pesarin-Salmaso's test is not a model-based test but a permutation test; the implementation used here is the same used in the simulation study presented in Salmaso (2010, 2009), i.e., the statistic used is actually a random weighted sum of the p univariate Student's t 2 pooled that can be written as (X a − X b − ∆µ 0 ) ′ S −1 diag (X a − X b − ∆µ 0 ), and its conditional distribution over the values observed within each data set is estimated by sampling 1000 random permutations of the n a + n b = 20 p-dimensional vectors making each data set. S diag is the diagonal matrix whose diagonal elements are the p (non-pooled) sample variances of the p components.
The results of the simulation study are summarized in Figure 2. For completeness, in the cases in which p ≤ n a + n b − 2, a classical Hotelling's test has been implemented.

Comparison between the p-asymptotic generalized Hotelling's test and the identity-matrix generalized Hotelling's test
In case I0, where H 0 is true and the hypotheses supporting the Identity-matrix Generalized Hotelling's test hold, the observed rate of rejection of the Identitymatrix Generalized Hotelling's test clearly matches its nominal level of significance 5%; on the contrary, in cases D0, R0, S0, and L0, where H 0 is still true Finite-sample high-dimensional Hotelling's theorem but the hypotheses supporting the Identity-matrix Generalized Hotelling's test do not hold, the observed rate of rejection of the Identity-matrix Generalized Hotelling's test significantly exceeds its nominal level of significance providing a strongly non-conservative test.
The assumptions which the p-asymptotic Generalized Hotelling's test is based on, hold instead for all cases, indeed for p "large enough" (in these cases 1024 seems to be a large enough value for p) the observed rate of rejection matches the nominal level of significance 5%. The almost identical patterns shown for cases D0, R0, and S0 confirm the invariance of the p-asymptotic Generalized Hotelling's test under orthogonal transformations of the components. Further simulations, not reported here, for different values of n a and n b show a quicker (slower) convergence to the nominal level of significance for smaller (larger) values of n a and n b . For instance, for n a = n b = 2 (i.e. the smallest sample size we tested), the nominal value is already reached for p = 64. Mind the fact that, though small sample sizes increase the reliability of the p-asymptotic Generalized Hotelling's test, they of course also reduce the power of the same test, as expected.
Fortunately, the same simulations also suggest that the convergence rate is independent from the value of the constant σ 2 /σ 2 . This fact enables an a-priori empirical measure, for a given sample size, of the minimal number p of random vector components (or given p, of the maximal sample size) that is necessary to make the p-asymptotic Generalized Hotelling's test reliable.
Talking about the power under the alternative hypothesis ∆µ = 0.4 · 1, in case I1 the superiority of the p-asymptotic Generalized Hotelling's test is just apparent and due to the mismatch between its actual and its nominal level of significance for too small values of p. For large value of p (i.e. values for which the actual level of significance reaches its nominal value), the powers of the two tests appear almost identical confirming the p-asymptotical inferential equivalence of the two under the more stringent hypotheses of the Identity-matrix Generalized Hotelling's test.
In cases D1, R1, S1, and L1, the mismatch between the actual and the nominal level of significance completely affects the Identity-matrix Generalized Hotelling's test providing meaningless power curves for this test. The only values of interest in these plots are the estimated powers of the p-asymptotic Generalized Hotelling's test for p = 1024, that is the only case in which the nominal level of significance equals the actual one. Different values of that power are achieved in the four cases. In particular, a comparison of the p-asymptotic Generalized Hotelling's test and the classical Hotelling's test across these four cases points out an opposite behavior of the two: the power of the p-asymptotic Generalized Hotelling's test is higher when the power of the classical Hotelling's test is lower and viceversa. More in detail, the power of the p-asymptotic Generalized Hotelling's test is enhanced (reduced) and the power of the classical Hotelling's test reduced (enhanced) for alternative hypotheses providing a difference of the means with large (small) components in the directions of important (in terms of eigenvalues) principal components and small (large) components in the directions of minor (in terms of eigenvalues) principal components. Indeed, classical Hotelling's test is based on the Mahalanobis distance (induced by the inverse of the sample covariance matrix) between the sample difference of the means and the H 0 difference of the means; thus, in the classical Hotelling's test, the effect of differences occurring in the direction of the minor sample principal components is enlarged with respect to similar differences occurring in the direction of the important sample principal components. On the contrary, in the framework of the p-asymptotic Generalized Hotelling's test the directions associated to the minor principal components are expected to be close to the directions detected by ker(S) and thus any difference in these directions have a high probability to be annihilated by the Mahalanobis semi-distance induced by the generalized inverse of the sample covariance matrix used.

4.2.
Comparison between the p-asymptotic generalized Hotelling's test and the n-p-asymptotic generalized Hotelling's test The actual significance level and power functions of the n-p-asymptotic Generalized Hotelling's test recall the ones of the p-asymptotic Generalized Hotelling's test except for a permanent bias observed in all simulated cases due to the use of the n-asymptotic approximation in a finite-n scenario. Indeed, differently from the p-asymptotic Generalized Hotelling's test, the actual significance level of the n-p-asymptotic Generalized Hotelling's test does not seem to converge to the nominal significance level of 5% for p going to infinity, providing in the large p small n case non-conservative inference. Given the actual (finite-n) p-asymptotic distribution of the Generalized Hotelling's T 2 under the null hypothesis (equation 3.8) it is straightforward to compute the actual p-asymptotic significance level of the n-p-asymptotic Generalized Hotelling's test, which does not depend on the covariance matrix. For convenience of the reader, in the right panel of Figure 3, for the nominal significance level α = 1%, 5%, and 10%, the p-asymptotic actual significance level of the n-p-asymptotic Generalized Hotelling's test is reported as the sample size n a + n b grows from 3 to 1024. As one can see the rate of convergence to the nominal value is quite slow, providing non-conservative inference in most real-world scenarios. In particular, the highest bias is observed in correspondence of small values of the nominal significance level and small sample sizes strongly discouraging the use of the n-p-asymptotic approximation on the finite-n framework. In the left panel of Figure 3 the same plot is reported for the one population test. Both the n-p-asymptotic and the p-asymptotic Generalized Hotelling's test require the knowledge of the constant σ 2 /σ 2 . If the constant is not known, one can perform both tests estimating the constant by means of estimators (3.9). In this case, all simulations, as expected, point out a positive bias for both tests (which indeed do not take into account the variability introduced by the estimators (3.9)). In details, for p = 1024, the actual level of significance of the p-asymptotic Generalized Hotelling's test is estimated to be close to 7% in all simulated cases. This bias is expected to decrease/increase as the sample sizes increase/decrease. Moreover, by comparing cases I0, D0, and L0, the bias does not seem to be affected by the condition number of Σ p . One-population test on the left and two-population test on the right.

Comparison between the p-asymptotic generalized Hotelling's test and the Pesarin-Salmaso's test
The p-asymptotic Generalized Hotelling's test has also been compared with the Pesarin-Salmaso's test Salmaso (2010, 2009)) by means of MC simulations. The aim of this comparison is to see to what extent the modelbased approach, pioneered by Srivastava (2007) and further developed in this work, can compete with another promising and less traditional approach to the analysis of large p small n data: multivariate permutation test Salmaso (2010, 2009)). The Pesarin-Salmaso's test presents some very interesting features (proven in Salmaso (2010, 2009)): first of all it does not require the normality of data (test for multivariate normality is still an open problem); secondly, its actual level of significance resembles the nominal level in all simulated scenarios (cases I0, D0, R0, S0, and L0) and for any value of p (i.e., it is not p-asymptotic). The Pesarin-Salmaso's test also presents a few drawbacks due to the discrete nature of the permutational distribution, to the factorial growth of the number of permutation with respect to the sample size, and the non parametric nature of the permutational distribution. Indeed, given sample sizes n a and n b , we have in general (n a + n b )! possible permutations of data associated -under the null hypothesis -to (na+n b )! na! n b ! equiprobable different values of the test statistic.
The discrete nature of the permutational distribution is particulary evident for small sample sizes: in these cases, test randomization becomes mandatory in The first two properties let A + be a generalized inverse of A. The last two properties confer to A + its uniqueness. The symbol " ⋆ " indicates the conjugate of a matrix. For our purposes, all matrices will have real entries and thus, it is equivalent to the symbol " ′ " indicating the transposed matrix.
The proof of the uniqueness of A + can be found, for instance, in Rao and Mitra (1971).
Moreover, it can be proven, by means of simple computations, that if A is a p × p symmetric matrix with real entries with rank m ≤ p, then A + = m i=1 λ −1 i e i e ′ i , with λ 1 , . . . , λ m being the m non-zero eigenvalues of A and e 1 , . . . , e m the corresponding eigenvectors. An immediate consequence of this result is that, if A is of full-rank, then A + = A −1 .
Hereby, we report some results necessary to the proof of Theorem 3. The proof can be found in Rao and Mitra (1971).
Proposition 9. Let A be a ℓ × m matrix. Two particular cases are of interest: • if A is of full column rank m, then A ′ A is invertible and we get • if A is of full row rank ℓ, then AA ′ is invertible and we get The proof can be found in Rao and Mitra (1971).
and it ends the proof.