A new nonparametric test for two sample multivariate location problem with application to astronomy

This paper provides a nonparametric test for the identity of two multivariate continuous distribution functions (d.f.'s) when they differ in locations. The test uses Wilcoxon rank-sum statistics on distances between observations for each of the components and is unaffected by outliers. It is numerically compared with two existing procedures in terms of power. The simulation study shows that its power is strictly increasing in the sample sizes and/or in the number of components. The applicability of this test is demonstrated by use of two astronomical data sets on early-type galaxies.


Introduction
Astronomical data, coming from different sources and collected by different telescopes, are often needed to be combined in a complete data set for study. In this situation, it is always very important to test compatibility of two data sets, collected in different surveys or measured with different resolutions, before pooling them together and they can only be combined when they are compatible (see, for example, De et al., 2014;Modak et al., 2017). That means, they should have approximately the same amount of observational error on an average. One possible way to deal with this situation is to carry out the following hypothesis testing problem under multivariate set up.
Let (X 1 , ..., X n 1 ) and (Y 1 , ..., Y n 2 ) be two independent samples from p−variate (p ≥ 2) populations with continuous distribution functions (d.f.'s) F and G respectively, where G(x) = F (x − △) for all x ∈ R p . We consider the problem of testing the null hypothesis H 0 : △ = 0 against the alternative H 1 : △ = 0. Our test can be employed to solve the above stated problem and indicates compatibility only when the null hypothesis is accepted.
In this context, the Hotelling T 2 test (HT ) is optimal and unbiased when F is a p−variate normal d.f. However, for non-normal population, its finite sample unbiasedness is not certain (Seber, 2004). It performs poorly for highdimensional data (Bai and Saranadasa, 1996) and the observations which are affected by outliers. Moreover, it is incomputable when p > n 1 + n 2 − 2.
In astronomy, data collection on celestial bodies is often obscured by bad weather conditions, obstruction by another celestial objects, instrumental restrictions, etc., and it cannot be repeated. So, we often get data which are contaminated with noise, affected by outliers or sparsely distributed (see, for example, Feigelson and Babu, 2013). In such situations, the asymptotic distribution of HT based on approximating the population dispersion matrix by the sample dispersion matrix fails to attain the desired size of the test. Because the sample variance-covariance matrix is affected by the outliers and is not anymore a consistent estimator of its population version. This problem is resolved by considering rank tests based on mutual distances of observations in each component (see, for example, Jurečková and Kalina, 2012;Marozzi, 2015Marozzi, , 2016. In the present work, we apply this concept on Wilcoxon rank-sum statistic obtained from mutual distances between a first sample observation and the other observations in each component, and find the maximum of all the componentwise Wilcoxon rank-sum statistics. Thus we obtain n 1 such statistics for each of the first sample observations and combine them in an appropriate manner (see, for example, Jurečková and Kalina, 2012) to define the ultimate test statistic. Our simulation study shows significant improvement in terms of the efficacy of the test, which is measured by empirical power.
Missions like Galaxy Evolution Explorer, Kepler Space Telescope, Hubble Space Telescope collect terabytes of astronomical data, which are preserved in virtual archives like Sloan Digital Sky Survey, Multi-mission Archive at STSCI, NASA Extragalactic Data base, Chandra (see, for example, Chattopadhyay and Chattopadhyay, 2014). They provide multivariate data sets of considerably large sizes. So our aim is to develop a test which is distributionfree asymptotically under certain conditions and we concentrate on the situations where p << n 1 + n 2 (see, for example, De et al., 2014;Modak et al., 2017). Simulation study shows that the power of the proposed test is strictly increasing in the sample sizes and/or in the number of components, and emphasizes the usability of the test in checking compatibility of two multivariate large sample astronomical data sets. Moreover, our test, as it is based on ranks, performs robustly in the presence of outliers. We con-sider two competitors, viz., the Wilcoxon rank-sum based test in Jurečková and Kalina (2012), abbreviated as the JK test and its analogous test, using L 1 −norm instead of L 2 −norm, abbreviated as the JK a test.
The paper is organized as follows. The proposed test and its properties are discussed in Section 2. Section 3 contains simulation study. Application to astronomical data sets is given in Section 4. Section 5 concludes.

Test
.., n 2 are two samples, and Z = (X 1 , X 2 , ..., X n 1 , Y 1 , Y 2 , ..., Y n 2 ) ′ is the pooled sample of size N = n 1 + n 2 , in which Z j = (Z 1j , Z 2j , ..., Z N j ) ′ represents the j−th column of Z, j = 1, 2, ..., p. Let R ij be the rank of Z ij in Z 1j , Z 2j , ..., Z N j , j = 1, 2, ..., p. Then, the componentwise Wilcoxon rank-sum statistics are The corresponding rank matrix is R = R ij , in which each of the p columns represents a permutation of {1, 2, ..., N}. The matrix R can be transformed to a matrix R * by permuting the rows of R so that the first column of R * becomes in the natural order from 1 to N. Let the conditional probability distribution of R under H 0 , given R * , be P . Then R has uniform distribution over N! permutations of {1, 2, ..., N} under P . Next, we consider where E(W j |P ) = n 2 2 (N + 1) and V (W j |P ) = n 1 n 2 12 (N + 1), j = 1, 2, ..., p.
.., W o p ) ′ has zero mean vector and covariance matrix H with elements h jj ′ = 1 for j = j ′ and Let us now assume that for each N, there are n 1 = n 1 (N) and n 2 = n 2 (N) such that n 1 + n 2 = N and as N → ∞, Then, given P , the distribution of W o under H 0 converges to N p (0, Γ), where the elements of Γ are γ jj ′ = 1 for j = j ′ and γ jj ′ = the grade correlation coefficient between component j and component j ′ for j = j ′ (see, for example, Chatterjee and Sen, 1964;Puri and Sen, 1971). Thus, defining the statistic T = max{W o j , 1 ≤ j ≤ p}, the conditional distribution of T given P can be computed empirically under H 0 when Γ is known. In practice, Γ is unknown and is replaced by its consistent estimator H. Note that this criterion based on T is equivalent to the Tippet criterion (see, Tippett, 1931), where the minimum of the p−values over several tests is considered, which is also used as T IP JK in Marrozi (2016).
Note that {d j (i, l), l( = i) = 1, 2, ..., n 1 } represents sample corresponding to a distribution function F ij (x|X ij ), whereas {d j (i, l), l = n 1 +1, n 1 +2, ..., N} represents another sample corresponding to a distribution function G ij (x|X ij ). So, we can frame n 1 testing problems in which the null and the corresponding alternative hypotheses are, respectively, H 0j : F ij (x|X ij ) and G ij (x|X ij ) are identical and H 1j : G ij (x|X ij ) is stochastically larger than F ij (x|X ij ) , i = 1, 2, ..., n 1 . Then, for each i, H 0 is true if H 0j holds for all j and H 0 is false if H 0j does not hold for at least one j (i = 1, 2, ..., n 1 ). To test these, assuming continuity of the distribution functions, we proceed in the following way. Let for each i and given reduced rank collection matrix, P (i) be the conditional probability distribution of the rank collection matrix under H 0 , i = 1, 2, ..., n 1 . Then P (i)'s are equiprobable on (N−1)! permutations of {1, 2, ..., N− 1}, and hence we compute , j = 1, 2, ..., p.
Let r(i) denote the Spearman's rank correlation matrix for given i. Then, under (1) and given r(i), P (i) , the conditional distribution of T (i) = max{W o j (i), 1 ≤ j ≤ p} is asymptotically distribution-free. LetT be such that for any i (= 1, 2, ..., n 1 ), under both H 0 and H 1 . Here we also have random matrixr and conditional probability distributionP over the same probability measure space as that of P (i) such that for i = 1, 2, ..., n 1 . Hence, (2) implies ThenT can be taken as our test statistic, and our level α (0 < α < 1) critical

Simulation study
We draw two independent random samples of sizes n 1 and n 2 from p−variate Dependence among the components can be described by the following models: (a) Independence among the components of the parent d.f. (b) t copula (Nelsen, 2006) is an elliptical copula corresponding to a multivariate t distribution wherein the Sklar's theorem establishes dependence structure among the components. Let Ψ be the d.f. of the p−variate t distribution and Ψ i be the d.f. of the i−th component with inverse function Ψ −1 i , i = 1, 2, ..., p. Then t copula determined by Ψ is We consider t copula corresponding to the p−variate t distribution with 2 degrees of freedom and correlation matrix with all the off diagonal elements 0.15. (c) Frank copula (Nelsen, 2006) is an Archimedean copula established using the generator In particular, we choose Frank copula parameterized by β = 0.90. Location shift, denoted by µ, is added to the second sample and we consider the same shift (µ) to each component as all the marginal distributions are the same. Hence we compute size and power of the tests by taking µ = 0 and µ = 0, respectively. We take the nominal level α = 0.05 and set n 1 = n 2 = 50 and 100, as we consider the large sample situation, and p = 2, 4, 10, as we are concerned with the situation p << N. For normal parent the exact Snedecor's F distribution of HT is used, while the asymptotic χ 2 distribution is applied to non-normal parents. Asymptotic normality for JK and JK a is adopted and asymptotic distribution forT is computed empirically. The outcomes are computed using 10000 replications.
In simulation study, we estimate the error as follows (see, Marrozi, 2016). Let RH i be the random variable denoting the rejection of H 0 in the i−th replication of the simulation and the probability of rejecting H 0 is P RH, then RH i iid ∼ Ber(1, P RH), i = 1, 2, . . . , REP , where REP is the number of replications. The power function is estimated from So, the error in simulation can be estimated as the standard deviation of the estimated power function given by Now, for fixed REP , ER is increasing in P RH for 0 < P RH < 0.5 and decreasing in P RH for 0.5 < P RH ≤ 1. The maximum of ER is attained at REP . In our case, REP = 10000 and α = 0.05 so that ER ≤ 0.005, and under H 0 , ER = 0.00218.
We provide the empirical power study for the tests (see , Tables 1-3). All the tests satisfy the desired size condition except HT , which fails to attain the nominal level for non-normal distributions because of the effect of outliers in the distributions, and hence its power should not be taken under consideration. In all the situations, power ofT always increases in the sample sizes and/ or in the number of components. The same happens for all the tests in distribution (i) (see, Table 1). As expected, being the optimal test, HT outperforms the others for normal d.f.T is the second best test, except being slightly outperformed by JK for µ = 0.2 with n 1 = n 2 = 50 and p = 2, 4 under model (b), and by JK, JK a for µ = 0.2 with n 1 = n 2 = 50 and p = 10 under model (c). These exceptions can be ignored as computational error, however, when n 1 = n 2 = 100,T prominently shows its superiority over JK and JK a .
For normal distribution,T has considerable high power for large sample sizes with increasing p. Showing similar performances in terms of power, JK generally performs little better than JK a (except µ = 1, n 1 = n 2 = 100, p = 10, model (c)). Powers of both JK and JK a are increasing in p.
For distribution (ii), Table 2  Also, for distribution (iii),T has the best performance in terms of power for all the situations considered ( Table 3). Power of JK a decreases in p under models (a), (b) and also under model (c) (except for µ = 4 with n 1 = n 2 = 100). For JK, the increase in p takes its toll on the decrease of power so much that slight change in the location shift cannot show up the change in power function correctly. For that reason, despite JK being an unbiased test, the estimated power gets less than the estimated size for p > 2. As a side effect the estimated power is a little less for n 1 = n 2 = 100 than that for n 1 = n 2 = 50 in model (a) with p = 10, µ = 0.5 and models (b) and (c) with p = 4, 10, µ = 0.5. This problem can affect the application of the test to multivariate data with p > 2, while our test being maximized over components does not suffer from such problems. Now, in real life situations, the data sets may have different sizes, so we study the performance of our proposed test with significantly different sample sizes n 1 = 50, n 2 = 100 and p = 2, 4, 10 (as considered before). Since we have seen (see, Tables 1-3) that the tests' relative performance remains the same under all the three models, we consider the effect of unequal sample sizes under model (a) (see, Table 4). In Table 4, we also study the empirical power for unequal location shifts in different components. Under all the situations, Table 4 shows that the relative performances of the tests in terms of power remains the same. Since the power ofT is strictly increasing in the sample sizes, we observe that the power for n 1 = 50, n 2 = 100 lies between those for n 1 = 50, n 2 = 50 and for n 1 = 100, n 2 = 100. Also, it is strictly increasing in p, provided the total of the shifts differs. Otherwise the power increases with the average of the shifts.

Application
We have n 1 dependent p-values, of T (i)'s, where t i is the observed value for T (i), i = 1, 2, ..., n 1 . To take decision by the test we need to combine these p-values. There are various methods in the literature to combine independent p-values (see, for example, Tippett, 1931;Fisher, 1948;Liptak, 1958)  (T , 6.25, 4.39). As HT is supposed to be misleading in such situation, we can ignore it; and the p−values of the other tests support compatibility of the two data sets.

Conclusion
We propose a nonparametric test using Wilcoxon rank-sum test statistics on distances between observations for each of the components. The test is asymptotically distribution-free under certain conditions. The simulation study shows that the test is unbiased and its power is strictly increasing in the sample sizes and/or in the number of components, provided p << N, which encourages its applicability to multivariate large sample astronomical data sets. In the presence of outliers or sparsely distributed data where HT fails, the performance of our proposed test, measured in terms of power, is the best among the possible competitors. For distribution (i), HT is optimal but under all the models power ofT becomes very close to 1 for n 1 = n 2 = 100 with p = 10. It guarantees its good performance for the parent distributions like multivariate normal when the sample sizes are large. It is to be noted that in greater effect of outliers as in distribution (iii) than in distribution (ii),T performs better than JK, JK a with higher efficacy. It indicates the proposed test's robustness under the presence of unusual observations in the parent distributions. JK and JK a not only get outperformed in the above stated situations but also their powers may become significantly worse for increasing values of p. While our test being maximized over the components gets only better with increasing p, provided p << N. Unlike HT , our test is computable for p > N, but not suitable for use, since the test depends on the central limit theorem, which does not hold for large p. As our objective is to provide a test for large sample data with p << N, we do not concern this problem here, while our future project is to provide tests for large-dimensional data.