A nonparametric multivariate multisample test based on data depth

: In this paper, we construct a family of nonparametric mul- tivariate multisample tests based on depth rankings. These tests are of Kruskal-Wallis type in the sense that the samples are variously ordered. However, unlike the Kruskal-Wallis test, these tests are based upon a depth ranking using a statistical depth function such as the halfspace depth or the Mahalanobis depth, etc. The types of tests we propose are adapted to the depth function that is most appropriate for the application. Under the null hypothesis that all samples come from the same distribution, we show that the test statistic asymptotically has a chi-square distribution. Some comparisons of power are made with the Hotelling T 2 , and the test of Choi and Marden (1997). Our test is particularly recommended when the data are of unknown distribution type where there is some evidence that the density contours are not elliptical. However, when the data are normally distributed, we often obtain high relative power.

In Section 2 we discuss briefly different types of existing multivariate multisample tests based on different types of rank vectors. Section 3 is devoted to a discussion on the different notions of data depth. The depth ranking based on data depth is given in section 4. In Section 5 we construct a new multivariate multisample test based on the depth ranking discussed in Section 4 and investigate its properties. In Section 6.1, via simulation, we investigate the asymptotic distribution of the test statistic, and in Section 6.2, we compare power of the test with some of the other tests, including generalized T 2 Hotelling. Finally, in Section 7, we apply the proposed test to a real data set.

Multivariate Kruskal-Wallis type tests
Consider comparing t absolutely continuous multivariate distributions F k , k = 1, 2, . . . , t. The hypothesis to be tested, say H 0 , specifies that for all x .
Under H 0 , the common distribution function shall be represented by F . The alternative hypothesis H 1 to H 0 says that H 0 does not hold. For j = 1, . . . , t, assume that X ij , i = 1, . . . , n j , denotes a random sample of size n j from a d-variate population F j . For the assumption that the distributions are d-variate normal with common unknown covariance matrix Σ and different mean vectors, Lawley (1938) and Hotelling (1951) introduced an affine equivariant test statistic which is well known as the generalized Hotelling's T 2 . Hotelling (1951) showed that the asymptotic null distribution of Hotelling's T 2 is chi-square with d(t − 1) degree of freedom.
The standard univariate and nonparametric test for one way analysis of variance is the Kruskal-Wallis test (Kruskal (1952), Kruskal and Wallis (1952)). Puri and Sen (1971) proposed a multivariate extension of the Kruskal-Wallis test based on a component-wise ranking. However, unlike the Hotelling's T 2 , it is not affine equivariant. It is only invariant under the coordinate-wise monotone transformation. Thus its performance will depend on the form of the covariance matrix and the direction of shift. The efficiency of the test based on the component-wise ranks becomes really poor in the case of highly correlated components. See Bickel (1965) for a general discussion on the procedures based on the component-wise ordering. Choi and Marden (1997) developed a multivariate multi-sample test based on average gradient vectors of the Euclidean norm. For each d-variate observation X i , its rank vector is defined by Choi and Marden's test statistic based on gradients of the Euclidean norm are not affine equivariant, but they are equivariant under orthogonal transformations. As discussed in Brown (1983), and Choi and Marden (1997), the efficiency of a test based on these gradients is increasing with the dimension in the case of the multivariate spherical distributions. Re-scaling one of the components, i.e. moving from a spherical case to an elliptical case, may however highly reduce the efficiency (see Brown (1983), and Chakraborty et al. (1998)). Hettmansperger et al. (1998) constructed an affine equivariant asymptotically distribution free multivariate multi-sample test by introducing a multivariate centered rank function based on the Oja (1983) criterion function and discussed its properties. For each d-variate observation X i , Oja's centered rank vector is defined by where ▽ is the gradient operator, and volume(x, x 1 , . . . , x d ) is the volume of the simplex with vertices x, x 1 , . . . , x d in R d . Um and Randles (1998) developed an affine equivariant nonparametric multivariate multisample test based on interdirections. Interdirections were introduced in Randles (1989), Peters and Randles (1990), and Randles and Peters (1990) to construct distribution free multivariate signed, signed-rank, and two sample tests. Interdirections measure the angular distance between two observation vectors relative to the rest of the data. The tests based on interdirections are distribution free over the class of all elliptically symmetric distributions.
When there are only two populations, say F 1 and F 2 , Liu and Singh (1993) introduced a quality index, Q(F 1 , F 2 ), based on data depth to measure the overall "outlyingness" of population F 2 relative to population F 1 . They showed that the empirical quality index Q( F 1 , F 2 ) produces a Wilcoxon type rank sum testing procedure to test equality of F 1 and F 2 when they belong to a location and scale family. The asymptotic distribution of this rank sum statistic, which was conjectured in Liu and Singh (1993), is proved in Zuo and He (2006).
The tests listed above are distribution free for large samples. In small sample cases, the tests are conditionally distribution free, or distribution free over a class of elliptically symmetric distributions. In Section 5, we will construct another asymptotically, as well as conditionally distribution free multivariate multi-sample test. In addition, this test can be constructed so as to be affine equivariant. The test statistic has an asymptotic chi-square distribution with t − 1 degrees of freedom in many standard settings. A graphical technique to compare two distributions known as the depth-depth plot, or DD-plot in short, is introduced in Liu et al. (1999).
In view of the large number of available tests, the introduction of another one needs some explanation. Our multisample test can be viewed as a formal multisample extension of the DD-plots of Liu et al. (1999). Our family of tests also has several desirable features. First, it allows the researcher to flexibly choose a depth function for the analysis that most appropriately combines the advantages of equivariance, robustness and computational convenience. In other words, there is no one prescription for all contexts, but rather an overall approach that can be tailored to the context. Secondly, the test is based upon ranking multidimensional data. Rank statistics have the advantage that they are essentially dimensionless, being positive integer values. This property of ranking is especially useful if a data set is pre-processed before analysis by a dimension reduction technique such as PCA. The effects of the PCA on the ranks of the data can be studied directly because they are dimension-free. Vector ranks are not dimensionless, making the effects of PCA and other dimension-changing methods more difficult to interpret. Like other generalizations of the Kruskal-Wallis test, the proposed multivariate depth rank based tests are sensitive to location shifts and also sensitive to other departures in varying degrees.

Statistical depth functions
Let F be the class of all (Borel measurable) distributions on R d . In the definitions of some depth functions below, it will be assumed that certain moments exist. In these case, an additional appropriate restriction must be placed on F to make the definition meaningful.
A typical depth function will be a bounded nonnegative function of the form D : R d × F → R. For any random vector Y, let F Y denote the distribution of Y. A depth function D will be said to be affine equivariant if D(Ax + b, F AX+b ) = D(x, F X ) holds for any random vector X in R d , any d × d nonsingular matrix A, and any vectors x and b in R d . In most, but not all cases, we shall expect D(x, F ) to be a quasi-concave function of x, maximised somewhere in the center of the distribution F and vanishing at infinity. The sample version of D( · , F ) is denoted by D(x, F ), where F is the empirical distribution based on the sample. Let X = {X 1 , X 2, . . . , X n } be a random sample from d−dimensional distribution F, and F its empirical distribution. We use P F to represent the probability measure corresponding to the distribution F , and E F to represent expectation with respect to that probability measure. A typical depth function will be continuous in the sense that D(x, F ) → D(x, F ) whenever the sample size goes to infinity under random sampling from F . However, if empirical convergence is simply replaced by weak convergence, many depth functions are not continuous in this (now stronger) sense. We may loosely define the robustness of a depth function to be a measure of its continuity under various types of convergence of F to F . The more that D is continuous under a variety of types of convergence, the more robust D is.
The Mahalanobis depth is perhaps the most common depth function associated with multivariate normal theory. Following Rao (1988), for a positive definite d × d matrix M , we shall define the Mahalanobis norm (Mahalanobis (1936)) || · || M as The Mahalanobis depth is defined as where F is a given distribution and µ(F ) and Σ(F ) are any corresponding location and covariance measures, respectively. The Mahalanobis depth M D( · , F ) is not robust. Robust alternatives to the usual mean vector and covariance matrix are available such as Rousseeuw's minimum covariance determinant (MCD), or the minimum volume ellipsoid (MVE) estimates (see Rousseeuw (1983) and Rousseeuw and Leroy (1987)). If the robustness of the Mahalanobis depth is desired, we can define a robust version RM D(· , F ) by using these robust estimates.
The spatial depth (Gower (1974), Brown (1983) It is easy to see that SD(x, F ) is invariant under orthogonal transformations. A modification of the Euclidean norm used to define the spatial depth function yields an affine invariant version. The spatial depth function may be modified to an affine invariant version, where Σ(F ) is the covariance matrix of F and || . || Σ(F ) denotes the Mahalanobis norm given by (1). Once again, if the robustness of the affine spatial depth is desired, we can define a robust version RASD(x, F ) by replacing Σ with a robust estimate such as the MCD, or MVE. Another depth function which is used in this paper is the halfspace depth. The halfspace depth (Hodges (1955), Tukey (1975), Donoho (1982)) at a point x ∈ R d with respect to F is defined to be where H x,u = {y ∈ R d ; u ′ y ≥ u ′ x} is a closed halfspace. Several properties of the halfspace depth discussed by Donoho and Gasko (1992), Rousseeuw and Ruts (1999), Small (1987), Zuo and Serfling (2000). The computational issues are discussed in Rousseeuw and Ruts (1996), Rousseeuw and Ruts (1998), Rousseeuw and Struyf (1998), and Ruts and Rousseeuw (1996).
For an overview on other depth functions, we refer the reader to Small (1990) and Zuo and Serfling (2000)

Depth based ranking
For a univariate sample the ranking of data is clear and unambiguous. If X 1 , . . . , X n is a random sample from a continuous random variable X, we can place them in increasing order, as X (1) < · · · < X (n) . The rank of X i is the number of data points less than or equal X i , that is where #A represents the cardinality or the number of elements in the set A. These are linear ordering and ranking, respectively. An alternative method of ordering and ranking the sample is to order the observations in relation to their absolute deviation, or distance, from some reference point, M . If M < X (1) the ordering is the same as that just described. If M is in the body of the data, e.g. at the median, quite a different ordering arises. Such distance ranking (Barnett (1976)) is seldom directly considered for univariate data, but it has an appeal in the multivariate context.
For a d-variate (d ≥ 2) sample X 1 , . . . , X n , there is no natural linear ranking, but one can use the notion of data depth to introduce depth ranking. Given a depth function D and a distribution function F , one can compute the depths of all the sample points X 1 , . . . , X n , namely D(X 1 , F ), . . . , D(X n , F ) and order them according to increasing depth values. This gives a depth ranking of the sample points. The rank of X i is The implication of (4) is that a larger rank is associated with a more central position with respect to the data cloud.
With the exception of Mahalanobis depth and spatial depth, many empirical depth functions can have ties in ranks. Traditional solutions to ties in ranks are available here and will not affect the asymptotic properties which will be investigated later. It is necessary that for large sample sizes, the proportion of data tied at a given value goes to zero. We may choose to break ties at random or to assign an average rank simultaneously to all tied values. Both methods have merits and difficulties that we shall not pursue here. We refer the reader to Lehmann and D'abrera (2006), and Hollander and Wolfe (1999) for the theory of averaged ranks.
Obviously, different distribution functions yield different rankings. Of particular interest is the depth-based ranking using the empirical depth D( · , F ). Suppose X 1 , . . . , X n is a random sample from some unknown distribution F . We shall write R i and R * i for the ranks of X i based upon the empirical depth D( · , F ) and theoretical depth D( · , F ), respectively. We naturally expect R i to be an empirical approximation to R * i . However, this will not be true without some additional regularity, which we shall discuss later. Liu et al. (1999) introduced the depth-depth (DD) plot as a two dimensional graphical technique for visual comparison of two d-dimensional distributions based on their respective samples. They showed that different distributional differences, such as difference in location, scale, skewness and kurtosis can be associated with different patterns in DD-plots. Koshevoy (2001), and Struyf and Rousseeuw (1999) showed that some depth functions characterize distributions. These results motivate formal depth based test statistics for comparing two or more multivariate distributions.

A depth based Kruskal-Wallis type test
We can tabulate data as in Table 1. As earlier, under H 0 we shall let the common distribution to be F . Let F j denote the empirical distribution of the j-th sample These ranks are summarized in Table 2.
Under the null hypothesis H 0 : F 1 = · · · = F t = F , assuming no ties, we have so it can be easily shown that Similar to the univariate Kruskal-Wallis test, we consider the test statistic to be the weighted average of K j 's, i.e.
Note that, under the null hypothesis K is distribution free. Unfortunately, the test based on K is not powerful against location shift alternatives. See Chenouri (2004), Liu and Singh (2006), and Chenouri et al. (2011). In order to construct a powerful test for location shift alternatives, we must consider ranking methods which have better sensitivity when samples are separated by a location shift. As an alternative, let us consider the depths D ij (k) = D(X ij , F k ) and ranks R ij (k) for i = 1, 2, . . . , n j and j, k = 1, 2, . . . , t, as described above. Table 3 summarizes the ranks R ij (k), for any k = 1, . . . , t.
Unlike the ranks R ij , the ranks R ij (k) have some power to distinguish the observations in sample k from those in sample j where j = k. Thus it seems reasonable to construct a test statistic H(k), say, based upon the ranks R ij (k) in a manner analogous to the construction of K as in formulas (6) and (7) above. We can then try to pool the test statistics H(1), . . . , H(t) together so as to create some overall test statistic for the null hypothesis that all samples have the same distribution. This suggests that we define by analogy with formula (6). Then could be considered as a test statistic, but it depends on the choice of k. To overcome this problem, define The null hypothesis is to be rejected when the value of the test statistic H is very large. Now we shall investigate the asymptotic behavior of the test statistic H under the null hypothesis. We will need some regularity conditions which we shall now consider.
Assumption 1. Let G be any distribution, and let G be the empirical distribution based upon any random sample of size m from G. As m goes to infinity, In particular, we may have m = n or n k , and G = F or F k , respectively. This assumption is true for many depth functions such as the spatial, Mahalanobis, halfspace, simpilicial, projection depths.
Recall that as n k → ∞, k = 1, . . . , t (and thus n → ∞), the Glivenko-Cantelli theorem (see Pollard 1984) Therefore under the null hypothesis, from the Assumption 1 we have This motivates the following assumption.
In section 6 we will check Assumption 2 through simulations. The Assumptions 1 and 2 in turn imply that H − K p − → 0 . Since K ∼ χ 2 (t−1) , thus in turn we have the following proposition, where we state that the asymptotic distribution of H is χ 2 (t − 1), i.e. H is asymptotically distribution free.
Proposition 1. Suppose for all i that n i /n → λ i as n → ∞ for some numbers λ i > 0, then under the null hypothesis and Assumptions 1 and 2 has a large sample chi-square distribution with t − 1 degrees of freedom.
In small samples, although the test statistic H is not distribution free under the null hypothesis, we can still do a permutation test of Fisher to approximate p-values (see Efron andTibshirani, 1993, andErnst, 2004) based on the test statistic H. In what follows we explain how to use this procedure.
1. Compute H based on the data X 1 , X 2 , . . . , X t and denote it by H obs , where X j = {X 1 j , X 2 j , . . . , X nj j } , j = 1, . . . , t 2. Permute the combined sample Y = X 1 ∪ X 2 · · · ∪ X t a total of B times, for B sufficiently large. 3. For all of B permutations, treat the first n 1 vectors as the X 1 sample, the next n 2 vectors as the X 2 sample, . . . , and the last n t vectors as the Note that to approximate the p-value P Ho (H ≥ H obs ) one can use Thus the power functional of this α-level permutation test is where G is the joint distribution of the t populations. Under H o , Π(G) approximates the type I error and for an specific alternative Π(G) provides an approximation of power at G. For a given G one can apply a Monte Carlo method to estimate Π(G).
To end this section it is worthwhile to mention that, H is affine invariant, if the underlying depth function is affine equivariant. The test statistic H is also robust against outliers, especially when the underlying depth function is robust.

Investigating the asymptotic distribution of H
In Table 4 we have simulation studies of Assumption 2. We have considered two sample problems with sample sizes n 1 = n 2 = 25, 50, 100, 200, . . . , 1000. The following five bivariate models have been considered: • Model (I) refers to the case that both samples have be drawn from the standard bivariate normal distribution N 2 ((0, 0) ′ , I), where I refers to the identity matrix. • Model (IV) is the bivariate t distribution t((0, 0) ′ , I, 3), Johnson and Kotz (1972). • Model (V) is the bivariate t distribution t((0, 0) ′ , I, 1) or C((0, 0) ′ , I).  The underlying depth functions in our simulation study are robust Mahalonobis depth (RMD), robust affine spatial depth (RASD), and the halfspace or Tukey depth (HSD). Figure 1 shows that the logarithm of E H0 |H −K| goes to −∞ roughly linearly in the logarithm of the sample size, indicating that E H0 |H − K| roughly obeys a power law in n. Table 4 provides the exponent in this power law for each model and depth function. As we see in this table, when the sample sizes are increasing, E H0 |H − K| → 0. This implies that H − K p − → 0.

Monte Carlo power study
In this section we display results from a Monte Carlo power study for the cases d = 2, 6, 10. The underlying depth functions in our simulation study are RMD, RASD, and HSD. In addition to our proposed statistic H in Section 5, we examine the performance of Hotelling's T 2 statistic, the spatial statistic CM suggested by Choi and Marden (1997) and K. Note that the test statistics H, K, and T 2 are affine invariant, but CM is not. All the simulations are done for the two-sample case with equal numbers of multivariate observations for two groups. In tables 5 to 8, we present some Monte Carlo results based on 1, 000 trials. Table  5 reports the results of permutation tests for d = 2 and small sample sizes. Tables 6, 7, 8 report the results using the asymptotic distribution for large sample sizes in dimensions d = 2, 6, 10, respectively. For each trial, we generate a data set under the null hypothesis and a data set under the alternative hypothesis. Each data set consists of two samples which are generated from two d-dimensional distributions. For the null hypothesis these distributions are identical and for the alternative hypothesis they are different. Each sample from every distribution is of size n = 25 for permutation tests and n = 50, 100 for asymptotic tests in d = 2. The sample sizes for d = 6, 10 are given in Tables 7 and 8, respectively. The nominal significant level α = 0.05 is used. In each case displayed in Tables  6, 7, 8, the cutoff point for the nominal significant level is calculated using the asymptotic distribution under the null hypothesis. So, for our proposed test H and also K we use the chi-square cutoff point, χ 2 (1,α) , and for CM, use χ 2 (d,α) , where χ 2 (df,α) is the upper αth point of the χ 2 (df) distribution. For the Hotelling test, we use the cutoff value ((n − 2)d/(n − 1 − d))F d,n−1−d,α , where F a,b,α is the upper αth point of the F a,b distribution. This cut-off value is the exact value under the multivariate normal assumption. The entries in each table are the proportions of times each statistic exceeded its asymptotic critical value.
In Table 5 to 8, different scenarios are considered.
• Scenario (1) refers to a multivariate normal distribution, where the first sample is taken from the standard multivariate normal distribution N (0, I) and the second sample is taken from the multivariate normal N (c1, I) with mean vector c1, and identity covariance matrix I, for c ∈ R. Here 0 = (0, . . . , 0) ′ and 1 = (1, . . . , 1) ′ , are the d-dimensional vectors of 0's and 1's, respectively. The null hypothesis corresponds to c = 0.   • Scenario (2) refers again to a multivariate normal distribution, but this time the first sample is taken from the standard multivariate normal distribution N (0, I). The second sample is taken from the multivariate normal N (0, c I) with the same mean 0 but the covariance matrix is given by c I, for c > 0. The null hypothesis corresponds to c = 1. • Scenario (3) refers to a bivariate normal distribution, while the first sample is taken from the standard bivariate normal distribution N ((0, 0) ′ , I). The second sample is taken from the bivariate normal with the same mean (0, 0) ′ but the covariance matrix is given by 1 c c 1 .
The null hypothesis corresponds to c = 0. • Scenario (5) refers to a mixture of two multivariate normal distributions with mixing probability ǫ = 0.8. The first sample is taken from 0.8 N (0, I)+ 0.2 N (61, I) and the second sample from 0.8 N (c1, I) + 0.2 N (61, I). The null hypothesis corresponds to c = 0. • Scenario (6) refers to a mixture of two multivariate normal distributions with mixing probability ǫ = 0.6. The first sample is taken from 0.6 N (0, I)+ 0.4 N (61, I) and the second sample from 0.6 N (c1, I) + 0.4 N (61, I). The null hypothesis corresponds to c = 0. • Scenario (7) refers to the multivariate t distribution, where the first sample is taken from t(0, I, 1), which is multivariate Cauchy, and the second sample is from t(c1, I, 1). The null hypothesis corresponds to c = 0. • Scenario (8) refers to the multivariate t distribution, where the first sample is taken from t(0, I, 1) and the second sample is from t(0, cI, 1). The null hypothesis corresponds to c = 1. • In scenario (9), the first sample is taken from the standard multivariate normal and the second sample is from t(0, I, c). The null hypothesis corresponds to c = ∞. • In scenario (10), the first sample is taken from t(0, I, 1) and the second sample is from t(0, I, c). The null hypothesis corresponds to c = 1.
In Table 5, which is the permutation test, the entries are calculated by random shuffling 1000 times. Therefore, the entries are typically close but not exactly at the nominal value of 0.05. In Table 6, under the null hypothesis, the asymptotic approximations to the distributions of the test statistics are satisfactory except for HHSD. For this particular case, the ranks of data points with tied depths have been averaged rather than randomly broken. This may affect the asymptotic calculations as seen. Therefore, we have chosen to omit HHSD in Tables 7 and 8. All tests, except those for K (namely, KRMD, KRASD, KHSD), which are designed for scale shifts, perform well in the case of scenario (1) for all dimensions studied. In this scenario, Hotelling's T 2 and CM, which are designed for location shifts, perform extremely well. As we see, depth-based tests (HRMD, HRASD, HHSD) using H perform well using different depth functions under variety of differently shaped distributions. While it does not always have the power that tests tailored to specific assumptions may have, it is reasonably robust against distribution location, shape and many other alternatives. Of course, this is true provided the central regions (as defined by the depth contours) do not overlap too much.

Example
A data set given by Johnson and Wichern (1988, p. 261-262) is examined here using the statistics T 2 , and our proposed depth-based test H using MD depth. The data set originally was taken from Jolicoeur and Mosimann (1960), who studied the relationship of size and shape for painted turtles. Table 9 contains their measurements on the carapace of n 1 = 24 female and n 2 = 24 male turtles. This data set was partially analyzed in Hettmansperger et al. (1998).
We treat observations as two independent samples from trivariate distributions. We consider a test for H 0 : F 1 = F 2 where F 1 and F 2 are the respective trivariate distributions. To see that whether the trivariate normality is a reasonable assumption for the female and male populations or not, we generate two samples of size 100 from trivariate normal distributions with mean vectors and covariance matrices being estimated by the sample means and covariance matrices of the female and male data. We use the Mahalonobis depth to compute the depth values of the pooled data points. Figure 2 represents the depth-depth plots (see Liu et al. (1999)) and also ranked depth-depth plots for the female and male samples. In these plots, the x and y axes represent the depth values of the pooled data with respect to the generated samples and to the female and male data points respectively. We see that all plots are concentrated along the diagonal line. This indicates that the normality can be a reasonable assumption for both female and male samples. These elliptical structures of the two distributions are also clear from the pairwise scatter plots in Figure 3. Hence it is reasonable to use the Mahalanobis depth and not worry about tied ranks. An eyeball investigation of the pairwise scatter plots also reveals that, in addition to a location shift, two populations have different dispersion parameters.
To carry out the the test, we compute the depth of each data point in the pooled data set with respect to the female and then male samples. So we have H(1) = 12 n(n + 1) and then finally H = H(1)+H(2) 2 = 20.22215. Comparing with the percentiles of a chi-square distribution with t − 1 = 1 degrees of freedom we see that the asymptotic P-value is 7 × 10 −6 . We easily reject the null hypothesis H 0 : F 1 = F 2 .
Based on 1,000,000 random permutations of the numbers 1, 2, . . . , 48, the P-value is estimated to be less than 10 −6 . Figure 4 represents the depth-depth plot to compare the female and male populations, which visually justifies our finding based on the test statistic. Assuming multivariate normality, which seems reasonable from the depth-depth plots in Figure 2, the Hotelling's T 2 gives the P-value 0.