A studentized permutation test for the nonparametric Behrens-Fisher problem in paired data

: We consider nonparametric ranking methods for matched pairs, whose distributions can have diﬀerent shapes even under the null hypothesis of no treatment eﬀect. Although the data may not be exchangeable under the null, we investigate a permutation approach as a valid procedure for ﬁnite sample sizes. In particular, we derive the limit of the studentized permutation distribution under alternatives, which can be used for the construction of (1 − α )-conﬁdence intervals. Simulation studies show that the new approach is more accurate than its competitors. The procedures are illustrated using a real data set.


Introduction
In many psychological, biological, or medical trials, data are collected in terms of a matched pairs design, e.g. when each subject is observed repeatedly under two different treatments or time points. When the normality assumption of the data (or of the differences of the paired observations) is violated, e.g. in case of skewed distributions or even ordered categorical data, nonparametric ranking procedures, which use ranks over all dependent and independent observations, are preferred for making statistical inferences (Munzel, 1999b). Most of these approaches, however, are restricted to testing problems and cannot be used for the computation of confidence intervals for the treatment effects. In randomized clinical trials, the construction of confidence intervals is consequently required by regulatory authorities: "Estimates of treatment effects should be accompanied by confidence intervals, whenever possible. . . " (ICH, 1998, E 9 Guideline, ch. 5.5, p. 25). Particularly, different variances of the paired observations occur in a natural way, e.g. when data are collected over time. The derivation of nonparametric procedures, which allow the data to have different variances or shapes even under the null hypothesis, is a challenge.
We consider n independent and identically distributed random vectors X k = (X k,1 , X k,2 ) T , k = 1, . . . , n, (1.1) with marginal distributions F i , i.e. X k,i ∼ F i , i = 1, 2; k = 1, . . . , n. To allow for continuous and discontinuous distributions in a unified way, F i (x) = 1/2(F + i (x) + F − i (x)) denotes the normalized version of the distribution function (Akritas, Arnold and Brunner, 1997;Munzel, 1999a;Ruymgaart, 1980). Hereby, F + i (x) = P (X 1,i ≤ x) denotes the rightcontinuous and F − i (x) = P (X 1,i < x) denotes the left-continuous version of the distribution function, respectively. The general model (1.1) does not entail any parameters by which a difference between the distributions could be described. Therefore, the marginal distributions F 1 and F 2 are used to define a treatment effect by p = F 1 dF 2 = P (X 1,1 < X 2,2 ) + 1/2P (X 1,1 = X 2,2 ), (1.2) which is known as relative marginal effect (Brunner, Domhof and Langer, 2002;Brunner and Puri, 1996;Konietschke et al., 2010, p. 38). Note that p is not equivalent to the sign effect p = P (X 1,1 < X 1,2 )+ 1/2P (X 1,1 = X 1,2 ), i.e. to the probability of the first observation being smaller than the second observation. The relative marginal effect p uses more information from the data. In particular, compared to the Wilcoxon-signed-rank test, inference methods for p do not need the assumption of symmetric distributed differences (Munzel, 1999b). For independent ordered categorical data, p is also known as ordinal effect size measure (Ryu, 2009). If p > 1/2, the observations with distribution F 2 tend to be larger than those with distribution F 1 . In case of p = 1/2, the observations in neither one of the two marginal samples tend to be smaller or larger than in the other sample. Thus, we characterize the case of "no treatment effect" as p = 1/2. Clearly, if F 1 = F 2 , then p = 1/2 is fulfilled. The other implication, however, is not true in general. This can be seen by the counterexample that p = 1/2 if X 1,1 and X 2,2 are both normally distributed with a common mean µ and possibly heteroscedastic variances σ 2 1 and σ 2 2 , respectively. Therefore, testing the null hypothesis H 0 : p = 1/2 is known as the nonparametric Behrens-Fisher problem (Brunner and Munzel, 2000). Munzel (1999b) proposed a test procedure for H 0 : p = 1/2 and a small sample approximation in matched pairs. His test is widely used by practitioners (Krone et al., 2008;Obenauer et al., 2002, among others). Simulation studies show that the procedure tends to maintain the type-I error level quite accurately when n ≥ 20. For smaller n < 20, however, the test tends to be very liberal or conservative, depending on whether X 1,1 and X 1,2 are negatively or positively correlated.
In this paper, we investigate the conditional studentized permutation distribution of Munzel's linear rank statistic to achieve a valid procedure for finite sample sizes. Although the data may not be exchangeable in model (1.1), an accurate and (asymptotically) valid level α permutation test for H 0 : p = 1/2 can be derived, if (i) the permutation distribution of the statistic is asymptotically independent from the distribution of the data; (ii) the permutation distribution has a limit; and (iii) if the distribution of the test statistic and the conditional permutation distribution (asymptotically) coincide (Janssen and Pauls, 2003). The conditions (i)-(iii) are also known as the invariance property of permutation tests (Neubert and Brunner, 2007). Moreover, our proposed permutation test has the additional advantage that it is even exact for finite sample sizes if the pairs are exchangeable.
For independent observations, Janssen (1997), Janssen and Pauls (2003) and Janssen (2005) investigate studentized permutation approaches for the parametric Behrens-Fisher problem, whereas Janssen (1999b) and Neubert and Brunner (2007) also consider ranking approaches for H 0 : p = 1/2. The theoretical results obtained in these papers, however, are not valid in our model and the permutation scheme carried out for independent observations (i.e. to permute all data X 1,1 , X 1,2 , . . . , X n,1 , X n,2 ) needs to be changed. Following Janssen (1999a) and Munzel and Brunner (2002), we permute the sample units X k,1 and X k,2 within each matched pair X k = (X k,1 , X k,2 ) T to protect the dependency structure of the data. Based on the 2 n possible permutations within the sample units, conditional central limit theorems as well as test consistency results will be derived in this paper. It will be shown that the items (i)-(iii) mentioned above are fulfilled. In particular, the studentized permutation distribution under an arbitrary alternative p = 1/2 will be investigated, which can be used for the computation of approximate (1 − α)-confidence intervals for p. Munzel and Brunner (2002) suggest an exact paired rank test for the null hypothesis of exchangeability. However, this approach cannot be used for the computation of confidence intervals, and particularly, it is not valid under the assumption of heteroscedasticity.

Munzel's (1999b) unconditional procedure
To estimate the relative marginal effect size p defined in (1.2), let denote the marginal empirical distribution functions. Here, F i (x) denotes the normalized version of the empirical distribution function (Ruymgaart, 1980). An estimator p of p is then obtained by replacing F 1 , and F 2 with F 1 , and F 2 , respectively. The estimator can easily be calculated by using the (mid-)ranks R k,2 and R k,1 from X k,2 and X k,1 among all 2n observations, respectively. Here, R i· = n −1 n k=1 R k,i denotes their means within marginal sample i, i = 1, 2. Brunner, Puri and Sun (1995), , Brunner and Puri (1996) and Munzel (1999b) have shown that the linear rank statistic T n = n 1/2 ( p − p) follows, asymptotically, as n → ∞, a normal distribution with expectation 0 and variance σ 2 = var(F 2 (X 1,1 ) − F 1 (X 1,2 )). To estimate the unknown variance σ 2 consistently, let R (i) k,i denote the rank of X k,i among all n observations in marginal sample i and define the normed placements (Orban and Wolfe, 1982) k,1 ). Finally, the unknown variance σ 2 can be consistently estimated by the empirical variance Under the null hypothesis H 0 : p = 1/2, the linear rank statistic follows, asymptotically, as n → ∞, a standard normal distribution if σ 2 > 0 holds. Thus, an asymptotic unconditional two-sided test is given by quantile from the standard normal distribution. A one-sided test for the null hypothesis H 0 : p ≤ 1/2 is given by ϕ n,1 = 1{M n ≥ z 1−α }. Without loss of generality we only consider two-sided testing problems throughout this paper.
Here we like to point out that the theoretical results in this paper, however, are not restricted to one or two-sided testing problems. Asymptotic (1 − α)confidence intervals for p are given by Munzel (1999b) suggests to approximate the distribution of M n by a central tdistribution with n − 1 degrees of freedom. For inferences, the quantiles z 1−α/2 used above are replaced by the (1 − α/2)-quantile from the t n−1 -distribution. Simulation studies show, however, that the quality of the approximation depends on the dependency structure in the data which is, of course, not a desirable property of a procedure for matched pairs. Therefore, as a finite correction, the studentized permutation distribution from Munzel's linear rank statistic M n defined in (2.3) will be considered in the next section.
It turns out, that the distribution of T n (X) = n 1/2 ( p − 1/2) and the conditional permutation distribution of T τ n (X τ ) = n 1/2 ( p τ − 1/2) differ under heteroscedasticity, and a valid level α test can not be achieved in this setup. Therefore, we consider the distribution of the test statistic M n (X) = n 1/2 ( p − 1/2)/ σ defined in (2.3) and the conditional studentized permutation distribution of the statistic i.e. of the studentized quantity T τ n (X τ ). In the next steps, we will investigate the invariance property of the conditional distribution of M τ n . The limiting distribution of M τ n will be derived in the next theorem. Theorem 1. Let M τ n as given in (3.1) and denote by Φ the standard normal distribution function. If σ 2 > 0, then we have convergence under the null as well as under the alternative Theorem 1 states that the limiting standard normal distribution of M τ n does not depend on the distribution of the data, particularly, it is achieved for arbitrary p, i.e. it even holds under the alternative p = 1/2. Let ϕ τ n = 1{M n ≤ z τ α/2 } + 1{M n ≥ z τ 1−α/2 }, where z τ 1−α/2 denotes the (1 − α/2)-quantile from the conditional studentized permutation distribution. Note that for notational convenience we only focus on this non-randomized version. However, the following results also hold for a randomized version of the permutation test. In the next theorem we will show that Munzel's unconditional test ϕ n and the conditional permutation test ϕ τ n are asymptotically equivalent.
The permutation test ϕ τ n is consistent, i.e. for any choice of p we have convergence Theorem 1 and Theorem 2 show that the studentized permutation test ϕ τ n fulfills the invariance property, thus, it is an appropriate level α test procedure for H 0 : p = 1/2. The numerical algorithm for the computation of the p-value is as follows 1. Given the data X, compute Munzel's statistic M n as given in (2.3).
2. For each of the 2 n possible permutations, compute the values M τ n and save them in A 1 , . . . , A 2 n . 3. Estimate the two-sided p-value by In particular, Theorem 1 states that the distributions of the pivotal quantity M p n = n 1/2 ( p − p)/ σ and of the studentized permutation statistic M τ n asymptotically coincide. This means, approximate (1 − α)-two-sided confidence intervals for p can be obtained from Remark 1. For larger n > 10 the number of permutations increases rapidly and the numerical computation of the p-value can be cumbersome. We recommend to use random permutations of the data in those cases. Simulation studies show that 10, 000 random permutations are sufficient for an adequate p-value estimation.
Remark 2. It may occur that σ or σ τ are 0. We recommend to replace them by 1/n in those cases (Neubert and Brunner, 2007).

Monte-Carlo simulations
For testing the null hypothesis H 0 : p = 1/2 formulated above, we consider the unconditional Munzel test ϕ n based on the t n−1 -approximation of the statistic M n in (2.3) and the conditional permutation test ϕ τ n as described above, respectively. The simulation studies are performed to investigate their behaviour with regard to (i) maintaining the pre-assigned type-I error level under the hypothesis, (ii) the power of the statistics under alternatives, and (iii) maintaining the pre-assigned coverage probability of the corresponding confidence intervals. The observations X k = (X k,1 , X k,2 ) T , k = 1, . . . , n, were generated using marginal distributions F i and varying correlations ρ ∈ (−1, 1). We hereby generate exchangeable matched pairs having a bivariate normal, exponential, log-normal, or a contaminated normal distribution (where we have rounded to one decimal) each with correlation ρ ∈ (−1, 1), as well as non-exchangeable data by simulating F 1 = 0.7N (4, 1) + 0.3N (8, 1) and F 2 = 0.3N (2.07, 2) + 0.7N (6.21, 2); F 1 = N (2.5745, 2) and F 2 = χ 2 3 ; F 1 = N (0, 1) and F 2 = N (0, 2); and F 1 = N (0, 1) and F 2 = N (0, 4), each with correlation ρ, respectively. Routine calculations show that H 0 : p = 1/2 is fulfilled in all of these considerations (Neubert and Brunner, 2007). We only consider the small sample sizes n = 7 and n = 10 throughout this paper. All simulations were conducted with the help of R-computing environment, version 2.13.2 (R Development Core Team, 2010), each with nsim = 10, 000 runs. The simulation results for normally, exponentially, log-normally, and contaminated normally distributed matched pairs are displayed in Figure 1.
It can be readily seen from Figure 1, that Munzel's test does not control the type-I error level constantly over the range of correlations ρ in the data in case of small sample sizes. It is very liberal when the data are negatively correlated, and vice versa quite conservative in case of positive correlated data. For very small sample sizes (n = 7), the test never rejects the null hypothesis in case of strongly positive correlations. This behaviour of the test does not depend on the underlying distributions and is identical for all considered setups. The studentized permutation test, however, is a (nearly) exact level α test in these cases and it is not affected by the dependency structure. For n = 7, however, only 2 7 = 128 permutations are possible and the simulation results indicate an maximal estimated type-I error level of 5.6%. For n = 10, 1024 permutations can be computed and the exactness of the procedure is apparent. We note that a randomized version of the permutation test would be an exact level α test, because the underlying distributions are exchangeable.
In the next step, type-I error simulations for the case of non-exchangeable data under the null hypothesis H 0 : p = 1/2 will be considered. The simulation results are displayed in Figure 2. For non-exchangeable data, the studentized permutation test exhibits a much better control of the pre-assigned type-I error level than Munzel's test. In case of very small sample sizes (n = 7) the procedure gets slightly liberal. This can be explained by the fact that only 2 7 = 128 permutations are possible. With an increasing sample size (n = 10), the procedure is accurate and a valid testing procedure for the null hypothesis H 0 : p = 1/2. The powers of the two competing procedures were investigated for homoscedastic bivariate normal distributions with correlation ρ ∈ (−1/2, 0, 1/2), varying expectations µ = (0, δ) T and moderate sample sizes n = 20 and n = 30. We have chosen these sample sizes for a fair power comparison. The simulation results are displayed in Figure 3.
It can be seen that both procedures have a comparable power, which is in line with Theorem 2. The power simulations for non-exchangeable data did not show any significant difference and are therefore not shown here. Finally, we investigate the maintaining of the pre-assigned coverage probability of the unconditional confidence interval C for p as given in (2.4) as well as of the conditional confidence interval C τ given in  we generate a bivariate normal distribution with correlation ρ = 0 and estimate the coverage probabilities for varying true underlying treatment effects p. The simulation results are displayed in Table 1. It can be seen from Table 1 that the confidence intervals based on the studentized permutation distribution C τ tend to maintain the pre-assigned coverage probability better than the unconditional version C. For large effects p ≥ 0.8 both types of confidence intervals tend to be liberal. This occurs, because the distribution of M n is very skewed in terms of high effects and small sample sizes.

Analysis of the panic disorder longitudinal trial
We reconsider the panic disorder longitudinal trial which was performed to observe the effect of a specific physical exercise therapy for n = 15 patients (Munzel and Brunner, 2002). The response variable is the patient rated global impression (PGI) score being observed at baseline and after 4 weeks of treatment. The lower the score, the better the clinical impression. The original data can be found in the appendix from (Munzel and Brunner, 2002), who already evaluated this trial with an exact paired rank test for the null hypothesis of exchangeability. This approach, however, cannot be used for the computation of confidence intervals for the treatment effect and is not robust against variance heterogeneity. Here, we will analyze the data with the studentized permutation approach. Since the data are observed on an ordinal scale, mean based approaches are inappropriate for this study. Using Spearman's rank correlation coefficient we estimate the correlation and obtain r = 0.61, thus, a positive statistical dependence. We obtain an estimated treatment effect of p = 0.29. This means, the PGI scores tend to be smaller after 4 weeks of treatment. The null hypothesis H 0 : p = 1/2 is significantly rejected at 5% level of significance (p-value=0.006 H(X k,2 ) − H(X k,1 ) . Since is not affected by the permutation we can rewrite where (W k ) k is a sequence of independent and identically distributed Rademacher variables with distribution 1 2 (δ 1 +δ −1 ) that are independent from the data. Here D = means that both expression are equal in distribution. Now, given the data X, E k,n = n −1/2 W k ( H(X k,2 ) − H(X k,1 )), k = 1, . . . , n, defines an array of row-wise independent random variables. It fulfills E(E k,n |X) = 0 and var(E k,n |X) = n −1 ( H(X k,2 ) − H(X k,1 )) 2 .
Hence we can expand the conditional variance of T W n as H(X k,2 ) H(X k,1 ) = V n,1 + V n,2 − 2V n,3 .
In the following the limit behavior of V n,1 , V n,2 and V n,3 will be analyzed separately. For V n,1 we have ( H(X k,2 ) − H(X k,2 ))H(X k,2 ).
for i = 1, 2. In the same way it follows that (again conditioned on X) sup t∈R | F − i,τ (t) − 1 n n k=1 G − i,k (t) |→ 0 in probability, so that an application of the triangular inequality gives convergence in probability (conditioned on X) for i = 1, 2 sup t∈R | F i,τ (t) − H(t) |→ 0. (A.4) Applying (A.4) on σ 2 (X τ ) we obtain using similar arguments as used for the convergence (A.2) above (again conditioned on X) that n − 1 n σ 2 (X τ ) = 1 n n k=1 F 1,τ (X k,τ k (2) ) − F 2,τ (X k,τ k (1) ) 2 − 1 n n k=1 F 1,τ (X k,τ k (2) ) − F 2,τ (X k,τ k (1) ) 2 is asymptotically equivalent to where again W k are independent and identically distributed Rademacher variables. By (A.3) the first summand of the last expression converges in probability to σ 2 τ . Hence the proof is completed if we show that the second summand converges in probability to zero. But this follows from Altogether this proves σ 2 (X τ ) → σ 2 τ in probability and therefore the desired result.