The nonparametric Behrens‐Fisher problem with dependent replicates

Purely nonparametric methods are developed for general two‐sample problems in which each experimental unit may have an individual number of possibly correlated replicates. In particular, equality of the variances, or higher moments, of the distributions of the data is not assumed, even under the null hypothesis of no treatment effect. Thus, a solution for the so‐called nonparametric Behrens‐Fisher problem is proposed for such models. The methods are valid for metric, count, ordered categorical, and even dichotomous data in a unified way. Point estimators of the treatment effects as well as their asymptotic distributions will be studied in detail. For small sample sizes, the distributions of the proposed test statistics are approximated using Satterthwaite‐Welch‐type t‐approximations. Extensive simulation studies show favorable performance of the new methods, in particular, in small sample size situations. A real data set illustrates the application of the proposed methods.


INTRODUCTION
Statistical comparisons of two independent groups are one of the most frequently occurring inference problems in scientific research, eg, in biomedical or in social sciences. Many different statistical methods are available for making inferences, eg, t-test type statistics for testing the equality of the means of normal samples (assuming equal or unequal variances), 2 -tests for binary data, Wilcoxon-Mann-Whitney (WMW) tests for testing H F 0 ∶ F 1 = F 2 , the equality of the two distribution functions of skewed or even ordered categorical data (assuming equal variances or shapes of the distributions) or the Brunner-Munzel tests for testing the hypothesis formulated in terms of the WMW effect (allowing for different variances or shapes of the distributions). [1][2][3][4][5] Here, X 1 and X 2 denote two independent random variables having distribution functions F 1 and F 2 , respectively. The correct method to use depends on the shapes of the data distributions and their scales. If X i ∼ N( i , 2 i ), then p = Φ and, thus, p = 1 2 if 1 = 2 even if the variances 2 1 and 2 2 are different. Here, Φ(x) denotes the standard normal cumulative distribution function. Testing the hypothesis H 0 ∶ p = 1 2 is therefore called the "nonparametric Behrens-Fisher problem" because inference methods upon the relative effect p allow for heteroscedastic variances or shapes of the distributions even under the null hypothesis. 2,6,7 Statistical methods, which do not rely on the assumption of equal variances, are especially meaningful when the distribution under the alternative hypothesis of a statistic is important, eg, for the computation of confidence intervals for the effect of interest.
All of these methods, however, are not applicable when measurements are taken with dependent replicates, eg, when visual acuity or any blood parameters of mice sharing the same cage are measured. In all of these scenarios, the replicates (ie, the observations coming from all mice sharing the same cage) should neither be assumed to be independent nor be seen as observations coming from different subjects. Furthermore, using a summary measure (eg, means or medians) of the replicates as a single observation would decrease precision of the effect estimates and thus decrease the powers of the test procedures (see the illustrative simulation results in Section 6). Therefore, there is a need for statistical procedures that allow the specific modeling of the dependent replicates. Under normality assumption of the data, the dependent replicates can be modeled using a linear mixed model and the hypothesis of the equality of the means can be tested using appropriate F-test statistics, eg, using SAS PROC MIXED. 8,9 Replicated binary data can be analyzed using 2 -square tests for R × C contingency tables with clustered data. 10,11 Dutta and Datta, 12 Rosner et al (RGL), 13 as well as Datta and Satten (DS) 14 generalized the WMW test to clustered data and their methods can also be used to analyze two independent groups with dependent replicates to test the hypothesis of the equality of the distribution functions H F 0 ∶ F 1 = F 2 of the two groups. This formulation of the null hypothesis is rather strict because (1) variances are assumed to be identical under the null hypothesis and (2) the test statistics cannot be inverted into confidence intervals for the WMW effect p given in (1). The computation of confidence intervals, however, is a rather important task in practice and even required in clinical trials by regulatory authorities "Estimates of treatment effects should be accompanied by confidence intervals, whenever possible..." (ICH E9 Guideline 1998, ch. 5.5, p25). 15 The only known available inference methods that can be used for the computation of confidence intervals for the relative effect p are the Brunner-Munzel test and its generalizations. 1,[3][4][5]16 Therefore, it is the aim of the present paper to generalize the applicability of the Brunner-Munzel test to situations in which data is observed with (possibly) dependent replications.
When such data are observed, the numbers of the replications may or may not play an important role for the scientists. Therefore, weighted as well as unweighted versions of the estimators of the treatment effects will be investigated and their asymptotic distributions will be derived in a closed form. The results achieved in this paper generalize the ideas on previous attempts for testing the rather strict hypothesis H 0 ∶ F 1 = F 2 7,17 or even for testing H 0 ∶ p = 1∕2. 7,[18][19][20] In comparison to these pioneering works, differently weighted estimators of the treatment effect p as well as unbiased variance estimators will be proposed in the current paper. Furthermore, major attention will be given to the accuracy of the tests in terms of controlling the nominal type-I error level as well as their powers to detect alternatives when sample sizes are rather small. Here, it will be shown that the distributions of the tests can be approximated using t-distributions with approximated Satterthwaite-Welch degrees of freedom. The degrees of freedom are estimated in such a way that the new methods coincide with the Brunner-Munzel test when single measurements are observed. Recently, Larocque et al 21 developed asymptotic weighted and unweighted tests for the nonparametric Behrens-Fisher problem and proposed two different consistent estimators of the variance of the effect estimator that are either consistent (1) only under the null hypothesis or (2) also even under the alternative. However, extensive simulation studies show that the tests based upon them tend to be either way too conservative or liberal and, therefore, (3) the use of a linear combination of them is recommended by Larocque et al 21 in practical applications. Still and all, the resulting test cannot be inverted into confidence intervals for the underlying effect because the linearly combined variance estimator is only consistent under the null hypothesis. The test procedures proposed by Larocque et al 21 are explained in detail in Section 5.1 and rigorously compared with the new approach in extensive simulation studies.
The remainder of this paper is organized as follows. In Section 2, an example that motivated the research reported in this paper is described. The statistical model and the quantity of inferential interest (nonparametric effect) are formally introduced in Section 3. In Section 4, two estimators for the effect size are given and their asymptotic properties are derived. The theories developed in Section 4 are applied in Section 5 for deriving tests and confidence intervals. The finite-sample fidelity of the asymptotic theories is evaluated in Section 6 via simulation studies. In addition, in Section 6, the performance of the new methods are evaluated in comparison with existing methods. The analysis of the motivating data using the new methods is carried out in Section 7. Some remarks pertaining to data analysis strategies in light of the new methods are discussed in Section 8. All technical details and proofs are placed in the Appendix.

MOTIVATING EXAMPLE
This research is motivated by a toxicological study involving small sample sizes and different numbers of dependent replicates per unit. The data is obtained from the National Toxicological Program study number C20536, which investigates the effect of "specular hermatite" on body weights of male HSD rats. * Several rats share the same cage and thus, the cage is seen as the experimental unit with the replications being the body weights of the rats. We consider the two dose groups "vehicle control" and "30 mg/m3" of the active treatment and select the body weights of the rats after four weeks of treatment. In total, n = 26 cages are involved in the trial, where the vehicle control group consists of n 1 = 13 cages and the remaining n 2 = 13 cages are assigned to the active treatment group. We assume that the measurements obtained under the active treatment are independent from those in the vehicle control group. It is very evident from the boxplots displayed in Figure 1 (right) that the medians of the two groups are different. Therefore, it is of major interest to estimate the treatment effect and to test whether there is any significant difference between these two groups along with the computation of a confidence interval. The data also show slightly different variances. Furthermore, since sample sizes are very small, the data should be modeled with a "general" statistical model without restrictive assumptions. Note that the raw data (individual replicates) are displayed in the following boxplots.
Next, a general nonparametric model that allows for arbitrary distributions, different numbers of replicates per unit as well as arbitrary dependency patterns among the replications will be discussed. In particular, no linear relationship between the response variables and the treatment effects is not assumed. This will be explained in the next section.

STATISTICAL MODEL AND HYPOTHESES
We consider two independent samples with replicated observations that can be modeled by independent random vectors with distributions X iks ∼ F i , i = 1, 2. Here, m ik denotes the number of replicates of subject k under treatment i. The replicates X iks may be arbitrarily correlated. We note that the numbers of replicates may not be under experimental control and may be different for each subject involved in the study. The total number of subjects (units) involved in the study is given by n = n 1 + n 2 and the total number of observations is given by ∑ n i k=1 m ik . In order to allow for metric, discrete, dichotomous, as well as ordered categorical data in a unified way, we use the normalized version of the distribution function of X iks , which is the average of the left-continuous, F (−) i (x) = P(X iks < x), and the right-continuous, F (+) i (x) = P(X iks ≤ x), versions of the distribution function, respectively. The normalized version of the distribution function has first been used by Lèvy 22 and later by other works 20,[23][24][25] to derive asymptotic results for rank statistics including the case of ties. The statistical model considered here does not entail any parameters by *https://tools.niehs.nih.gov/cebs3/views/index.cfm?action=main.download&bin_id=1600&library_id=4877&fileIdsSelected= 1de2c1a6578948500157908016d60027 accessed on December 16, 2018. which adequate treatment effects could be described. Therefore, the distributions F 1 and F 2 are used to define a treatment effect by which is the generalized WMW effect p introduced in (1) to dependent replicates. If p < 1 2 , then the observations coming from distribution F 1 tend to be smaller than those coming from F 2 . If p = 1 2 , then the observations coming from these two distributions are expected to be almost similar. Thus, the effect p can be interpreted as a measure of tendency to larger or smaller values. As indicated in the introduction, the effect p = 1 2 does not imply that the distributions F 1 and F 2 are identical; indeed, inference methods upon p allow for heteroscedastic variances, skewness or other shapes of the distributions even under the null hypothesis H 0 ∶ p = 1 2 . The derivation of appropriate inference methods requires (1) consistent estimation of the effect in model (2) and (2) the computation of the asymptotic distribution of the estimates along with the consistent estimation of its parameter estimates. Unbiased and consistent estimators of p as well as their asymptotic normality will be established in the next section.

POINT ESTIMATORS AND THEIR ASYMPTOTIC DISTRIBUTIONS
When no replicates were observed, the relative effect p can be estimated by plugging-in the empirical versionsF 1 (x) andF 2 (x) of the distribution functions F 1 and F 2 into the integral representation of p = ∫ F 1 dF 2 given in (1). In our situation (2), however, replicates of the measurements per unit may be apparent, and, thus, the traditional estimators of the cumulative distribution functions may not be applicable in this situation. Furthermore, different weighting schemes to incorporate the information from the numbers of replicates may play an important role in the definition of a reasonable effect estimate. Here, we investigate two different versions of the empirical distribution functions and investigate their impact on the interpretation of the resulting estimator as well as their asymptotic behavior in detail. As weighting factors we use the sizes of the clusters and define estimators of the empirical distribution functions in a way that (1) larger clusters add more weight to the estimator than smaller ones and (2) each cluster adds the same weight to the estimator disregarding their sizes. Throughout this paper, the resulting estimators will be called weighted and unweighted estimators, respectively. Let c(x) = 0, 1∕2, 1 according as x < 0, = 0, > 0 denote the normalized version of count function and consider two different versions of empirical distribution functionŝ Here, bothF (u) g (x) andF (w) g (x) represent estimators of F (c) g (x), where the sums of counts within each cluster are first averaged, and then the mean of these averages is computed inF (u) g (x), while the sums of all counts obtained from all observations are averaged inF (w) g (x) for all x ∈ R. Thus,F (w) g (x) basically is the standard version of the empirical distribution function of a random sample. The impact of these two different weighting versions becomes noticeable when they are plugged-in into the integral representation of p given in (1) to get the unweighted and weighted version of the estimatorŝ where R g·· = m −1 g ∑n g k=1 ∑m gk s=1 R gks and R gks is the (mid)rank of X gks among all the N observations. Both estimators are means of the counts c(X 2ks − X 1k ′ s ′ ); however,p (u) is an unweighted andp (w) is a weighted mean of the normed placementsF (u) 1 (X 2ks ) andF (w) 1 (X 2ks ), respectively. The main difference between these two estimators lies in their interpretation when being applied to data, ie, subjects/units with larger clusters stack the estimator with more weight (p (w) ), while every cluster stucks the estimator with the same weight (p (u) ). The answer to "which estimator to use?" depends on the specific research question and experiment and, therefore, the choice has to be made on a case-by-case basis. Both estimators are identical in case of balanced clusters (m 1k = m 2k = M) and identical to the standard rank-based estimator of the relative treatment effect when single observations were measured (m 1k = m 2k = 1). Note that weighted means play an important role in statistical sciences and the weighted estimator is also often chosen in count data analysis with offset variables. 26 Both of the estimatorsp (u) andp (w) are unbiased and strongly consistent for p, if sample sizes (ie, the number of experimental units) are reasonably large. The unbiasedness follows from E(c(X 2ks − X 1k ′ s ′ )) = ∫ F 1 dF 2 , because X 2ks and X 1k ′ s ′ are independent. The consistency is outlined in the Appendix. Next, the asymptotic distributions of the estimators will be established. Note that each estimator is a sum of dependent random variables; thus, standard central limit theorems do not apply and their asymptotic normality is not obvious at first hand. Brunner et al 6 derived the asymptotic normality of the estimator in case of single observations by exploring the so-called asymptotic equivalence theorem that introduces sums of independent random variables which have the same asymptotic distribution as the estimator. Here, we will adapt their results to dependent replicates.
Define the unobservable random variables Y 1ks = F 2 (X 1ks ) and Y 2ks = F 1 (X 2ks ). Then, under mild conditions on the sample and cluster sizes, it can be shown that where Y gks denote unweighted and weighted means of the (unobservable random variables) Y gks for g = 1, 2. This means that both quantities √ n(p (u) − p) and √ N(p (w) − p) have the same distribution as the sums of independent random variables given in the right-hand side of (7) and (8), respectively. However, note that Y gk· are independent but not identically distributed. This occurs because each variable Y gk· has variance 2 gk = Var(Y gk· ). For the derivation of the asymptotic normality, the following assumptions on the sample and replication sizes are necessary, which ensure that variance components of the limiting distributions exist. All of the following assumptions hold for g = 1, 2: Then, it follows that 2(u) ) under A1, A3 and A5, and respectively. Here, 2(u) = (u) denote the sums of the variance components, respectively. The variances are, however, unknown and must be estimated in real data applications. Consistent estimators will be developed in the next section.

Estimation of the variances
In the previous section, the asymptotic normalities of the quantities √ n(p (u) − p) and √ N(p (w) − p) have been established. It turns out that limiting distributions of both random variables exist, the variances of which are given by 2(u) and 2(w) defined in (11), respectively. Both of the variances 2(u) and 2(w) , however, do not only consist of a sum of two variance constants, they both are rather a mean of variances. The upcoming arising task is the consistent estimation of these sums of variances in this nonparametric framework. One solution for this problem is first estimating the variances of the asymptotic equivalent sums in (7) and (8) using the unobservable random variables Y gks and in a second step replacing them with observable random variables that are close enough to the Y gks in an appropriate norm. We will first derive estimators for the variance 2(u) g . Computing the variance of the mean Y gk· in the right-hand side of (7), we obtain by the independence of Y gk· and Y gk ′ · , k ≠ k ′ , Thus, an unbiased and consistent estimator of 2(u) g = lim for g = 1, 2. The variance estimation of the weighted estimatorp (w) is a more challenging task. Analogous to the above, computing the variance of Y (w) 2·· given in (8) yields where Y gk· = ∑m gk s=1 Y gks . Thus, the variance components 2(w) gk represent variances of the sums of the variables Y gks . In comparison to the investigations with respect to the variance of the unweighted estimator, here, the variables Y gk· may have a different expectation when cluster sizes are different. Therefore, the variance estimator is derived by considering the squared deviation of Y gk· to its estimated expectation m gk Y (w) g·· along with a bias correction. To this end, define the known weight K g = m −1 g ∑n g k=1 m 2 gk (m g − 2m gk ) −1 for g = 1, 2 and consider the estimator It is explained in the Appendix that̃2 (w) g is an unbiased and consistent estimator of 2(w) g . Both of the quantities̃2 (u) g and 2(w) g given in (12) and (13), are, however, not observable in real data applications. Therefore, the unobservable random variables Y 1ks = F 2 (X 1ks ) and Y 2ks = F 1 (X 2ks ) are replaced by the observable random variables where F (c) g denotes the empirical distribution function of sample g = 1, 2 defined in (4), respectively. Finally, these variables replace the Y gks used iñ2 (u) g and̃2 (w) g , and thus, the estimators becomê respectively. Combining these results, consistent estimators of the limiting variances 2(u) and 2(w) displayed in (11) are given bŷ2 It is shown in the Appendix that both estimatorŝ2 (u) g and̂2 (w) g are consistent. Based on the asymptotic distribution of the effect estimators and their consistent variance estimation, test procedures for testing the hypothesis H 0 ∶ p = 1∕2 as well as confidence intervals for p can be derived. This will be explained in the next section.

TEST STATISTICS
The asymptotic normality of the estimatorsp (c) , where c ∈ {u, w} along with the consistent estimators of their variances, can now be used for the derivation of appropriate test statistics for testing the null hypothesis H 0 ∶ p = 1∕2. To this end, define the quantities (u) = √ n and (w) = √ N and consider where the superscript (c) refers to the weighted and unweighted estimation approaches, respectively. It follows from the above that the variables T (c) follow, asymptotically, as f (c) → ∞, a standard normal distribution. Thus, under the hypothesis For large sample sizes, the null hypothesis H 0 ∶ p = 1∕2 will be rejected at level , if |T (c) | ≥ z 1− /2 . One-sided test results can be achieved in the obvious way. Extensive simulation studies show, however, that the test tends to be liberal and to over reject the hypothesis when sample sizes are rather small. In order to provide an approximate version of the tests that control the nominal type-I error rate in small sample size situations, the idea from the work of Brunner et al 6 motivates us to approximate the distribution of T (c) by a central t -distribution and estimate its approximate degree of freedom using Satterthwaite-Welch equations. This type of approximation is also known as Box-type approximation. 27 The problem that arises here is that each of the variables Y gk· represents a mean of the variables Y gks ; thus, each variable may have a different variance 2 gk . Computing the variance of the variance estimatorŝ2 (c) g involves sums of 4 gk and 3 gk , quantities rather difficult to estimate in this setup due to overfitting issues. Therefore, we define approximate degrees of freedom of the resulting t -distribution such that the methods coincide with the Brunner-Munzel test when cluster sizes are equal to 1 and are given by ) .
For small sample sizes, the null hypothesis H 0 ∶ p = 1∕2 will be rejected at level , if where t 1− /2 ( (c) ) denotes the (1 − ∕2)-quantile from the central t (c) -distribution with estimated degree of freedom (c) given in (18). Approximate (1 − )-confidence intervals for p are given by One-sided confidence intervals and tests can be computed in the usual way by using (1 − )-quantiles and setting the lower or upper bound of the confidence intervals to 0 or 1, depending on the direction. We note that (c) → ∞ as f (c) → ∞ and, therefore, the approximation is asymptotically correct. Furthermore, (u) and (w) are identical when clusters are equally sized (ie, m gk ≡ M). Furthermore, both of (u) and (w) are identical to the Brunner-Munzel degree of freedom when m gk ≡ 1.
Remark 1. When the numbers of replicates of any unit is way larger than those of the others, it may happen that the weighted variance estimator̂2 (w) g given in (14) becomes negative and, thus, the test statistics T (w) cannot be computed. In this case, we propose to replacê2 (w) g by the asymptotically unbiased version in T (w) and in the confidence intervals given in (20). Throughout this manuscript, the resulting test will be denoted as TW.
Remark 2. Note that the confidence intervals CI (c) as given in (20) may not necessarily be range-preserving, ie, the lower bound may be small than zero and/or the upper bound may be larger than one. Range-preserving confidence intervals for the effects p can be derived using the delta method and an appropriate transformation, eg, the logit(x) = log(x∕(1 − x)) or probit(x) = Φ −1 (x) transformation function. For example, the logit-type confidence intervals for p are given by and Here, expit( ) = exp( )∕(1 + exp( )) denote the inverse of the logit function.
The quality of the proposed tests in terms of controlling the nominal type-I error rate and their powers to detect alternatives will be investigated in extensive simulation studies in the next section.

Approach from the work of Larocque et al
Recently, Larocque et al 21 proposed solutions for the nonparametric Behrens-Fisher problem with clustered data, where clusters may contain observations from each group, respectively. Their methods are also valid in our model (2) and shall be briefly explained as follows: The methods are intended to test under the null hypothesis , and thus, basically identical to testing H 0 ∶ p = 1∕2. In order to estimate the treatment effect E(s(X 111 − X 211 )), define the variables where w ik are nonnegative weights associated with S ik . For the computation of the estimator of Var(S), let denote the noncentered (S 0 ik ) and centered (S 1 ik ) versions of the sums of signs S ik given above. For an easy representation of the quite involved computation of the variance estimator, let S = (S h i ) n 1 ×n 2 denote the matrices of the S h ik for h = 0, 1, respectively. Note that, in the third term of the variance estimator in the work of Larocque et al, 21 p759 one of w ik or w ri must be zero in our setting. To see this, suppose i is an index value in the first group. Then w ri = 0 for any r because m i2 = 0. Therefore, the expression of the variance estimator in the work of the aforementioned authors 21 p759 can be written aŝ where Vec(·) denotes the vector operator that stacks the columns of a matrix on top of each other and tr(·) denotes the trace of a matrix, respectively. Based on the previous calculations, Larocque et al 21 proposed three different estimators of the variance Var(S) as follows: •̂2 S 0 that uses S 0 ik given in (22) and is only consistent under the null hypothesis, •̂2 S 1 that uses S 1 ik given in (22) and is also consistent under the alternative hypothesis, and These three different consistent variance estimators lead to three different versions of test statistics for testing H (L) 0 in respectively. Under the null hypothesis H (L) 0 , all three versions have a standard normal distribution. Note that only the test statistic T (1) L can be inverted into a confidence interval for the treatment effect, because the variance estimator used (̂2 S 1 ) is consistent under the alternative hypothesis. However, all of the three versions of the tests will be used as competing procedures in the simulation studies. Those will be explained and discussed in detail in the next section.

SIMULATIONS
All of the methods proposed in the previous sections are valid for large sample sizes. The arising questions are (1) "How accurate do they control the nominal type-I error rate under the null hypothesis?" and (2) "How much power do the procedures have to detect alternatives when sample sizes are small?" Extensive simulation studies were conducted to find answers to these questions in different scenarios involving very small and moderate sample sizes in balanced and unbalanced situations with different settings for the numbers of replications. Throughout the simulations, a two-sample To investigate the impact of the shape of data distributions on the quality of the procedures, three different types of distributions are considered in the simulation studies, namely, as well as two different heteroscedastic scenarios with 2 1 = 1 and 2 2 = 2 and 2 1 = 1 and 2 2 = 3 with correlation values ∈ {0, 0.5, 0.9} were investigated. Thus, both positive (the larger sample has the larger variance) and negative (the larger sample has the smaller variance) pairing situations are covered within these settings; N(0, ik ). Here, the covariance matrices were chosen from ik = I m ik ×m ik + (J m ik ×m ik − I m ik ×m ik ) with correlation values ∈ {0, 0.5, 0.9} (note that the actual correlation coefficients of the resulting lognormal distributions are different 28 ); • Ordinal data: N(0, ik ). Here, the covariance matrices were chosen from ik = I m ik ×m ik + (J m ik ×m ik − I m ik ×m ik ) with correlation values ∈ {0, 0.5, 0.9} and the symbol [·] represents the rounding operator, respectively.
All simulations were conducted using R computational environment version 3.4.3 (www.r-project.org) each with nsim = 10 000 simulation runs. Throughout the simulations, the newly developed tests T (u) , T (w) , and TW proposed in (16) and (21) were implemented using the corresponding t (c) -approximation proposed in (19). They were compared with the methods T (0) L , T (1) L , and T ( F) L proposed by Larocque et al 21 given in (23). Another competitor is the Brunner-Munzel test T BM for the application of which the first observation X ik1 within each cluster X ik , i = 1, 2; k = 1, … , n i , was used. The aim of simulating the Brunner-Munzel test is exploring its difference to the Larocque test T ( F) L when single observations (see Setting 1) were observed as well as investigating if the new methods increase its power when replicates were observed. Furthermore, we simulated the behavior of the rank-based methods for testing the hypothesis H F 0 ∶ F 1 = F 2 with clustered data proposed by RGL 13 and DS. 14 These tests were computed using the clusrank R-package. 29 The simulation results are summarized for all the four settings 1 − 4, the three different correlation values, and sample size configurations for each of the four different data distributions separately. Type-I errors are displayed by multiplying a factor 100 for the ease of visualization.
The type-I error simulation results using homogeneous normal distributions are displayed in Table 1, for heteroscedastic normal distributions having variances 2 1 = 1 and 2 2 = 2 are displayed in Table 2, for heteroscedastic normal distributions where the two different groups have variances 2 1 = 1 and 2 2 = 3 are displayed in Table 3, results for lognormally distributed data are given in Table 4 and empirical type-I error rates for ordered categorical data are displayed in Table 5, respectively. First, it can be seen from Tables 1 to 5 that the methods T (0) L and T (1) L given in (23) tend to be either way too conservative or too liberal throughout all of the investigated settings and data distributions. The test procedure T ( F) L tends to control the type-I error level conservatively when sample sizes are rather small. The conservatism decreases when sample sizes n 1 and n 2 get larger. The WMW-type test statistics tend to control the nominal type-I error rate reasonably well, even under heteroscedasticity. When sample sizes are unbalanced and variances are heteroscedastic, the methods tend to be slightly liberal or conservative, depending on size and variance allocations. However, neither the methods proposed by Larocque et al 21 nor the WMW-type tests can be inverted into confidence intervals for the effect p. The newly developed methods T (w) and T (u) tend to control the nominal type-I error reasonably well in all the investigated scenarios. When sample sizes are very small and correlation is very high the tests tend to be slightly liberal. The liberality decreases with increasing sample sizes. Furthermore, it can be seen from Tables 1 to 5 that the weighting scheme of the estimators does not impact the behavior of the tests. Sometimes, T (w) tends to be slightly more liberal than T (u) . This occurs because the standard error (SE) ofp (w) is usually a bit "harder" to estimate as variances of sums are estimated rather than variances of means. In all of the investigated scenarios, the weighted variance estimatorŝ2 (w) i did not become negative and all tests T (w) could be computed. However, to investigate the behavior of TW motivated in (21), the test was simulated in all scenarios. It turns out that the test controls the nominal size well and sometimes even better than T (w) . However, when sample sizes are small and unbalanced, a conservative behavior of the test may become apparent. Summarizing the findings discussed above, the new methods seem to be pretty accurate and their usage is recommended when the sample sizes n i ≥ 7. In case of extreme small samples, an accurate control of the type-I error rate using asymptotic normal or t-quantiles for these rank-statistics cannot be expected. Simulation studies using negatively correlated data show very similar results to the reported above and are therefore omitted.
Next, the powers of all the methods was simulated to detect the alternative H 1 ∶ p ≠ 1 2 . Two different balanced designs were simulated, namely, The sample sizes were chosen to be moderately large (n i = 20) for both of the designs. The power simulation results (multiplied by a factor 100) are displayed in Tables 6 and 7 with = 5%. First, it can be readily seen that all of the methods that take the replications into account have a higher power than the Brunner-Munzel test. It should, however, be noted that the comparisons have to be confined to the newly proposed methods and the methods of Larocque et al, 21 ie, T (0) L , T (1) L , and T ( F) L . More specifically, RGL and DS are designed for the null hypothesis H 0 ∶ F 1 = F 2 . Hence, they should theoretically be sensitive to heteroscedastic settings as those setting are alternative points for the two tests. Therefore, strictly speaking, these two tests have low powers. When comparing the new method with that of the work of Larocque et al, 21 the power simulation (especially Design 2 in Table 7 as follows) clearly shows the advantage of the new methods when the within-cluster correlation is negative. Even for positive correlation, the new methods have slightly better power compared to the work of Larocque et al. 21 Furthermore, the type of weighting slightly impacts the powers of the methods T (w) or T (u) and seems to depend on the correlations within the clusters. Comparing the two weighted tests TW and T (w) , it seems that the power of T (w) is slightly lower in some scenarios. This result may occur because the used   variance estimators are based upon̂2 i given in (21) along with a bias correction that results in an estimator with larger variance.
All of the results and conclusions made here are, however, based on few selected designs of replications. Overall, it seems that all of the methods have a substantial power to detect departure from the null hypothesis H 0 ∶ p = 1∕2. A general conclusion cannot be made due to the abundance of possible designs and sample size configurations. Additional theoretical power and efficiency investigations of rank-tests are found in the work of Janssen 1 and references therein.  Remark 3. The question whether the weighted or unweighted estimator should be used has not been answered yet. A helpful selection criteria might be the asymptotic relative efficiency (ARE) of the corresponding weighted and unweighted estimators of p. It follows from the asymptotic normal distributions of √ n(p (u) − p) (7) and √ N(p (w) − p) (8) that we can define the ARE of these two sequences as   Here, T (u) and T (w) represent the new methods given in (16); T BM the Brunner-Munzel Test; T (0) L , T (1) L , and T ( F) L the methods from the work of Larocque et al 21 given in (23); the RGL test proposed by Rosner et al 13 ; and DS the method proposed by Datta and Satten 14 (see, eg, Boos and Stefanski 30 p14 ). Thus, in our case, the ARE indicates which of the two estimators has a smaller SE. If ARE < 1, then the weighted estimator is more efficient; if ARE > 1, then the unweighted and both are of equal quality if ARE = 1, respectively. For example, let n 1 = n 2 = n 0 and assume that Var(Y iks ) = 2 and Cov(Y iks , Y iks ′ ) = .
Since both the unweighted and weighted estimators are identical in case of equally sized clusters, consider a scenario   in which all cluster sizes are equal except one of them, ie, let m 11 = … = m 1n 1 = m 21 = … = m 2,n 2 −1 = 1 and let m 2n 2 = m 0 > 1. Routine calculations show that ARE ⋛ 1 if   and T (w) represent the new methods given in (16); T BM the Brunner-Munzel Test; T (0) L , T (1) L , and T ( F) L the methods from the work of Larocque et al 21 given in (23); the RGL test proposed by Rosner et al 13 ; and DS the method proposed by Datta and Satten 14 where m 2 = n 0 − 1 + m 0 . For example, if n 0 = 10 and m 0 = 2, an intraclass correlation value of about = 0.35 leads to ARE = 1 (see Figure 2). Thus, the powers of both the weighted and unweighted tests T (w) and T (u) are expected to be identical (for large sample sizes) in this specific scenario. If = 0, it follows that T (w) has a higher power than T (u) , and thus, the weighted estimator is preferred. Otherwise, in case of large correlations, the unweighted estimator should be used. The aforementioned findings are numerically justified in a simulation study with n 1 = n 2 = 30, m 0 = 10. Data X 11 , … , X 2,n 2 −1 were generated from normal distributions with variance 1, whereas the only cluster X 2n 2 was          Here, was chosen according to (25). The results are displayed in Table 8.
It can be seen from Table 8 that the powers of the three tests are almost identical if ARE = 1. Otherwise, if ARE < 1, the weighted test T (w) has a higher power than T (u) and vice versa if ARE > 1. Roughly speaking, T (w) is more efficient than T (u) if the clustered data are low or mildly correlated. Investigations of the ARE of the tests will be part of future research. 31,32  (24) is smaller than, equal to, or larger than 1

DATA EVALUATIONS
In this section, the data set introduced in Section 2 will be analyzed. We will test the null hypothesis H 0 ∶ p = 1∕2 using the two new methods T (u) , T (w) as given in (19), and also compute the modified version TW of T (w) as indicated in (21). We add the Brunner-Munzel test T BM for comparative purposes for the computation of which the first observation within each cluster was used. The three methods T (0) L , T (1) , and T ( F) L for testing H 0 ∶ E(S) = 0 as given in (23) were also computed and compared with the aforementioned methods. Note that these differ only in the estimation of the asymptotic variance. Furthermore, for testing the null hypothesis H 0 ∶ F 1 = F 2 , the methods proposed by RGL 13 and DS 14 were computed. For all of the different methods, point estimators of the treatment effect, their SEs, test statistics, 95% confidence intervals, as well as p-values are reported. We used = 0.05 for all of the data evaluations and interpretations.  Since several rats share the same cage, the cage is assumed to be the experimental unit and the rats represent the replicates. Here, the numbers of cages in the two dose groups are identical (n 1 = n 2 = 13), while the numbers of replicates are different. The statistical analysis of the data is displayed in Table 9. First, it can be seen that both the unweighted p (u) as well as the weighted estimatorp (w) are just slightly different. The estimated SEs are also about the same. This occurs, because the replicates have a medium correlation. Furthermore, the data do not provide the evidence to reject the null hypothesis H 0 ∶ p = 1∕2 at 5% level of significance. All of the results are, however, borderline and a remarkable improvement can be detected. The test decisions for testing H 0 ∶ F 1 = F 2 differ slightly. However, both p-values indicate a difference in terms of the distributions.

DISCUSSION
Dependent replications are observed in many experiments and there is a need for adequate statistical procedures that can be used for modeling them. Reducing the replications to single observations by either using their means, medians, or strictly using the first observation cannot be recommended because a lot of the information provided by the replications is not effectively used. Therefore, this strategy results in a loss of power and cannot be recommended for practitioners. Furthermore, data often follow a skewed distribution or are even observed on ordinal scales. Thus, there is a need for purely nonparametric flexible methods that can be used for analyzing such data in a unified way. Ranking procedures are known to be a robust and powerful statistical analysis tool for which parametric distributional assumptions are doubtful. Rosner et al 14 and DS 13 proposed rank-based WMW-type test procedures for testing H F 0 ∶ F 1 = F 2 formulated in terms of the distribution functions of the data. This hypothesis implies that variances of the data across the two groups are identical. In particular, confidence intervals for the underlying effect cannot be computed in general nonparametric models. In this paper, purely nonparametric methods for testing hypotheses formulated in terms of the WMW-effect p given in (1) have been introduced. All of the methods neither imply that variances nor data distributional shapes are identical even under the null hypothesis. Thus, the nonparametric Behrens-Fisher problem with dependent replications has been investigated.
Different weighting schemes (weighted and unweighted) for estimating the treatment effect p have been investigated from a theoretical as well as empirical point of views. The choice of the weighting scheme to use depends on the specific study question and one cannot be recommended over the other for all situations. When, for example, exchangeable correlation structure can be assumed, the strength of the ICC present in the data could be used as a guiding factor. We conducted a small-scale simulation to shed some light on the effect of ICC on the precision in the estimation of the relative treatment effect. We examined the effects of ICC on the precision of the estimator for two distributions. Figures 2 and 3 show boxplots constructed from SEs computed from all possible sample sizes (n 1 , n 2 ∈ {7, 10, 20}) and cluster sizes (m ik s generated from Binomial(10, 0.3) + 1 and Binomial(4, 0.3) + 1) combinations (total of 18 numbers) for different values of ICC.
It can be seen from the figures that, for both distributions and effect sizes, the weighted analysis is preferred for low ICC (−0.4 to 0.4) and unweighted analysis is preferred otherwise. It is also interesting to see that high positive correlations show widely varying SE based on sample size and cluster size combinations compared to high negative correlations. As one would expect, the estimation is best in terms of low SEs for both weighted and unweighted analysis when the correlation is near zero.
In this paper, methods for specific clustered data were investigated. Clusters involving data from both treatment groups were not considered in this manuscript. The generalization of the estimating approaches to general clustered data designs will be part of future research. Furthermore, we restricted ourselves to two different weighting schemes. These can be generalized to "arbitrarily weighted" estimators by using a general weighting framework similar to the work of Larocque et al. 21 The motivating example represents a part of a complex study involving more than two treatment groups. The generalization of the methods to the several sample case will be considered in future investigations and appropriate global testing as well as multiple contrast test procedures for testing hypotheses formulated in terms of relative effects as in the work of Konietschke et al 33 will be explored theoretically as well as empirically. Resampling methods to approximate the distribution of the tests in both the two-and several sample case appears to be an intriguing option for small sample sizes. Since data is not exchangeable in the general setup considered here, studentized resampling (permutation or bootstrap) methods shall be explored. The aim of these methods is to mimic the asymptotic distribution of the test statistic (which turned out to be standard normal). A general theory for resampling methods is provided by other works 1,34-36 and will serve as an excellent basis for the development of such methods. Another important issue to tackle is allowing the cluster size to be informative. 12

A.1 Underlying model
Given are two independent samples with dependent replicated data that can be modeled by independent random vectors X ik = (X ik1 , … , X ikm ik ) ′ , i = 1, 2; k = 1, … , n i (A1) with distributions X iks ∼ F i , i = 1, 2. m ik is the number of replicates of subject k under treatment i.
∑ n i k=1 m ik is the total number of observations. m i = ∑ n i =1 m i i = 1, 2.

A.3 Proof of asymptotic equivalence for the unweighted estimator in (5)
In a first step, we decompose the estimator as follows:

Now,
and, thus, Notice that which follows by applying Fubini's theorem and thatF 1k is an unbiased estimator of F 1 . Then, to complete the proof, it suffices to show that E(nA 2 n ) = o(1) as n tends to infinity. To show this, we consider two cases.
Case 1: k ≠ k ′ or l ≠ l ′ . For example, if k ≠ k ′ , we know that X 1k and X 1k ′ are independent Again, by applying Fubini's theorem and the fact thatF 1k is an unbiased estimator of F 1 , it can be seen that the inner expectation is zero. Therefore, By symmetry, we get the same expectation when l ≠ l ′ .
From the two cases we have, since, by Assumption A1, n∕n i = O(1) as n → ∞ for i = 1, 2.

A.4 Proof of Asymptotic Equivalence for the Weighted Estimator in (6)
With the same arguments as above, we first decompose the random variable √ N(p (w) − p) in the following way:

Furthermore,
Here, it can be similarly shown that E(A N ) = 0 and, thus, it remains to show that NE(A 2 N ) = o(1). Case 1: k ≠ k ′ or l ≠ l ′ . The proof in this case is similar the one for unweighted estimator. Case 2: k = k ′ and l = l ′ . Here also, assuming the cluster sizes are uniformly bounded, ie, m ik ≤ M 0 < ∞ for all i = 1, 2 and k = 1, … , n i , Combining the two cases, as in the proof of the unweighted variance estimator,