Comparison of profile-likelihood-based confidence intervals with other rank-based methods for the two-sample problem in ordered categorical data

ABSTRACT For ordered categorical data from randomized clinical trials, the relative effect, the probability that observations in one group tend to be larger, has been considered appropriate for a measure of an effect size. Although the Wilcoxon–Mann–Whitney test is widely used to compare two groups, the null hypothesis is not just the relative effect of 50%, but the identical distribution between groups. The null hypothesis of the Brunner–Munzel test, another rank-based method used for arbitrary types of data, is just the relative effect of 50%. In this study, we compared actual type I error rates (or 1 – coverage probability) of the profile-likelihood-based confidence intervals for the relative effect and other rank-based methods in simulation studies at the relative effect of 50%. The profile-likelihood method, as with the Brunner– Munzel test, does not require any assumptions on distributions. Actual type I error rates of the profile-likelihood method and the Brunner–Munzel test were close to the nominal level in large or medium samples, even under unequal distributions. Those of the Wilcoxon–Mann–Whitney test largely differed from the nominal level under unequal distributions, especially under unequal sample sizes. In small samples, the actual type I error rates of Brunner–Munzel test were slightly larger than the nominal level and those of the profile-likelihood method were even larger. We provide a paradoxical numerical example: only the Wilcoxon–Mann–Whitney test was significant under equal sample sizes, but by changing only the allocation ratio, it was not significant but the profile-likelihood method and the Brunner–Munzel test were significant. This phenomenon might reflect the nature of the Wilcoxon–Mann–Whitney test in the simulation study, that is, the actual type I error rates become over and under the nominal level depending on the allocation ratio.


Introduction
In some randomized clinical trials, primary endpoints are ordered categorical data. As an example, we provide pain score data from the shoulder tip pain trial reported by Lumley (1996). We consider the subset of the data measured on the third time after a surgery in 25 female patients, following the reanalyses by Brunner and Munzel (2000) and Neubert and Brunner (2007). The pain score takes values from 1 (low pain) to 5 (high pain). The patients were randomly assigned to two treatments: 14 patients received an active treatment and 11 patients received a control treatment. The numbers of patients with pain scores of 1, 2, 3, 4, and 5 were 11, 2, 0, 1, and 0, respectively, for the active treatment and 3, 1, 4, 2, and 1 for the control treatment.
For ordered categorical data, the relative effect has been considered to be an appropriate measure of an effect size. The relative effect is also called the probabilistic index, the Mann-Whitney parameter, and the win probability (Brunner et al. 2021). This is the probability that observations in one group tend to be larger compared with observations in the other group. This measure can be used for various response variables, including continuous, count, and ordered categorical data. As a statistical test, the Wilcoxon-Mann-Whitney test is widely used to compare two treatment groups. The test statistic is based on the relative effect, which seems to be the natural effect measure of the Wilcoxon-Mann-Whitney test. However, the null hypothesis of this test is not just the relative effect of 50%, but the identical distribution between groups. Note that, in general, the Wilcoxon-Mann-Whitney test is not appropriate for comparing means or medians (Divine et al. 2018).
As other rank-based methods, the Brunner-Munzel test and the Fligner-Policello test have been proposed for heteroscedastic cases (Brunner and Munzel 2000;Fligner and Policello 1981). The term 'heteroscedastic' means different variances between groups. Although the Fligner-Policello test assumes continuous distributions, the Brunner-Munzel test can be used for arbitrary types of data, such as continuous, count, and ordered categorical data (Brunner and Munzel 2000). For small samples, the Brunner-Munzel test uses the t-distribution and Satterthwaite degrees of freedom. For small or very small samples, Neubert and Brunner (2007) proposed a permutation test based on the studentized rank statistic of the Brunner-Munzel test. The null hypothesis of the Brunner-Munzel test is just the relative effect of 50%, and this test does not require any assumptions on distributions. The test statistics of the Wilcoxon-Mann-Whitney test and the Brunner-Munzel test are both based on the relative effect but the null hypotheses differ, and the estimators of the variances in the test statistics differ (see Appendix A). The Brunner-Munzel test can provide a confidence interval for the relative effect. Apart from the P-value, it is recommended that the effect size and precision of estimates, such as the confidence interval, be reported in biomedical publications (International Committee of Medical Journal Editors 2017).
For analysis of continuous data, simulation studies under unequal variances and unequal skewness have demonstrated that small differences in variances and moderate degrees of skewness can produce large deviations from the nominal type I error rate of the Wilcoxon-Mann-Whitney test when comparing means or medians Sandvik 2009a, 2009b). Rank-based methods such as the Brunner-Munzel test and the Fligner-Policello test perform similarly, although the Brunner-Munzel test is generally better. For continuous data, Govindarajulu (1968) proposed a confidence interval for the relative effect.
For ordered categorical data, some methods for estimating the variances or confidence intervals for the relative effect have been proposed (Halperin et al. 1989;Ryu and Agresti 2008). The profilelikelihood method, as with the Brunner-Munzel test, does not require any assumptions on distributions, and it is considered appropriate to compare the performances of other rank-based methods.
For analysis of ordered categorical data, a common approach is to assign scores to the categories and use methods for comparing means (Ryu and Agresti 2008), and the Wilcoxon-Mann-Whitney test uses the rank as a score. A modeling approach is also used. In this study, we additionally evaluated the proportional-odds model with cumulative probabilities, which is also called the proportional-odds version of the cumulative logit model (Agresti 2010), because it is widely known and sometimes used to analyze data from randomized clinical trials. The model assumes that the treatment effect measured by the odds ratio is the same across the cumulative probabilities of the different cut points, and this is called a proportional-odds assumption. The measure of the effect size of this model is a common odds ratio rather than a relative effect, and the Wald confidence interval is widely used. The common odds ratio provides a relative effect size as opposed to an absolute effect size, but the term 'relative effect' used in this paper is a name of an effect measure.
Some researchers have suggested that variances should be the same between groups under a null hypothesis, because the variances are the same if the treatments are identical (Brownie et al. 1990;Fisher 1939). When two distributions are identical, the relative effect is 50%, variances are equal, and the proportional-odds assumption is valid. However, in this paper, we consider a null hypothesis that the relative effect is 50% without assuming equal distributions. For example, even if two distributions are different, if both distributions are symmetric over all categories, the relative effect is 50% (Agresti 2010). However, in this case, variances are unequal, and the proportional-odds assumption is violated. In general, the equal variances do not necessarily mean the proportional-odds, and the proportionalodds does not necessarily mean the equal variances.
In this paper, we compared the profile-likelihood-based confidence intervals for the relative effect and other rank-based methods for ordered categorical data through simulation studies and numerical examples. The structure of this paper is as follows: In Section 2, we explain the profile-likelihood method. We describe the simulation studies and provide a numerical example and the analysis of clinical data in Section 3. We discuss our results in Section 4.

Profile-likelihood-based confidence intervals
Suppose a 2 � K contingency table, where the K categories are ordered for two groups, as shown in Table 1. The contingency table arises from product multinomial models with sample sizes of n 1 and n 2 for Groups 1 and 2, respectively (Ryu and Agresti 2008). The n 1 and n 2 data are independent and identically distributed, respectively. Let x i and y i denote the frequencies in the ith i ¼ 1; � � � ; K ð Þ category of Groups 1 and 2, respectively, and let p i and q i denote the corresponding cell probabilities.
Let X and Y be random variables of the two groups and take ordered values for each category such as 1; � � � ; K. Let X and Y be independent. The relative effect, θ, is expressed as In the above setting, θ is expressed by the following equation: In the case of three categories, K = 3, θ is Except a constant term, the likelihood and log-likelihood are expressed as follows: In the case of K = 3, the likelihood is Frequency Frequency y 1 y 2 � � � y K n 2 Cell probability q 1 q 2 � � � q K ¼ 1 À P KÀ 1 i¼1 q i 1 l p 1 ; p 2 ; q 1 ; q 2 ð Þ ¼ p x 1 1 p x 2 2 1 À p 1 À p 2 ð Þ x 3 q y 1 1 q y 2 2 1 À q 1 À q 2 ð Þ y 3 : If we express p 1 by θ, p 2 , � � � , p KÀ 1 ,q 1 , � � � , q KÀ 1 , the likelihood l p 1 ; � � � ; p KÀ 1 ; q 1 ; � � � ; q KÀ 1 ð Þ can be expressed by l θ; p 2 ; � � � ; p KÀ 1 ; q 1 ; � � � ; q KÀ 1 ð Þ. In the case of K = 3, p 1 and the likelihood are expressed by θ, p 2 , q 1 ,and q 2 as follows: We are interested in the parameter θ, and the others are nuisance parameters. We obtain the confidence interval for θ by calculating the profile-likelihood. The main value of profile-likelihood is in constructing confidence regions for nonlinear functions of parameters (Aitkin 2005).
where 1 is the dimension of θ, under the hypothesis θ ¼ θ 0 (Sen 1998). The 100 1 À α ð Þ% confidence interval based on the profile-likelihood is the acceptance region for the hypothesis and expressed as where χ 2 1;1À α ð Þ is the 100 1 À α ð Þ percentile of the χ 2 distribution with one degree of freedom. Figure 1 shows the profile-likelihood of the relative effect and the 95% confidence interval for pain score data in Section 1. For profile-likelihood, see Aitkin (2005).
To search for intervals that satisfy Equation (1), we calculate log Pθ � � and log P θ ð Þ around θ . In fact, we should search the points of θ continuously and there are iterative methods (Venzon and Moolgavkar 1988), but this is difficult and computationally unstable. Hence, we calculate log P θ ð Þ at discrete points. The discrete Figure 1. Profile-likelihood of the relative effect and the 95% confidence interval for pain score data.
points for the confidence limits are θ � δ, θ � 2δ, θ � 3δ, � � � , where δ is a positive value. We calculate the differences between log Pθ � � and log P θ ð Þ at these points until the differences are bigger than 0:5 � χ 2 1;1À α ð Þ . It is easy to maximize log P θ ð Þ by transforming the nuisance parameters. In the case of K = 3, p 2 , q 1 , and q 2 are transformed to z p 2 , z q 1 , and z q 2 as follows: Through the above transformation, ranges of z p 2 , z q 1 , and z q 2 are À 1,1. We join these points, then search for intervals that satisfy Equation (1). We obtain more accurate intervals by taking smaller values for δ; however, this increases the burden of the calculation. Since the relative effect θ is between 0 and 1, δ of 0.001 can be considered sufficiently small. To speed up the calculation, we use a larger δ at first, then switch to a δ of 0.001.

Simulation study
In simulation studies, we examined four settings of the cell probabilities (Step 1), with four sample sizes and three patterns of the allocation ratio (Step 2), and analyzed by five methods (Step 3) to show actual Type I error rates and other indicators (Step 4). We describe each step below.
We examined four patterns of the cell probabilities of ordered three categories of the two groups shown in Figure 2, (I.) equal symmetric distributions, (II.) equal asymmetric distributions, (III.) unequal symmetric distributions with unequal dispersion, and (IV.) unequal distributions with equal dispersion. Note that the relative effects are 50% in the settings I-III and 36% in the setting IV. The Brunner-Munzel test uses the variance of placement for each group. See Brunner and Munzel (2000) and Result 3.21 in Brunner et al. (2018) for details. For ordered categorical data, the asymptotic variance for each group expressed by the cell probabilities is given in Figure 2. Cell probabilities in the simulation settings I-IV. RE, relative effect; OR, odds ratio; aVars, asymptotic variances. Munzel and Hauschke (2003) (see Appendix B). The asymptotic variances of the two groups in each setting are given in Figure 2. In the setting III, the variance is larger in Group 1 than Group 2. The variances in the other settings are the same between groups. The odds ratios of cumulative probabilities for the two cut points, are (1, 1), (1, 1), (8/3, 3/8), and (8/3, 8/3), respectively. Therefore, in the setting III, the proportional-odds assumption is violated, though the estimated common odds ratio is 1 if we analyze the data with these proportions using a proportional-odds model. In the other settings, the proportional-odds assumption is valid, and the common odds ratio is 1 in the settings I and II and 8/3 in the setting IV.
We compared the profile-likelihood-based confidence intervals for the relative effect with the Brunner-Munzel test with t-approximation and Satterthwaite degrees of freedom, the Wilcoxon-Mann-Whitney test with z-approximation, and the proportional-odds model for two-group comparison of ordered categorical data. For small and very small sample sizes, we also examined the Neubert-Brunner test which is the permutation version of the Brunner-Munzel test. The test statistics are given in Appendix A. For the Wilcoxon-Mann-Whitney test, we used the empirical variance of the ranks and a continuity correction. In the proportional-odds model, cumulative logits are a linear function of intercepts corresponding to the cut points and a treatment effect. Let the coefficient of the treatment effect be β, then the common odds ratio is e β . We performed the Wald test for β ¼ 0. In the Neubert -Brunner test, we used 10,000 randomly selected permutations.
For the settings I -III with relative effects of 50%, we compared the actual type I error rates of the statistical methods with a two-sided type I error rate of 5%. For the profile-likelihood method, which provides only confidence intervals, the statistical significance is determined according to whether the 95% confidence interval of the relative effect contains 50%, and the 1 -coverage probability is used. For the setting IV, we examined the powers under large and medium sample sizes. Under small and very small sample sizes, we did not examine the powers, because the type I error rates tend to differ from the nominal level and the interpretation of the power is difficult. 1 -coverages are also provided for each method, except for the Wilcoxon -Mann-Whitney test.
We used SAS 9.4 for the simulation studies and the numerical examples presented in the next subsections. We use the IML procedure for the profile-likelihood method, the perm_bf macro for the Brunner-Munzel test with t-approximation and the Neubert-Brunner test (Neubert and Brunner 2007), the npara1way procedure for the Wilcoxon-Mann-Whitney test, and the logistic procedure for the proportional-odds model. Table 2 and 3 show the results of the simulation studies with large and medium sample sizes, respectively. The following is a summary of the results of the actual type I error rates or 1 -coverages. If these are largely different from the nominal level, we do not recommend using the methods. In the setting VI of unequal distributions with equal dispersion, the powers of all the tests were similar. Table 4 shows the results of the simulation studies with small and very small sample sizes. In the very small sample sizes of the settings I and II, data of one-point-distributions occurred, such as x 1 ; x 2 ; x 3 ; y 1 ; y 2 ; y 3 ð Þ=(0, 9, 0, 0, 9, 0), 13 or 9 times among 100 000 data sets, and these were excluded from the comparison for all the methods. The actual type I error rates of the profile-likelihood method and the Brunner -Munzel test were larger than the nominal level. The Brunner -Munzel test was slightly liberal and the profile-likelihood method was even more liberal. The term 'liberal' means that the actual type I error rate is larger than the nominal level, and the term 'conservative' means that is smaller than the nominal level. The actual type I error rates of the Neubert-Brunner test, the   Wilcoxon-Mann-Whitney test, and the proportional-odds model were smaller than the nominal level in the case of equal distributions, and the tests were conservative. The Wilcoxon-Mann-Whitney test was less conservative among them. Similar to the results in large and medium sample sizes, the actual type I error rates of the Wilcoxon-Mann-Whitney test and the proportional-odds model were largely different from the nominal level under unequal dispersion, especially under unequal sample sizes.

Numerical example
For interpretation of P-values and confidence intervals in the case of θ�50%, we produced a paradoxical numerical example under similar settings to the simulation study of unequal distributions with p 1 ; p 2 ; p 3 ; q 1 ; q 2 ; q 3 ð Þ ¼ 0:4; 0:2; 0:4; 0:2; 0:6; 0:2 ð Þ. For this setting in Tables 2 and 3 , the actual type I error rates of the profile-likelihood method and the Brunner-Munzel test were close to the nominal level. However, those of the Wilcoxon-Mann-Whitney test and the proportional-odds model were far from the nominal level. They were liberal, strongly liberal, or conservative depending on the allocation ratio. In this setting, the two groups had unequal variances and the proportionalodds assumption was violated.
The results of the numerical example are shown in Table 5. The confidence intervals of the profilelikelihood method and the Brunner-Munzel test were almost the same. When n 1 ¼ n 2 , these were not significant. When n 1 < n 2 (the group with a larger variance had a smaller sample size), these were not significant and showed wider confidence intervals and larger P-values compared with those under  n 1 ¼ n 2 . When n 1 > n 2 (the group with a larger variance had a larger sample size), these were significant and showed narrower confidence intervals and smaller P-values compared with those under n 1 ¼ n 2 . On the contrary, when n 1 ¼ n 2 , the Wilcoxon-Mann-Whitney test and the proportional-odds model were significant, unlike the other methods. When n 1 < n 2 (the group with a larger variance had a smaller sample size), the proportional-odds model was still significant and showed an even smaller P-value compared with that under n 1 ¼ n 2 , unlike the other methods. The Wilcoxon-Mann-Whitney test was not significant and showed a larger P-value compared with that under n 1 ¼ n 2 , but the P-value was clearly smaller than that of the Brunner-Munzel test. When n 1 > n 2 (the group with a larger variance had a larger sample size), the Wilcoxon-Mann-Whitney test and the proportional-odds model were not significant unlike the other methods, and they showed larger P-values compared with those under n 1 ¼ n 2 .
The results of the numerical example of the Wilcoxon-Mann-Whitney test and the proportional-odds model corresponded to the results of the simulation studies. Note that the profilelikelihood method and the Brunner-Munzel test kept the nominal type I error rate in the simulation studies. When n 1 ¼ n 2 , the Wilcoxon-Mann-Whitney test and the proportionalodds model were liberal in the simulation studies and significant in the numerical example, although the other methods were not significant. When n 1 < n 2 (the group with a larger variance had a smaller sample size), the Wilcoxon -Mann-Whitney test and the proportional-odds model were strongly liberal in the simulation studies, and the proportional-odds model was still significant and both methods showed smaller P-values than those of the profile-likelihood method and the Brunner-Munzel test in the numerical example. When n 1 > n 2 (the group with a larger variance had a larger sample size), the Wilcoxon-Mann-Whitney test and the proportional-odds model were conservative in the simulation studies and not significant in the numerical example, although the other methods were significant because the larger sample sizes were allocated to the larger variance group.
By changing the allocation ratio to increase the sample size of the larger variance group, the results of the profile-likelihood method and the Brunner-Munzel test changed from nonsignificance to significance. On the contrary, the results of the Wilcoxon-Mann-Whitney test and the proportional-odds model changed from significance to non-significance, which is completely opposite to the other methods. Although this is one numerical example, the results in generally correspond to those of the simulation studies. Table 5. 95% confidence intervals (CIs) of relative effects (REs) or odds ratios (ORs) and P-values obtained from the numerical example with cell probabilities p 1 ;p 2 ;p 3 ;q 1 ;q 2 ;q 3 ð Þ ¼ 0:45; 0:2; 0:35; 0:15; 0:6; 0:25 ð Þ and relative effect θ = 0.43.

Analysis of clinical data
We analyzed the pain score data in Section 1, and Table 6 shows the results. The estimated variances of the two groups in the Brunner-Munzel test are 0.035 and 0.066. This is a case of unequal distributions with unequal dispersion. Because the data show unequal dispersion, we do not recommend using the Wilcoxon-Mann-Whitney test. The relative effect, the probability that the observations in the active treatment group tend to be larger (higher pain score), was estimated to be 0.21, and the 95% confidence intervals were 0.08-0.41 for the profile-likelihood method and 0.02-0.40 for the Brunner-Munzel test. For the proportional-odds model, we performed the score test for the proportional-odds assumption, and the test statistic was 23.2 compared to the chi-square distribution with 3 degrees of freedom. The test was clearly significant, and the proportional-odds assumption was unreasonable. We do not recommend using the proportional-odds model. All of the tests provided very small P-values.

Discussion
In the simulation studies, the Brunner-Munzel test performed as well as the computer-intensive profile-likelihood method in medium and large sample sizes.
• The actual type I error rates (or 1 -coverage probability) of the profile-likelihood method and the Brunner-Munzel test were almost nominal and similar. • The actual type I error rates of the Wilcoxon-Mann-Whitney test and the proportional-odds model were far from the nominal level under unequal distributions with the relative effect of 50%.
The null hypothesis of the Wilcoxon-Mann-Whitney test is the identical distribution between groups, not just the relative effect of 50%.
• Therefore, P-values of the Wilcoxon-Mann-Whitney test cannot be considered as the pure signal of differences from the relative effect of 50% under unequal dispersion and are difficult to interpret. • The proportional-odds model is also difficult to interpret when the proportional-odds assumption is violated.
When two distributions are different and both symmetric over all categories, the relative effect is 50%, the variances are unequal, and the proportional-odds assumption is violated. This is the setting III in the simulation studies, and the Wilcoxon-Mann-Whitney test and the proportional odds model do not perform well. When the variances are equal and the proportional-odds assumption holds, the powers of these four methods were almost the same in medium or large sample sizes. In small and very small samples, the slightly liberal nature of the Brunner-Munzel test was seen in other simulation studies with continuous data (Neubert and Brunner 2007) and ordinal data (Delaney and Vargha 2002). In our simulation studies with small and very small samples, the profile-likelihood Table 6. 95% confidence intervals (CIs) of relative effects (REs) or odds ratios (ORs) and P-values of the actual data. Frequency x 1 ; x 2 ; x 3 ; x 4 ; x 5 y 1 ; y 2 ; y 3 ; y 4 ; y 5

Wilcoxon-Mann-Whitney
Proportionalodds model 11, 2, 0, 1, 0 3, 1, 4, 2, 1 RE = 0.21 CI = .08-.41 RE = 0.21 CI = .02-.40 P = .0058 P = .0076 P = .0077 OR = 11.1 CI = 1.9-64.7 P = .0074 method was more liberal than the Brunner-Munzel test. This may be partially explained by the use of t-approximation in the Brunner-Munzel test instead of z-approximation for small samples. For analysis of continuous data, it is known that the Student's t-test, which assumes equal variances between groups, has the following property. In the case of unequal sample sizes and unequal variances, the actual type I error rate is not at the nominal level. When a group with a larger variance has a smaller sample size, the actual rate is over the nominal level and the test is liberal. When a group with a larger variance has a larger sample size, the actual rate is under the nominal level and the test is conservative (Algina 2005). Similarly, Brunner and Munzel (2000) reported that the Wilcoxon-Mann-Whitney test is liberal when a group with a larger variance has a smaller sample size and is conservative when a group with a larger variance has a larger sample size. In our simulation studies of ordered categorical data, the Wilcoxon-Mann-Whitney test has this property.
Therefore, in the case of unequal sample sizes and unequal variances, tests assuming unequal variances, such as the t-test with Satterthwaite's approximation (Satterthwaite 1946), which is also called Welch's t-test, are widely used. It is known that the power of the t-test with Satterthwaite's approximation is maximized when the ratio of the sample sizes is approximately equal to the ratio of the standard deviations (Dette and O'Brien 2004). Then, the power increases to some extent if more subjects are allocated to the group with a larger variance, and the power decreases if less subjects are allocated to the group with a larger variance. In our numerical example of ordinal categorical data with unequal variances, the profile-likelihood method and the Brunner-Munzel test showed this property. These tests were significant when the group with a larger variance had a larger sample size, and not significant when the group with a larger variance had a smaller sample size.
In the case of equal sample sizes, the Student's t-test for continuous data keep a nominal type I error rate asymptotically even under unequal variances between groups. The Student's t-test and the t-test with Satterthwaite's approximation use the same test statistic in equal sample sizes. The difference is only the degrees of freedom, and there are no essential differences in large sample sizes. However, even if the sample sizes are large and equal, the Wilcoxon-Mann-Whitney test and the proportional-odds model for ordered categorical data do not keep a nominal type I error rate when distributions are unequal with the relative effect of 50% and the proportional-odds assumption is violated. Based on the asymptotic formula (Pratt 1964), in equal sample sizes, the farther from 1 the ratio of the variances is, the more liberal the Wilcoxon-Mann-Whitney test is. Furthermore, if the ratio of the sample sizes is not far from 1, there are cases that the test is not conservative, but liberal, even when a group with a larger variance has a larger sample size.
We used the Wald test for the proportional-odds model because the Wald confidence interval of the odds ratio is widely used. The score test of the treatment effect for the proportional-odds model is known to be equivalent to the discrete version of the Wilcoxon-Mann-Whitney test (Agresti 2010), and both tests can be expressed in terms of the difference between the mean ranks for two groups. In the simulation with unequal dispersion, the actual type I error rates of the score test were 6.38, 9.32, and 3.95 in large samples of (90, 90), (60, 120), and (120, 60), respectively, compared with 6.27, 9.20, and 3.86 for the Wilcoxon -Mann-Whitney test. The P-values of the score test for the numerical example in Section 3.2 were 0.046, 0.056, and 0.060 compared with 0.046, 0.057, and 0.061 for the Wilcoxon-Mann-Whitney test. The results of the score test were more similar to those of the Wilcoxon-Mann-Whitney test than those of the Wald test of the same model under these settings of unequal dispersion.
In our numerical example, under equal sample sizes, the Wilcoxon-Mann-Whitney test and proportional-odds model were significant. However, this might be due to the liberal nature of these tests. Under equal sample sizes the actual type I error rates were over 6% in both tests in the corresponding simulation studies. When the group with a larger variance had a smaller sample size, which is the setting the power decreases in general, e.g. the t-test with Satterthwaite's approximation, the Wilcoxon-Mann-Whitney test was not significant, but the P-value was about the half that of the Brunner-Munzel test (0.057 vs. 0.103), and this might be due to the liberal nature of the Wilcoxon-Mann-Whitney test. In the simulation studies, the actual type I error rates were over 9% in the Wilcoxon -Mann-Whitney test, although the profile-likelihood method and the Brunner-Munzel test kept the nominal level. The proportional-odds model was significant, and this might be due to the liberal nature of the test. In the simulation studies, the actual type I error rates were over 13% in this model. When the group with a larger variance had a larger sample size, which is the setting the power increases in general, the Wilcoxon-Mann-Whitney test and proportional-odds model were not significant, and this might be due to the conservative nature of these tests. In the simulation studies, the actual type I error rates were less than 4% in the Wilcoxon-Mann-Whitney test and less than 3% in the proportional-odds model.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by JSPS KAKENHI Grant Number JP17K00066 and the ISM Cooperative Research Program (2021-ISMCRP-1016). The authors contributed equally to this work.

Appendices Appendix A. Test statistics
We provide the test statistics of the Brunner-Munzel test with t-approximation and Satterthwaite degree of freedom, the Wilcoxon-Mann-Whitney test, and the Neubert-Brunner test following Brunner et al. (2018) and Neuhäuser and Ruxton (2009). Let R lm (l ¼ 1; 2 and m ¼ 1; � � � ; n l ) be rank of X m and Y m among all N ¼ n 1 þ n 2 observations. Let R int;1m be the internal rank of X m among the n 1 observations X 1 ; � � � ; X n1 . Let R int;2m be the internal rank of Y m among the n 2 observations Y 1 ; � � � ; Y n2 . Let � R l ¼ 1 nl P nl m¼1 R lm (l ¼ 1; 2) be the average rank. The estimator of the relative effect in this notation is The test statistic of the Brunner-Munzel test is ffi ffi ffi ffi ffi ffi ffi ffi ffi n 1 n 2 N r : The variance estimator is where S 2 l ¼ 1 n l À 1 X nl m¼1 R lm À R int;lm À � R l þ n l þ 1 2 � � 2 : S 2 l = N À n l ð Þ 2 is the empirical variance of the placements R lm À R int;lm . Under the null hypothesis of θ ¼ 0:5, W BM can be approximated by a t f distribution where the degree of freedom f are estimated by f ¼ P 2 l¼1 S 2 l = N À n l ð Þ � � 2 P 2 l¼1 S 2 l = N À n l ð Þ � � 2 = n l À 1 ð Þ : A two-sided 1 À α ð Þ-confidence interval is given by θ À tf ;1À α=2 �σ BM = ffi ffi ffi ffi N p ;θ þ tf ;1À α=2 �σ BM = ffi ffi ffi ffi N p � � : where tf ;1À α=2 is the 1 À α=2 ð Þ-quantiles of a central t-distribution with f degrees of freedom. The test statistic of the Wilcoxon-Mann-Whitney test is ffi ffi ffi ffi ffi ffi ffi ffi ffi n 1 n 2 N r ; with the variance estimator Under the null hypothesis of the identical distribution between groups, it follows asymptotically the standard normal distribution. The Neubert-Brunner test calculates W BM for the actual data and for permutations of it. Permutations involve assigning each the N values independently between two groups, subject to the restriction that samples sizes, n 1 and n 2 , which are the same as in the actual data. The two-sided P-value of the test is the fraction of permutations that the absolute value of W BM equal or larger than that actually observed.