A Three-arm Non-inferiority Test for Heteroscedastic Data*

In this paper, we consider the three-arm non-inferiority trial in the statistical testing framework established by Hida and Tango (2011). As distinct from existing methods, this paper allows the data to be heteroscedastic. Several new test statistics are developed. Numerical simulations are used to illustrate the performance of the novel proposed methods, which are compared with some existing methods. It is found that a recent proposed procedure may not control the Type I Error well when the data are heteroscedastic. Among the three new methods, the Improved Score test has the best numerical performance.


Introduction
Two-arm non-inferiority clinical trials with active control has been widely used in the pharmaceutical industry. However, it has several problems, such as it cannot assess the assay sensitivity because of lacking a placebo treatment, which allows us to judge the efficacy of the active control. What's more, it's sensitive to the choice of noninferiority margin ∆, which must be determined from the effect size of the reference treatment (CPMP, 1999). The choice of ∆ has long been a controversial question (Lange et al., 2005), and it is still an ongoing debate today.
In order to solve the problem mentioned above, the three-arm study including a placebo group is recommended to avoid fulfilling high quality requirements due to external validation in two-arm studies (ICH, 2000). Since  For three-arm trials, Pigeot et al. (2003) proposed to formulate non-inferiority as a fraction of the trial sensitivity. This resulted in hypotheses based on the ratio of differences in means. For a given threshold θ, the alternative hypothesis indicates that the relative efficacy of the experimental drug is more than θ * 100 percent of the efficacy of the reference compound compared with placebo. For this ratio hypothesis, a t-distributed test statistic was derived, assuming normal distribution and variance homogeneity (Piegot et al., 2003). Schwartz and Denne (2006) described a two-stage procedure for sample size optimization. Hasler et al. (2008) gave an extension for the case of heterogeneous group variances, and Munzell (2009) presented a non-parametric version, suggesting a rank-based test for a three-arm non-inferiority trial using relative treatment effects. These studies can be categorized to the socalled fraction methods, which all formulated the non-inferiority margin as a fraction of the trial sensitivity. Other studies including Koch and Tangen (1999), Koch and Röhmel(2004), Röhmel(2005), Kieser and Friede(2007), and Ng (2008) are also following this line. Hida and Tango (2011) proposed another statistical test procedure for three-arm non-inferiority trials to assess the assay sensitivity with the margin ∆ defined as a pre-specified difference between treatments under the situation that the primary endpoints are normally distributed with a common, but unknown variance. To be precise, assuming that the primary clinical outcomes under experimental, reference, and placebo treatments X E i , X R j , X P k , are mutually independent and normally distributed with unknown variances. That is, X E i ∼ N (µ E , σ 2 E ), i = 1, · · · , n E , X R j ∼ N (µ R , σ 2 R ), j = 1, · · · , n R , and X P k ∼ N (µ P , σ 2 P ), k = 1, · · · , n P , with sample sizes n E , n R , and n P not necessarily equal. Under the assumption that σ 2 E = σ 2 R = σ 2 P , Hida and Tango (2011) constructed two sets of hypothesis tests: proposed a test procedure to improve the non-inferiority test under non-normal distribution. And many other scholars developed methods for three-arm non-inferiority testing with binary endpoints. For example, Tang and Tang (2014) proposed two asymptotic approaches for testing three-arm non-inferiority via rate difference based i , X R j , X P k are the same. However, this may not be true in practice. In this paper, we develop several procedures for simultaneously testing H 01 and H 02 without common variance assumption.
The paper is organized as follows. In Section 2, we develop the test statistics. Simulation studies are reported in Section 3 where we demonstrate the superior performance of our proposed tests over existing methods. We end the paper with a discussion in Section 4.

Test statistics construction
In the following, we will mainly consider three kinds of test methods, namely, the Welch's t-test, the Score test, and the Improved Score test.

Welch's t-test
Welch's t-test, first proposed by Welch (1938), gives an approximate solution to the Behrens-Fisher problem, the problem to compare the means of two normal populations with the ratio of the populations variances unknown. Comparing with Student's t-test, Welch's t-test is more reliable when the two samples have unequal variances.
The basic idea of Welch's t-test is the same as Student's t-test, but the main difference between them is that Welch's t-test uses n 2 instead of pooled variance in Student's t-test. Here S 2 1 and S 2 2 are the sample variance of the two samples, and n 1 and n 2 are sample sizes. Now, for testing the null hypotheses H 01 and H 02 in our paper, we can easily get the Welch's t-test statistic T 1 and T 2 as follows: Under null hypothesis, according to Welch-atterthwaite equation, T 1 follows a tdistribution with degree of freedom df 1 = , and T 2 follows another tdistribution with degree of freedom df 2 = sample variances of the experimental, reference and placebo treatments, respectively.
When the nominal level α is given, we have sufficient reason to reject the null hypothesis H 01 and H 02 if and only if T 1 > t α/2 (df 1 ) and is the upper 100%×α/2 percentile of the t-distribution with υ degrees of freedom. At this time, we could say that the experimental treatment is not inferior to the reference treatment.

6
The type I error and the power of the Welch's t-test can be calculated as follows: Tamhane ( Welch's t-test to represent Tamhane's test procedure in our following discussion.

Score test
Score test is a widely used statistical test of a simple null hypothesis that a parameter of interest θ is equal to some particular value θ 0 . Normal approximation based on Score test is another solution to the Behrens-Fisher problem. The key to construct the Score statstic is the calculation of the score, which is defined as the first-order derivative of the likelihood function, and the calculation of the Fisher information matrix. Here in our problem, we use the boundary condition to construct the statistics.
That's to say, we construct the Score statistic under µ E = µ R -∆ and µ R = µ P + ∆.
According to Jin (2009), we can obtain the Score test statistic T * 1 and T * 2 to test the null hypothesis tests H 01 and H 02 in the paper: The only difference when we construct T * 1 and T * 2 is that we use data from the experimental and reference treatment in deducing T * 1 , and we use data from the reference and placebo treatment in deducing T * 2 . Thus, we only explain the meaning of the parameter in T * 1 , and readers can easily get the meaning of the parameter in T * 2 based on our following explanation. Here,σ 2 where S 2 E and S 2 R are the sample variances of the experimental and reference treatments. And t 0 is the solution in the interval (0,1) to the following cubic equation: Under null hypothesis, both T * 1 and T * 2 asymptotically follow the standard normal distribution. Thus, when the nominal level α is given, we have sufficient reason to reject the null hypothesis H 01 and H 02 if and only if T * 1 > Z α/2 and T * 2 > Z α/2 , where Z α/2 is the upper 100%×α/2 percentile of the standard normal distribution. If we reject the null hypothesis H 01 and H 02 simultaneously, the non-inferiority with assay sensitivity can be claimed. Also, we can calculate the type I error and the power of the Score test as follows: where Z α is the upper 100%×α/2 percentile of the standard normal distribution. And we may calculate the type I error and the power of the Score test with following formula:

Simulation studies
In this section, we present the results of simulation studies that compare the type I errors and powers of three methods in our paper with Hida and Tango   Many factors may have influence on the statistical power of our hypothesis tests.
We choose some of them to make a study. The first factor is the proportion of σ 2 E , σ 2 R and σ 2 P . Here we let σ 2 R =1, and we set 6 groups of variance proportion of σ 2 E : σ 2 R : σ 2 P : 0.2:1:5, 0.25:1:4, 0.4:1:2.5, 0.5:1:2, 0.8:1:1.25, and 1:1:1. The second factor is sample size and the proportion of three group samples. Statistical power may have much difference under small sample size and large sample size, and it may change greatly when the proportion of the three group sample sizes varies. Thus, we set 6 groups to learn the effect of this factor, that is, n E :n R :n P equals to 20:20:20, 30:30:30, 50:50:50, 100:100:100, 120:120:120 and 150:100:50. Another factor is the distance between µ E and µ P − ∆ and the distance between µ P − ∆ and µ R . Recall that our statistic is constructed under the boundary condition, so it's natural to think that with the distance of µ E and µ P − ∆ or the distance of µ P − ∆ and µ R becoming larger, the statistical power may change accordingly. In order to make our simulation more conveniently, we fix ∆=0.5, µ E =3, and µ R =2.6, then we change µ P to discuss the influence of the distance between µ E and µ P − ∆ and the distance between µ P − ∆ and µ R . We set 5 different value of µ P : 2, 1.6, 1. According to the demonstration above, the Improved Score test performs well no matter the data is homoscedastic or heteroscedastic, and thus, when the variances of the experimental, reference and placebo treatments are not the same, we may choose the Improved Score test to test whether the non-inferiority with assay sensitivity can be claimed.

Conclusions and discussions
In this paper, we consider the three-arm non-inferiority trial in the statistical testing framework established by Hida and Tango (2011). Different from existing methods, in this paper, we allow the data to be heteroscedastic. We propose several test statistics.
Numerical studies are used to illustrate the performance of our proposed method and compare with some existing methods. When the data is heteroscedastic, Lu et al.
(2017)'s method could not control the type I error rates, and Hida and Tango (2011)'s method, Welch's t-test as well as Score test are conservative in controlling the type I error rates. While the improved Score test overcomes this problem, performs well in controlling the type I error rates. Further, the Improved Score test has the largest power among all the methods we discussed in the paper in almost all the conditions, thus, we could say that the Improved Score test is a valid test for the three-arm non-inferiority problem.
In this paper, we only discussed the endpoints of normally distributed data sets, and other endpoints such as binary data are not considered. In the future, we may assess the validity of our method in a three-arm inferiority trial with the placebo treatment having other endpoints.