HomogeneityTest ofMany-to-OneRiskDifferences forCorrelated Binary Data under Optimal Algorithms

In clinical studies, it is important to investigate the effectiveness of different therapeutic designs, especially, multiple treatment groups to one control group. .e paper mainly studies homogeneity test of many-to-one risk differences from correlated binary data under optimal algorithms. Under Donner’s model, several algorithms are compared in order to obtain global and constrained MLEs in terms of accuracy and efficiency. Further, likelihood ratio, score, and Wald-type statistics are proposed to test whether many-to-one risk differences are equal based on optimal algorithms. Monte Carlo simulations show the performance of these algorithms through the total averaged estimation error, SD, MSE, and convergence rate. Score statistic is more robust and has satisfactory power. Two real examples are given to illustrate our proposed methods.


Introduction
Binary data are often encountered for paired organs (e.g., eyes and ears) in medical clinical studies. e responses of each patient are collected and recorded as paired data at the end of the study. e outcome can be none, unilateral, or bilateral cured. Data from all patients can be summarized in a contingency table. e correlation between responses from paired parts should be taken into account to avoid biased or misleading results. Some probability models have been proposed for analyzing correlated paired data. Rosner introduced a constant R model under the assumption that the conditional probability of a response at one side of the paired body parts given response at the other side was R times the unconditional probability [1]. Under Rosner's model, asymptotic and exact tests were discussed [2][3][4][5]. However, Dallal pointed out that Rosner's model could give a poor fit if the characteristic was almost certain to occur bilaterally with widely varying group-specific prevalence [6]. He assumed that each group had a constant conditional probability c and derived the likelihood ratio statistic to test prevalence equality. M'lan and Chen presented three objective Bayesian methods for bilateral data under Dallal's model [7]. Donner proposed an alternative model by assuming that the correlation coefficient was a fixed constant ρ in each of the groups [8].
ompson proved that Donner's model could make full use of single and two-organ data to optimize the power of study [9]. Pei et al. applied the model into stratified paired data and assumed that the correlation coefficients of responses were the same to all subjects in two groups of each stratum [10].
Testing the homogeneity has received considerable attention for bilateral binary data. In ophthalmologic studies, Rosner [1] proposed two statistics to test the equality of rate difference in (in)dependence models. Tang et al. [2] developed exact and approximate unconditional procedures for the aforementioned statistics in small sample designs or sparse data structures. Further, Tang et al. [11] developed several statistics for testing the equality of cure rates, including likelihood ratio, score, and two Wald-type statistics in the (in)dependence models. Ma et al. [3] extended these tests to multigroup cases and investigated whether the response rates of the g groups (g ≥ 2) were identical under Rosner's model. From the above results, we note that it is crucial to derive the global and constrained MLEs under the hypotheses. However, there are usually no closed-form solutions for maximum likelihood estimates (MLEs). Under Donner's model, Ma and Liu [12] used two-step algorithm to obtain MLEs and developed several tests for the proportion equality among g groups (g ≥ 2). Liu et al. [13] also used the method for constrained MLEs. Peng et al. [14] constructed confidence intervals (CIs) of proportion ratio under Rosner's model. ey introduced Fisher scoring algorithm for constrained MLEs. Many algorithms were proposed to obtain MLEs for correlated binary data. However, there are few research studies on comparison of different algorithms for MLEs in multigroup binary design.
Under Donner's model, this paper aims to provide several algorithms for calculating global and constrained MLEs and extends the homogeneity tests of Tang et al. [11] to many-to-one case under optimal algorithms. Fisher scoring algorithm, two-step method, and generalized expectation-maximization (GEM) algorithm are taken into account, since they are widely used in calculating MLEs. Optimal algorithms for MLEs required by the objective test can be found through comparing these algorithms. e rest of this article is organized as follows. In Section 2, we review data structure and establish Donner's model for multigroup correlated binary data. Global and constrained MLEs are derived by various algorithms in Section 3. Based on the optimal algorithms, the likelihood ratio, score, and Waldtype statistics are constructed for testing the equality of many-to-one risk differences. e performance of algorithms is compared by the total averaged estimation error, SD of the averaged estimation error, MSE, and convergence rate in Section 4. Monte Carlo simulations show the empirical type I error rate and power of these tests. Two real examples are provided to illustrate the proposed methods in Section 5. Conclusions and further work are given in Section 6.

Preliminaries
Suppose there are g groups involving M individuals in the clinical trial, where the first group is control group and other g − 1 groups are treatment groups. Let m li be the number of patients with l responses (l � 0, 1, 2) in the i-th group (i � 1, 2, . . . , g) and N i � 2 l�0 m li be the total number of patients in the i-th group, which is assumed to be fixed. e data structure is shown in Table 1.
us, the probability density of m i is expressed as follows: Let Z hik be an indicator of the k-th organ's response (k � 1, 2) for the h-th patient (h � 1, . . . , N i ) in the i-th group. If there is a response, then Z hik � 1, and 0 otherwise. Suppose that Pr(Z hik � 1) � π i (0 ≤ π i ≤ 1), and Corr(Z hik , Z hi (3− k) ) � ρ(0 ≤ ρ ≤ 1) under Donner's model. us, the probabilities p li can be obtained by for i � 1, 2, . . . , g. Based on the observed data m � (m 1 , m 2 , . . . , m g ), the log-likelihood function can be given by is the risk difference between the first group and the i-th group. We are interested to test the hypotheses below.

Test Methods
In this section, the global and constrained MLEs are first derived by various algorithms. en, likelihood ratio, score, and Wald-type tests are constructed based on the optimal algorithms.

Global MLEs.
Let π i and ρ be the global MLEs of π i and ρ. For the unknown parameters π i (i � 1, 2, . . . , g) and ρ, their global MLEs are the solutions of the following equations: where However, there are no closed-form solutions for the above equations. us, we need to obtain the global MLEs π i (i � 1, 2, . . . , g) and ρ by different algorithms.

Global MLEs Based on Two-
Step Method. e two-step method is described by a third-order polynomial and Newton-Raphson algorithm. e detailed procedure is provided below.

Number of responses (l)
Group (i) Step. Update π 1 , π 2 , . . . , π g and ρ in order to successively maximize expected value of the loglikelihood function in E step. e new approximate of parameters can be obtained by maximizing Q(π, ρ|m, π (t) , ρ (t) ) when other parameters are given as their latest approximates. Repeat E and M steps until the result converges.

Constrained MLEs Based on Fisher Scoring Algorithm.
e initial values of ρ, π 1 are defined in equation (9), and δ (0) � 1. e (t + 1)-th updates of δ, π 1 , and ρ can be calculated by Fisher scoring algorithm as follows: where I 2 is a 3 × 3 Fisher information matrix (see Appendix B).

Constrained MLEs Based on Two-Stage
Procedure. e two-stage procedure is different from the two-step method in Section 3.1.2. Firstly, the MLE δ of δ is given by Newton-Raphson algorithm. en, π 1 and ρ are obtained by Fisher scoring algorithm under given MLE δ. e detailed process is described as follows.

Likelihood Ratio Test.
Likelihood ratio test statistic can be constructed through the global and constrained MLEs as follows: where π i , ρ(i � 1, 2, . . . , g) are the global MLEs and π 1 , δ, ρ are the constrained MLEs.

Monte Carlo Simulations
In this section, the performance of several algorithms are compared with respect to average errors of MLEs, the number of iteration, and time cost. For convenience, we denote Fisher scoring algorithm, two-step method, and GEM algorithm for global MLEs as FSA, TSM, and GEM and Fisher scoring algorithm, two-stage procedure, and GEM algorithm for constrained MLEs as FSA, TSP, and GEM for tables and figures. en, we investigate the type I error rates (TIEs) and power of the likelihood ratio, score, and Waldtype tests. In simulations, g and N � (N 1 , N 2 , . . . , N g ) T are arranged as shown in Table 2, where the scenarios 4, 8, and 12 are unbalanced designs.

Selection of Algorithms.
Under H a or H 0 , we randomly select 1000 sets of π i (i � 1, 2, . . . , g) and ρ for each scenario in Table 2. Further, 10,000 samples are randomly produced for each parameter setting. e constrained MLEs are obtained by Fisher scoring algorithm, two-stage procedure, and GEM algorithm. e random samples for the former are generated under H a and the latter are generated under H 0 . e convergence accuracy ϵ is defined by the differences from two close iterations and fixed as 1 × 10 − 5 . e MSEs of the three algorithms for global MLEs have no significant difference as shown in Tables 3-5 . at is to say, the global MLEs are identical by these algorithms. In Tables 6-8 , the values of e, SD, and MSEs from Fisher scoring algorithm are usually smaller than other two algorithms for constrained MLEs. So, Fisher scoring algorithm has higher accuracy for constrained MLEs. All MSEs become smaller and close to each other when sample size increases. Algorithms for MLEs have better MSEs in balanced designs than unbalanced designs. Tables 9 and 10 show the number of failures for these algorithms to converge when ε � 1 × 10 − 5 , 1 × 10 − 4 within k � 100, 200 iterations. Fisher scoring algorithm has lower failure rate for convergence when calculating global MLEs. GEM algorithm for global MLEs is sensitive to ϵ and k. Both reducing ϵ and adding k can markedly improve the convergence possibility of GEM algorithm for global MLEs. Fisher scoring and GEM algorithms for constrained MLEs hardly fail to converge within 100 iterations. As g increases, the number of failures becomes small.

Number of Iteration. Since the algorithms for global
MLEs have identical accuracy in terms of MSEs, we further compare their efficiency by calculating the number of iteration and time cost when the convergence accuracy is ϵ � 1 × 10 − 5 . e average number of iteration is recorded for every parameter setting in Figure 1.
As shown in Figure 1, GEM algorithm takes more iterations to converge than Fisher scoring algorithm and twostep method. e number of iterations is most intensive between 20 and 30 and the upper limit increases when N i (i � 1, 2, . . . , g) is bigger. For Fisher scoring algorithm, the number of iterations has slight superiority over two-step method.

Time Cost.
e time required to achieve convergence can also reflect the convergence rate of an algorithm. Figure 2 shows that GEM algorithm has the worst performance in terms of time cost. Fisher scoring algorithm always takes less time, especially when g � 6, 9. As N i (i � 1, 2, . . . , g) increases, the time distribution is more clustered for Fisher scoring algorithm and two-step method. When g is larger, difference of time cost is bigger between the two algorithms.
Based on the above results, Fisher scoring algorithm, two-step method, and GEM algorithm can produce the same global MLEs when the number of iterations is large enough. However, Fisher scoring algorithm for global MLEs has a better convergence rate in terms of number of iteration and time cost. Constrained MLEs from Fisher scoring algorithm has smaller MSEs. erefore, it is advisable to choose Fisher scoring algorithms to obtain the global and constrained MLEs required by the homogeneity test in this paper.

Evaluation of Test Statistics.
Since Fisher scoring algorithm has higher accuracy for global MLEs and is more efficient for constrained MLEs, we construct test statistics    Index  Table  6: Errors of the constrained MLEs for g �

3.
Index  Table  7: Errors of the constrained MLEs for g �

6.
Index  Index     Complexity when the significance level is α � 0.05. Power is the probability of correctly rejecting a null hypothesis when it is false in a statistical test. A good test should not only be robust but also make power as high as possible. 1,000 random parameter settings involving π i and ρ (i � 1, 2, . . . , g) are generated under H 0 , where g and N are shown in Table 2. 10,000 samples are randomly produced for every parameter setting. e empirical TIEs can be computed by dividing the number of times of rejecting H 0 with 10,000. Figure 3 further reflects the comprehensive performance of T L , T SC , and T W . e empirical TIEs of T SC are smaller and close to 0.05, which means score test is more robust. Wald-type test statistic tends to be liberal when N i (i � 1, 2, . . . , g) is small (i.e., N i � 30, 70) and unbalanced. Wald-type test statistic is more liberal when g is bigger. Empirical TIEs of three tests are closer to 0.05 as N i increases. Balanced designs make test statistics have better performance than unbalanced designs.

4.2.2.
Power. Let N 1 � N 2 � · · · � N g � m, g � 3, 6, 9, π 1 � 0.2, ρ � 0.2, 0.5, 0.8, δ p � 0.1, 0.2, 0.3, π i � π 1 + δ p when i � 2k, and π i � π 1 when i � 2k + 1(k � 1, 2, . . .). 10,000 samples will be randomly generated under every parameter setting when m changes from 10 to 160 at intervals of 10. Empirical power can be computed by dividing the number of times of rejecting H 0 with 10,000. Figure 4 reflects how empirical power changes as m changes. T W has higher power and T L and T SC have close power. Empirical power of three test statistics is close to each other as m increases. Increasing the number of groups can produce higher power. It means that all test statistics work well under multigroup cases. e moderately and highly relevant data (i.e., ρ � 0.5 and ρ � 0.8) have lower power than the mildly relevant data (i.e., ρ � 0.2).
According to the above results, score test is more robust and has satisfactory power. us, it should be recommended for the homogeneity test about many-to-one comparison of risk differences.

Real Examples
In this section, we introduce two real examples to illustrate the aforementioned methods. e first example was presented by Rosner [1] to illustrate the new proposed methods. As shown in Table 11, 216 persons aged 20-39 with retinitis pigmentosa (RP) were classified into four genetic types including autosomal dominant RP (DOM), autosomal recessive RP (AR), sex-linked RP (SL), and isolate RP (ISO). e results from the four groups were assessed by the Snellen visual acuity (VA). An eye was considered affected if VA was 20/50 or worse and normal if VA was 20/40 or better.
rough Fisher scoring algorithm, two-step method, and GEM algorithm, global MLEs π 1 , π 2 , π 3 , π 4 , and ρ are listed in Table 12. We observe that three algorithms can produce the same results. According to global MLEs of proportions in AR, SL, and ISO groups, estimated risk differences can be calculated as δ 2 � 0.5455 − 0.3625 � 0.1830, δ 3 � 0.7926 − 0.3625 � 0.4301, and δ 4 � 0.4658 − 0.3625 � 0.1033. Constrained MLEs π 1 , δ, ρ vary slightly in Fisher scoring algorithm, two-stage procedure, and GEM algorithm as shown in Table 12. rough Fisher scoring algorithms, the values of T L , T SC , and T W are 9.4332, 8.7076, and 13.0135, respectively, bigger than 95% percentile of the chi-square distribution with 2 degrees of freedom. ree tests all have p values smaller than 0.05 (see Table 13). us, there is strong evidence to reject the null hypothesis H 0 : δ 2 � δ 3 � δ 4 ≜ δ at a significance level of 0.05. e second example is a cross-sectional populationbased study about avoidable blindness in Iran by Rajavi et al. [15]. Nearly 3000 persons were examined, where blindness is assessed for seven age groups presented in Table 14. Table 15 provides global MLEs and constrained MLEs. e common risk differences δ from three algorithms are estimated to be 0.0419, 0.0433, and 0.0462. Table 16 shows that p values of test statistics T L , T SC , and T W are 1.4474 × 10 − 21 , 8.9818 × 10 − 26 , and 3.4369 × 10 − 14 . So we have enough evidence to reject H 0 : δ 2 � δ 3 � δ 4 � δ 5 � δ 6 � δ 7 ≜ δ at a significance level of 0.05.

Conclusion
is paper mainly studies the many-to-one comparison of risk differences for correlated binary data. ree e results of simulations show that score test statistic is robust and has high empirical power.
In particular, Fisher scoring algorithm may not work well with high dimensional Fisher information matrix when the number of groups is massive. en, the twostep method or two-stage procedure can be adopted to solve the high dimensional problems. As the modified Fisher scoring algorithm or Newton-Raphson algorithm, they only need to calculate a small part of parameters at every step to reduce dimension. GEM algorithm is also an alternative approach regardless of time cost. e biggest obstacle for GEM algorithm may occur in M step when the local optimal solution does not exist or cannot be found. Furthermore, the convergent performance of GEM algorithm is sensitive for the convergence accuracy and the iteration number.
In this article, we investigate the homogeneity test of many-to-one risk differences in multigroup design. If samples can be stratified by some control variables (e.g., age, gender, etc.), the treatment-by-stratum interaction should be considered. us, it is of great significance to develop many-to-one comparison in stratified correlated binary data. Other indexes such as risk ratio can be adopted for evaluating differences of proportions instead of risk difference.