Empirical likelihood based tests for stochastic ordering under right censorship

This paper develops an empirical likelihood (EL) approach to testing for stochastic ordering between two univariate distributions under right censorship. The proposed test is based on a maximally selected local EL statistic. The asymptotic null distribution is expressed in terms of a Brownian bridge. The new procedure is shown via a simulation study to have superior power to the log-rank and weighted Kaplan–Meier tests under crossing hazard alternatives. The approach is illustrated using data from a randomized clinical trial involving the treatment of severe alcoholic hepatitis. MSC 2010 subject classifications: 62G30, 62G10, 62N03, 62G20.


Introduction
When comparing survival patterns between two treatment groups in a randomized clinical trial (RCT), it is often of interest to examine whether there is a uniformly higher survival function in one of the groups.For example, in a recent RCT involving patients with severe alcoholic hepatitis, the objective is to compare a combination therapy of prednisolone plus N-acetylcysteine with prednisolone alone.Testing whether the combination therapy has a consistently higher/lower survival probability (throughout the follow-up period) addresses the issue directly, as opposed to the practice of using an omnibus alternative (i.e., any difference between the survival functions), or using the log-rank test which detects ordered hazards instead of ordered survival functions.This paper develops such a testing procedure that allows us to establish an ordering between two survival curves uniformly over time.
Our approach is framed in terms of the classical notion of stochastic ordering.A survival function S 1 is said to be stochastically larger than another survival function S 2 if S 1 (t) ≥ S 2 (t) for all t ≥ 0, and denoted S 1 ≽ S 2 .When in addition there is strict inequality for some t, it is denoted S 1 ≻ S 2 .Here and in the sequel we implicitly assume that t ≥ 0 is further restricted to a given follow-up period [t 1 , t 2 ], a common practice in simultaneous inference of survival functions in censored data [see, e.g., 3,33].
Notice that the parameter space for (S 1 , S 2 ) can be split into four separate hypotheses: H 0 : S 1 = S 2 , H 1 : S 1 ≻ S 2 , H ′ 1 : S 2 ≻ S 1 , and H c : S 1 (u) > S 2 (u) for some u ≥ 0, S 1 (v) < S 2 (v) for some v ≥ 0, i.e., crossing survival functions.The problem then is to test the null hypothesis H 0 ∪ H c versus the alternative H 1 ∪ H ′ 1 of stochastic ordering based on independent right-censored random samples from S 1 and S 2 .Our proposed approach is to combine a test that the survival functions do not cross, with a test of stochastic ordering, under the assumption of no crossing.If the first test does not reject H c , we terminate the procedure and conclude no evidence for H 0 ∪ H 1 ∪ H ′ 1 , the cases of equal and stochastically ordered survival functions.If the first test rejects H c , we proceed to the second test of stochastic ordering.The composite procedure concludes the alternative of stochastic ordering if both of these tests reject.We will show in Section 2.3 that the family-wise error rate of the combined procedure can be controlled at the same alpha level as the individual tests.Also, if prior information precludes the possibility of a crossing (e.g., when comparing survival probability of early-and late-stage cancer patients), one can skip testing (1).
A feature of our proposed composite procedure is that it can be adapted easily to utilize prior information on the direction of stochastic ordering.This can result in more powerful testing than simply reading a simultaneous (twosided) confidence band for the difference S 1 − S 2 on [t 1 , t 2 ].For example, when comparing survival curves between patients with early-stage (S 1 ) and late-stage (S 2 ) cancer diagnoses, it is not biologically plausible to consider S 1 ≺ S 2 .We could utilize such prior information by testing only a one-sided alternative (see (3)) as H ′ 1 is not reasonable, providing greater power.In contrast, the usual confidence bands correspond to two-sided tests and can have lower power in testing one-sided alternatives.We could construct a one-sided confidence band for this purpose.But then the band alone cannot test for absence of a crossing, and a two-step method as our composite procedure is still needed.
Our test for the absence of a crossing (1) is a straightforward adaptation of a simultaneous confidence band for the difference S 1 − S 2 on [t 1 , t 2 ], which crosses the time axis when the survival functions cross.The most challenging part of the problem is to produce a test of (2), or even more simply for the one-sided alternative This test is tractable, however, given that S 1 and S 2 do not cross, since in that case it suffices to detect whether S 1 (t) > S 2 (t) at some t.Developing this second test constitutes the main part of the paper.A test of the two-sided alternative in (2) is then easily constructed using the union-intersection principle applied to the two one-sided test statistics for H 1 and H ′ 1 (i.e., (3) in each direction).Commonly used two-sample tests for censored data include the log-rank test and weighted Kaplan-Meier (WKM) tests [34], and these tests can be one-sided or two-sided.The log-rank test is based on an integrated weighted difference between hazard functions, and is thus designed to detect ordered hazards instead of more general stochastic ordering.Other tests based on weighted differences between hazard functions, such as the K-class of weighted log-rank statistics [17,16], also share this property.The WKM class of tests targets stochastically ordered alternatives by estimating an integrated weighted difference between survival functions, but such test statistics depend on an ad hoc weight function that needs to be specified throughout follow-up.
We derive our procedure for testing (3) using the empirical likelihood (EL) method.EL involves forming a ratio of two nonparametric likelihoods subject to constraints on the parameters of interest.The method originates with Thomas and Grunkemeier [35], who constructed pointwise confidence intervals for survival functions from right-censored data.EL has also been used to provide confidence regions for parameters defined by estimating equations [29,30], in numerous censored and uncensored settings.EL enjoys many appealing properties: highly accurate confidence regions, self-studentization and the possibility of Bartlett correctability.There is also evidence that EL-based tests have optimal power [see, e.g., 21].On the other hand, order restricted inference is known to be challenging for EL [see, e.g., 30, Ch.10], and much less has been done in this direction.El Barmi [12] explored EL tests for order-restricted hypotheses of the form g(η) ≤ 0, where g is some smooth function and η is a finite-dimensional parameter specified by estimating equations [see also 38].Other recent contributions in this direction have been made by Andrews and Guggenberger [2] and Canay [5].As for order restrictions on distribution functions, El Barmi and McKeague [13] studied EL-based tests for stochastic ordering, while Davidov et al. [7] investigated EL-based tests for likelihood ratio ordering under a semiparametric biased sampling model.However, these tests are limited to uncensored data.
As already mentioned, the main part of our proposed procedure is an ELtest for one-sided stochastic ordering (3).The idea is to construct a localized EL statistic for H t 0 : S 1 (t) = S 2 (t) versus H t 1 : S 1 (t) > S 2 (t) at each given t .The key step in this construction is to recast the stochastic ordering constraint into an inequality involving a single Lagrange multiplier.Then the proposed test rejects H 0 for large values of the maximally selected EL statistic.A maximally selected test statistic is used (as opposed to integral-type) because it is more sensitive to local differences between the survival functions.Kolmogorov-Smirnov type test statistics (not based on EL) for stochastic ordering have been proposed by El Barmi and Mukerjee [14] and Davidov and Herman [8].Besides localization, another possible approach might be to use the full nonparametric likelihood [10,31] and compute its ratio under S 1 ≻ S 2 versus S 1 = S 2 .However, we find the localization approach to be much more tractable.The localization approach has been used in Einmahl and McKeague [11], Davidov and Herman [9] and El Barmi and McKeague [13] for testing various nonparametric hypotheses, except they considered an integral type test statistic and restricted attention to uncensored data.Park et al. [32] proposed a localized NPMLE under stochastic ordering (for right-censored data), but its asymptotic distribution is not known, so it is unclear how a formal test could be developed using their approach.
Various ways of formulating EL in right-censored data settings have been proposed.The standard approach for censored data [35,23] maximizes the censored data likelihood subject to contraint(s) on the parameter of interest.Wang and Jing [37] instead used the nonparametric likelihood for uncensored data and plug-in of the Kaplan-Meier (KM) estimator of the censoring distribution.We use the former approach as it is tractable and more natural in our setting.There are in fact two different versions of EL for censored data, namely the binomial and Poisson versions [see, e.g., 26].We utilize the binomial version.
The paper is organized as follows.In Section 2.1 we set up the general framework and notation to be used throughout the paper.The main focus of our procedure, the two-sample test for stochastic ordering, is developed in Section 2.2.The first part of our procedure, the test that the survival functions do not cross, is then discussed in Section 2.3.Related tests are discussed in Section 2.4: stochastic ordering in the null hypothesis, a test for crossing survival functions, and an integral type statistic.Section 3 presents the results of a simulation study: the proposed two-sample EL test is shown to outperform the log-rank and WKM tests under different stochastically ordered alternatives, including alternatives with crossing hazards.We have also shown the effectiveness of our combined procedure in ruling out H c when testing for stochastic ordering.Application of the proposed test to the RCT mentioned earlier is given in Section 4, and some concluding remarks are placed in Section 5.

Preliminaries
We introduce notation for the one-sample setup, then add a further subscript j indicating the j-th sample (j = 1, 2) for the two-sample case in the corresponding notation.Let X i and C i for i = 1, . . ., n be i.i.d.from unknown survival functions S and G, respectively; only min (X i , C i ) and I(X i ≤ C i ) are observed.The lifetimes X i and the censoring times C i are assumed to be independent.Also, S(0) = G(0) = 1.Order the uncensored lifetimes as 0 < T 1 < . . .< T m < ∞.For each T i (i = 0, . . ., m) , let r i be the number alive just before T i , d i be the number of deaths at T i and h i be the hazard at T i .Let N (t) be the number of observed lifetimes that are less than or equal to t.Then the nonparametric likelihood (depending on the unknown survival function) supported on the observed lifetimes is proportional to for This variance can be consistently estimated by the well-known Greenwood formula, Ŝ2 (t)σ 2 (t), where σ2 For the two-sample case, the nonparametric likelihood (denoted as L(S 1 , S 2 )) is proportional to L(S 1 )L(S 2 ) by independence between the two samples.The sample proportion n j /n is assumed to converge to some p j > 0, where n = n 1 + n 2 .The σ2 (t) now equals the weighted average n{σ

Two-sample test for stochastic ordering
Now we develop the main part of our combined procedure, the two-sample test for stochastic ordering.As described in the Introduction, we focus on the one-sided test for (3), then construct a test of the two-sided alternative in (2) using the union-intersection principle applied to the two one-sided test statistics for H 1 and H ′ 1 (i.e., (3) in each direction).Consider the "local" hypotheses H t 0 : S 1 (t) = S 2 (t) versus H t 1 : S 1 (t) > S 2 (t) for a given t, and the EL ratio where we use the conventions sup ∅ = 0 and 0/0 = 1.Note that the numerator and denominator of R(t) maximize L(S 1 )L(S 2 ) over (h 11 , . . ., h m11 , h 12 , . . ., respectively.We solve this constrained maximization problem using the Karush-Kuhn-Tucker (KKT) method [4], a generalization of the Lagrange method that allows inequality constraints.As the constraints are placed only on the lifetimes up to t, the terms after t turn out to be the same in both the numerator and denominator and thus cancel out.Also, for some t the maximum is attained on the boundary of the constraint set, in which case R(t) = 1.Specifically, in Appendix A we establish the following expression for the EL ratio: where hij = d ij /r ij , ĥij = d ij /(r ij + (−1) j−1 λ), and the Lagrange multiplier λ is determined by the equality in ( 6) with h ij replaced by ĥij .Here we have suppressed the dependence of λ and ĥij on t.Based on the above expression, we can derive large sample properties of the local EL test statistic, −2 log R(t).This is done by approximating −2 log R(t) via a Taylor expansion as a function of the difference between log Ŝ1 (t) (recall from Section 2.1 that Ŝ(t) is the KM estimator) and log Ŝ2 (t).We then make use of asymptotic properties of Ŝj (t) (j = 1, 2) to establish the weak convergence of −2 log R(t).The asymptotic null distribution turns out to be chi-bar square Namely, for t such that 0 < S 0 (t) < 1 and , where Z ∼ N (0, 1) and Z + = max(Z, 0).This result can be used to test the local hypotheses H t 0 versus H t 1 .
To test for the alternative of stochastic ordering, we propose the following maximally selected EL statistic: where 0 < t 1 < t 2 < ∞ are to be specified.We suppress the dependence of K n on t 1 and t 2 .Guidance on the choice of [t 1 , t 2 ] is provided later.Our first result gives the asymptotic null distribution of K n (see Appendix B for the proof).
Theorem 1. Suppose H 0 holds and the common survival function S 0 is continuous.For t 1 and t 2 satisfying S 0 (t 1 ) < 1 and S 0 (t 2 )G j (t 2 ) > 0 for j = 1, 2, where B is a standard Brownian bridge on [0, 1], B + = max(B, 0), To implement the test, we pre-specify one of the intervals [t However, b is unknown, so one of the two intervals has to be estimated.
We can choose [t 1 , t 2 ] based on the smallest and the largest observed lifetimes [see, e.g., 3] or some other biological considerations [33], and then estimate [x 1 , x 2 ] (by [x 1 , x2 ] say).But we cannot tabulate critical values in advance, because [x 1 , x2 ] varies across different data sets.In this case, instead of using the tabulated critical values, R code supplied in a supplementary file [6] can be used to compute the critical value based on [x 1 , x2 ].
On the other hand, pre-determining [x 1 , x 2 ] allows "universal" critical values, and this is the approach we take.Both the choice of [x 1 , x 2 ] and details of implementation will be provided in the next subsection.

Calibrating the test
This section discusses issues in calibrating the test.The first one is the choice of [x 1 , x 2 ].Secondly, having chosen [x 1 , x 2 ], we explain how to estimate [t 1 , t 2 ] and implement the proposed EL test.Justification for this calibration procedure will be provided in Appendix C, where a statistic K * n is defined for K n with estimated [t 1 , t 2 ].Critical values for the test are then obtained via simulation in Section 3.
The choice of [x 1 , x 2 ] is important because the interval width can affect power of the EL test.In a similar context, this issue has been discussed by Davidov and Herman [8]; they proposed a (non-EL-based) test of stochastic ordering for uncensored data via localization, and point out that a narrower [x 1 , x 2 ] gives smaller critical values, but may fail to capture deviations (from H 0 ) outside the interval.We have investigated their recommendation [x 1 , x 2 ] = [0.2,0.8] in a simulation study (see Section 3 and Table 4), and found that the performance is not very sensitive to the choice of x 2 , so our preference is to choose x 2 close to 1 to utilize the local statistics on the right tail.Our simulation study (in Section 3) shows that the choice x 1 = 0.2 and x 2 = 0.98 performs well in terms of balancing power and accuracy, and this is what we recommend in practice.
Having specified [x

Two-sided testing
The above one-sided test for stochastic ordering has an immediate extension to a two-sided test for (2), as needed for the second part of the composite testing procedure described in the Introduction.The two-sided alternative in (2) is the union of the two one-sided alternatives (S 1 ≻ S 2 or S 2 ≻ S 1 ).Based on the union-intersection principle, the test statistic is the maximum of the two onesided test statistics.The asymptotic null distribution of this test statistic is where B is a standard Brownian bridge, as in Theorem 1.The test can therefore be calibrated in much the same way as we did for the one-sided test.

Crossing survival functions
As explained in the Introduction, if there is no prior information that excludes the possibility of crossing survival functions, we conduct the first part of our combined testing procedure.It calls for a consistent test of (1), and this can be done using a 100(1 − α)% simultaneous confidence band for the difference S 1 − S 2 , and showing that the asymptotic level of the resulting test is bounded above by α.The follow-up interval [t 1 , t 2 ] to be used in this connection will be specified in a later section.A band for the ratio S 1 /S 2 could be used in a similar fashion [see, e.g., the EL band given by 24], but here for simplicity we only consider the difference approach.
Consider the random multiplier bootstrap band B n for S 1 − S 2 developed by Parzen et al. [33].Intuitively, the data would support the presence of crossing (H c ) when the lower boundary of B n is > 0 at some time point (implying S 1 (u) > S 2 (u) for some u ≥ 0), and its upper boundary is < 0 at another time point (implying S 1 (v) < S 2 (v) for some v ≥ 0).Therefore, the opposite should lead to rejection of the null hypothesis H c of a crossing: if the lower boundary of B n is ≤ 0 or its upper boundary ≥ 0.
Note that B n is centered on the difference Ŝ1 − Ŝ2 of the KM estimators, and the results of Parzen et al. [33] imply (under the same conditions assumed here) that it has coverage P (S 1 − S 2 ∈ B n ) → 1 − α, and maximal width This leads to an asymptotic level α test as follows.Under To obtain the family-wise error of conducting this test of (1) along with the test for stochastic ordering (2), we appeal to the partitioning principle of Finner and Strassburger [15].This principle holds when the null hypotheses are disjoint (in our case H c and H 0 are indeed disjoint), and shows that the level of each test can be chosen to be the same as the desired family-wise error rate (α).

Stochastically ordered null
We have developed our test for stochastic ordering (3) under the null hypothesis S 1 = S 2 and under the assumption that S 1 and S 2 do not cross.Here we describe how our approach can be extended to the stochastically ordered null hypothesis S 1 ≼ S 2 under the same assumption (i.e., testing where the denominator maximizes over the union of the local (null and alternative) hypotheses and results in no constraint on S 1 (t) and S 2 (t).Since the KM estimator is the NPMLE, if Ŝ1 (t) ≤ Ŝ2 (t) the numerator of R ′ (t) coincides with the unconstrained maximum and thus equals the denominator.If Ŝ1 (t) > Ŝ2 (t), it can be shown that the numerator attains its maximum on the boundary S 1 (t) = S 2 (t) of the constraint set (using log-concavity of (4)).We then have Thus R ′ (t) = R(t) by ( 7), since λ ≥ 0 is the same as Ŝ1 (t) ≤ Ŝ2 (t) by Appendix A. Hence K n does not change under this broader null hypothesis.The same calibration method can be used to obtain an asymptotic level α test because where c α is the upper α-quantile of the limiting distribution in Theorem 1.

Detecting crossing survival functions
In some applications it can be of interest to test for crossing survival functions, i.e., reversing the null and alternative hypotheses in (1).This can be done by carrying out the one-sided test of Section 2.2 in both possible directions.The reason is here the parameter space for the one-sided tests allows for crossing, so that the test of Section 2.2 is interpreted instead as testing S 1 (t) ≤ S 2 (t) for all t versus S 1 (t) > S 2 (t) for some t, based on Section 2.4.1 and the union-intersection principle.If both tests reject, then there is evidence of crossing survival functions.Then, using the intersection-union principle, we take the minimum of the two one-sided test statistics as the test statistic.The R code (provided online) for implementing the one-sided test is readily adapted for this purpose, with critical values obtained from simulating min sup where B − is the negative part of the Brownian bridge B.

An integral type statistic
An integral type EL statistic could be developed as well, as an extension of the integrated statistics provided by El Barmi and McKeague [13] for uncensored data.However, it is challenging to find a suitable integrator that (a) is interpretable, (b) leads to an easily calibrated test.For example, a direct extension of their integrator F to the censored case (i.e., using the Kaplan-Meier estimates), as far as we know, will not lead to an asymptotically distribution free test statistic as our Lemma 3; so this extension does not satify criterion (b).
We have also tried using the following test statistic: with the very unintuive integrator σ2 (t)/(1 + σ2 (t)).The integrator is chosen so that the limiting distribution is the same as the asymptotic null distribution in El Barmi and McKeague [13] with the [x 1 , x 2 ] restriction.However, the integrator is weighting the local EL statistics in a way that is difficult to interpret (criterion (a) violated), and is rather ad hoc and needs to be specified throughout follow-up, just like the WKM class of tests.Due to the undesirable properties of these integrated statistics, we do not pursue this direction further.In comparison, our proposed maximally selected EL statistic does not need to be weighted throughout follow-up and the test is easily calibrated.

Simulation study
In this section, we report the results of a simulation study to evaluate various aspects of the proposed method.We start with the second (i.e., main) step in the combined procedure.We restrict our attention to one-sided tests, but results for the two-sided tests are similar.We first tabulate selected critical values, and then compare the performance of K * n with the (one-sided) log-rank and WKM tests in terms of accuracy and power.Finally we assess the performance of the combined procedure under models of equal and crossing survival functions, and most importantly, stochastic ordering.

Critical values and accuracy
Quantiles of the limiting distribution in Lemma 3 of Appendix C are used as critical values for K * n .These are computed by simulation based on 100,000 replications of standard Brownian bridge over a fine grid on [0, 1] (100,000 equidistant points), for selected values of x 1 and x 2 (see Table 1).
To compute empirical significance levels, we simulate lifetimes from the piecewise exponential distribution displayed as solid line in upper left panel of Figure 1.We consider exponential censoring distribution: , where θ is chosen to give a censoring rate (CR) of 10% or 25%.Our one-sided EL statistic (K * n ) is compared with the one-sided log-rank statistic.Another class of tests for comparison is the one-sided WKM, and we follow recomendations of Pepe and Fleming [34] and select the WKM statistic with the pooled variance estimator and the weight function denoted by ŵc (t) in their paper.
Results on the size of our EL test are given in Table 2, where we use [x 1 , x 2 ] = [0.2,0.98].The test is slightly conservative in small samples but approaches the nominal level as the sample size increases.Such conservativeness has been seen in other maximal deviation-type statistics for stochastic ordering [8].The empirical significance levels of the one-sided log-rank test and the WKM test under the same settings are closer to the nominal level, but sometimes on the anticonservative side.

Power comparisons
In this section, we compare the small sample power of the proposed test with the one-sided WKM and log-rank tests.Two models of lifetime distributions are considered, both with piecewise-constant hazards.In Model A, the hazard functions cross while the survival functions still remain stochastically ordered (see Figure 1, first column).In this case, the one-sided log-rank test can fail to detect the difference between the survival curves because it is designed to detect ordered hazards.In Model B, the two groups have different hazards initially but the same hazard later on, so the difference between the survival functions gradually wears off (see Figure 1, second column).This is a common phenomenon which is also seen in our real data example in Section 4. For both models, we consider exponential and uniform censoring distributions: or Uniform(0, c 1 ), with θ 1 or c 1 chosen to give a CR of 10% or 25% for group 1.
Results are given in Table 3 for outperforms the other tests in all the cases considered, especially in the crossing hazards scenario (Model A).The much lower power of WKM in Model A is surprising, because this test were shown to work well under crossing hazard alternatives in some previous simulation examples [34].The superior performance of our test may be due to two factors: first, our test is based on nonparametric likelihood, so it can be expected to be more powerful than tests that depend on an ad hoc weight function; second, we are using a maximal deviation-type statistic, rather than a weighted average, so our test may be more sensitive to local differences in the survival functions.We have also investigated power under proportional hazards configurations, and our test closely matches the performance of the log-rank and WKM tests (see supplementary tables).These results show that for stochastically ordered alternatives, the proposed EL test can compete effectively with the log-rank and WKM tests, especially when the hazard functions cross.Table 4 gives size and power for various choices of x 1 and x 2 reflecting light or heavy truncation.It is clear from the last two rows that light truncation on the left results in both poor accuracy and power compared with the top row, which corresponds to our recommendation [x 1 , x 2 ]=[0.2,0.98].Yet the performance is not very sensitive to the choice of x 2 , so our preference is to choose x 2 close to 1 in order to reduce truncation.

Combined procedure
Finally we conducted simulations for the combined procedure described in the Introduction: (1) testing the survival functions do not cross, and (2) testing stochastic ordering under the assumption of no crossing.The goal is to see if the combined procedure has good performance under models of equal and crossing survival functions, and most importantly, stochastic ordering.Size and power for various choices of x 1 and x 2 based on 10,000 replications, α = 0.05, n 1 = n 2 = 50, and exponential censoring with censoring rate 10%.Model A: survival functions as in Figure 1, upper left panel.Model B: survival functions as in Figure 1, upper right panel.For size, only the solid survival functions are used.
x Three models of lifetime distributions are considered: the one from Section 3.1 under H 0 , the Model A from Section 3.2 under H 1 , and a new Model C under H c (see Figure 2).Decisions are labeled as "crossing" if H c is not rejected in testing (1), "equality" if H c is rejected but H 0 is not rejected in testing (2), and "stochastic ordering" if both H c and H 0 are rejected.The empirical proportion of decisions from the combined procedure is then reported.
The results are summarized in Table 5.Our combined procedure is effective in correctly identifying H 0 , H c and most importantly, H 1 .The results from Model A, in particular, shows the ability of our combined procedure to rule out H c in testing for stochastic ordering.

Application
A RCT for treatment of severe alcoholic hepatitis [28] is analyzed.The data are obtained by digitizing the published KM curves and reconstructing survival and censoring information using the algorithm developed by Guyot et al. [18].The purpose of the trial was to assess whether a combination therapy of prednisolone plus N-acetylcysteine is better than prednisolone alone (the currently recommended treatment).A total of 174 patients were randomized to taking the combination (n 1 = 85) or only prednisolone (n 2 = 89), and the primary endpoint is their 6-month survival.The KM curves (see the top panel of Figure 3) suggest a stochastic ordering between the two groups.
The case of crossing survival functions is precluded via a rejection of H c in testing (1) in our composite procedure.Application of the one-sided EL test indicates that the combination therapy group has stochastically larger survival pattern than patients receiving only prednisolone (K * n = 10.36,p = 0.018).In comparison, the WKM and the one-sided log-rank tests yield p-values of 0.021 and 0.037, respectively.Examining the cumulative hazards plot (see the bottom panel of Figure 3), we can see that the slopes (i.e.hazards) of the two curves only differ noticeably during the initial 40 days.Such a scenario of an initial hazard difference has been considered in Model B of Section 3.2, where we show our EL test is better adapted to detecting a difference between the two treatment groups.
Nguyen-Khac et al. [28] actually used the two-sided log-rank test and reported a p-value of 0.07.They concluded that the combination therapy does not improve the 6-month survival.In contrast, our two-sided EL test shows that the two treatment groups are significantly different and there is a uniformly higher survival function in one of the groups (p = 0.036, computed by the supplementary R program that implements the two-sided EL test).In this case the EL test shows a more significant result that leads to a completely different conclusion than the log-rank test.

Discussion
We have developed a class of EL-based tests for both one-and two-sided stochastically ordered alternatives under right censoring.The procedure involves first checking that the survival functions do not cross.The proposed test statistic for one-sided stochastic ordering is a maximally selected local EL statistic and is shown to be asymptotically distribution-free.The test statistic for two-sided stochastic ordering is taken as the maximum of the two one-sided test statistics.A simulation study shows that our test can be much more powerful than the log-rank and WKM tests under alternatives with crossing hazards.We applied our test to a RCT involving patients with severe alcoholic hepatitis and found a more significant result than the log-rank and WKM tests.
Our test statistics utilize a data-dependent interval [t 1 , t 2 ], much like the data-dependent weight-function used in integral-type tests based on hazard or survival functions.Such restriction has been used in simultaneous inference of survival functions in censored data [see, e.g., 27,3,33], such as Nair's equal precision confidence band [27] and empirical likelihood based confidence band [20].This cannot be avoided in procedures that are (asymptotically) based on standardized statistics, as far as we know.However, in contrast to methods that rely on the selection of a complete weight function throughout follow-up (e.g., the WKM test), it is actually much easier and more transparent to select just the two tuning parameters (x 1 and x 2 ) needed in our case.Although t 1 and t 2 could be specified using a data-dependent rule (such as 5% of the data in each tail), this approach would have the disadvantage of needing tailor-made critical values for each dataset.In this case, instead of using the tabulated critical values in Table 1, one can use the supplementary R code to compute a critical value based on [x 1 , x2 ].
Our test targets stochastically ordered alternatives through construction of a nonparametric likelihood ratio (EL).It can be expected to be more powerful than commonly used two-sample tests that either are not tailored for such alternatives or depend on an ad hoc weight function.When combined with a test for the absence of a crossing ( 1), it provides more information about the nature of the difference between S 1 and S 2 compared to the omnibus alternative S 1 ̸ = S 2 , in which case the functional parameters S 1 and S 2 may be ordered in one direction at certain time points, but ordered in the reverse direction at other time points.
Our central contribution is the development of the first EL-based test for ordered survival functions in right-censored data settings, and we envision the test to be useful in clinical trials, in reliability engineering, and health policy applications.It would also be of interest to extend our approach to allow the testing of stochastic ordering in k-sample censored data settings, and to explore how it could be used for other types of ordering between distributions, such as increasing convex ordering, likelihood ratio ordering and uniform stochastic ordering (or hazard rate ordering).Another direction is to generalize our approach to cover the situation with left-truncation.This can be done by using the empirical likelihood formulated in Li [22] (who considered the case of one sample, two-sided situation at one time point only), although a full derivation is well beyond the scope of this article.
Then from (14) we have This is exactly (7).We use the simplified notation ĥij and λ to replace ĥ0 ij and λ0 , respectively.
Next, we find an asymptotic expansion of −2 log R(t) as a function of λ(t).We begin, based on (7) ) for j = 1, 2. Using n j /n → p j , and the fact that λ(t) < 0 is equivalent to Ŝ1 (t) > Ŝ2 (t), we can combine the terms for j = 1, 2 and obtain This and (17) give the desired result.
Remark.Lemma 2 shows that −2 log R(t) is asymptotically equivalent to squaring the positive part of a scaled difference between the log of KM estimators from the two samples.The inclusion of only the positive part of the difference can be attributed to the stochastically ordered form of our alternative hypothesis.We have compared the small sample performance of K n and its counterpart based on this squared difference (results not shown), and it turns out the latter tends to be too conservative.The advantage of using the EL approach, as opposed to a test statistic derived from the first term in the expansion of Lemma 2, is that we expect higher-order accuracy [cf. 19].This is parallel to the parametric result in which the likelihood ratio test is asymptotically equivalent to the Wald test, but the former has better higher-order accuracy [see, e.g. , 25].
We now complete the proof of Theorem 1.
We first obtain the weak convergence of −2 log R(t) as a process on [t

Fig 1 .
Fig 1.The piecewise exponential survival functions (top row) and the hazard functions (bottom row) in Model A (first column): S 1 (solid) and S 2 (dashed), and in Model B (second column): S 1 (solid) and S 2 (dashed).

Fig 3 .
Fig 3. Estimates of survival functions (top) and cumulative hazards (bottom) for prednisolone plus N-acetylcysteine (solid line) versus prednisolone alone (dashed line).
1 , x 2 ], we need to estimate [t 1 , t 2 ].Under suitable conditions on b −1 , t l can be consistently estimated by b−1 (x l ) = inf{t : b(t) ≥ x l } for We can then compute K * n accordingly, based on the estimates t1 and t2 .To ensure stability of K * n in small samples, we consider only values of −2 log R(t) inside the interval formed by the smallest and the largest observed lifetimes, [max(T 11 , T 12 ), min(T m11 , T m22 )].Such restriction has been used in simultaneous inference of survival functions in censored data [see, e.g., 3].This leads to considering only t ∈ [max( t1 , T 11 , T 12 ), min( t2 , T m11 , T m22 )].Note that this modification makes no difference asymptotically, since [max(T 11 , T 12

Table 1
Critical values for K * n for selected x 1 , x 2 and α.

Table 2
Empirical significance levels based on 10,000 replications.

Table 3
Power at α = 0.05 based on 10,000 replications.Model A: survival functions as in Figure1, upper left panel.Model B: survival functions as in Figure1, upper right panel.

Table 5
Proportion of decisions from the combined procedure based on 10,000 replications, α = 0.05, n 1 = n 2 = 80, and exponential censoring with censoring rate 10%.Model A: survival functions as in Figure1, upper left panel.Model A.0: solid survival function in Figure1, upper left panel.Model C: survival functions as in Figure2.
, by writing −2 log R(t) as i≤Nj (t) 1 , t 2 ], based on Lemma 2 and large sample properties of the KM estimator.Then by a transformation of the limiting process and the continuous mapping theorem, we get the limiting distribution ofK n .convergence[−2 log R(t), t1 , t2 ] T d −→[U 2 + (t)/σ 2 (t), t 1 , t 2 ] T in D[t 1 , t 2 ] × Θ 2 [see, e.g., 36, Theorem 18.10 (v)].Then applying a similar argument in the last part of the proof for Theorem 1 and the continuous mapping theorem, we get the desired result.