Comparison of methods to testing for di ﬀ erential treatment e ﬀ ect under non-proportional hazards data

: Many tests for comparing survival curves have been proposed over the last decades. There are two branches, one based on weighted log-rank statistics and other based on weighted Kaplan-Meier statistics. If we carefully choose the weight function, a substantial increase in power of tests against non-proportional alternatives can be obtained. However, it is di ﬃ cult to specify in advance the types of survival di ﬀ erences that may actually exist between two groups. Therefore, a combination test can simultaneously detect equally weighted, early, late or middle departures from the null hypothesis and can robustly handle several non-proportional hazard types with no a priori knowledge of the hazard functions. In this paper, we focus on the most used and the most powerful test statistics related to these two branches which have been studied separately but not compared between them. Through a simulation study, we compare the size and power of thirteen test statistics under proportional hazards and di ﬀ erent types of non-proportional hazards patterns. We illustrate the procedures using data from a clinical trial of bone marrow transplant patients with leukemia.


Introduction
The most commonly used two-sample nonparametric test statistic is the log-rank (LR) statistic, which is locally most powerful against proportional hazards (PH) alternatives.However, rank-based log-rank statistics may not be sensitive to certain types of stochastic ordering alternatives, particularly if the hazard functions cross or a delayed effect is observed.In reality, the PH does often not hold true as for example in some immuno-oncology therapies [1].As an alternative to generalized linear rank statistics, [2] proposed nonparametric weighted Kaplan-Meier (WKM) statistics, which are nonrank-based statistics and therefore sensitive to the magnitude and duration of the difference in survival curves over time.These test statistics seems to compare favorably with the popular log-rank test statistic across a broad range of stochastic ordering alternatives.
As a particular case of WKM, you get the restricted mean survival time (RMST) which is an alternative robust and clinically interpretable summary measure that does not rely on the proportional hazard (PH) assumption.[3] concluded that the RMST-based test has better performance than the logrank test when the truncation time is reasonably close to the tail of the observed curves under non-PH scenarios where late separation of survival curves is observed.
Under non-proportional alternatives, the effect of a treatment may wane over time, leading to a decreasing hazard ratio and a closing up of the two survival curves or may have a delayed effect whereby they do not separate until a certain interval of time has elapsed.As another more powerful alternative over the unweighted log-rank statistic, [4] proposed a class of adaptive weighted log-rank statistics.These tests may assign more weight on either early, middle or late survival differences with a suitable choice of weight function.A particularly useful family in the class of weighted log-rank statistics was introduced in [5] and [6].However, in many situations, it is difficult to know in advance the kind of survival differences that will actually occur.Therefore, [7] proposed the maximum of a finite collection of these weighted log-rank statistics trying to overcome the above problem.
Later, [8,9] proposed tests based on a more extended family of test statistics where the weights are governed by two parameters and combinations of them that are more sensitive to nonproportional hazard alternatives.Furthermore, [10] considered a new combination of these test statistics.On the other hand, [11] considered the same family of weighted functions for the WKM test statistic and evaluated the maximum of nine of these test statistics.[12] proposed a different weighted functions for these test statistics.Recently, [13] investigated weighted log-rank tests, the supremum log-rank test and composite tests (mainly based on log-rank tests).On the other hand, [14] studied the performance of nine tests.Seven out of nine are based on weighted log-rank and the other two are WKM and RMST.Both papers came to the conclusion that: there is not a single most powerful test across all scenarios but the maximum of several weighted log-rank tests seems to be the best choice.The former proposed the so-called MaxCombo test and the later the proposal studied by [10].Recently, [15] have performed additional simulations to evaluate the MaxCombo test under some extreme scenarios which are unlikely in real life.Futhermore, they have provided design and analysis considerations based on a combination test under different non-PH types and present a straw man proposal for practitioners.
To our knowledge, the class of maximum weighted log-rank tests and that of maximum weighted Kaplan-Meier tests have not been compared to each other.Therefore, we try to fill this gap as well as to compare them with the RMST which is easily interpretable.The paper is organized as follows: In Section 2, we describe the tests to be compared and our approach to get the critical points.A simulation study of the performance of the tests is presented in Section 3. In Section 4, we apply all the methods to a real data set from a clinical trial of marrow transplant patients with leukemia.We finish with a discussion in Section 5.

Methods
We assume the two-sample general random censorship model.Consider two censored samples with sample sizes n 1 from the treatment group and n 2 from the control group.Let T i j , i = 1, 2, j = 1, ..., n i , be independent, positive random variables and C i j be independent censoring variables that are also independent of the survival variables T i j .The data really observed are X i j , δ i j , i = 1, 2, j = 1, ..., n i , where X i j = min(T i j , C i j ) and δ i j = I T i j ≤ C i j , where I is the indicator function.Define survival functions S i (t) = P(T i j ≥ t) and C i (t) = P(C i j ≥ t), i = 1, 2, which are assumed to be absolutely continuous throughout the paper.The null hypothesis to test is that the survival distributions between two groups are the same, that is,

Weigthed Log-rank tests and combinations
The so-called weighted log-rank statistics are given by where N i (t) is the number of failures in group i before or at time t and Y i (t) the number at risk in group i at time t−, i = 1, 2, w(t) is a bounded nonnegative weight function and t c = sup{t/ Ĉ1 (t) ∧ Ĉ2 (t) > 0} where C i denotes the Kaplan-Meier estimator of the censoring survival function in group i.As the weight function is sensitive to the alternative hypothesis, it should be chosen carefully.[7] proposed w(t) = S (t−) with S (t−) being the left-continuous version of the product-limit estimator ( [16]) for the survival function based on the pooled survival data.We consider the family of test statistic given in (2.2) when the class of functions (2.3) are chosen, WLR λ,γ .In particular, λ = γ = 0 corresponds to the log-rank statistic, WLR 0,0 , which specifies equal weights and λ = 1 and γ = 0 corresponds to the Prentice-Wilcoxon statistic, WLR 1,0 , which places more weight on the earlier time points.In contrast, WLR 0,1 places more weight on the later time points and WLR 1,1 will emphasize the middle differences.[8] studied the maximum over the statistics WLR 0,0 , WLR 2,0 , WLR 0,2 and WLR 2,2 as well as their average.[9] studied the test statistics WLR 1,0 + WLR 0,1 /2, WLR 1,0 + WLR 0,1 /2 and max WLR 1,0 , WLR 0,1 .[10] and [13] considered WLR max3 = max WLR 0,0 , WLR 1,0 , WLR 0,1 .[14] focused on the so called MaxCombo test WLR max4 = max( WLR 0,0 , WLR 0,1 , WLR 1,0 , WLR 1,1 ).According to the conclusions of the papers that have studied these test statistics and combinations, we focus on maximum combination tests.To conduct inference on these test statistics, i.e. to calculate the p-value for testing (2.1), there are different methods.One is to obtain the asymptotic distribution of these test statistics as the maximum of a multivariate normal distribution with mean zero and covariance that can be consistently estimated by for i, j = 1, 2, 3, 4, ŵi (t), i = 1, 2, 3, 4 are the weight functions with λ = γ = 0; λ = 1, γ = 0; λ = 0, γ = 1 and λ = γ = 1, respectively and ∆N i (t) = N i (t) − N i (t−), the number of events at time t in group i.However, in this paper, to account for small and moderate sample sizes, bootstrap is used to obtain p-values as is explained in Section 2.3.

Weigthed Kaplan-Meier tests and combinations
[2] defined the weighted Kaplan-Meier statistics as to stabilize the test statistic and down weight its variance toward the end of the observation period if censoring is heavy.[11] proposed for λ ≥ 0 and γ ≥ 0 (2.4) and C i denotes the Kaplan-Meier estimator of the censoring random variables in group i and ŵ(t) is defined in (2.3).We shall call these test statistics WK M λ,γ .In particular, λ = γ = 0, WK M 0,0 , corresponds to the Pepe-Fleming's WKM test.[11] studied max i, j=0,1,2 WK M i, j .As a particular case, we have the difference in RMST ( [3]) To be fair the comparison of these test statistics with that based on the log-rank test, we will focus on their maximum combination tests counterparts.In particular, WK M max4 = max(WK M 0,0 , WK M 0,1 , WK M 1,0 , WK M 1,1 ) and WK M max3 = max(WK M 0,0 , WK M 0,1 , WK M 1,0 ).To conduct inference on these test statistics, i.e. to calculate the p-value for testing (2.1), the asymptotic distribution of these test statistics can be obtained as the maximum of a multivariate normal distribution with mean zero and covariance that can be consistently estimated by for i, j = 1, 2, 3, 4, and ûi (t), i = 1, 2, 3, 4 are the weight functions given in (2.4) for ŵi (t) defined in the previous section.However, in this paper, to account for small and moderate sample sizes, bootstrap is used to obtain p-values as is explained in Section 2.3.

Computing p-values using bootstrap
The p-value of our test statistics can be obtained using their asymptotic distributions either by numerical integration or Monte Carlo estimation of the multivariate Gaussian distribution.[11] pointed out the variance-covariance matrix may not have a close form solution and requires intense computation.Therefore, our proposal is to use Bootstrap instead of asymptotic distributions that has less computational cost and improve the precision of the asymptotic approximations in small and moderate samples as well as deal with analytically challenging problems in some cases.
We propose to calculate the p-values as follows: • Step 1: To obtain the value of the test statistic for the original sample, T 0 • Step 2: Draw B bootstrap samples from the original data, so n 1 + n 2 observations are obtained and consider the first n 1 as controls and the other n 2 as treatments then to obtain the value of the test statistic for the bootstrap sample, T B (i), i = 1, ..., B. • Step 3: Estimate the p-value by counting the number of T B (i) greater than T 0 .
We also offer R-code facilitating the implementation of the evaluation of the p-value of the test statistics in Supplementary.

Simulation study
We carry out a simulation study to compare the performance of these thirteen test statistics: WLR 0,0 , WLR 0,1 , WLR 1,0 , WLR 1,1 , WLR max4 , WLR max3 , WK M 0,0 , WK M 0,1 , WK M 1,0 , WK M 1,1 , WK M max4 , WK M max3 and D. The simulation design described in [9] has been considered but it is similar to that included in the other cited references such as [17] and [13].We used piecewise exponential models to generate simulated data with parameters λ i (t) from groups 1 and 2 set to represent common alternatives that might arise in real data.To be specific, in Figure 1, (a late survival differences, t < 0.5, λ 1 = 2, λ 2 = 2; t ≥ 0.5, λ 1 = 4, λ 2 = 0.4 and (e) early and late occurring survival differences, t < 0.2, In each case, the censoring distributions are uniform and the censoring percentages in the sample are 0%, 20%, 40% and 60%.Empirical size and power of the tests were evaluated considering α = 0.05 for sample sizes n 1 = n 2 = 20, 50, 70, 100 and 150.For each sample size, 2000 replications were performed for each configuration of survival and censoring distributions and in each one of them 500 bootstrap repetitions were performed.The simulations were conducted using R (https://www.r-project.org//RCore Team, Vienna, Austria), specifically, we simulate the data with the rpwexp function of the nhpsim package [18], which simulates the exponential distribution by parts and allows to specify a distribution where the risk rate changes with time, logrank.test and logrank.maxtestfunctions of the nph package [19] were used to calculate the statistics WLR 0,0 , WLR 0,1 , WLR 1,0 , WLR 1,1 , WLR max4 , WLR max3 and the survtest function of the SurvBin package [20] to calculate the statistics WK M 0,0 , WK M 0,1 , WK M 1,0 , WK M 1,1 and we implement a function to calculate the maximum weighted Kaplan-Meier test statistics, WK M max4 and WK M max3 .Finally, we calculate the D statistic with the rmst2 function of the survRM2 package [21].
As can be seen in Table 1, the simulated type I error of the thirteen test statistics for all sample sizes are close to the nominal, two-sided 5% significance level.Therefore, we can assess their power performance.The power of the thirteen test statistics for each simulation scenario is shown in Table 2. Figure   2 displays the powers under 60% censoring.Under proportional hazard configuration, the log-rank test (WLR 0,0 ) and Pepe-Flemings's weighted Kaplan-Meier test (WK M 0,0 ) have maximum power as expected.However, the power of WK M max3 and WK M max4 are really close.The D statistic is also competitive but under non-censoring.Therefore, these two combination tests have almost the same sensitivity as WLR 0,0 , i.e. the optimal test against proportional hazards alternative.
For early survival differences, as expected, WLR 1,0 and WK M 1,0 have greater power than the other statistics.Even though, WLR max3 fares poorer than them, it is the best among the rest of considered test statistics.In the case of the late survival differences, WLR 0,1 and WK M 0,1 have greater power than the others.Nevertheless, the difference in power between these and WK M max3 is small.Finally, in the case of early and late survival differences, there is not a clear more powerful test statistic but WK M max3 is consistently a good choice for all censoring and sample sizes.In summary, assuming ignorance of the likely treatment effect, except for early survival differences, the WK M max3 test statistic is preferable to the other maximum combination test statistics.For early survival differences, the WLR max3 test statistic exhibits a higher power than WK M max3 but the latter is higher than the well-known log-rank test.In fact, [14] showed a minimal power gain power for WKM and RSMT to the LR test when there is a delayed effect but WK M max3 has better performance than WKM and RMST.

Application to a real data set
We ilustrate the performance of the thirteen test statistics with a real data set, chosen because they appear to show late difference.The data is available in the package KMsurv [22] with the name bmt.The bmt data come from a study on the survival of bone marrow transplant patients with different types leukemia.The complete data frame has 22 variables and to carry out the study we worked with the following: disease free survival time, that is, the time either relapse or death (t2), censoring indicator, 1: relapse or death, 0: censored survival time (d3), disease group, acute lymphoblastic leukemia (ALL) consisting of 38 patients and low risk acute myeloid leukemia (AML) consisting of 54 patients.3 and exhibit a late difference in survival.The p-values associated to the proposed test statistics are summarized in Table 3, from which it can be seen that most of the test statistics reject the null hypothesis H 0 : S 1 = S 2 at the 0.05% level.However, D, WLR 0,1 , WLR max3 and WLR max4 yield a nonsignificant p-value.Although, WLR max3 and WLR max4 pvalues are in the borderline.Therefore, WK M max3 and WK M max4 yield stronger evidence against the null hypothesis of no difference between the two groups than WLR max3 and WLR max4 .Therefore, one can miss a potential regulatory submission opportunity if the primary analysis is carried out based on the well-known RMST or MaxCombo tests (WLR max4 or WLR max3 ).

Discussion
In practice, the log-rank test is the most used to test the equality of two survival distributions in the presence of censoring and it is known to have optimum power to detect a difference in survival distributions when proportional hazards holds.However, non-proportional hazard ratio between the control and experimental arms occurs fairly often in clinical trials.For example, the proportional hazards (PH) assumption often does not hold for the primary time-to-event (TTE) efficacy endpoint in trials of novel immuno-oncology drugs.For some immuno-oncology drugs, a delayed treatment effect is observed, where the survival curves for the immuno-oncology drug and standard of care are not separated initially but start separating after some period of time.In other oncology trials, the separation between the survival curves occurs early on but then the distance between the curves diminishes over time.There may also be cases where the survival curves cross each other, i.e., the PH assumption is violated.With a preconceived type of treatment effect, then the weighted log rank and weighted Kaplan-Meier test statistics are good alternatives to the log-rank test.However, if the direction of the violation from non-PH is not known, then a combination of the previous test statistics should be used.There are a number of papers studying these combinations based on weighted log rank and weighted Kaplan-Meier test statistics separately.Recently, [13] and [14] used Monte Carlo simulation to study the performance of nine different statistics based mainly on weighted log-rank tests.They conclude that a modified version of WLR max3 and WLR max4 , respectively are preferred in the absence of prior knowledge regarding the PH or non-PH patterns.[15] concluded that MaxCombo test shows clear advantages when a large treatment benefit emerges later on the trial.They introduced a design approach for confirmatory trials with WLR max4 and developed a stepwise and iterative approach for calculating sample size when the final analysis is based on this test statistic.Our conclusion is that WLR max3 is better than the MaxCombo test, WLR max4 and WK M max3 is preferable to WLR max3 in most of the cases for small and moderate sample sizes.Performance of the KM based methods is known that depend on the lenght of study period (t c ) and censoring pattern.We have minimum of the largest observed time in each of the two groups as a choice of t c .There are other issues such as the interpretation of the WLR test statistics result in terms of the clinical effect.[23] proposed a Cox-model based time-varying treatment effect estimate to complement the WLR test statistics.However, investigation of these topics is beyond the scope of the present paper.

Figure 1 .
Figure 1.Survival functions considered in the size and power simulations.

Figure 2 .
Figure 2. Simulated power of the statistic tests for 60% censoring.

Figure 3 .
Figure 3. Survival functions for bone marrow transplant patients.

Table 1 .
Simulated type I error.

Table 3 .
Test statistics and their correponding p-values.