A nonparametric test for equality of survival medians using right-censored prevalent cohort survival data

The median is a robust summary commonly used for comparison between populations. The existing literature falls short in testing for equality of survival medians when the collected data do not form representative samples from their respective target populations and are subject to right censoring. Such data commonly occur in prevalent cohort studies with follow-up. We consider a particular case where the disease under study is stable, that is, the incidence rate of the disease is stable. It is known that survival data collected on diseased cases, when the disease under study is stable, form a length-biased sample from the target population. We fill the gap for the particular case of length-biased right-censored survival data by proposing a large-sample test using the nonparametric maximum likelihood estimator of the survivor function in the target population. The small sample performance of the proposed test statistic is studied via simulation. We apply the proposed method to test for differences in survival medians of Alzheimer’s disease and dementia groups using the survival data collected as part of the Canadian Study of Health and Aging.

Recently, Rahbar et al. proposed a hypothesis testing procedure for comparing population medians using right-censored failure time data which does not require the location family assumption. 4 Using the asymptotic distribution of the Kaplan-Meier estimator and an application of the functional delta method, Rahbar et al. derived a quadratic test statistic with an asymptotically chi-squared distribution. 7 In this article, we propose an extension to the method of Rahbar et al. for the case in which the failure time data are both length-biased and right-censored using the nonparametric maximum likelihood estimator (NPMLE) of the unbiased survival function. 8 We note that when the observed failure times are length-biased, the censoring mechanism is informative yielding an NPMLE which places mass on both the observed failure/censoring times of the prevalent cohort study. [8][9][10][11] Thus, the unbiased survival function estimator is fully defined over the entire real line establishing the existence of the median point estimate. While other product-limit estimators for left-truncated right-censored failure time data may be applied (see, e.g. Huang and Qin, 12 Luo and Tsai, 13 and Tsai et al. 14 ), the survival function estimates may not yield a median point estimate when censoring is high. Although there may be instances in which the median point estimate exists when utilizing the bootstrapping procedure of Efron, some bootstrap samples may not yield median estimates requiring them to be discarded. 5 In addition to ensuring the existence of the median point estimate, our use of the NPMLE of the unbiased survival function is motivated by its efficiency compared to all other nonparametric estimators. 9 In Section 2., we define the notation for prevalent cohort studies with follow-up resulting in length-biased right-censored failure time data and describe how the asymptotic representation of the NPMLE of the unbiased survival function may be utilized to define a statistic for testing differences between multiple population medians. In Section 3., we generate simulated failure time data from prevalent cohort studies with follow-up to assess the performance of the proposed testing procedure and in Section 4., we apply our testing methodology to the Canadian Study of Health and Aging (CSHA) dataset to compare median survival between various dementia/Alzheimer's disease groups. We provide some extensions and future research directions in Section 5.

Notation and methodology
In a general prevalent cohort study with follow-up, subjects are screened and only diseased cases are followed until failure/ censoring for which their respective onset dates are recorded retrospectively. 15 Let the fixed constant R denote the date at which subjects are tested for the prevalence of the condition/disease under study (prevalence day) and let O < R denote a generic onset time generated from a stationary Poisson point process. Let T be the underlying failure random variable with distribution function F U (·) = 1 − S U (·) and mean μ. A subject is included in the prevalent cohort study if T > R − O = A (i.e. they survive past prevalence day) where they are subsequently followed until the end of the study or they are lost to follow-up. Let C denote the censoring time measured from R with distribution function G(·). Thus, the prevalent cohort consists of the triples . . , n} made up of the observed truncation times, failure/censoring times and associated indicator functions. When the additional stationarity assumption is made for the underlying onset process of the prevalent cohort data (i.e. the underlying onset dates are drawn from a stationary Poisson point process) the observed failure times are typically referred to as being "length-biased." For details on testing the validity of the stationary onset process assumption, see the literature. [16][17][18][19] Let F LB (·) = 1 − S LB (·) denote the length-biased distribution function for which it may be shown 8 : In this setting, it is sufficient to only use the observed failure/censoring times without the observed truncation times in order to make inferences regarding the unbiased survival function. For a set of length-biased right-censored failure time data, Asgharian and Wolfson established uniform consistency and weak convergence for the NPMLE ofŜ LB and showed where U is a Gaussian process. 9 Using the relationship between the unbiased and length-biased survival functions given in equation (1), Asgharian et al. established uniform consistency and weak convergence for the NPMLE of the unbiased survival functionŜ U . 8 In particular, they established the following weak convergence result: where L y (x) = I [y,∞] (x) − S U (y)/x. As the operator acting on the limiting Gaussian process ofŜ LB is a bounded linear operator, it may then be shown that U * (y) is also a Gaussian process. 20(p. 377) Using similar notation as Rahbar et al., consider the setting in which k ≥ 2 independent populations are sampled for which the observed samples consist of length-biased failure/censoring times. 4 We utilize the subscript j, j = 1, . . . , k, to differentiate the parameters/estimators between the k populations/samples, respectively. For j = 1, 2, . . . , k, we define the median as: for which the sample median estimator,θ U ,j is determined via plugging in the estimator ofŜ U ,j in equation (2). 4 We note that the median estimators for the different groups will always exist as the corresponding unbiased survival function estimators, S U ,j , under the stationarity assumption, are fully defined over the entire real line. This is a direct result of the informative censoring mechanism which forces non-zero probability mass to fall on the observed censoring times. 21 Moreover, as the observed length-biased data are collected from a prevalent cohort study with follow-up through a cross-section at prevalence day, the subjects' onset dates may occur any time arbitrarily far before prevalence day. This implies that if the follow-up period is short, resulting in a large proportion of the observed failure times being right-censored, the NPMLE of the S U ,j functions will be unaffected. For details on the NPMLE of S U ,j without the stationarity assumption, see Tsai et al. 14 For group j, j = 1, 2, . . . , k, it may be shown that the sample median estimator converges to a Gaussian random variable: We summarize the necessary conditions of this result with a sketch of its proof in Theorem 1 and Corollary 4 in the Appendix. Estimates for σ 2 j , j = 1, 2, . . . , k may be obtained by applying the simple bootstrap procedure of Efron to each of the k samples by resampling the pairs (Y i , δ i ), i = 1, 2, . . . , n j , j = 1, 2, . . . , k. 5 Let θ U,1 , . . . , θ U ,k be the population medians and let n 1 , . . . , n k be the sample sizes for each of the k samples. Consider the null hypothesis of equal medians for each of the k groups: Following the approach of Rahbar et al., we define the combined median estimator as: for which k j=1L j = 1. 4 Denoting the combined sample size N = n 1 + · · · + n k , Rahbar et al. proposed the following quadratic test statistic:Λ whereΓ N is a k × k matrix with diagonal entries given bŷ and off-diagonal entries given bŷ . . , k. 4 As remarked by Rahbar et al., the matrix Γ N is not necessarily invertible and so the generalized inverse may be utilized instead of expression (4). Using the asymptotic normality of the sample median estimators,θ U ,j , j = 1, 2, . . . , k, Λ N may be shown to weakly converge to a central χ 2 distribution with k − 1 degrees of freedom (see Theorem 2 of Rahbar et al. 4 ). The percentiles of the central χ 2 distribution may be utilized to test for significance when conducting the equality of medians hypothesis test.

Length-biased/unbiased survival function median comparisons
Although the proposed testing procedure is based on determining differences between the unbiased survival function medians, a question that may be posed is whether the correct conclusion will be reached when testing for equality of the length-biased population medians instead. Using the weak convergence of the length-biased survival function NPMLEŜ LB (·), an equivalent form of the test statistic in equation (4) may be formed to test the null hypothesis: . . , k are the population medians of the length-biased survival functions. For simplicity of exposition, consider the case with k = 2 in which the medians of two survival distributions are being compared. Addona examined different types of group comparisons between unbiased and length-biased survival curves and found that in the two-sample setting, there exist examples in which θ LB,1 < θ LB,2 , however, θ U ,1 > θ U ,2 (i.e. the ordering of medians is "reversed"). 22 In Figure 1, we illustrate a similar phenomenon using Weibull failure time data and show that the unbiased survival curves for two groups can have identical medians whereas the medians of the corresponding length-biased survival curves are clearly different. Therefore, without restriction on the family of underlying failure time distributions, it is not sufficient to test for differences in medians of the length-biased distributions and then subsequently make conclusions regarding the unbiased population medians. For related discussions on testing for equality of survival curves using lengthbiased failure time and adjusting for time-dependent selection mechanism/covariates, see Ning et al. 10 and Huiart and Sylvestre. 23

Simulations
To assess the performance of the proposed hypothesis test, we generated samples of length-biased right-censored failure time data from various distributions to control for the values of the unbiased population medians. Specifically, we considered three failure time distributions: uniform distribution with support (0, 2θ) (median equal to θ), the LogNormal distribution with parameters μ and 1 (median equal to e μ ) and the exponential distribution with rate parameter log (2)/λ (median equal to λ). To generate a sample of length-biased right-censored failure time data, we sampled an onset time, O, from a uniform distribution over (0, R) where R was chosen arbitrarily large such that P(T > R) ≈ 0. We then sampled a failure time T and retained both the onset, failure time pair if T > R − O, otherwise both sampled quantities were discarded. This procedure of accepting/rejecting the sampled (onset, failure time) pairs simulates the effect of left-truncation on the underlying failure time random variables. We censored the forward failure times T − (R − O) by drawing an exponentially distributed censoring time, C, with rate parameter ϕ where we recorded min (C, T − (R − O)) and whether the minimum was a failure/censoring time. The rate parameter ϕ was varied to allow for either 15% or 30% censoring.
We generated multiple length-biased right-censored failure time data sets and computed the empirical test sizes and empirical powers when the medians of the distributions were equal and different, respectively. We considered the case in which k = 3 where we allowed the underlying failure time distributions to be equal or different from each other. We set the sample sizes of the three groups equal to 150, 200, and 250, respectively, and performed the same simulations when all three samples were of size 50. We ran 500 simulation runs and for each run, we estimated the sample median variances using 1000 bootstrapped samples. We reported the empirical test sizes in Table 1 and the empirical powers in Table 2. We find that for the majority of cases when the sample sizes are large, the empirical test sizes are similar to the true test significance level of α = 0.05. When the sample sizes of the three groups were small, the empirical sizes increased in most cases. In the cases involving the exponential distribution, we attribute the larger empirical test sizes to the difficulty in drawing a sufficient number of observations arbitrarily close to time t = 0. In Table 2, we find that when the sample sizes are large, the empirical powers are in most cases above 0.50 and when the sample sizes are small, the empirical powers are diminished. This phenomenon is to be expected as with smaller sizes, the variability of the estimators' increases and it becomes more challenging to detect any differences between the underlying population medians.
To assess how the power of the hypothesis test varied depending on the group sample sizes, we simulated length-biased right-censored failure time data from three groups with different medians (Uniform: 10, LogNormal: e 2 , Exponential: 20). We varied the censoring variable's rate parameter, ϕ, so that either 15% or 30% of the sampled length-biased failure times were right-censored. We computed the empirical powers over 500 simulation runs with 1000 bootstrapped samples per run where all three sample sizes were equal to 50, 100, 200, 300, 400, 500, and 1000. The empirical power curves are plotted in Figure 2. Generally, as the sample size of the individual groups' increases, the power of the test converges to 1 independently of the two different proportions of censoring between the three groups.

Application
The CSHA was a multicenter study of dementia and related geriatric conditions for individuals over the age of 65 in Canada. 24 In 1991, in the first stage of the CSHA, approximately 10,000 subjects living in a community or institutional residences were screened for various forms of cognitive impairment. At the second stage of the CSHA in 1996, the subjects who were still alive were screened a second time for cognitive impairment through a clinical evaluation. For the subjects who screened positive at the first stage of the CSHA, their onset dates were obtained through the recollections of their family members and caregivers and the dates of death were recorded for those subjects who died between the first and second stages of the CSHA. Through the screening tests and the clinical evaluations, the prevalent cohort of subjects (n = 823) consisting of 240 males and 583 females who screened positive at the first stage of the CSHA were classified Table 1. Empirical test sizes for simulated length-biased right-censored failure times of three groups drawn from underlying survival distributions with equal medians of θ = 10. Large sample sizes of three groups: n 1 = 150, n 2 = 200, n 3 = 250. Small sample sizes of three groups: n 1 = n 2 = n 3 = 50. as having either vascular dementia (n 1 = 173), possible Alzheimer's disease (n 2 = 252) or probable Alzheimer's disease (n 3 = 396). Subjects' survival times were (administratively) right-censored if they were still alive at the time of the second stage of the CSHA. Approximately 19%, 24%, and 21% of the subjects screened with vascular dementia, possible Alzheimer's disease, or probable Alzheimer's disease, respectively, were right-censored at the second stage of the CSHA.
Using the observed failure/censoring times for the three dementia/Alzheimer's disease groups of the CSHA, we calculated the unbiased survival function estimates and the associated unbiased survival function median estimates (see Figure 3). The medians for vascular dementia, possible Alzheimer's disease, and probable Alzheimer's disease groups (in months) were 39.9, 50.9, and 47.7, respectively. For other median estimates and estimates of the hazard ratio for death, stratifying by sex, education level, and age at onset of dementia, see Wolfson et al. 24 Using the simple bootstrap, we generated 10,000 bootstrap samples for each group and determined estimates of the sample median variances. Setting Table 2. Empirical powers of a hypothesis test for simulated length-biased right-censored failure times of three groups drawn from underlying survival distributions with varying medians. Large sample sizes of three groups: n 1 = 150, n 2 = 200, and n 3 = 250. Small sample sizes of three groups: n 1 = n 2 = n 3 = 50.  the significance level α = 0.05, using the bootstrapped estimates, we calculated the test statistic from equation (4), Λ n = 2.988, and did not reject the null hypothesis that individual dementia/Alzheimer's disease group medians are equal.

Discussion
The median estimator is a key summary statistic for determining the center of a failure time distribution and can be utilized to compare the failure time distributions of multiple independent groups. Using length-biased right-censored failure time data collected from a prevalent cohort study with follow-up, we proposed an extension to the Rahbar et al. testing procedure using the NPMLE of the unbiased survival function to compare the unbiased medians of k ≥ 2 groups/populations. We applied our testing procedure to simulated prevalent cohort data sets and to data collected from the CSHA. Throughout this article, we proposed a testing procedure for comparing the medians of multiple groups/populations using length-biased right-censored failure time; however, a closely related function to the median survival time is the quantile residual lifetime function (qrlt function). The qrlt function θ α (t) satisfies the equality P(T − t ≥ θ α (t)|T ≥ t) = α for all t ≥ 0 and some α ∈ [0, 1] and determines the conditional α quantile when T ≥ t. The median survival time may be regarded as a single point of the qrlt function when α = 0.5 and t = 0. Recently, Liu et al. utilized the asymptotic properties of the unbiased survival function NPMLE to develop confidence intervals for two test statistics based on the ratio and difference of two qrlt functions. 11 It is not immediately clear how to utilize the proposed test statistics in practice when performing a hypothesis test of equality of qltr functions. It remains an open problem how the proposed methodologies may be extended to the k > 2 sample setting in which multiple population qrlt functions are being compared.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The second author is supported by NSERC RGPIN-2018-05618.

Definition 1.
A stochastic process X (t) is "cadlag" for every point t 0 ∈ R if the process X (·) is continuous at t 0 from the right with limit existing at t 0 from the left. Specifically, for every point t 0 ∈ R, for all sequences {t 1n } ∞ n=1 , which converge to t 0 from the right, P( lim n→∞ X (t 1n ) = X (t 0 )) = 1 and for all sequences {t 2n } ∞ n=1 , which converge to t 0 from the left, P( lim n→∞ X (t 2n ) = X (t 0 − )) = 1.
Proposition 2. Let X (t) be a cadlag stochastic process and let g(t) be a continuous function ∀t ∈ R. The stochastic process X (g(t)) will also be a cadlag stochastic process.
Proof. Let t 0 be an arbitrary point in R. Then since X (·) is a cadlag stochastic process, either 1. X is continuous at t 0 from both the left and right such that for all convergent sequences {t n ∈ R : n = 1, 2, . . . }, which converge to t 0 , we have that P( lim n→∞ X (t n ) = X (t 0 )) = 1, or 2. X has a limit, which exists at t 0 from the left and X is continuous at t 0 from the right such that for all convergent sequences {t n ∈ R : n = 1, 2, . . . }, which converge to t 0 from the left and for all convergent sequences {t * n ∈ R : n = 1, 2, . . . }, which converge to t 0 from the right, we have that P( lim n→∞ X (t n ) = X (t 0 − )) = 1 and P( lim n→∞ X(t * n ) = X(t 0 )) = 1.