survivalBIV: Estimation of the Bivariate Distribution Function for Sequentially Ordered Events Under Univariate Censoring

In many medical studies, patients can experience several events. The times between consecutive events (gap times) are often of interest and lead to problems that have received much attention recently. In this work we consider the estimation of the bivariate distribution function for censored gap times, using survivalBIV a software application for R. Some related problems such as the estimation of the marginal distribution of the second gap time is also discussed. It describes the capabilities of the program for estimating these quantities using four different approaches, all using the Kaplan-Meier estimator of survival. One of these estimators is based on Bayes’ theorem and Kaplan-Meier survival function. Two estimators were recently proposed using the Kaplan-Meier estimator pertaining to the distribution of the total time to weight the bivariate data (de Una-Alvarez and Meira-Machado 2008 and de Una-Alvarez and Amorim 2011). The software can also be used to implement the estimator proposed in Lin, Sun, and Ying (1999), which is based on inverse probability of censoring weighted. The software is illustrated using data from a bladder cancer study.


Introduction
In longitudinal studies of disease, patients may experience several events through a followup period.In these studies, the sequentially ordered events (and the gap times) are often of interest.The events of concern can be of the same nature (e.g., cancer patients can experience recurrent disease episodes) or represent different states in the disease process (e.g., alive and disease-free, alive with recurrence and dead).If the events are of the same nature, this is usually referred as recurrent events, whereas if they represent different states they are In this work we present four methods (estimators) for the bivariate distribution function of the gap times.One simple estimator is based on Bayes' theorem and Kaplan-Meier survival function.This estimator is related to that proposed in Lin et al. (1999) and with estimators proposed by de Uña-Álvarez (de Uña-Álvarez and Meira-Machado 2008; de Uña-Álvarez and Amorim 2011) since all use (in different ways) the Kaplan-Meier estimator (Kaplan and Meier 1958).The estimator proposed by Lin in 1999 uses inverse probability of censoring weighted (IPCW) based on the Kaplan-Meier estimator.On the other hand, the idea behind both estimators proposed by de Uña-Álvarez is the use of the Kaplan-Meier estimator pertaining to the distribution of the total time to weight the bivariate data.The difference between these two methods is that the more recent paper uses a presmoothed version of the Kaplan-Meier estimator (Dikta 1998).Without smoothing, the estimator described in de Uña-Álvarez and Amorim (2011) reduces to that in the de Uña-Álvarez and Meira-Machado (2008).
This paper describes the R-based package survivalBIV (available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=survivalBIV) and its capabilities for implementing nonparametric and semiparametric estimators for the bivariate distribution function for censored gap times.In this article we explain and illustrate how numerical and graphical output for all methods can be obtained using the survivalBIV package.
The following section provides a brief introduction to the methodological background.All four estimators for the bivariate distribution function and marginal distribution of the second gap time are presented.An overview of the use of survivalBIV is given in Section 3. In Section 4 we explain how the package can be used to simulate bivariate censored data and how to use the several functions in the package.An example of its application is given using data from a bladder cancer study in Section 5. A discussion is in Section 6.

Methodological background
In this section we will present four different methods for estimating the bivariate distribution function F 12 (x, y) = P (T 1 ≤ x, T 2 ≤ y), all using the Kaplan-Meier estimator of survival.Some related problems such as estimation of the marginal distribution of the second gap time will also be discussed.

Conditional Kaplan-Meier estimator
A simple estimator for the bivariate distribution function of the gap times is based on Bayes' theorem and Kaplan-Meier survival function (conditional Kaplan-Meier, CKM).Since F 12 (x, y) = P (T 1 ≤ x, T 2 ≤ y) = P (T 2 ≤ y T 1 ≤ x)P (T 1 ≤ x) one simple estimator for the bivariate distribution is given by where F1 (x) is the Kaplan-Meier product-limit estimator based on the pairs ( T1i , ∆ 1i )'s and FKM (y) is the Kaplan-Meier estimator based on the pairs ( T2i , ∆ 2i )'s.The FKM (y T 1 ≤ x, ∆ 1 = 1) is the conditional distribution function for the subset of T 1 ≤ x and ∆ 1 = 1 (the Kaplan-Meier estimator based on the pairs ( T2i , ∆ 2i )'s such that T1i ≤ x and ∆ 1i = 1).

Kaplan-Meier weighted estimator
Another simple estimator was recently proposed by de Uña-Álvarez and Meira-Machado (2008).The idea behind estimation is to use the Kaplan-Meier estimator pertaining to the distribution of the total time to weight the bivariate data.The proposed estimator (Kaplan-Meier weighted estimator, KMW) is given by where is the Kaplan-Meier weight attached to Ỹi when estimating the marginal distribution of Y from ( Ỹi , ∆ 2i )'s, and for which the ranks of the censored Ỹi 's, R i , are higher than those for uncensored values in the case of ties.

Kaplan-Meier presmooth weighted estimator
Recently, de Uña-Álvarez and Amorim (2011) propose a modification of estimator (2) based on presmoothing (Dikta 1998), which allows for a variance reduction in the presence of censoring.Basically, this method uses a presmoothed version of the Kaplan-Meier estimator (see e.g., Dikta 1998 and references therein) pertaining to the distribution of the total time to weight the bivariate data.This is obtained by replacing the censoring indicator variables in the expression of the Kaplan-Meier weights by a smooth fit of a binary regression.This estimator (Kaplan-Meier presmooth weighted estimator, KMPW) is expressed as where are the presmoothed Kaplan-Meier weights.Here, m(x, y) = P (∆ 2 = 1 T1 = x, Ỹ = y, ∆ 1 = 1), belongs to a parametric (smooth) family of binary regression curves, e.g., logistic.Our package provides the results assuming that m denotes a logistic regression model (KMPW).In practice, we assume that m(x, y) = m(x, y; β) where β is a vector of parameters which typically will be computed by maximizing the conditional likelihood of the ∆ 2 's given ( T1 , T2 ) for those with ∆ 1 = 1.
Note that, unlike (2), the KMPW can attach positive mass to pair of gap times with censored second gap time.However, both estimators (2) and (3) attach a zero weight to pairs of gap times with censored first gap time.In the limit case of no presmoothing, the estimator (3) reduces to (2).Conditions under which both estimators are consistent is fully discussed in papers by de Uña-Álvarez and Meira-Machado ( 2008) and de Uña-Álvarez and Amorim (2011).In the latter paper the authors compare the performance of the presmoothed (semiparametric) estimator with the purely nonparametric estimator (without presmoothing) and concluded that the presmoothed estimator improves efficiency in the multivariate setup of gap times.

Inverse probability of censoring weighted estimator
Another estimator for the bivariate distribution function was proposed by Lin et al. (1999).This estimator is based on inverse probability of censoring weighted (IPCW).The rationale behind IPCW is that each subject that is observed at time u is representative of 1 G(u) individuals that might have been observed if there was no censoring.Lin's estimator is expressed as where The censoring survivor function G is typically unknown and needs to be replaced by an estimate.This can be obtained by reversing the role of T and C, using a Kaplan-Meier estimate Ĝ of the censoring survivor function, i.e., using an estimate based on the ( T1i , 1 − ∆ 1i )'s (for the first term in the right-hand side of Equation 4) or ( Ỹi , 1−∆ 2i )'s (for the second term in the right-hand side of Equation 4).This is the simplest choice and was assumed by Lin et al. (1999).Other procedures for estimation of G are appropriate, for example the approach used in prodlim package.Without ties (between event times and censoring times) the two methods provide the same result.Our package allows the user to choose between one of the two methods.
Note that consistency of estimators (2), ( 3) and ( 4) is only guaranteed whenever x + y is smaller than the upper bound of the support of the censoring time.The estimates produced via Bayes' theorem and Kaplan-Meier (CKM) may not produce a valid bivariate distribution since it does not guarantee that the bivariate distribution function is monotone.The problem can be explained by the fact that, as the conditioning set T 1 ≤ x changes, the redistribution to the right of the probability mass associated with censored observations also changes.In contrast to the other two methods, the estimators based on Kaplan-Meier weights (KMW and KMPW) are monotonic (distribution) functions, in the sense that they attach positive mass to each observation.

Marginal distribution of the second gap time
From (1), ( 2), ( 3) and ( 4) we may obtain an estimator for the marginal distribution of the second gap time, F 2 (y) = P (T 2 ≤ y), namely Note that if F1 (+∞) = 1, then ( 5) is the Kaplan-Meier estimator based on ( T2i , ∆ 2i )'s such that ∆ 1 = 1 (i.e., for which the first gap time is uncensored).Estimator ( 6) is different because the Kaplan-Meier weights W i in this estimator are based on the Ỹi -ranks rather than on the T2i -ranks.In fact, since T 2 and C 2 are expected to be dependent, the ordinary Kaplan-Meier estimator of F 2 (estimator (5)) will be generally inconsistent.The corresponding estimator for (3) is obtained using the same ideas as for ( 6) by replacing the weights W i by the presmoothed Kaplan-Meier weight W ⋆ i previously defined.Similarly, from Lin's estimator (4) one can obtain an estimator for the marginal distribution of the second gap time.Again, note that such estimator does not guarantee monoticity.
Results of an extensive simulation study comparing the four methods are reported in the paper by Meira-Machado and Moreira (2010).In this work the authors consider two simulation scenarios, the first scenario is the same as the described in Lin et al. (1999) (see their Section 3).In the second scenario, the gap times were generated using a family of bivariate Weibull distributions (see our Section 4 for more details).The main conclusions are the following: (a) the CKM estimator has larger bias for higher values of the first gap time; (b) the KMW estimator has less bias than its presmoothed version (KMPW); however, as expected, provides large standard errors in estimation (resulting in a wiggly estimator with fewer jumps); (c) the KMW and IPCW estimators are almost unbiased but the last one obtains higher levels of variance for small values of the second gap time.
Other estimators were proposed to estimate the bivariate distribution function.A valid estimator of the bivariate distribution function, was provided by Keilegom (2004) which is based on Akritas (1994).However, this approach has some limitations since some smoothing is required.Recently, alternative estimators for these quantities were also given in Keilegom, de Uña-Álvarez, and Meira-Machado (2011).This methodology assumes that the vector of gap times (T 1 , T 2 ) satisfies the nonparametric location-scale regression model, allowing for the transfer of tail information from lightly censored areas to heavily ones.

Package description
The survivalBIV software contains functions that calculate estimates for the bivariate distribution function.As mentioned in Section 2, this package can be used to implement four methods (CKM, KMW, KMPW and IPCW).This software is intended to be used with the R statistical program R Development Core Team (2011).Our package is composed of 9 functions that allow users to obtain estimates for the bivariate distribution function.Table 1 provides a summary of the functions in this package.
Users can obtain the estimates for the methods discussed in Section 2 by means of three functions, namely, survBIV, summary and plot.Details on the usage of these functions can be obtained with the corresponding help pages.It should be noted that to implement the methods described in Section 2 one needs the following variables: time1, event1, time2 and event2.Covariates have not been included in any of the implemented methods, therefore they are not necessary.is a censored observation, the value is 0 and otherwise the value is 1).The variable time2 represents the observed second time (second gap time).If event1 = 0, the second gap time is not observed and then time2 = 0.The variable event2 is the final status of the individual (takes the value 1 if the second event of interest is observed and 0 otherwise).

Data generation
Users may use the function dgpBIV to generate bivariate survival data.This function can be used to generate bivariate survival times from two of the most known copula functions: Gumbel's bivariate exponential distribution Lu andBhattacharya (1990, 1991), also known as the Farlie-Gumbel-Morgenstern distribution and the bivariate Weibull distribution.In the book by Johnson and Kotz (1972) several bivariate distributions are discussed and procedures of construction are given.
It is well known that Exponential and Weibull distributions are very useful for modeling survival times.The Farlie-Gumbel-Morgenstern distribution is given by F where the marginal distribution functions F 1 and F 2 are exponential with rate parameter θ i , i = 1, 2 and where δ ≤ 1 is the association parameter.The case of independence is obtained for δ = 0 while the maximum of correlation (between T 1 and T 2 ) for the bivariate exponential distribution is obtained for δ = 1 with bound equal to 0.25.These and other theoretical correlations between the bivariate times for this copula distribution (with unit marginal distributions) can be obtained using the input commands shown below.
The following input command provides the estimate for the KMW method.With this command we obtain the pointwise confidence intervals (conf = TRUE) using a 1000 bootstrap replicates (n.boot = 1000).The construction of the pointwise confidence intervals is obtained by randomly sampling the n items from the original data set with replacement.This can be achieved using percentile bootstrap (method.boot= "percentile") or using basic bootstrap (method.boot= "basic").By default all functions use the percentile bootstrap (Davison and Hinkley 1997).
R> bivKMW(object = sim_data_exp, t1 = 0.5108, t2 = 0.9163, conf = TRUE, + conf.level= 0.95, n.boot = 1000) 2.5% 97.5% 0.3015313 0.2692518 0.3337527 One important issue is whether 1000 is a suitable number of resamples to generate.Since a second and a third set of 1000 resamples gave similar results for the bootstrap confidence intervals, this suggests that with these number of resamples the results are consistent.From this perspective 1000 would seem sufficient.
The CPU time needed for running the bivKMW function varies according to whether bootstrap confidence bands are requested or not, the sample size, and the type of processor in the PC computer.The command presented above took no more than 2 second on a PC with an Intel Core i7 processor with 8 GB memory.The same input command but with n = 10000 resamples took a little more than 17 seconds.
Results for the other methods are very similar and can be obtained using the functions bivKMPW, bivCKM and bivIPCW with the same arguments.The bivIPCW function has one extra argument which allows the user to choose how to estimate Ĝ in ( 4) method.cens= "KM" for the Kaplan-Meier method and method.cens= "prodlim" for the method proposed in prodlim package.In general, the two methods (for estimating the survival of censoring times) provide similar results; without ties (e.g., using simulated data) they provide the same result.The method based on Kaplan-Meier is implemented in C language and is faster.
The summary function can be used to obtain estimates for the bivariate distribution function.This function allows the user to obtain the estimates for all four methods using method = "all": R> summary(object = sim_data_exp, t1 = 0.5108, t2 = 0.9163, conf = TRUE, + conf.level= 0.95, n.boot = 1000, method = "all") The CPU time needed for running the command presented below took a little more than 16 seconds.The same input command but with a sample size of n = 10000 took a little more than 68 seconds.Note that this input command is the one which requires more computational effort since all methods are implemented with bootstrap confidence bands (optional).
One limitation of the so-called Farlie-Gumbel-Morgenstern families of bivariate cdf's, is that the correlation of T 1 and T 2 can never exceed 1 3 (0.25 in the bivariate exponential distribution).The bivariate Weibull distribution allows for a larger correlation, which makes it superior to Gumbel's bivariate exponential.The dgpBIV function allows the user to generate a pair of times from the bivariate Weibull distribution with two-parameter marginal distributions.Its survival function is given by where 0 < δ ≤ 1, and each marginal distribution has shape parameter β i and a scale parameter θ i , i = 1, 2. The correlation between the two gap times may be obtained though it is a complicated function of the shape and scale parameters and of δ.Again, the function corrBIV, from the survivalBIV package can be used to calculate the theoretical correlation between times for this bivariate distribution.This function may be valuable for choosing the appropriate shape and scale parameters.For example, choosing δ = 0.6, θ 1 = θ 2 = 7 and shape parameters β 1 = β 2 = 2, lead to about 54% of correlation.Below follow two input commands to illustrate the use these two functions.The first command provides the theoretical correlation while the second generates bivariate survival data from the bivariate weibull with exponential censoring with rate parameter 0.08.
The input commands are shown below.

Example of application: Bladder cancer study
The methods described in Section 2 are illustrated using data from a bladder cancer study (Byar 1980) conducted by the Veterans Administration Cooperative Urological Research Group.In this study, patients had superficial bladder tumors that were removed by transurethral resection.Many patients had multiple recurrences (up to a maximum of 9) of tumors during the study, and new tumors were removed at each visit.For illustration purposes we re-analyze data from 85 individuals in the placebo and thiotepa treatment groups; these data are available as part of the R survival package.Here, only the first two recurrence times (in months) and the corresponding gap times, T 1 and T 2 , are considered.From the total of 85 patients, 47 relapsed at least once and, among these, 29 experienced a new recurrence.
We have a total amount of censoring of 66% from which 44.7% is obtained from censored observations on the first gap time.We have about 38% of censored total time among the uncensored first gap time.
There is a high percentage of censored total time (Y 's) which in general lead to difficulties in the estimation of the bivariate distribution function.The presence of a reasonable amount of censored Y 's among the uncensored T 1 's suggests that presmoothing could lead to an important reduction of variance in estimation (see de Uña-Álvarez and Amorim 2011).
We will calculate estimates for the bivariate distribution function in several points and plot these estimates.This will be done using the survivalBIV package.
In the following, we will demonstrate the package capabilities using data from the bladder cancer study.Below is an excerpt of the data with one row per individual.
R> data("bladderBIV", package = "survivalBIV") R> head(bladderBIV) Each line represents the information from one individual in study.Among the first five observations, only individual represented by line 5 had a recurrence.This individual had a recurrence on month 6 and remained alive and without second recurrence until time 10 (months).
Note that event1 = 0 and event2 = 0 (the remaining five observations) corresponds to a censored first gap time in the initial state ("remained alive without a recurrence").All observations with event1 = 1 and event2 = 1 corresponds to individuals with a first recurrence and a second recurrence.
We computed the estimated values for the four estimators of F 12 (x, y), for x equals to 3, 13, 29 and 49 and y values 3, 10, 17.75 and 36.75, corresponding to marginal survival probabilities of 0.25, 0.5, 0.75 and 0.95.For illustration purposes we only report the estimated values of F 12 (x, y) for two pairs of gap times with 95% bootstrap confidence intervals.
The outputs for the bivariate distribution function and for the marginal distribution of the second gap time are useful displays that greatly help to understand the patients course over time.Plots for these two quantities can easily be obtained.Figure 1 plots the marginal distribution function of the second gap time (time from first to second recurrence) for all methods.These plots are obtained using the following input commands: R> plot(bladder_obj, plot.marginal= TRUE, method = "KMW", + ylim = c(0, 0.65), xlim = c(0, 45)) R> plot(bladder_obj, plot.marginal= TRUE, method = "KMPW", + ylim = c(0, 0.65), xlim = c(0, 45)) R> plot(bladder_obj, plot.marginal= TRUE, method = "IPCW", + ylim = c(0, 0.65), xlim = c(0, 45)) R> plot(bladder_obj, plot.marginal= TRUE, method = "CKM", + ylim = c(0, 0.65), xlim = c(0, 45)) In Figure 1 we can see new insights for each method, for example, about the number of jump points and monotonicity.In this graphical output we have on top the semiparametric estimator (right) and the method without presmoothing.The main difference between the first two methods is that the semiparametric estimator has more jump points, explicitly the censored values of the total time for which the first gap time is uncensored.Below, the method based on Bayes' theorem (CKM) and the method based on inverse censoring.Clearly, we can see that estimator based on inverse censoring (IPCW) provides a plot with more jump points than the remaining methods.Note also that this method provides non-monotone curves.In regard to the number of jump points and monotonicity, similar behaviors can be found in the plots for the bivariate distribution function (Figures 2 and 3).For illustration purposes we only present the plot for the semiparametric method.These plots are obtained through the following input command, R> plot(bladder_obj, plot.bivariate= TRUE, method = "KMPW") Plots for the different methods can be obtained by simply changing the method argument.

Conclusion
This paper discusses implementation in R of some newly developed methods for the bivariate distribution function for censored gap times.The survivalBIV package uses four nonparametric and semiparametric estimators.One of these estimators is the conditional Kaplan-Meier, based on Bayes' theorem and Kaplan-Meier estimator; also, two recent estimators based on the Kaplan-Meier weights pertaining to the distribution of the total time (time to the second or final event of interest).It also implements the inverse probability of censoring weighted estimator proposed by Lin et al. (1999).The package allows for numerical results as well as graphics to be easily obtained.Covariates have not been included in our methods.This is a topic of current research and hopefully will be implemented in future.We plan to constantly update survivalBIV package to cope with other estimators.

Figure 1 :
Figure 1: Marginal distribution function of the second gap time.Bladder cancer data.

Figure 3 :
Figure 3: Contour plots for the bivariate distribution.Bladder cancer data.

Table 1 :
The variable time1 represents the observed time of the first event (first gap time), and event1 the status indicator of the first gap time (if the first gap time Function Description dgpBIV A function that generates bivariate censored gap times from some known copula functions.By default returns a dataset of class survBIV.corrBIV Provides the correlation between the bivariate times for some copula distributions.survBIV Provides the adequate dataset for implementing all the four methods.The new dataset is of class survBIV.bivCKM Provides estimates for the bivariate distribution function for the conditional Kaplan-Meier estimator, CKM.bivIPCW Provides estimates for the bivariate distribution function for the inverse probability of censoring weighted estimator, IPCW.bivKMW Provides estimates for the bivariate distribution function for the Kaplan-Meier weighted estimator, KMW.bivKMPW Provides estimates for the bivariate distribution function for the Kaplan-Meier presmoothed weighted estimator, KMPW.plot A function that provides the plots for the bivariate distribution function and marginal distribution of the second time.summary Summary method for objects of class survBIV.Summary of functions in the package.