Competing risks survival data under middle censoring—An application to COVID-19 pandemic

Survival data is being analysed here under the middle censoring scheme, using specifically quantile function modelling under competing risks. The use of middle censoring scheme has been shown to be very appropriate under the COVID-19 pandemic scenario. Cause-specific quantile inference under middle censoring is employed. Such quantile inferences are obtained through cumulative incidence function based on cause-specific proportional hazards model. The baseline lifetime is assumed to follow a very general parametric model namely the Weibull distribution, and is independent of the censoring mechanism. We obtain estimates of the unknown parameters and cause specific quantile functions under classical as well as a Bayesian set-up. A Monte Carlo simulation study assesses the relative performance of the different estimators. Finally, a real life data analysis is given for illustration of the proposed methods.


Introduction
After the Coronavirus Infectious Disease (COVID-19) first appeared in Wuhan, China in December 2019, the World Health Organization (WHO) declared it a global pandemic on March 11, 2020 due to its global spread [1]. COVID-19 is transmitted mainly during coughing and sneezing; fecal-oral transmission is also reported in a few cases. According to the worldometer website (https://www.worldometers. info/coronavirus/), as of 10 June 2021, there were more than 175 million cases and around 3.78 million deaths globally. COVID-19 has now been reported on every continent. The first case of COVID-19 in India is reported in the state of Kerala on 30 January 2020, after a medical student returned back from China. COVID-19 has occurred at an unprecedented period, and the lockdown measures enacted have influenced human life with serious economic and social development concerns. In comparison to high-income nations, low-and middleincome nations with less developed health systems appear to have greater obstacles and remain vulnerable in regulating and controlling COVID-19 [2].
The COVID-19 outbreak has changed the health system's responsibilities, and it now finds itself not only overloaded, but also with limited capacity to provide services that it had previously extended to communities. COVID-19 patients are clogging up hospitals and health facilities, making it difficult for other symptomatic patients with acute or chronic illnesses to receive standard care [3]. When the entire health system is focused on fighting against the COVID-19 pandemic, medical (clinical trials) and surgical emergencies (including road accidents) are they fall in some random censoring intervals. This censoring scheme arises prominently in time to event analysis in clinical trial studies. Moreover, middle censoring also occurs when the mechanism where the observations are being taken, is closed for a period, due to an external emergency such as the outbreak of disease, war or a strike.
In the current scenario of COVID-19 pandemic, the need for the use and application of middle censoring scheme is highly relevant. For example, in a breast cancer clinical trial centre, patients are registered with node-negative breast cancer. Some of the patients may receive the tamoxifen-alone arm and remaining patients received the combination of radiation and tamoxifen arm. Investigators may be interested to observe events such as local relapse, auxiliary relapse, remote relapse, second malignancy of any kind, and death. After the surgery patients are discharged and they are instructed that they have to come to the hospital for routine check-up and follow-up. But in the pandemic situation, many countries declared a nationwide complete lock-down so that many patients fail to get routine check-ups, and the investigators may also lose the follow-up on these patients. In the interim, some of the patients may experience an event of interest. In this situation the exact time of the event occurrence cannot be observed for some patients except for noting the interval for the event-time, and learning on later inspection that the actual event has occurred during this interval. Thus, middle censoring is highly relevant with immediate application in real life.
To define middle censoring, let 1  with Pr( < ) = 1, otherwise unobservable. Middle censored competing risks data with exponential distribution was studied by Ahmadi et al. [6] and Abuzaid et al. [7]. See also the references cited therein.
A single long-term survivor may have a major impact on mean life in survival study, particularly in the case of heavy tailed models, which are widely used for lifetime data. This type of situation is commonly encountered when the subjects can experience types of mutually exclusive competing risks of death/event. For example, in a cancer clinical trial, the primary risk of concern may be a full or partial reaction to therapy, with death as a competing risk, and death may be attributed due to various risks such as cardiac arrest, corona virus infection etc. Similarly, in liver transplantation an individual can experience one of the three possible outcomes such as death, transplantation and withdrawal from the waiting list.
In modelling of survival data with competing risks, two basic quantities such as cumulative incidence function (CIF) and cause specific hazard function (CSHF) get considerable attention in the statistical literature, see Kalbfleisch and Prentice [8]. The CIF represents the cumulative probability of failure due to cause up to a certain time point , conditional on × 1 vector of covariates ∈ R , which is described as follows where is the time to failure, = , ∈ {1, 2, … , } is the cause of failure. The CSHF simply gives the instantaneous failure rate from the cause at time among the individuals who survives up to . Mathematically, CSHF is defined as Although, in general the distribution function ( ) reaches 1 as → ∞, in the presence of competing risks, the asymptote of the CIF is less than 1, implying that the proportion of the ( ; ) due to cause increases up to some time point, and then plateaus. Therefore, obtaining the mean survival time does not make much sense because it will always be infinite. Hence, in such a situation, quantile-based estimates, which are finite and may be identifiable from observed data are generally found to be more precise and robust. These measures can be used for summarizing a CIF curve. A good discussion of the various theoretical aspects of the quantile function may be found in [9]. Let ( ) be a right-continuous distribution function of a random variable . The quantile function of , say ( ) is defined by The quantile function has a number of unique features that a distribution function does not have. For example, the sum of the quantiles, product of the positive quantiles, and the monotonic transformation of the quantile functions are also quantile functions. These properties make the quantile function a preferred alternative to the distribution function in statistical modelling. The quantile function is gaining popularity as a comprehensive tool for statistical analysis of lifetime data. Quantiles are frequently used in medical research to summarize the survival function. For example, the median survival time has long been used to assess the survival curve. In survival studies, quantile regression has gained increasing interest as a viable alternative to methods using distribution functions. For more details on quantile function one may refer to Sankaran et al. [10] and the references therein.
Modelling of competing risks survival data using the quantile function has been studied by several researchers. Peng and Fine [11,12] discuss non-parametric and regression model approach of quantile function with competing risks. Sankaran et al. [10] propose a quantile based test for comparing the equality of CIFs. Lee and Fine [13] considered parametric and non-parametric inferences for cumulative incidence quantiles without covariates. Lee and Han [14] proposed covariate adjusted quantiles inferences through cause specific proportional hazards (PH) model of the CIF. Lee [15] considered the parametric modelling of the cause specific quantiles with covariates through Weibull PH model and direct semi-parametric improper Gompertz model. Recently, Peng [16] has reviewed various aspects of quantile regression analysis of survival data with competing risks and without competing risks under randomly censored and left truncation mechanism based on the semiparametric approach.
The primary goal of this paper is to obtain quantile based inference for middle censored competing risks survival data based on the parametric regression modelling of CIF. We define the CIF through Weibull PH model. Under this set-up, we obtain both maximum likelihood and Bayes estimates under reasonable priors. The Bayes estimates are obtained using two different loss functions, namely the squared error loss function and LINEX loss function. As one may expect, explicit form of the posterior densities turn out to be intricate, and so we adopt the Markov Chain Mote Carlo (MCMC) simulation algorithms for generating the posterior samples.
The rest of the article is organized as follows. In Section 2 we define the parametric cause-specific quantile functions based on Weibull PH model. In Section 3 we obtain the maximum likelihood estimates under middle censoring scheme for cause specific quantile functions. Bayes estimates based on squared error and LINEX loss functions are provided in Section 4. Section 5 presents a Monte Carlo simulation study to compare the relative performance of the proposed methods. A real data application of the proposed approach is given in Section 6. Finally Section 7 concludes with some remarks.

Parametric cause specific quantile functions
Regression models in survival studies may be developed via Cox's Proportional Hazards (PH) [17] model, in which the effect of the covariates is multiplicative on some baseline hazard function. For parametric regression modelling of survival time one could use some well known distribution for the baseline function. More details can be found in [18] and references therein. The CSHF in terms of the well known Cox's PH model turns out to be of the following form: where 0 ( ) is the baseline CSHF and ⊤ ∈ R is the 1 × vector of regression coefficients of cause . The CIF can be formulated in terms of all the CSHFs as follows The overall survival function ( ; ) is obtained in terms of cumulative CSHFs as ( ; In this article, we consider 0 ( ) is corresponding to a Weibull distribution i.e. 0 ( ; , ) = ( ) −1 . The CIF under the cause specific PH assumption (4) is then given by where is the vector of parameters and = ( , , ). From (6) it can be seen that the closed form expression for ( ; , ) exists if the shape parameter is common for all causes i.e. 1 = 2 = ⋯ = = . Under this assumption, it takes the form The corresponding cause specific probability density function ( ; , ) is obtained by differentiating ( ; , ), and we get Following general notation for quantile function as in (3), the cause specific quantile/ sub-quantile function is defined as The estimate of cause specific quantile is obtained aŝ− 1 ( ; ) = inf { ∶ ( ; ) ≥ }. Note that, if ( ; ) is continuous and strictly increasing function, then −1 ( ; ) is the unique value of such that ( ; ) = . Therefore, cause specific quantile function of ( ; , ) from Eq. (7), yields We used Weibull distribution for parametric modelling of cause specific quantile functions under middle censoring scheme because it is a very broad and flexible distribution for lifetime data analysis. A detailed discussion on parametric modelling of middle censored lifetime data with covariates can be found in [19]. We now proceed to obtain estimates of the unknown parameters of the Weibull distribution, the regression coefficients, and cause specific quantile functions, under both classical and Bayesian approaches in the following Sections 3 and 4 respectively.

Maximum likelihood estimation
The maximum likelihood estimation (MLE) is widely used among the statistical inference methods because of its desirable properties such as consistency, asymptotic efficiency, and invariance. In the middle censoring scenario, we assume that the lifetime is middle censored by random censoring interval [ , ] which having a bivariate cumulative distribution function (⋅, ⋅). Moreover, we assumed that ( 1 , 1 ), ( 2 , 2 ), … , ( , ) are i.i.d. pairs of bivariate random observations, where left end point and length of the censoring interval = ( − ) are independently follow exponential distributions i.e.
∼ Exp( 1 ) and ∼ Exp ( 2 ). For ∈ N individuals, let lifetime 's, ( = 1, 2, … , ) and censoring interval [ , ]'s are independent, given the covariate . The observed lifetime for the th individual is given by ) is a censoring indicator. In this study it is assumed that when ∈ [ , ], then the causes of failure can be observed on later inspection. We assume that ( , , , ) are i.i.d. observations of ( , , , ) corresponding to individuals under study. For the observed data ( , , , ), the likelihood function is then given by where ( ) = ( = ) is the indicator function for th cause. It is also assumed that 1 and 2 do not depend on . Without loss of generality we assume that first 1 = ∑ =1 are the uncensored and remaining  7) and (8) can be written as The log-likelihood function = log ( ) is given by The MLEs of unknown parameters are obtained by maximizing the log-likelihood (13). The system of equations with respect to each of the parameters is given in Appendix. Since these equations are not in an explicit form, analytical solutions are not possible. So, we utilize iterative methods such as Newton-Raphson or other techniques to solve the system of equations. We used the optim function in R software for obtaining MLEs of the unknown parameters. By using the invariance property of MLEs we obtain the MLEs of cause specific quantile functions. Suppose that thêis the MLE of then the estimator of −1 ( ; , ) is given bŷ− 1 ( ;̂, ).

Bayes estimation
Bayesian inference is distinctive in that it incorporates prior information with the observed data. For obtaining the Bayes estimates of unknown parameters , , and cause specific quantile −1 ( ; , ), first we need to define the suitable priors of unknown parameters and appropriate loss functions. It is known that there is limited information about the unknown parameters except that > 0, > 0 and −∞ < < ∞, = 1, 2, … , . If all the parameters are unknown then it is difficult to obtain the joint conjugate prior for the parameters. We therefore assumed informative prior by choosing independent gamma priors for and and normal priors for as follows where 1 , 1 , 2 , 2 , and are the hyper parameters. The hyperparameters are assumed to be known and are chosen in such a way that reflects the degree of belief about the unknown parameters. The joint prior distribution of , and from (14) up to the proportionality is given by The joint posterior density of , and is obtained as follows where ( | ) is the likelihood function based on observed data as given in Eq. (12) and ( ) is the joint prior density (15). The denominator part of Eq. (16) involve multiple integrals and it is difficult to obtain the posterior densities of random variables , and in explicit form. Thus the analytical evaluation of posterior samples is impossible. Therefore, in this situation MCMC method can be used to approximate the integrals [20]. Popularly used MCMC algorithms are Gibbs sampling algorithm [21] and Metropolis-Hastings (M-H) algorithm [22]. Since, marginal posterior densities of random variables , and are not obtained in closed form, and so we employ the M-H algorithm.
In this article we consider two different types of loss functions, namely the commonly used squared error (symmetric) loss function and the LINEX (asymmetric) loss function for the purpose of comprehensive comparison of Bayes estimates. Squared error loss function (SELF) is defined as ( ,̂) = ( −̂) 2 for a parameter . Then the Bayes estimate for parameter and −1 ( ; , ) under SELF can be obtained as the posterior mean and calculated aŝ where , = 1, 2, … , are the MCMC posterior random samples drawn from the marginal posterior distribution of random variables , and , and is the number of iteration used in burn-in period. Note that SELF is a symmetric loss function but it is not useful for the situations when under/over estimation is more costly than the over/under estimation and it is considered the equal weight for both under and over estimation. For example, over estimation of survival function and failure rate function is usually much more serious than under estimation. To overcome this difficulty we also consider as an alternative the LINEX loss function (LLF) which is an asymmetric loss function given by ( ,̂) = (̂− ) − (̂− ) − 1, ≠ 0. Under LLF the Bayes estimates of parameter and −1 ( ; , ) can be obtained as followŝ where is the hyper parameter of the LLF and magnitude of reflect the degree of asymmetry. For > 0 the LLF is quite asymmetric about 0 with overestimation being more serious than underestimation. The opposite is true with < 0. If is close to zero then estimates under LLF are approximately equal to estimates obtained under SELF.

A Monte Carlo comparison of the estimates
We conducted a Monte Carlo simulation study to observe the finite sample behaviour of the MLE and Bayes estimators of the unknown parameters and cause specific quantile functions. We generate the random samples through inverse transformation for four different sample sizes i.e. = 25, 50, 100, and 200. For each sample sizes, we simulated 500 sets of data. In this scenario, we computed average estimates (AVE) and mean square error (MSE) for , , and −1 ( ; , ). Besides that we obtained the average length (AVL) along with coverage probability (CP) of the asymptotic confidence interval (ACI) of the MLE and Bayes credible interval (BCI) of the Bayes estimates to compare the precision of the estimates. We also consider two different censoring percentage viz., mild (approximately, 10%), and heavy (approximately, 30%) for observing the impact of censoring. The censoring effect is explained as follows: if lock-down is extended during the COVID-19 pandemic, the percentage of censored observations will increase. As a result, COVID-19 cases are on the rise, posing a threat to the health-care system. We refer these censoring percentages as censoring scheme 1 (CS-1) and censoring scheme 2 (CS-2) respectively. The results of simulation study based on CS-1 and CS-2 are available in Tables 1 and 2 respectively.
In the simulation study, we consider two causes of failure for simplicity i.e. = 1, 2 and one single covariate . The covariate is generated from (0, 1). The survival time is generated by using the steps given in [23]. Next, as we discussed in Section 4 that marginal posterior densities of unknown parameters are not in a closed form, so we utilized the MCMC procedure for generating the random samples from marginal posteriors. For this purpose we used the BUGS software via R2OpenBUGS package in R software [24]. We generate, = 10 000 Markov chains for each parameter and the first = 4000 samples were used in burnin period. Furthermore, for minimizing the effect of the autocorrelation every second equally spaced outcome is considered i.e. thin=2. By the visualization of the convergence diagnostics plots it is realized that chains are converging nicely. Therefore, the last 6000 MCMC samples are used to obtained the Bayes estimates of , , and −1 ( ; , ) based on SELF and LLF.
From Table 1 it is clear that for fixed censoring proportion as the sample size increases, MSEs decreases for MLE and Bayes estimates, which verifies the consistency property of all the estimators. As expected, for small sample size the Bayes estimates under both the loss functions are better than MLE in terms of MSE, AVL and CP. It is also noticed that the CPs for ACIs of the cause specific quantile functions for sample size 25 are little bit away from the nominal level (95%). Similarly, for = 50,100 and 200 we can say that Bayes estimates of baseline parameters and cause specific quantiles under both the loss functions are quite better except some values in sample size 200. The Bayes estimates for llf-1 are smaller as compared to llf-2. From Tables 1 and 2 it is observed that for fixed sample sizes the MSE and AVL of all the estimates is increases as the censoring percentage increases except for some values. Therefore, it indicates that as the censored observations are increased this will leads to the less efficient estimates.
This implies that if the spread of corona virus is not under control in a reasonable period of time, then it will have a significant impact on the human health-care system. Overall it is observed that the CPs maintain the nominal level (95%) of all the estimates for both the censoring schemes.

An illustrative application
In this section, for illustrative purposes a real life application is considered. We have taken real data from a Mayo Clinic study on primary biliary cirrhosis (PBC) of the liver conducted between 1974 to 1984. This data set is available in survival package of R software. During this ten-year period, 312 patients were randomly assigned to receive D-penicillamine or placebo treatment from a total of 424 patients. The remaining 112 patients did not take part in the clinical trial but agreed to have their basic measurements taken and to be followed for survival. Six of those patients were lost to follow-up shortly after diagnosis, so these patients were removed from the study. 161 patients died at the end of the study, 25 patients received a liver transplant, and 232 patients were lost to follow-up. As a result, for two competing outcome variables, liver transplant and death, the competing risks model becomes more reasonable. All the survival times are measured in days. For more information on this PBC data, one may refer to Therneau and Grambsch [25] and application of competing risk on PBC data is available in [26].
In order to compute the survival time in years, it divided by 365, which yielded a median survival time of 4.74 years. First, we check the goodness of fit of the Weibull distribution using fitdistrplus package in R software with assumption that the data is complete. The Kolmogorov-Smirnov distance between the empirical distribution function and the fitted Weibull distribution function is 0.0331 and the correspondingvalue is 0.7378. Therefore, Weibull model appears to be reasonable and cannot be rejected. We also consider the graphical method of goodness of fit to check the appropriateness of the model as given in Fig. 1 and it indicates that the model fits well to the data.
As we have discussed in Section 1 the middle censoring may arise due to COVID-19 pandemic. But this PBC data set does not have middle censored observations and currently we do not have any middle censored follow-up data. However, once the COVID-19 pandemic is over, it is possible that middle censored follow-up data will be available. Therefore, we created an artificial data set using middle censoring, whose left end point is equal to the observed time and the right end point is equal to the + , where is the width of the interval, which is generated from an exponential distribution with a mean value of 10. Then, all the censored individuals in the original data set are considered as the middle censored observations. The competing outcome variables for middle censored observation are randomly assigned from transplant and death.  For this new data set which consists observed exact lifetimes and censored intervals, we check the PH assumption of the model (4) by considering treatment as a covariate. To examine the PH assumption for transplant and death we utilize the graphical method known as Andersen plot [27]. The covariate treatment is discrete and take two values 1 and 2 for D-penicillmain and placebo respectively. We also assume that 106 patients who do not participate in the trial they received the D-penicillmain treatment. Thus data are divided into two strata, corresponding to D-penicillmain and placebo individuals. Suppose that̂0 ( ) be the estimate of baseline cumulative CSHF for th cause in rth stratum, = 1, 2 for transplant and death, and = 1, 2 for Dpenicillmain and placebo respectively. We plot̂1 0 ( ) versuŝ2 0 ( ) for transplant and death. If the proportionality assumptions holds then these plots should be a straight line passing through origin which is verifying by Fig. 2. We then apply the proposed methods of estimation to obtain the estimates of unknown parameters and cause specific quantile functions. These are presented in Tables 3 and 4 under proposed methods of   Fig. 3 in which solid line represents the estimates of the CIF due to death and dotted line represents the estimates of the CIF due to transplant. From Fig. 3 it is observed that the estimates of CIFs have smaller value for transplant as compared to death. gives the information about the stay of the patients in waiting queue, this implies that the 10% and 20% patients will receive the transplant soon after 4 and 7 years respectively and 10% and 20% patients died soon after 2.2 and 3.6 years respectively. This shows that the waiting time of the patients to receive the transplant is roughly two times larger as compared to the death under both the treatment groups. It is observed that the quantile estimates under both the treatment groups have minor differences. This indicates that the effect of treatment is not significantly different on transplant and death.

Concluding remarks
This article considers parametric cause specific quantile inference of the CIF under middle censoring scheme. In this study we have discussed the use of middle censoring scheme in the context of the current COVID-19 pandemic. This research could provide a useful statistical framework for medical practitioners to obtain precise survival analysis for patients who were lost to follow-up due to this pandemic. We believe this to be first such attempt to model the quantile event times of the middle censored data under competing risks. The regression model was developed based on Cox's PH model by assuming a very flexible Weibull distribution for the baseline hazard function. Also, we provide the estimates of unknown parameters and cause specific quantiles under both the classical and Bayesian set-up. The simulation study shows that the Bayes estimates perform well based on informative priors under squared error loss function in terms of MSE as compared to MLE. However, all the estimates exhibit the consistency property, as also the identifiability and appropriate convergence of the proposed model. Overall, the proposed model performed well in simulation studies. In a real data analysis on Primary Biliary Cirrhosis of the liver, goodness of fit criteria verify that the Weibull model fits well to the data. Also the covariate treatment maintains the assumed model PH assumption. Other semi-parametric regression models, such as additive hazard regression model and proportional odds model, may also be appropriate in this context and will be explored elsewhere.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Details of the system of normal equations used in Section 3 are provided here after introducing some notations that help reduce the length of expressions viz.
Partial derivatives of the log-likelihood of Eq. (13) with respect to each of the parameters gives the following equations.