Confidence Bands for the Survival Function Using a Weibull Regression Model in Presence of Arbitrary Censoring

Usually, the exact time at which an event occurs cannot be observed for several reasons; for instance, it is not possible to constantly monitor a characteristic of interest. This generates a phenomenon known as censoring that can be classified as having a left censor, right censor or interval censor. When one is working with survival data in the presence of arbitrary censoring, the survival time of interest is defined as the elapsed time between an initial event and the next event that is generally unknown. This problem has been widely studied in the statistic literature and some progress has been made, toward resolving and the formulation of a bivariate likelihood to estimate parameters in a parametric regression model offers positive development opportunities. In this paper, we construct a bivariate likelihood for the Weibull regression model in the presence of interval censoring. Finally, its performance is illustrated by means of a simulation study.


Introduction
Situations in which the observed response for each individual under study is either an exact survival time or a censoring time are commonly in practice, There are two types of censoring known as type I and type II, respectively.Type I censoring occurs when units that have not failed are removed from a test at a prespecified time; on the other hand, type II censoring occurs when a life test is terminated after r failures are observed.Nonetheless, there can be situations, such as in longitudinal studies which individuals are monitored during a fixed period of time or periodically visited during a certain period.In this context, the time T i , i = 1, . . ., n, until the occurrence of an event of interest for each individual is unknown, the only known fact is that the event occurred on an interval between visits, in other words, between the visit at time L i and the visit at time U i , where L i < U i .It is important to point out that in such studies, the survival time T i is not exactly known.We only know that the event of interest occurred inside the interval (L i , U i ] with L i < T i ≤ U i .Moreover, taking into account that if the event occurs in the exact moment of a visit, which is very unlikely but could occur, an exact survival time would be obtained.In this case, it is possible to assume that L i = T i = U i . On the other hand, it is known that for individuals whose times are rightcensored, the event of interest has not occurred until the last visit, but that could happen at any time beyond that moment.Therefore, it is assumed in this case that T i could occur inside the interval (L i , ∞), with L i being equal to the time period from the start of the study to the last visit, so U i = ∞.
Similarly, it is known that for individuals whose times are left-censored, the event of interest occurred before the first visit and, hence, it is possible to assume that T i occurred on the interval (0, U i ] with L i = 0 representing the start of the study and U i being the time period from the start of the study to the first visit.Schenker (1993), state that to treat the survival time of interest as if it were exact could lead to biased estimators as well as to partial and unreliable conclusions and estimations.
These affirmations somehow motivate different proposals related to the treatment of these censorings in order to avoid bias and to extract more information from the data.Our proposal partially covers this objective.
For the case of right censoring, the Kaplan-Meier estimator could be used to obtain F (t) Kaplan & Meier (1958).However, with interval-censored data, the classic Kaplan-Meier method could not be implemented.For these intervalcensored data, Peto (1973), Turnbull (1974) and Turnbull (1976) developed the so called nonparametric maximum likelihood estimator (NPMLE); from now on, we will call it Turnbull estimator.
The Turnbull estimator is based on a sample of observed intervals [L i , R i ], i = 1, 2, . . .n, including the independent random variables T 1 , T 2 , . . ., T n .As stated above, an exact observation of T i is obtained only if Given this example, the likelihood function to be maximized is: To solve this maximization problem, (Peto 1973) defines two sets: γ = {L i , i = 1, 2, . . .n} and κ = {R i , i = 1, 2, . . ., n} containing the right and left sides of the intervals respectively.
From these sets, new [q 1 , p 1 ] , [q 2 , p 2 ] , . . ., [q m , p m ] disjoint intervals are formed, such that q j ∈ γ, p j ∈ κ and q j ≤ p j .It could be proved that a function that maximizes (1) is constant between the intervals [q j , p j ] and it is not defined inside those intervals.This implies that P (T ∈ (p j−1 , q j )) = 0 for any j.Let s j be the increases of F inside the [q j , p j ] intervals, j = 1, . . ., m, L (F ) must be maximized as a function of s 1 , s 2 , . . ., s m subject to the constrain s j ≥ 0 and s m = 1 − m−1 j=1 s j .Peto addresses this maximization problem using the Newton-Raphson algorithm.
In contrast to Peto, Turnbull (1976), proposes the use of the so called selfconsistency algorithm for the same maximization problem.The idea of the selfconsistency algorithm was first introduced by Efron (1967).Its application for the maximization in (1) is as follows: Let Then, the probability of T i being inside the interval [q j , p j ], given a vector s = (s 1 , s 2 , . . ., s m ) , is given by: Revista Colombiana de Estadística 40 (2017) 85-103 Since F is constant outside the intervals [q j , p j ], the proportion of observations inside the interval [q j , p j ] is equal to: And a vector s = (s 1 , s 2 , . . ., s m ) is called self-consistent if, Following this definition, the Turnbull self-consistency algorithm to calculate the nonparametric estimator of F (t) could be implemented following these steps: 1. Obtain initial estimations of s; for instance, s 2. For i = 1, 2, . . ., n, j = 1, 2, . . ., m, calculate µ ij s (0) in accordance with (2), then update π j s (0) in accordance with (3).
3. Obtain improved estimations for s by finding s (1) 4. Return to step 2, replace s (0) with s (1) , and continue until convergence is attained.Meeker & Escobar (1992) proposed evaluating the effect of perturbations on the model, or the weight they have on the maximum likelihood estimates obtained from censored survival data.Waller & Turnbull (1992) analyzed several graphic methods to check goodness of fit in the case of right censored data, and the proposed making an empirical rescaling of the axes to prevent data to be grouped around particular areas in the graphics.Chang & Weissfeld (1999) proposed two diagnostic methods to evaluate the accuracy of the confidence region based on the partial likelihood function using a Cox's proportional hazards model with censored data.Joly & Commenges (1999) studied both the intensity and survival function for a progressive right-shift multi-state model using arbitrary censored data; they illustrated their method using longitudinal data about AIDS.Rosales & Salazar (2006) generalized the model proposed by Joly & Commenges (1999) and formulated a likelihood function that considers the presence of arbitrary censoring.However, the problem of constructing simultaneous confidence bands with arbitrary censoring still presents opportunities for development.This paper discusses how to obtain simultaneous confidence bands when a Weibull regression model with arbitrary censoring is considered.In the case of simultaneous confidence bands (SCB) for the cumulative distribution function, Cheng & Iles (1983) used the Wald statistic to construct the SCB for quantiles of the cumulative distribution function and the probability of failure.Cheng & Iles (1988) extended their confidence bands results to cumulative distribution functions that are members of the location and scale family with complete data.Jeng & Meeker (2001) generalized the work of Cheng & Iles (1988) using the Wald statistic with the Revista Colombiana de Estadística 40 (2017) 85-103 observed Fisher's information matrix, the Wald statistic with local information matrix Fisher, and the likelihood ratio statistic.Finally, Escobar, Hong & Meeker (2009) extend the work of Cheng & Iles (1983) in the following ways: 1.They showed how to compute SCB based on local information, expected information, and estimated expected information for both the "cdf method" and the "quantile method", Escobar et al. (2009); Cheng & Iles (1983) considered only the expected information case 2. They described a calibration method of the intervals to provide exact coverage for type II censoring and improved approximate coverage for other kinds of censoring.
3. They discussed how to extend these procedures to regression analysis.
This work was motivated by a radiographic progression study conducted in Colombia the propose of which was to identify risk factors related with Rheumatoid Arthritis (RA Rojas, Diaz, Calvo, Salazar, Iglesias, Mantilla & Anaya 2009).Suppose that a patient is observed at irregular times and at each visit his/her health state is assessed and classified in three categories namely mild, moderate, or severe.Since, in general, it is not possible to observe a patient continuously, one of the following situations would probably by time: 1. On the first visit, the patient could be in a moderate or severe state of the disease.In this case, the time when the patient changed from mild to moderate or from mild to severe is unknown.This generates left censored data.
2. The patient is observed at least once in mild or moderate condition and then he/she left the study for some reason.This generates right censored data.
3. In two consecutive visits the patient changed of state (say from mild to moderate or from moderate to severe) but the exact time when this occurred is unknown.This generates interval censored data.
This dataset about radiographic progression of RA exhibits these three types of censoring, and therefore it is not convenient to analyze it using conventional approaches that take into account only right censored data, such as the well-known Cox model.Even if we fit a parametric model that takes into account the dynamics of censoring the data set, the way the goodness of fit is evaluated could not be entirely correct because the confidence bands of Nair (1984) are used; these are nonparametric and only work for right censored data.Ii then seems more reasonable to build confidence bands that take into account arbitrary censored data.PROC LIFEREG of SAS c , allows data to be modeled with censored arbitrary data as long as a parametric regression model is specified.Allison (1995) fitted a Weibull model, but the way he assessed the model's goodness of fit is not entirely satisfactory because he used Nair's confidence bands in the presence of interval censored data, which cannot not correct.
The goal of this paper is to propose simultaneous confidence bands for a Weibull regression model in the presence of arbitrary censored data.Specifically, instead of using likelihood to obtain the confidence interval, we adapted the simultaneous confidence parametric bands proposed by Escobar et al. (2009) in conjunction with the likelihood function of a bivariate distribution.This is a different strategy from that of just imputing the interval censored data.This strategy is the works most important contribution and it yields simultaneous parametric confidence bands.It is contrasted this with PROC LIFEREG of SAS c , which yields nonparametric confidence bands.
They also made comparisons, using a simulation study based on the deviance of two models.The first estimated the parameters using likelihood with arbitrary censoring, and the other estimated the parameters using a bivariate likelihood (Gentleman & Vandal 2001).The goal was to evaluate which of the two likelihoods yielded better estimates.
Take into consideration that the goal of this paper is to build a bivariate likelihood with dependency for interval-censored data in order to find S(t).Since this dependency will be specified by means of copulas, it is important to first define them.

Copulas
Suppose that C α is a distribution function with density c α over [0, 1] 2 for α ∈ R. Let (T 1 , T 2 ) be the failure times and let both (S 1 , S 2 ) and (f 1 , f 2 ) be its corresponding marginal survival and density functions, respectively.If (T 1 , T 2 ) comes from a copula C α , for any α ∈ R, the joint survival and density functions of (T 1 , T 2 ) are given by where α represents the dependency parameter between T 1 and T 2 .
We will use the Archimedean family of copulas because is the most used copula family.A bivariate distribution belonging to the family of Archimedean copula models can be represented in the following way: where φ is a convex and decreasing function such that φ ≥ 0, φ (1) = 0.The φ function is named generator of the C α copula and the inverse of the generator, φ −1 and is the Laplace transform of a latent variable denoted as γ, which induces the dependency α.Thus, the selection of a generator results in several families of copulas.Table 1 shows the forms for bivariate survival functions in three Archimedean copula families.Additionally, Table 2 shows the generators and the Laplace transform for the considered families.

A Likelihood Function for Interval-Censored Bivariate Data
Let T and V be two random variables with the cumulative distribution function F (t, v); both T and V are Type I interval censoring.So, instead of observing the pair (T, V ), we observe the vector be two bivariate random variables with the joint probability density function g(t, v), the joint cumulative distribution function It is understood that (T , V ) and (T, V ) are independent and Pr(T We assume n independent and identically distributed repetitions of The underlying repetitions of (T, V ) are (t 1 , v 1 ), . . ., (t n , v n ).For each observation i, the (T i , V i ) points define 9 rectangles the are denoted as R jki , for j, k = 1, 2, 3, where the values of Let g(t, v) denote the joint density of (T , V ), where t = (t 1 , t 2 ) and v = (v 1 , v 2 ).Let f (t, v) denote the joint density of (T, V ).Since (T , V ) and (T, V ) are independent, the joint density of (T , ) and the fact that ∆ 11 = 1, the distribution of Ψ is obtained as follows: Here, for convenience, we use the notations Then, the loglikelihood is: If F T is the marginal distribution function for T and F V is the marginal distribution function for V , the loglikelihood for F is given by: Revista Colombiana de Estadística 40 (2017) 85-103 When we only have interval and right censoring, δ 11i = 0, δ 12i = 0 and δ 13i = 0, then n (F ) reduces to: In terms of the survival function, we have: Consider the Weibull regression model, Where the response variable T could include the three types of censoring (left, right, and interval censoring), β is a vector of unknown parameters, and σ is the scale parameter (σ > 0), T ∼ Weibull(µ, σ), W ∼ SEV(0, 1), with µ = β 0 + β Z where SEV (0.1) is the standard smallest extreme value distribution.
To verify the assumptions of the Weibull regression model, we use the standardized residuals: If the Weibull model is suitable, then these residuals could be thought of as a censored sample of a small extreme value distribution, (W ∼ SEV(0,1)).
Let V be an auxiliary variable so that T and V are highly dependent, let τ T,V , be the Kendall's tau, τ , between T and V .Since W = (log T − β 0 − β Z)/σ, is an increasing function of T and τ is invariant under monotonic transformations, τ T,V = τ W,V .
To estimate the parameters of the Weibull regression model, we use the bivariate loglikelihood for S, which is expressed as: If the Gumbel copula is used for constructing the bivariate distribution with dependency parameter τ, we have the following: Even though the uniform distribution has rough edges, it works well in the simulation process, as we will show in the next section; however, another distribution could be used, for instance, the beta distribution.

Simulation Study
To explore if the bivariate likelihood with random censoring improves the estimations of the parameters of the Weibull regression model in comparison to the ones obtained with the Turnbull method (Turnbull 1976), the following simulation study was carried out.
Recall that the Weibull regression model is specified as: Therefore, to generate times from a Weibull model, we must generate Z and W , maintaining β, β 0 and σ fixed.Since it is assumed that T and V are highly dependent and that their dependence could be measured with the coefficient τ , that was fixed at τ = 0.99, and since τ is invariant under monotonic transformations, then we must generate W in such a way it satisfies τ (W, V ) = 0.99.
The simulation factors that will be controlled are: 1. Sample size n: the objective of this factor is to assess the effect of the number of individuals in the study during the estimation process.Values of n = 50, 100, 200 will be taken because they can easily appear in practice.
2. Percentage of interval censored observations p: the objective of this factor is to evaluate the effect of the percentage of interval censored observations p during the estimation process.The values of p = 0.5, 0.7, 0.9, will reflect situations with high interval censoring percentages; the remaining data are right censored.
3. Variance of time until the event of interest σ 2 T : the objective of this factor is to assess the effect of the variance of the time until the event of interest during the estimation process.The values of σ 2 T = 4, 25, 100 because we want to observe the effect in the presence of small and large variances.
4. Parameter vector β: the objective of this factor is to evaluate the effect of the explanatory variable Z, on the estimation process.We will consider the following values: β = −0.9,−0.7, −0.5, −0.3.Some simulations were performed with positive values of β and they yielded very similar results to the ones obtained using negative values.
5. Distribution of the explanatory variable Z: the objective of this factor is to assess the effect of the distribution of the explanatory variable Z on the estimation process.For the sake of simplicity only two distributions will be considered, a continuous standard normal distribution Z ∼ NOR(0, 1) and an ordinal discrete binomial distribution with parameters n = 6 and p = 0.5, (Z ∼ BIN(6, 0.5)).However, more complex distributions could also be considered.
Finally, with the simulated data, β 0 , β and σ will be estimated to obtain β 0 , β, σ, and the square root of the mean square errors will be calculated to observe the accuracy of the estimation process.
In Lawless & Babineau (2006), we find a comprehensive discussion on how to generate interval-censored data.Ii specifically refers to the case of a longitudinal study, in which there is a periodic follow-up of the scheduled visits, and it takes into account that the patients could miss some of their appointments.We supposed that there are M potential inspection times a j , j = 0, . . ., M , for instance a j = j.The probability of patients attending each scheduled visit is p.For an individual i, the observed interval censored (L i , R i ] is constructed by defining R i as the first visit in which the event of interest is observed.L i is the previous visit, i.e.L i = max a j : a j < T i , δ i j = 1 and R i = min a j : a j ≤ T i , δ i j = 1, where δ i j = 1, indicates that the visit occurred at time a j .Different values of p lead to different interval lengths.For instance, p = 0.3 implies that 70% of the visits are missed, which would lead to the observation of wide confidence intervals for T .
With censored data, β 0 , β and σ will be estimated using the interval-censored likelihood and we will write those estimates as β 0int , β int and σ int , then the square root of the mean square errors will be calculated to measure the accuracy of the estimation process.
With the censored data, β 0 , β and σ, will be estimated by taking the likelihood as a bivariate likelihood and we will denote those estimates as β 0biv , β biv and σ biv .The square root of the mean square errors will they be calculated to measure the accuracy of the estimation process.This optimization process will be carried out using the Nelder-Mead Simplex algorithm (Nelder & Mead 1965), which is one of the options included in the maxLik package of R software.This algorithm was used instead of the Newton-Raphson method because it showed a better performance in the preliminary tests.
Figure 1 shows that the square root of the mean square errors does not substantially change when varying the sample size.It also shows that if we take the likelihood as a bivariate likelihood for random-censored data, taking into account the auxiliary variable V , and if we estimate β 0 , β, and σ, the square root of the mean square errors of β 0 , β, and σ are much lower than if we estimate these parameters of the Weibull model using a traditional likelihood with random censoring without considering the auxiliary variable V .Figure 2 shows that the square root of the mean square errors does not substantially change when varying the right censoring percentage p and that if we take the likelihood as a bivariate likelihood for random-censored data, taking into account the auxiliary variable V , and if we estimate β 0 , β, and σ, the square root of the mean square errors of β 0 , β and σ are much lower than if we estimate these parameters of the Weibull model using a traditional likelihood with random censoring without considering the auxiliary variable V .
Figure 3 shows that the square root of the mean square errors does not substantially change when varying the variance of the time of interest.It also shows that if we take the likelihood as a bivariate likelihood for random-censored data, taking into account the auxiliary variable V , and if we estimate β 0 , β, and σ, the square root of the mean square errors of β 0 , β, and σ are much lower than if we estimate these parameters of the Weibull model using a traditional likelihood with random censoring without considering the auxiliary variable V .Figure 4 shows that the square root of the mean square errors does not substantially change when varying the coefficient of the explanatory variable β.It also shows that if we take the likelihood as a bivariate likelihood for random-censored data, taking into account the auxiliary variable V , and if we estimate β 0 , β, and σ, the square root of the mean square errors of β 0 , β, and σ are much lower than if we estimate these parameters of the Weibull model using a traditional likelihood with random censoring without considering the auxiliary variable V .
In Figure 5 simultaneous parametric confidence bands (Escobar et al. 2009) are shown.To construct these, we used both data with arbitrary censored, and a bivariate likelihood functions with arbitrary censored data that has an auxiliary variable V that is highly correlated with the response variable.On the right side of the graph, we can see that when the cumulative distribution function with the bivariate likelihood is estimated, taking into account the auxiliary variable V , the cumulative distribution is fairly close to the real cumulative distribution.However, if we do not take into account the auxiliary variable V , the estimated cumulative distribution is not that close to the real cumulative distribution.In the graph on the left, we can observe that the confidence parametric bands proposed by Escobar et al. (2009), in the case of the auxiliary variable, contain all the straight lines.These represent the real cumulative distribution function, whereas when if we do not take into account the auxiliary variable, this straight line is out of the confidence bands.In summary, the use of the bivariate likelihood, the construction of which is undertaken considering the auxiliary variable, is recommended.Escobar et al. (2009) for F (t) to the interval censorored case, using the two likelihoods.

Conclusions and Recommendations
When the goal is to study the lapse of time until the occurrence of an event of interest, and to detect whether such event occurred, it is necessary to measure a variable that could be and index.This index which is known as an auxiliary variable, is correlated to the time of occurrence of the event, and this occurrence time could exhibit left, right, or interval censoring.In addition, if some covariates are available and we want to adjust a parametric regression model to determine which of these covariates are related to the time of occurrence of an event, we can not only use a likelihood with the three censoring types, but also the proposed bivariate likelihood.To obtain the maximum-likelihood estimators of β 0 , β, σ, the maxLik package of the R software was used as this package maximizes the likelihood functions.After this, the Nelder-Mead method was applied since it showed greater stability during the estimation process.
According to the results from the simulation study, we can conclude that the estimated parameters of the Weibull model using the proposed methodology (the bivariate likelihood) are closer to the real values of the parameters than the ones obtained only taking into consideration the three types of censoring.However, it is worth noting that the standard errors associated with the proposed method are consistently higher than the ones from the conventional methods in all the simulation scenarios.
Also, from the simulation study, we can see that, according to the likelihood ratio test, the proposed model (that uses the auxiliary variable in addition to the three types of censoring) performs better than the model that only takes into consideration the three censoring types.This is because a higher percentage of acceptance was obtained.
According to these conclusions, when interval-censored data are available, for which the interval censoring is determined by the measurement of a variable that indicates whether the event of interest occurred or not, and when the goal is to adjust a Weibull regression model, the use of the bivariate likelihood proposed in this paper is recommended.This is because it produces closer estimations to the real parameters than the estimations obtained when the likelihood for intervalcensored data is used.
In terms of future work, this methodology could be implemented as a package of R-project and this work could be applied to other members of the localization and scale family.

Figure 1 :
Figure 1: MSE behavior varying the sample size n using the three estimation methods.

Figure 2 :Figure 3 :Figure 4 :
Figure 2: MSE behavior varying the proportion of interval censored p using three estimation methods.

Figure 5 :
Figure5: Extension of the simultaneous confidence bands ofEscobar et al. (2009) for F (t) to the interval censorored case, using the two likelihoods.

Table 2 :
Generators and their Laplace Transforms.