A goodness of ﬁt test for the survival function under random right censoring

: The present article contributes a goodness of ﬁt test for the survival function under random right censoring. The test is based on a central limit theorem for the Integrated Square Error of an already exist- ing in the literature kernel survival function estimate. Establishment of its asymptotic distribution yields the proposed test statistic for drawing deci- sion on the null hypothesis of correctness of the assumed survival function. Numerical simulations quantify the empirical nominal level and power of the suggested test for various sample sizes and amounts of censoring and facilitate comparison with the power of the data driven Neyman goodness of ﬁt test for censored samples.


Introduction
This article considers the goodness of fit testing of the survival function by a fully specified estimate under the random right censored data setting. The intention is to provide a statistical test applicable to a wide variety of situations for assessing the validity of a parametric model. This is achieved by employing as the test basis, a kernel survival function estimate, which by definition does not depend on distributional assumptions on the underlying data set.
Typically, formulation of a hypothesis test involves establishing the asymptotic distribution of a measure of accuracy of the smooth estimate, under the null hypothesis that the true survival function is fully specified by a given parametric model. The preferred measure here is the Integrated Square Error (ISE) because it quantifies the performance of the estimate for the available data set at hand. The smooth estimate is taken to be the estimate of [15], given in Section 2 below. The absence in the literature of a goodness of fit test based on the estimate employed here, together with the advantages this estimate offers, discussed in detail in [2], necessitate the development of such a test and extend its scope of application.
Based on the ISE, development of the test statistic is done in lines parallel to those of [9] for the density setting which was also used by [12] for the fixed design nonparametric regression setting. The result reveals that the form of the obtained statistic is determined by the amount of smoothing applied to the data. Furthermore, the empirical power of the proposed goodness of fit test is investigated via an extensive simulation study and is compared under the same settings with the power of Neyman's test.
The methodological contributions of the present research include a central limit theorem for the ISE of the smooth estimate of [15] as well as numerical evidence on the null performance and power of the proposed test for various sample sizes and amounts of censoring.
The rest of the paper is organized as follows. The framework of study together with definition of the smooth estimate is given in Section 2. Section 3 is devoted to the central limit theorem for the ISE of the estimate and development of the suggested test. The simulation studies on the nominal level and the power of both the proposed and the Neyman test are given in Section 4. Section 5 demonstrates how to utilize the test in practice with real data. Section 6 contains a discussion of the present work with emphasis on when and how the test should be used. Technical proofs are deferred for the last Section.

Notation and preliminaries
Let T 1 , T 2 , . . . , T n be a sample of n i.i.d. survival times censored at the right by n i.i.d. random variables U 1 , U 2 , . . . , U n , independent of the T i 's. Let f and F be the density and distribution function of the T i 's and H the distribution function of the U i 's. The observed data are then the pairs (X i , ∆ i ), i = 1, 2, . . . , n with X i = min{T i , U i } and ∆ i = 1 {Ti≤Ui} where 1 {·} is the indicator random variable of the event {·}. The observed data form an i.i.d. sample with probability density g and distribution function G which satisfies 1 − G = (1 − F )(1 − H). An estimate of the unknown survival function S(x) = P (T > x) for a continuous duration T isŜ(x) = 1 −F (x) wherê The real valued function K is called kernel and integrates to 1, while h is called bandwidth and controls the amount of smoothing applied to the estimate. EstimatorŜ(x) cannot be used directly in practice as it involves the unknown censoring distribution H(x). One solution is to reverse the intuitive role played by T i and U i and estimate 1−H(x) by the (sightly modified) Kaplan-Meier, [16], estimator. The result is where (Z i , Λ i ) are the ordered X i 's, along with their censoring indicators ∆ i , i = 1, . . . , n. This gives rise to the practically useful estimator which is used next as the basis of the proposed goodness of fit test.

Assumptions and main results
The main result of this research is presented in this section in the form of Theorem 1 below which is then applied to drawing inference about the true survival curve. Prior to that, the necessary notation and assumptions are given. First, denote with µ i (K) the ith moment, i = 0, 1, 2 of the function K and with R(K) the integral of the real function K 2 over its domain. The following conditions are assumed throughout 1. S(x) is twice differentiable and S ′′ (x) is bounded and uniformly continuous. 2. For l = 0, 1, 2, the lth derivative of K, K (l) , is bounded and absolutely integrable with finite second moments. 3. R(K) < +∞ and µ 0 (K) = 1, µ 1 (K) = 0, µ 2 (K) < +∞, i.e. the kernel K is of order 2. 4. There exists small enough h such that W ((y − x)h −1 )/(1 − G(y)) is uniformly bounded for |y − x| > M , for any M > 0.
A consequence of condition 2 is that W (x) is bounded, while condition 3 and particularly µ 0 (K) = 1 implies Conditions 1-3 are satisfied by virtually all kernels in use in practice, see for example [19]. Condition 4 essentially means that there should be enough censored data at the right end of the estimation region for the asymptotics to apply. It has to be noted that it is automatically satisfied when the kernel has bounded support.
A widely used measure of closeness of two curves is the ISE. In the case of estimatorŜ(x) it is defined as Also let Z denote an asymptotic standard normal random variable. Then, the limiting distribution of I n is given in the next theorem Theorem 1. Assume conditions 1-4. Also assume that h → 0 and nh → +∞ as n → +∞. Then Remark 1. It is evident from Theorem 1 that the limiting distribution of the centered and scaled I n depends on the amount of smoothing applied to the data. If nh 3 → 0, then the data are undersmoothed and the stochastic part of the deviation dominates the systematic part. In this situation the stochastic behavior of the centered and scaled ISE is determined by the term In the case of oversmoothing (nh 3 → +∞), the stochastic part of d(n) (I n − EI n ) is asymptotically negligible compared to the systematic part. Then, the limiting distribution of the centered and scaled ISE is determined by the term For optimal smoothing, i.e. nh 3 → λ neither term dominates as asymptotically they both have the same magnitude. In this case the limiting distribution centered and scaled ISE is the sum of the distributions of the two terms above.
Theorem 1 is applied next in creating a goodness of fit test. The test is given in the form of the simple null hypothesis H 0 : S(x) = S 0 (x) against the alternative H 1 : S(x) = S 0 (x). S 0 (x) denotes the assumed true parametric model and is completely known. By Corollary 1 in [2], Theorem 1 and under H 0 , the proposed test statistic, T n , is obtained by replacingŜ(x) byŜ n (x) and S(x) by S 0 (x) in the expression of I n . By lemma 3, the scaled and centered T n , say T n, * , can be written as Thus, in testing H 0 against H 1 with significance level a we have Consequently, the test suggests rejection of H 0 when T n, * (Var{T n, * }) −1/2 > z a where z a is the standard normal quantile at level a. Implementation of the test is discussed in the next section.

Simulations
The primary objective of this section is the study of the power function of the proposed test for small samples. Efficient implementation of the test in practice is discussed first, together with the test's operational characteristics. The power function is then simulated for various sample sizes, amounts of censoring and lifetime distributions in order to provide numerical evidence for its small sample practical performance.
In order to obtain a computationally convenient formula for realizing the test, the sample density is employed as a weight function for the integrands in T n, * 's definition. This has the additional benefit of trimming out low density areas. Repeated use of the fact that weighting the integrand of a functional by the population's density leads to its expected value, which in turn can be reasonably estimated by its sample mean, yields the sample version of T n, * , where the second term on the RHS of (4) results from applying the bias expression ofŜ n (x) (Theorem 1, [2]) on (2) and The quantities involved in (4) and (5) are discussed next. First, the Epanechnikov kernel is used throughout this section. Its integrated version, W (x), is given by By direct calculation, the constants in (4), (5) and elsewhere throughout this section are given by Now,Ŝ n (x), defined in (1), uses the MSE optimal (local) bandwidth of [15], (3.3.1). Note here that this is also the bandwidth h used in (4) and (5). As this expression involves unknown quantities, following [15], a practical version is In (7),f (x) andf ′ (x) are estimates of the unknown f (x) and f ′ (x) respectively based on an exponential reference distribution with its mean estimated by maximum likelihood. That is, Further, by direct calculation based on the Epanechnikov kernel, (7) is implemented with Next, f n (x) is the estimate of the underlying density which is discussed extensively in [21]. It is given by and is implemented with MSE optimal (local) bandwidth (see also [21], page 1525) wheref ′′ (x) denotes an estimate of the second derivative of the underlying density. Following the rationale that was used for obtaining (8) Furthermore, in implementing f n (x), the reflection method of [13] has been employed to reduce boundary bias. This means that in practice, (4) corresponds to an estimate of the negative of f ′ (x). Based on [13], page 143, as the second derivative ofŜ n (x), adjusted for boundary bias. In (10), h d = (4/(5n)) 1/7σ whereσ is the standard deviation of the sample. That is, h d is the univariate version of the normal scale bandwidth selector of [6], page 815, (3.2). It is important to note that a separate simulation on the performance of f ′ n (x) (not reported here) indicated that the use of reflection further improves its performance at the boundary.
Attention is now shifted to the implementation of the denominator of the test statistic, i.e. Var{T n, * } −1/2 ≡ 1/σ(T n, * ). By the same weighting argument as in (4) and (5), and assuming the optimal smoothing version of (3) due to h and h 0 , we have where,λ = nh 3 and Note that (12) is realized by using the negative of (10) in place ofŜ ′′ n (X i ) and (13) is obtained by direct calculation specifically for the Epanechnikov distribution function. Combining (4) and (11), the test statistic used in practice is T n, * /σ(T n, * ).
Performance of the test statistic under the null and alternative hypotheses is discussed next. Three families of common lifetime distributions are assumed for this purpose: the exponential with rate λ (exp(λ)) and the Weibull and Gamma distributions with shape parameter κ and scale parameter λ (W(κ, λ) and G(κ, λ) respectively). The null hypothesis is formulated as This corresponds to the survival function of any of the null distributions exp(1), W(1, 1) or G(1, 1). In order to estimate the power of the test, the probability of rejecting the null hypothesis given that the alternative is true is approximated at each one of 12 specific alternatives. Each alternative survival function, say S 1 (x), is seen as a member of a sequence which all belong to the same family of distributions with S 0 (x) and their parameter(s) selected so that the Kullback-Liebler divergence between S 0 (x) and S 1 (x) becomes increasingly deviant. Thus for each distribution, sample size and amount of censoring, the conditional probability of rejecting H 0 : S(x) = S 0 (x) given that H 1 : S(x) = S 1 (x) is true is approximated from m = 10, 000 independent random samples by where, for each specific alternative, S 0 (x) in the definition ofT n, * in (4) is replaced by S 1 (x). Assuming significance level a = 5% (probability of type I error), the cut-off points in (14) are estimated by the 95% quantile of the numerical distribution ofT n, * /σ(T n, * ). The distribution of the statistic is approximated by generating 100,000 values for each different sample size, assuming no censoring and under H 0 . Definition 7 of [11] is then used for obtaining the cut-off point. It has to be noted that the cut-off points vary by sample size but not by censoring level so as to get an indication on how censoring affects power. The results of the simulation are presented in Fig. 1 which contains the power of the test for each combination of distribution, sample size and censoring level. The simulation results indicate that, as expected, power increases as divergence from H 0 increases. In parallel, power increases as sample size increases. On average, the test is consistent with the nominal level under no censoring. However, censoring has a drastic effect on the test's nominal level and power. For as low as 10% censoring, the nominal level of the test is (on average) doubled. As expected, the more the censoring increases, the more the type I error and power increases. We conjugate though that by using cut-off points based on the test statistic's distribution under censoring -calculated in an obvious manner -to get nominal level and power figures closer to the uncensored ones. Censoring aside, the rejection percentages of Fig. 1 suggest that the further the divergence from the null, the more sensitive the test becomes in detecting the inappropriate fit resulting from H 1 . Furthermore it has to be noted that even though many Censoring (denoted by C) is either 0% (no censoring), 10%, 20% or 30% and the sample sizes considered are n = 30, 40, . . . , 100 at significance level a = 5%. data driven estimates are used in (4) and (5), this does not prevent the test from possessing reasonable power. Theoretical explanation for this, at least asymptotically, is provided by first noticing that usingĤ(x), f n (x) and f ′ n (x) instead of H(x), f (x) and f ′ (x) respectively in (4) and (5) results in an error of order o p (n −1/2 ), o(n −2/5 ) and o(n −2/7 ) in each of these estimations. In (4), the substitution of f ′ (x) by f ′ n (x) on the second term of the RHS (which involves the derivative estimate) will incur an additional term of order n −1.62 (assuming optimal bandwidth of order n −1/3 ) which converges to zero very quickly. Similarly in (5), the substitution of H(x) and f (x) byĤ(x) and f n (x) respectively will result in an additional bias term of rate n −2.7 which again, is negligible. Therefore, in practice it is expected that the test statistic, at least asymptotically, will perform quite reasonably.
In addition, for comparison purposes, the same examples under exactly the same settings as in Fig. 1, have been replicated using the Neyman data driven goodness of fit test for completely specified distributions. The test is implemented in the R package surv2sample, [20]. The package is not currently available from CRAN, however can be incorporated into existing MS Windows R installations via the Rtools package.
The comparison between the two methods indicates that the test suggested in the present research is more sensitive to subtle differences between the null and alternative distributions. However, as divergence from the null increases, the simulation results indicate that Neyman's test appears to be more powerful, albeit the rejection percentages of the present test are reasonable too. The source R code for obtaining the numbers used in Fig. 1-2 together with the code used for obtaining the cut-off points, as well as full instructions and comments, is available from the Journal's website.

Real data analysis
As real data analysis, the air conditioning unit failure data set of [22] is presented here to exhibit the practical usefulness of the test when employed to validate a parametric model. The data set consists of the time intervals, in hours, between successive failures of the air conditioning system of each member of a fleet of 13 Boeing 720 jet airplanes. The aim of [22] was to find a characterization of the distribution of the failure times at fleet level for maintenance purposes. After pooling the data (yielding thus a total of 213 observations), [22] investigated fitting an exponential survival function with the parameter of the distribution estimated by the mean of the failure intervals. The appropriateness of the fit was assesed by testing the null hypothesis H 0 : S(x) = e − t 93.14 (15) which although not rejected by the Kolmogorov-Smirnov test, was found to be unreasonable for practical use as it suggests that failures decline with time; this is counter intuitive in view of wear out and ageing effects of air conditioning units. It further implies that all failure times follow the same distribution irrespective of which plane they come from, something questionable too as different planes might be exposed to different conditions which affect failure occurrences. Prompted by these issues, [14] developed models individually for each plane and used those as a basis for suggestions at the fleet level. Moreover, prompted by [7], another suggestion of [14] was the use of mixture distributions as an approximation of aggregated individual survival functions for the pooled data. This route was further followed by [1], [18] and [23], who all assessed the resulting fits graphically. As noted by [14] though, use of mixture distributions can be uncertain as adding or deleting one plane from the sample would change the Censoring (denoted by C) is either 0% (no censoring), 10%, 20% or 30% and the sample sizes considered are n = 30, 40, . . . , 100 at significance level a = 5%. mixture distribution. However, while the last examples correspond to distributions with decreasing failure rate, which despite any ageing or wear out effects corroborates with the data pattern, increasing failure rate distributions for specific planes on the dataset have been investigated too. As an example, [17] and [8] tested the null hypothesis of exponentiality versus increasing failure rates based on the gamma distribution. In both cases the null is rejected. Yet another approach was based on the fatigue life distribution (also known as the Birnbaum-Saunders distribution, (BS)), which is used extensively in reliability applications to model failure times. Specifically, [5] tested (via the likelihood ratio statistic) the null hypothesis that the data are coming from the BS distribution against the alternative that the underlying distribution is a β-BS. The null was rejected under any usual significance level (p < 0.01).
The fact that all aforementioned approaches are parametric in nature together with the fact thatŜ n (x) is geared towards revealing the pattern of the underlying survival makes the suggested goodness-of-fit test an impartial asset for testing the validity of a given parametric model. For example, using T n, * /σ(T n, * ) to test (15) leads to a p-value of 0.1178 (the statistic's value is equal to 1.1857) which leads to the outcome that the null model may not be the best option to adopt. On the other hand, testing suggested by [1], leads to a p-value of 0.03247 (T n, * /σ(T n, * ) = 1.8456) which gives strong evidence in favor of the model.

Discussion
This article has proposed a kernel based goodness-of-fit test, useful for assessing how well a parametric survival estimate matches the true curve under the random right censoring data setting.
The test is based on the asymptotic distribution of the discrepancy between the estimated and actual survival functions. Thus, it summarizes the divergence between observed values and the values expected under the model in question and as a result it is of a very general scope. However, specific applications include: a) comparison of the distribution of a data set to a normal distribution or in general to a fully specified distribution/survival function, b) testing the normality assumption in analysis of variance, c) testing the normality of residuals in regression and d) whether outcome frequencies follow a specified distribution.
Additionally, by straightforward adjustments, it can serve as a goodness-offit test for Cox proportional hazards models when they are expressed in terms of the survival function. Further, it can be potentially used as an alternative approach to resampling techniques as well as to graphical goodness-of-fit tests when an objective view of the discrepancy between the estimated and the actual model is needed.

Proof of Theorem 1
The present Section starts with an outline description of the proof and continues with its full mathematical details in Subsections 7.1 and 7.2.
The proof begins by decomposingF (x)'s ISE expression into four terms: (a) a symmetric U-statistic, (b) its diagonal, (c) the integrated square bias of F (x) and (d) the integrated product of the stochastic and systematic parts of the deviation ofF (x) from F (x). These terms are studied in Lemmas 1-4 and the resulting expressions are substituted back so as to obtain the central limit theorem for I n .
In detail, Lemma 1 shows that (a) is an asymptotically zero mean normal random variable. This is established by expressing the symmetric U-statistic as a martingale and then applying a suitable central limit theorem to quantify the distribution of the statistic. Validation of the theorem's applicability is done through two conditional Lindeberg and variance conditions given in Corollary 3.1 of [10]. Proceeding with the term in (b), Lemma 2 uses direct calculation to show that the expression there equals σ 2 2 plus a negligible (under optimal bandwidth) term. Lemma 3 shows that the expected ISE value equals the term in (c) plus the result of Lemma 2. Thus, the result of Lemma 3 is used to shape the LHS of Theorem 1 as well as to cancel out the σ 2 2 term resulting from Lemma 2. Finally, in Lemma 4 the term in (d) is written as a sum of independent and identically distributed random variables and so, by a central limit theorem, verification of the theorem's conditions and straightforward calculations, is proved that the term is asymptotically normally distributed.
Putting back the results of all four lemmas in the decomposed I n expression and noting that for optimal bandwidth the asymptotic terms are negligible concludes the proof.

ISE decomposition
The proof of Theorem 1 starts with the ISE decomposition. We have Let Y i = (X i , ∆ i ), i = 1, . . . , n denote the observed sample. Then, Note also that Then the ISE becomes Apply Lemmas 1, 2, 3 and 4 to (16) to get and note that for optimal bandwidth h ∼ n −1/3 , O p (n −3/2 h) = O p (n −7/6 ) which is asymptotically negligible. This completes the proof.

Auxiliary Lemmas
Lemma 1. As n → +∞, we haveÎ 1 ∼ N (0, 2n −2 h 3 σ 2 1 ). Proof. The proof is analogous to the proof of Theorem 1, [9]. Note that, and soÎ 1 is a symmetric U-statistic. Thereforê Corollary 3.1 of [10] will be used to prove the asymptotic distribution ofÎ 1 . Note that it is readily established that k i=2 T i , k = 2, . . . , n is a zero mean, square integrable martingale with differences T i . Moreover, in the present setting the conditional Lindeberg condition of corollary 3.1, [10], is equivalent to its unconditional version ((3.6) in [10], page 53). This means that the proof will be completed by showing (a) For all ε > 0, as n → +∞ First, note that since the Y i , i = 1, . . . , n are i.i.d. Hence It is immediately then seen that (a) and (b) implyÎ 1 /s n ∼ N (0, 1) which in turn verifies the statement of the lemma. Starting with the proof of (a), first note that by Chebyshev's inequality Now, Expanding the summand of the RHS of (19) and using (49) and (50) of Lemma 6 yields Combining (19) and (20) we get Using (17), (21) and (37) in (18) proves (a). We now turn our attention to prove (b). First, It is easily seen that (b) will be proved if we show that Set Then Working in exactly the same way as in as in [9], page 5 we have Then, for a positive generic constant C Working as in the proof of (4.6) of [9], we get that Now, (23), (24) and (36) prove (22) which in turn proves (b).
For the remainder of this Section and in order to simplify notation define Lemma 2. As n → +∞Î Proof. WriteÎ

Now,
By the definition of w i (x), (40) for x = y and (41), after using a Taylor series expansion around x on the term f (x)(1 − H(x)) −1 in the third step above. Also, for a positive generic constant C 1 , Applying the definition of w i (x) and (40) for x = y yields Similarly and Using (27)-(29) back to (26) yields EL 2 i = O(h 2 ) and so By (25), setting σ 2 2 = n −1 EL i and using (30) completes the proof. Lemma 3.
Proof. This is verified by straight calculation by Theorem 1 in [2]. Now, note that and so, by applying Lemma 2, the result follows immediately.
Using the change of variable y − x = hu yields, dt + o(h 6 ) by (33).
Expanding in Taylor series around x gives By (34) and (35), (32) becomes Working in the same way it is shown that EZ 4 i = O(h 12 ). By setting as n → +∞. Therefore,Î 4 is normally distributed with zero mean and variance Proof. The proof is based on conditioning on the number of the uncensored observations of the observed sample Y i = (X i , ∆ i ), i = 1, . . . , n. If N denotes the number of the uncensored observations then N ∼Binomial(n, p) where p = f (x)(1 − H(x)) dx. For given N = ν, (X i : ∆ i = 1) is a set of i.i.d random variables with density f (x)(1 − H(x))/p for ν = 1, 2, . . . , n. Now,