Cumulative distribution function estimation under interval censoring case 1

We consider projection methods for the estimation of the cumulative distribution function under interval censoring, case 1. Such censored data also known as current status data, arise when the only information available on the variable of interest is whether it is greater or less than an observed random time. Two types of adaptive estimators are investigated. The first one is a two-step estimator built as a quotient estimator. The second estimator results from a mean square regression contrast. Both estimators are proved to achieve automatically the standard optimal rate associated with the unknown regularity of the function, but with some restriction for the quotient estimator. Simulation experiments are presented to illustrate and compare the methods.


Introduction
Let X be a survival time with unknown cumulative distribution function (cdf) F . In the interval censoring case 1 model, we are not able to observe the survival time X. Instead, an observation consists of the pair (U, δ) where U is an examination time and δ is the indicator function of the event (X ≤ U ). Roughly speaking, the only knowledge about the variable of interest X is wether it has occurred before U or not. Early examples of such interval censoring can be found in demography studies, see Diamond and McDonald (1991). In epidemiology, these censoring schemes also arise for instance in AIDS studies or more generally in the study of infectious diseases when the infection time is an unobservable event.
We assume that U is independent of X, that F has density f and that the cdf G of U has density g. Such data, also known as current status data, may remind us right-censored data where the observed data is the pair (min(X, C), I (X≤C) ) where C is a censoring variable. However, the estimation procedure in these two censoring models is substantially different. In the right-censoring model, the Kaplan and Meier (1958) estimator is well studied and is asymptotically normal at the rate √ n. Nevertheless, current status data have been studied by many authors in the last two decades, see Jewell and van der Laan (2004) for a state of the art. In the interval censoring model, the nonparametric maximum likelihood estimator (NPMLE) of the survival function is proved to be uniformly consistent, pointwise convergent to a nonnormal asymptotic distribution at the rate n 1/3 in Groeneboom and Wellner (1992). In van de Geer (1993), it is also established that the NPMLE converges at rate n −1/3 in L 2 -norm.
Recent extensions take two directions. First, more general contexts are considered. For example, van der Vaart and van der Laan (2006) build nonparametric estimates of the survival function for current status data in presence of time dependent and high dimensional covariates: they provide limit central theorems with rate n 1/3 and nonstandard limiting processes. The second direction aims at proposing smooth estimates that may take into account the possible smoothness of the survival function. Indeed, the NPMLE estimator is a piecewise constant function. The locally linear smoother proposed by Yang (2000), contrary to the NPMLE may be non monotone, but it has a better convergence rate than the NPMLE when the density f is smooth and the kernel function and the bandwidth are properly chosen. In the same spirit, Ma and Kosorok (2006) introduce an adaptive modified penalized least square estimator built with smoothing splines but their main objective is the study of semiparametric models. They have in mind the same type of penalization device that we present here, but their penalty functions contain many complicated terms that would be difficult to estimate.
Here, we also pursue the search for smooth (or piecewise smooth) adaptive estimators. We present two different penalized minimum contrast estimators built on trigonometric, polynomial or wavelet spaces whose associated penalty terms are really simple; the minimization of the penalized contrast function allows to choose a space that leads to both a non asymptotic automatic squared bias/variance compromise and to an asymptotic optimal convergence rate according to the regularity of the function F in term of Besov spaces. An interesting feature of the procedure is that the estimators and their study is made straightforward by the most powerful Talagrand (1996) inequality for empirical centered processes. We also use technical properties proved in a regression framework by Baraud et al. (2001) and Baraud. (2002) for the mean-square estimator. Globally, the available tools and algorithms for adaptive density and regression estimation make our solution easy to study and to implement.
The plan of the paper is as follows. Section 2 introduces the quotient and the regression estimators, after the description of the lifetimes model. We also give a detailed description of the projection spaces with their main properties. Then, we study one projection estimator of the density of the failure times which have occurred before the examination time in Section 3. Both convergence and adaptation results are given. This estimator is then applied to the estimation of the cumulative distribution function via a quotient construction. Section 4 describes a direct adaptive procedure to estimate the distribution function based on a mean square regression contrast. Simulations compare both approaches in Section 5. We use as a benchmark the NPMLE and also the simple piecewise constant estimator proposed by Birgé (1999). Lastly, most proofs and technical lemmas are deferred to Section 6.

Model and assumptions
Let (U 1 , δ 1 ), · · · (U n , δ n ) be a sample of the pair (U, δ) where δ i = I (Xi≤Ui) and the sequences (U i ) 1≤i≤n and (X i ) 1≤i≤n are independent. We are interested in the estimation of the distribution function F of the lifetime X on a compact set A only. By rescaling the data, we take A = [0, 1] without loss of generality.
The compact set is considered as fixed in the theory, even if in practice, it is determined from the data. Remember that we denote by f and F the density and the cumulative distribution function of the unobserved lifetime X and g and G those of the examination time U . A function of interest is the density ψ of the U i restricted to the individuals for which δ i = 1 defined by: It is clear that this equation provides a way to build an estimator of F . This approach is developed in Section 3. The censoring mechanism is such that the conditional law of δ = I (X≤U) given U = u is a Bernoulli law with parameter F (u) and as a consequence we have: This relation will lead to define a direct mean-square estimator of F . Both strategies require the following assumption: [A1] The density g of the random time U is lower and upper bounded on A so that there exist real finite constants g 0 > 0 and g 1 > 0 such that for all x ∈ A, g 0 ≤ g(x) ≤ g 1 .

Definition of the estimators
Assume that we have at our disposal a collection of finite dimensional spaces of functions, denoted by (S m ) m∈Mn , satisfying the following assumption: (H 1 ) (S m ) m∈Mn is a collection of finite-dimensional linear sub-spaces of L 2 ([0, 1]), with dimension dim(S m ) = D m such that D m ≤ n, ∀m ∈ M n and satisfying: where t 2 = 1 0 t 2 (x)dx, for t in L 2 ([0, 1]).

Quotient estimator
As already mentioned, the first strategy requires to estimate ψ and g. The estimatorg of g is chosen as the adaptive density estimator defined in Massart (2007), Chapter 7, namely:g =ĝm g whereĝ m = arg min t∈Sm γ g n (t), andm g = arg min m∈Mn γ g n (ĝ m ) + pen g (m).
For the estimation of ψ, we consider the following contrast function The penalty function will be motivated and defined later. The contrasts γ g n and γ ψ n are both found as empirical versions of the L 2 distance between a function t in S m and the function of interest (g or ψ). To see this, take the expectation of e.g. γ ψ n : This illustrates that minimizing γ ψ n is likely to provide a function t that minimizes in mean t − ψ 2 and thus estimate ψ, on the space S m . Now, the adaptive estimatorsψ of ψ andg of g are defined, and we can use Equality (2.1) to build a quotient estimator of the distribution function F by settingF

Regression estimator
On the other hand, a direct estimator of the cdf F can be obtained by considering the following mean-square contrast: In this case, we setF in the sense that we always can compute a vector (F m (U 1 ), . . . ,F m (U n )) as the orthogonal projection of the vector (δ 1 , . . . , δ n ) on the sub-space of R n defined by {(t(U 1 ), . . . , t(U n )), t ∈ S m }. Then we defineFm 0 by: where κ 0 is a numerical constant.

Remark about the NPMLE estimator
In the above setting, it is conceivable to consider the log-likelihood contrast γ MLE ). If t is supposed to be a piecewise constant function with jumps only at the observed points, then the NPMLEF n which maximizes γ MLE n can be obtained by the max-min formula, see Jewell and van der Laan (2004): Most results are essentially of asymptotic nature for the NPMLE as already mentioned. Nevertheless, it is of interest for setting the benchmark to include it in our simulation study, see Section 5. The advantage is that no adaptation is required, but the NPMLE is a piecewise constant function. Now, another approach is to consider the histogram-type estimator introduced by Birgé (1999).
be a partition of [0, 1] and let us consider a piecewise constant function t = D j=1 a j I Ij . If one looks for such a function that maximizes the contrast γ MLE n , one finds the estimator given by Birgé (1999): If D is of order n 1/3 , Birgé (1999) gives weak assumptions ensuring that the L 1 -risk E( is of order n −1/3 . A thinner model selection strategy may be developed to take a possible higher regularity of F into account. But the contrasts proposed here have the advantage that the empirical processes to be controlled are linear with respect to the functions t, a property that the NPMLE estimator would not share. This would make the theoretical study more technical, and the estimation algorithms difficult to implement for general bases. Moreover, Hellinger-type risk would have to be considered, in the same way as in Birgé and Rozenholc (2006). This explains why we rather consider the contrasts γ ψ n and γ MS n .
Before studying both estimators, let us give some examples of collections (S m ) m∈Mn .

Spaces of approximation
The main assumption is described by (H 1 ). In this setting, an orthonormal basis of S m is denoted by (ϕ λ ) λ∈Λm where |Λ m | = D m . Let us mention that it follows from Birgé and Massart (1997) that Property (2.3) in the context of (H 1 ) is Moreover, for some results we need the following additional assumption: (H 2 ) (S m ) m∈Mn is a collection of nested models, we denote by S n the space belonging to the collection, such that ∀m ∈ M n , S m ⊂ S n . We denote by N n the dimension of this nesting space: dim We consider more precisely the following examples: [P ] Regular piecewise polynomial spaces: S m is generated by m(r + 1) polynomials, r + 1 polynomials of degree 0, 1, . . . , r on each subinterval [ In particular, the histogram basis corresponds to r = 0 and is simply de- We denote by [DP] the collection of piecewise polynomials corresponding to dyadic subdivisions with m = 2 q and D m = (r + 1) 2 q .
[W ] Dyadic wavelet generated spaces with regularity r and compact support, as described e.g. in Donoho and Johnstone (1998).
All those spaces satisfy (H 1 ), with for instance Φ 0 = √ 2 for collection [T] and Φ 0 = √ 2r + 1 for collection [P]. Moreover, [T], [DP] and [W] satisfy (H 2 ) since they are nested with S n being the space with the largest dimension in the collection.

Study of the quotient estimator
Our aim is to estimate the cdf F from the observations (δ i , U i ), i = 1, . . . , n.

E. Brunel and F. Comte/Estimation under case 1 interval censoring 8
We define also ψ m as the orthogonal projection of ψ on S m . We can write The rate of the estimatorψ m of ψ is quite easy to derive. Indeed, it follows from (3.1), (3.2) and Pythagoras theorem that with (2.13). This can be summarized by the following Proposition: Proposition 3.1. Consider the model described in Section 2.1 and the estimatorψ m = arg min t∈Sm γ ψ n (t) where γ ψ n (t) is defined by (2.5) and S m is a D m -dimensional linear space in a collection satisfying (H 1 ). Then Inequality (3.3) gives the asymptotic rate for one estimator if we consider that ψ belongs to a Besov space B α ψ ,p,∞ ([0, 1]) with finite Besov norm denoted by |ψ| α ψ ,p . For a precise definition of those notions we refer to DeVore and Lorentz (1993) Chapter 2, Section 7, where it is also proved that This justifies that we now restrict our attention to Then the following (standard) rate is obtained: Corollary 3.1. Consider the model described in Section 2.1 and the estima- Remark 3.1. The bound r stands for the regularity of the basis functions for collections [P] and [W]. For the trigonometric collection [T], no upper bound for the unknown regularity α ψ is required.

Adaptive estimator of the density ψ
The penalized estimator is defined in order to ensure an automatic choice of the dimension. Indeed, it follows from Corollary 3.1 that the optimal dimension depends on the unknown regularity α ψ of the function to be estimated in the asymptotic setting and more generally on the unknown constants involved in the squared-bias/variance terms. Then we definê where the penalty function pen ψ is determined in order to lead to the choice of a "good" model. First, we apply some Talagrand (1996) type inequality to the linear empirical process defined by Then, by using the decomposition of the contrast given by we easily derive the following result: where C is a constant depending on Φ 0 and on 1 0 ψ(x)dx. As it is clear from Theorem 3.1, only a lower bound for the penalty is provided. As pen ψ (.) also appears in the risk bound (3.7), we should not take it much larger than the order D m /n because then, the L 2 error would increase and the resulting rate would not be the optimal one. On the other hand, no result is available for smaller penalties. This explains in particular why it is possible to keep the asymptotic rate unchanged when increasing the constant κ only.
Also, if we choose pen ψ (m) = κΦ 2 0 1 0 ψ(x)dx (D m /n), it follows from (3.7) that the adaptive estimator automatically makes the squared-bias/variance compromise and from an asymptotic point of view, reaches the optimal rate, provided that the constant in the penalty is known. Note that Inequality (3.7) is nevertheless non-asymptotic.
Remark 3.2. In practice, the constant in the penalty, denoted above by κ, is found by simulation experiments taking into account very different types of functions ψ. See examples of such a work in Birgé and Rozenholc (2006) or Comte and Rozenholc (2004).
The penalty given in Theorem 3.1 cannot be used in practice since it depends on the unknown quantity 1 0 ψ(x)dx = E(δ 1 I (U1≤1) ).
A simple solution is to use that 1 0 ψ(x)dx ≤ 1; it follows that Inequality (3.7) would hold for a penalty defined by pen ψ (m) = κΦ 2 0 D m /n. This possibly works with a resulting over-estimation of the penalty, in a way depending on the unknown function ψ. The alternative solution is to replace the unknown quantity by an estimator (rather than a bound), and to prove that the estimator of ψ built with this random penalty keeps the adaptation property of the theoretical penalized estimator. This is described in the following theorem whose proof is omitted since it is quite the same as the proof of Theorem 3.4 in Brunel and Comte (2005).
where K 0 is a universal constant and K depends on ψ, Φ 0 .
In particular, we can derive quite straightforwardly from results as Theorem 3.2 adaptation results to unknown smoothness:

Application to the estimation of the distribution function F
Consider now the first estimator of F , given by (2.7).
A simple case study allows to see that ifψ(x)/g(x) < 0 orψ(x)/g(x) > 1, then |F (x) − F (x)| ≤ |ψ(x)/g(x) − F (x)|, and thus the inequality |F (x) − F (x)| ≤ |ψ(x)/g(x) − F (x)| holds for any x. Also, our definition implies that |F (x) − F (x)| ≤ 1, for any x. Moreover, to exploit [A1], we define Then, the following bounds are obtained: Thus the first term can de decomposed as follows and sinceg(x) ≥ g 0 /2 on Ω g , this yields For the second, taking the expectation, we use the following Lemma: Lemma 3.1. Assume that g ∈ B αg,2,∞ ([0, 1]) for some α g > 1/2 and consider a collection of spaces S m such that log(n) ≤ D m ≤ √ n. Then, under Assumptions (3.10) Finally, by gathering the bounds, we obtain the following proposition: Proposition 3.3. Under the assumptions of Lemma 3.1, where C(g 0 , ψ ) is a constant depending on g 0 and ψ .
Note that Theorem 2 in Yang (2000) shows that the rate in the sup-norm over a compact is of order O((ln n/n) (1+α f )/(3+2α f ) ) a.s. where α f stands for the regularity of the density function f (that is α F = α f + 1).
If the index of regularity of F , α F , is greater than the index of regularity of ψ = F g, α ψ , then the asymptotic rate of the estimatorF is given by n −α ψ /(1+2α ψ ) instead of the optimal one n −αF /(1+2αF ) . This is the reason why we propose another contrast to estimate directly F .

Study of the mean square estimator
In this section, we study the mean square estimator of F from (2.9) and its adaptive version. In this context, we define the empirical norm · n as follows: for Then, the mean-square contrast defined by (2.8) can be decomposed as follows: which is a centered process since E(δ|U = u) = F (u). In this case, we obtain the following result for the penalized estimator: Theorem 4.1. Consider the collections of models [T] with N n ≤ √ n/ ln(n) or [DP] or [W] with N n ≤ n/ ln 2 (n). LetFm 0 be defined by (2.10), with Then, where F m stands for the orthogonal projection of F on S m and C and C ′ are constants depending on Φ 0 and g. Note that the computation of the estimator may be more tedious in practice than the quotient one, but the result is obtained directly for the estimator of F , without any regularity condition on ψ. As a consequence, we obtain here a rate only depending on the regularity of F , and we can state the following result: where B αF ,2,∞ (L) = {t ∈ B αF ,2,∞ ([0, 1]), |t| αF ,2 ≤ L} where C(α F , L) is a constant depending on α F , L and also on F , Φ 0 and g 0 .

Simulations
We consider the regular collection [DP] (see Section 2.3) with degrees less than r max on a subdivision [j/2 p , (j + 1)/2 p [. The density and regression algorithms minimize the contrast and select the approximation space in the sense that the integers p and r are selected such that 2 p (r + 1) ≤ N n ≤ n/ log 2 (n) and r ∈ {0, 1, . . . , r max }. Note that the degree r is global in the sense that it is the same on all the intervals of the subdivision. We take r max = 9 in practice. Moreover, additive (but negligible) correcting terms are classically involved in the penalty (see Comte and Rozenholc (2004)). Such terms avoid under-penalization and are in accordance with the fact that the theorems provide lower bounds for the penalty. As the correcting terms are asymptotically negligible, they do not affect the rate of convergence. The constants in the penalty are taken equal to 4. Finally, for m = (p, r), the penalties are proportional to 2 p (r + 1 + log 2.5 (r + 1)) with proportionality factor κ = 4 for the estimation of g and F and a factor  (4/n) n i=1 δ i for the estimation of ψ. Most programs are available on Yves Rozenholc's web page http://www.math-info.univ-paris5.fr/∼rozen/. Now, let us describe the simulated models. Remember that the distribution of δ given U = u is a Bernoulli variable with parameter F (u). We consider the following models for generating data: Model 1. Uniform distribution F : U ∼ U(0, 1) and δ ∼ B(1, U ) Model 2. χ 2 -distribution F : U ∼ U(0, 1) and δ ∼ B(1, F χ 2 1 (U )) Model 3. Quadratic distribution F : U ∼ U(0, 1) and δ ∼ B(1, U 2 ) Model 4. Exponential distribution F : U ∼ γ(1, λ) and δ ∼ B(1, 1 − e −µU ) with λ = 1, µ = 0.5. Model 5. Beta distribution (S-shape) F : U ∼ β(4, 6) and δ ∼ B(1, F β(4,8) (U )) where F β(α,β) is the cdf of a Beta distribution of parameter (α, β).
To study the quality of each estimation procedure and to compare them, we compute over J sample replications of size n = 60, 200, 500 and 1000 the mean squared errors (MSE) over the sample points u 1 , . . . , u K falling in [a, b]: whereF j stands for the (adaptive) quotient estimatorF or for the penalized regression estimatorFm 0 or the benchmark NPMLEs, computed over the jth sample replication for j = 1, . . . , J. For the small sample sizes n = 60 and n = 200, the average values are obtained with J = 500 repetitions while for large samples (n = 500 and n = 1000), only J = 200 replications are performed.
To avoid boundary effects due to the sparsity of the observations at the end of the interval particularly for models 2, 4, and 5 the MSE j 's are truncated for each replication in the sense that we include in the mean only the u k less than a given quantile value: P(X ≤ 1) = 0.68 for model 2, P(X ≤ 1) = 0.86 for model 4 and P(X ≤ 0.5) = 0.89 for model 5; thus, the MSE j are computed over [a, b] with a = 0 and b = 1 from model 1 to 4, and b = 0.5 for model 5. Therefore, the MSE's given in Table 1 stand effectively for the truncated arithmetic means of the MSE j 's. As we can see from results in Table 1, the regression estimator is always better than the quotient and the NPMLE estimators for large samples. However, for small sample sizes, the quotient estimator can behave as well as and even better than the regression one, see models 2 and 4 for n = 60, 200. The same remark holds for both Birgé's and Groeneboom's NPMLE. These last estimators have the advantage to be very easy to compute. As a counterpart they look like step functions whatever the regularity of the function is. Moreover, Birgé's estimator is not adaptive since we have to choose the number of partition cells (D = 5 cells for a sample size n = 60, 200 and D = 10 cells for n = 500, 1000), see Figure  3. Note also that, the density estimatorg of g is a very attractive estimator by itself as shown in Figure 1. In some cases and particularly for model 4, see Figure 1, the quotient mechanism works wrong even if the density estimator is very performing. Figure 1 (right) shows that near than half of the curves do not give the good shape. This is a drawback of quotient strategies which do not have good robustness properties. Regression estimators (see Figure 2) are much more stable. In Figure 3, we give an illustration of all the compared estimators for small (n = 60) and large (n = 1000) samples and we can see that our adaptive estimators behave as well as and often better than the benchmark NPMLEs.
Concluding remarks. Globally it appears that the regression estimation is better than the quotient estimator, from both theoretical and empirical points of view. The latter can be better than the former only for small sample experiments. The two density estimators involved in the quotient are nevertheless easy to compute, and empirically very good. It is thus interesting to see that the estimation algorithms give very good results. Nevertheless, even for well estimated numerator and denominator, the ratio is less satisfactory than the direct regression estimator.
Moreover, there exists a universal constant K 1 such that for ǫ > 0 E sup ℓ∈L |ν n (ℓ)| 2 − 2(1 + 2ǫ)H 2 Note that (6.1) is also given in Birgé and Massart (1998), Corollary 2. In both cases, usual density arguments allow to take instead of the class L a unit ball in a finite dimension space of functions.
In the same way as Baraud et al. (2001), we introduce for t 2 g = 1 0 t 2 (u)g(u) du, the ball B g m,m ′ (0, 1) = {t ∈ S m + S m ′ , t g = 1} and the set Ω n = ω, t 2 n t 2 On the complement of Ω n , a separate study leads to the following lemma: Lemma 6.3. If N n ≤ √ n/ ln(n) for [T] or N n ≤ n/ ln 2 (n) for [P] or [W], then P(Ω c n ) ≤ c/n and, E ( Fm 0 − F 2 n I Ω c n ) ≤ c ′ /n, where c and c ′ are positive constants.