Estimators for the interval censoring problem

We study three estimators for the interval censoring case 2 problem, a histogram-type estimator, proposed in Birg\'e (1999), the maximum likelihood estimator (MLE) and the smoothed MLE, using a smoothing kernel. Our focus is on the asymptotic distribution of the estimators at a fixed point. The estimators are compared in a simulation study.


Introduction
Let X 1 , . . . , X n be a sample of unobservable random variables from an unknown distribution function F 0 on the interval [0, 1]. More generally, we could take an arbitrary closed interval [a, b] as support for the underlying distribution, but for the purposes of the development of the theory, we can just as well take [0, 1], as is also done in [1].
Suppose that one can observe n pairs (T i , U i ), independent of X i , with a joint density function h on the upper triangle of the unit square, for which the sum of the marginal densities is bounded away from zero. Moreover, provide the only information one has on the position of the random variables X i with respect to the observation times T i and U i . In this set-up we want to estimate the unknown distribution function F 0 , generating the "unobservables" X i . This setting is known as interval censoring, case 2.
The model of current status data, also known as interval censoring, case 1, has been thoroughly studied, and has a theory which is considerably simpler than the theory for the interval censoring, case 2, model. In the current status model one only has one observation time T i , corresponding to the unobservable X i , and the only information we have about X i is whether X i is to the left or to the right of T i .
Although the present paper mainly focuses on the case 2 model, we start by discussing the current status model, in order to put this paper into a more general context and to explain why the case 2 model is so much harder to study.
In the current status model, the only observations which are available to us are the pairs so we do not observe X i itself, but only its "current status" ∆ i . The nonparametric maximumum likelihood estimator, commonly denoted by NPMLE or just MLE, maximizes the (partial) log likelihood where the maximization is over all distribution functions F . The MLE can be found in one step by computing the left-continuous slope of the greatest convex minorant of the cusum diagram of the points (0, 0) and the points   i, j≤i ∆ (j)   , i = 1, . . . , n, (1.2) using a notation, introduced in [10]. Here ∆ (j) denotes the indicator corresponding to the jth order statistic T (j) . The theory for this estimator is further developed in [10], where also the (non-normal) pointwise limit distribution is derived and it is shown that the rate of convergence is n −1/3 . In contrast, there is no such one-step algorithm for computing the MLE in the case 2 situation, where one wants to maximize over distribution functions F . One has to take recourse to iterative algorithms, for example the iterative convex minorant algorithm, introduced in [10] and further developed in [11]. Moreover, the MLE can possibly achieve a faster local rate of convergence than in the current status model, depending on properties of the bivariate distribution of the observation times (T i , U i ).
In the so-called non-separated case, the density of the pair of observation times (T i , U i ) is positive on the diagonal, meaning that we can have arbitrarily small observation intervals [T i , U i ]. For this situation, [1] proposes a simple piecewise constant estimator for F 0 , with the purpose of showing that in this situation an estimator can be constructed that achieves the (n log n) −1/3 convergence rate, which is optimal in a minimax sense, both using a global loss function , and using a local loss function for the estimation at a fixed point. In the separated case, the observation times T i and U i cannot become arbitrarily close: in this case there exists an > 0 so that U i − T i > for each i. In this case the convergence rate of Birgé's estimator is n −1/3 again, which is also the minimax rate for the current status model. For both situations we derive the asymptotic behavior of Birgé's estimator, and compare this with the behavior of the MLE in a simulation study. The simulations show a better behavior of the MLE, probably caused by the local adaptivity of the MLE.
A common complaint about the MLEs is that under the conditions for which the local asymptotic distribution result is derived, other estimators can be suggested, which in fact attain a faster rate of convergence. Such estimators are discussed for the current status model in, e.g., [8], [9] and [7]. We introduce a similar estimator below for the case 2 model below, the smoothed maximum likelihood estimator (SMLE). The smoothed MLE is defined bỹ Analogously to what has been proved for the current status model, we expect the smoothed MLE to converge at (at least) rate n −2/5 under appropriate regularity conditions. It is an attractive alternative to the MLE and histogramtype estimator of [1]. We give a heuristic discussion on this in section 6. Just as in [3] and [4], the asymptotic variance depends on the solution of an integral equation. The asymptotic expressions for the variance, obtained by solving these equations numerically, give a rather good fit with the actually observed variances, as shown in section 6. The SMLE can probably also be used for a two-sample test for interval censored data, analogous to the two-sample test for current status data, introduced in [7]. The MSE of the smoothed MLE is much smaller than that of Birgé's estimator or the MLE for smooth underlying distribution functions, as is illustrated in the sections on the simulations. A picture of the three estimators is shown in Figure 1. The MLE and smoothed MLE are monotone, in contrast with Birgé's estimator. Also Birgé's estimator can have negative values and values larger than 1; both events happen in the picture shown. This cannot happen for the MLE and smoothed MLE, since these are based on isotonization; the smoothed MLE is an integral of a positive kernel w.r.t. the (positive) jumps of the MLE, and inherits the monotonicity properties of the MLE. Although histogram-type estimators (like Birgé's estimator) and kernel estimators without any isotonization are much easier to analyze than the estimators, based on isotonization, the price one has to pay is the behavior illustrated in Figure 1.

A local minimax result for the non-separated case
In this section we derive a local minimax result for the non-separated case of the interval censoring problem, case 2. This result will provide the best possible local convergence rate and also the best constant, as far as this constant depends on the underlying distributions. Our approach makes use of a perturbation F n of F 0 which is defined by for a c > 0 to be specified below. Before stating the theorem to be proved, we introduce some notation. Let ∆ = (∆ 1 , ∆ 2 ) ∈ T := {(1, 0), (0, 1), (0, 0)} and define the densities q 0 and q n by with respect to the measure µ = λ 1 ⊗ λ 2 on Ω = R 2 + × T , where λ 1 is the Lebesgue measure and λ 2 is counting measure. We note that q 0 is the joint density of (T, U, ∆ 1 , ∆ 2 ).
Furthermore, let (L n ), n ≥ 1, be a sequence of estimators for F 0 (t 0 ), based on samples of size n, generated by q 0 . That is, we can write where l n is a Borel measurable function. Then, the following theorem holds: where E n,q denotes the expectation with respect to the product measure q ⊗n .
In our proof we need the following lemma, which is proved in [6]. This type of result is often denoted as "LeCam's lemma". Lemma 2.1. Let G be a set of probability densities on a measurable space (Ω, A) with respect to a σ-finite dominating measure µ, and let L be a realvalued functional on G. Moreover, let f : [0, ∞) → R be an increasing convex loss function, with f(0)=0. Then, for any q 1 , q 2 ∈ G such that the Hellinger distance H(q 1 , q 2 ) < 1 : Proof of theorem 2. t < u} be defined by where δ n = c(n log n) −1/3 . The partitioning is shown in figure 2.  Then the squared Hellinger distance between q 0 and q n can be written as We now calculate the three integrals over A 1,n . dt.
The last integral can be split into two integrals over the sets [0, t 0 − κ n ) and Next, a straightforward computation shows that The integrals over A 2,n , A 3,n and A 4,n can be treated in a similar way. Indeed, Moreover, it is easily verified that Thus, we infer that the asymptotic squared Hellinger distance between q 0 and q n is given by By using lemma 2.1 we now get: Maximizing the last expression over c yields the desired minimax lower bound. 3. Asymptotic distribution of Birgé's estimator in the non-separated case [1] constructed a histogram-type estimator to show that the minimax lower bound rate of the preceding section can indeed be attained in the non-separated case. It is defined in the following way. Let t 0 be an interior point of [0,1], let c be a positive constant and let K = c −1 (n log n) 1/3 , where n is the sample size and where x denotes the "floor" of x, i.e., the largest integer which is smaller than or equal to x. We distinguish two cases.
Note that in this case the intervals I 2 , . . . , I K have length 1/K, but that I 1 and I K+1 have a shorter length. Furthermore, just as in case (i), t 0 is the left boundary point of one of the intervals I j .
In fact we slightly modified the definition of Birgé who always partitions the interval into K subintervals of equal length. The reason for our modification is that we want to assign a fixed position to t 0 with respect to the boundary points of the interval I j to which it belongs, since the bias of the estimator heavily depends on this position. Letting t 0 be a left boundary point enables us to compare the results for different sample sizes "on equal footing", so to speak.
Let ∆ i,1 , ∆ i,2 and ∆ i,3 be defined by (1.1). We define, following [1], for 1 ≤ j, k ≤ K, In addition to these (integer-valued) random variables, [1] defines the random variables:F weights w j,k , defined by where We are now ready to define Birgé's estimator F n .
Definition 3.1. (Birgé's estimator) Let the intervals I j be defined as in (i) or (ii) above (depending on the value of t 0 ), and letF (j,k) and the weights w j,k be defined by (3.1) and (3.2), respectively. Then, for t belonging to the interval In determining the asymptotic distribution of Birgé's estimator, we are faced with the following difficulties.
(1) The weights w j,k are ratios of random variables, which interact with the random variables M k /M k , N k /N k and Q j,k /Q j,k , for which they are multipliers. (2) The ratios M k /M k , N k /N k and Q j,k /Q j,k are themselves ratios of random variables. (3) The weighted sum, defining Birgé's estimator, consists of dependent summands. The dependence is caused by the dependence of the weights, the dependence between the M k /M k , N k /N k and Q j,k /Q j,k and the dependence between the weights and these terms. This prevents a straightforward use of the Lindeberg-Feller central limit theorem.
These difficulties have to be dealt with in turn. The following crucial lemma bears on difficulty (1), by showing that the random weights w j,k are close to deterministic weights w j,k . Lemma 3.1. Consider a partition of [0, 1] into K or K + 1 subintervals, according to the construction of Birgé's estimator, using the scheme of (i) and (ii) at the beginning of this section. Assume that, for a fixed constant c > 0, that is: the asymptotic binwidth is given by c(n log n) −1/3 . Moreover, assume that the observation density h is continuous on the upper triangle of the unit square, staying away from zero on its support. Let g 1 and g 2 be the first and second marginal density of h, respectively. Finally, let t 0 be the left boundary point of I j , let a(t) and b(t) be defined by and let the deterministic weights w j,k be defined by: (3.7) Then: and, for m = 1, 2, . . .
It may be helpful to give some motivation for the construction of Birgé's statistic. If we replace N k , N k , etc. by their expected values, we obtain: where G 1 and G 2 are the first and second marginal distribution functions of H, imsart-ejs ver. 2011/12/01 file: interval.tex date: December 13, 2011 repectively. By expanding F 0 at the left endpoints t k of the intervals, we get: One of the difficulties in this expansion that we have glossed over for the moment is that g 1 (t k ) tends to zero, if t k → 1, and that similarly g 2 (t k ) tends to zero, if t k → 0. This difficulty has to be dealt with separately. We do not have that difficulty for h, since we assume that h stays away from zero on its support. The expansion suggests that the asymptotic bias at t j will be f 0 (t j )/(2K), which is indeed the case. However, the expansion does not explain the particular choice of the weights. Considering the deterministic counterparts w j,k of w j,k , given by (3.7) in Lemma 3.1, we see that the weights are proportional to 1/(1 + |j − k|), which has the effect that the smaller observation intervals give the biggest contribution to the estimator, taking advantage of the fact that the smaller observation intervals do indeed give more precise information on the "unobservable" X i , if we know that X i is contained in the interval (see the discussion on this point in section 1. The choice of these weights reduces the variance of the estimator. Only this fact is responsible for the fact that the rate of convergence is slightly faster than n −1/3 . It seems that the MLE is doing something similar automatically, but in a more efficient way, if we believe the "working hypothesis", discussed in section 1. Assuming the truth of this "working hypothesis", the asymptotic variance of the MLE only involves the local joint density h of (T i , U i ) at (t 0 , t 0 ) and the density f 0 (t 0 ) of X i at t 0 , whereas the variance of Birgé's estimator also involves the marginal densities of (T i , U i ), which do not appear in the local minimax lower bound, derived in section 2.
Also note that the partition, needed in the construction of Birgé's estimator, is dependent on an a priori knowledge of whether we are in the separated or non-separated case; in the non-separated case binwidths of order (n log n) −1/3 are taken (otherwise the higher rate (n log n) −1/3 would not be attained), and in the separated case binwidths of order n −1/3 (taking (n log n) −1/3 would let the variance dominate the bias, as the sample size tends to infinity). For the computation of the maximum likelihood estimator (MLE), discussed in section 5, it is not necessary to use a priori knowledge on the observation distribution; the MLE, considered as a histogram adapts automatically to the separated or non-separated case and will choose generally smaller binwidth for the nonseparated case. This is one of the major advantages of the MLE over Birgé's estimator, apart from being monotone with values restricted to [0, 1].
Using the notation of Lemma 3.1 we can now formulate the main result for Birgé's estimator. j , for which we assume that it converges to an interior point t 0 ∈ (0, 1), as n → ∞. Then: where the right-hand side of (3.13) denotes a normal random variable, with expectation 1 2 cf 0 (t 0 ) and variance

14
Note that Theorem 3.1 implies that the optimal value of c is given by . This value of the constant was used in the simulations, reported below.

Birgé's estimator in the separated case
We consider the asymptotic behavior of Birgé's estimator in the separated case. This is mainly meant for illustrative purposes and to give background to the simulation study. We therefore do not aim to prove results in the widest generality and confine our discussion to the case where the density h of the observed pairs (T i , U i ) has as support the triangle with vertices (0, ), (0, 1) and (1 − , 1) and stays away from zero on its support, which is the situation we consider in the simulation study. In this case the faster rate (n log n) −1/3 is unattainable, and we know that Birgé's estimator (and also the MLE) can only achieve the rate n −1/3 . We therefore assume K to be of order n 1/3 and set K = c −1 n 1/3 . As in section 3 we introduce deterministic weights w j,k to replace the random weights w j,k . Recall that, by definition, and Let g 1 and g 2 be the first and second marginal density of h, respectively, that is: showing W j n 1/3 . The deterministic weights w j,k are now defined by: , k > j, where We assume that the integrals on the right-hand side of (4.4) are finite, and hence that W (t 0 ) < ∞.
We now have the following lemma, which plays a similar role as Lemma 3.1 in section 3.
Lemma 4.1. Consider a partition of [0, 1] into K or K + 1 subintervals, according to the construction of Birgé's estimator, using the scheme of (i) and (ii) at the beginning of section 3. Assume that for a fixed constant c > 0, that is: the asymptotic binwidth is given by cn −1/3 . Let the weights w j,k and w j,k be defined by (4.1) and (4.3), respectively, where we assume W (t 0 ) < ∞. Then: Using this lemma, we get the following limit result (compare with Theorem 3.1).
Theorem 4.1. Suppose that the observation density h has as support the triangle with vertices (0, ), (0, 1) and (1 − , 1) and stays away from zero on its support.
be a subinterval, belonging to the partition of [0, 1] into K intervals, corresponding to the construction of Birgé's estimator for a sample of size n. Finally, let W (t 0 ) be defined by (4.4), where we assume W (t 0 ) < ∞.
Assume that, for a fixed constant c > 0, K = K n ∼ n 1/3 /c, and let t  k , for which we assume that it converges to an interior point t 0 ∈ (0, 1), as n → ∞. Then we have, as n → ∞ where the right-hand side of (4.6) denotes a normal random variable, with expectation 1 2 cf 0 (t 0 ) and variance In the simulation study we take the observation density h uniform on the triangle of its support. For ease of reference, we here determine the value of the variance σ 2 of the asymptotic distribution for this case. If h is uniform, its density is given by Hence the marginal densities g 1 and g 2 are given by: and For W (t 0 ) we get: Hence, using (4.7), we obtain: where W (t 0 ) is defined by (4.9).

The maximum likelihood estimator
As mentioned in section 1, the (nonparametric) maximum likelihood estimator (MLE or NPMLE) maximizes the (partial) log likelihood imsart-ejs ver. 2011/12/01 file: interval.tex date: December 13, 2011 where the maximization is over all distribution functions F . For the non-separated case the following conjecture was given in [5] (the lecture notes of a summer course given at Stanford University in 1990), which later appeared as part 2 of [10]): Theorem 5.1. (Conjecture in [5]) Let F 0 and H be continuously differentiable at t 0 and (t 0 , t 0 ), respectively, with strictly positive derivatives f 0 (t 0 ) and where Z is the last time that standard two-sided Brownian motion minus the parabola y(t) = t 2 reaches its maximum.
It was also shown in [5] that Theorem 5.1 is true for a "toy" estimator, obtained by doing one step of the iterative convex minorant algorithm, starting the iterations at the underlying distribution function F 0 ; the "toy" aspect is that we can of course not do this in practice. In spite of the fact that now more than 20 years have passed since this conjecture has been launched, it still has not been proved. In the simulation section we provide some material which seems to support the conjecture, but further research is necessary to settle this question.
For the separated case one can also introduce a toy estimator of the same type and one can again formulate the "working hypothesis" that that the toy estimator and the MLE have the same pointwise limit behavior. Anticipating that this would hold, [14] derived the asymptotic distribution of the toy estimator in the separated case, under the following conditions. respectively. (C3) Let the functions k 1, and k 2, be defined by (C4) 0 < F 0 (t 0 ) < 1 and 0 < H(t 0 , t 0 ) < 1.
The motivation for these conditions is given in [14] and actually become clear from the proof, which is not given here.
Theorem 5.2. ( [14]) Suppose that assumptions (C1) to (C4) hold. Let k i , i = 1, 2, be defined by and suppose that f 0 , g 1 , g 2 , k 1 and k 2 are continuous at t 0 , where g 1 and g 2 are the first and second marginal densities of h, respectively. Moreover, assume is the estimator of the distribution function F 0 , obtained after one step of the iterative convex minorant algorithm, starting the iterations with F 0 , we have where Z is the last time where standard two-sided Brownian motion minus the parabola y(t) = t 2 reaches its maximum, and where .
It is indeed proved in [6] that, under slightly stronger conditions (the most important one being that an observation interval always has length > , for some > 0), which hold for the examples in the simulation below, the MLE has the same limit behavior, using the same norming constants. The expression for the asymptotic variance in the separated case is remarkably different from the conjectured variance in the non-separated case, which only depends on F 0 via f 0 (t 0 ), showing that only the local behavior, depending on the density at t 0 , is important for the asymptotic variance (assuming that the working hypothesis holds).
Note that if (T i , U i ) is uniform on the upper triangle of the unit square, with vertices (0, ), (0, 1) and (1 − , 1), we have: and, if F 0 is the uniform distribution function on [0, 1], We give some results for the latter model in section 8.

The smoothed maximum likelihood estimator
Let h be the density of (T i , U i ), with first marginal density h 1 and second marginal h 2 , and let φ t,b,F be a solution of the integral equation (in φ): , and the function k t,b is defined by Moreover, let the function θ t,b,F be defined by 2) where u < v. Then, as in [3] (separated case) and [4] (non-separated case), we have the representation For F = F 0 we get the integral equation: Using the theory in [3] and [4] again, we get that the solution φ t,b,F0 gives as an approximation for n var(F n (t)): The approximation seems to work pretty well, as can be seen in table 1, where we estimated the actual variance for samples of size n = 1000 by generating 10, 000 samples of size 1000 from a Uniform(0, 1) distribution F 0 and a uniform observation distribution H on the upper triangle of the unit square. As in the papers cited above, we do not have an explicit expression for φ t,bn,F0 ; a picture of φ t,bn,F0 for F 0 the Uniform(0, 1) distribution F 0 and b n = n −1/5 is shown in Figure 3; the function was computed by solving the corresponding matrix equation on a 1000×1000 grid. Note that we apply the smooth functional theory of the above mentioned papers (which is also discussed in [6]) not for a fixed functional, but for changing functionals on shrinking intervals (in the hidden space). The reason we can do this is that the bandwidth b is chosen to be of a larger order than the critical rate n −1/3 , and that then a different type of asymptotics sets in, with asymptotic normality, etc., instead of the non-standard asymptotics of the MLE itself. This method is also used in [8], for the current status model. In analogy with Theorem 4.2 in [8] we expect the following result to hold, using the conditions on the underlying distributions, discussed in [3] and [4]. To avoid messy notation, we will denote the smoothed MLE byF n instead ofF M L n in the remainder of this section.  [4] (non-separated case) be satisfied. Moreover, let the joint density h of the joint density of (T i , U i ) have a continuous bounded second total derivative in the interior of its domain and let f 0 have a continuous derivative at the interior point t of the support of f 0 , and letF n be the smoothed MLE, defined by (1.3). Then, if b n n −1/5 , we have where N (0, 1) is the standard normal distribution and σ 2 n is defined by with θ t,bn,F0 given by (6.2).
Note that (the conjectured) Theorem 6.1 covers both the separated and the non-separated case. Unfortunately, we do not have an explicit expression for (6.3) in Theorem 6.1 at present. The functions φ F0 , defining the function θ F0 and hence also the variance σ 2 n , are of a rather different nature for the separated case and the non-separated case. For an example of this, see Figure 4.
The variance σ 2 n can be estimated bŷ andφ t,bn,Fn solves the integral equation and where h n is a kernel estimate of the density h, and where For b n chosen as in the theorem, the distribution functionF n will be strictly increasing with probability tending to one. SinceF n is also continuously differentiable, the equation (6.4) will have an absolutely continuous solutionφ t,bn,Fn , and we do not have to take recourse to a solution pair, as in [4], which deals separately with a discrete and absolutely continuous part.
In the corresponding result for the current status model we have explicit expressions, and we briefly discuss the analogy here, using a notation of the same type. LetF (CS) n be the smoothed MLE for the current status model, defined by (1.3), but now using the MLEF n in the current status model. In this case the function θ t,b,F , representing the functional in the observation space, is given by , u ∈ (0, 1). (6.5) where φ is given by: and k t,b is defined by (6.1). Moreover, g is the density of the (one-dimensional) observation distribution. The solution φ (CS) t,bn,F0 gives as an approximation for n var(F n (t)): so in this case we obtain the central limit theorem see Theorem 4.2, p. 365, [8].
Remark 6.1. It is tempting to think that the asymptotic variance can be found for case 2 by computing just as in the current status model. However, numerical computations suggested that bE θ 2 t,b,F0 tends to zero in the non-separated case. This might mean that the variance is not of order n −4/5 in this case, but perhaps contains a logarithmic factor, in analogy with the variance (n log n) −2/3 for the histogram-type estimators, like Birgé's estimator and the MLE without smoothing.
However, we do not expect this to happen for the separated case. All this still has to be determined by the analysis of the difference in asymptotic behavior of the functions φ t,bn,F0 for the separated and non-separated case (see Figure 4 for a picture of the rather different behavior of φ t,bn,F0 in these two situations).

Simulation results for the non-separated case
In tables 2 to 5 we present some simulation results for the "non-separated case" for both Birgé's estimator, the MLE and the smoothed MLE. In all cases the observation density was the uniform density on the upper triangle. All results are based on 10,000 pseudo-random samples. For Birgé's estimator the asymptotically optimal binwidth was chosen in all simulations.
We study the case where f 0 is the uniform density on [0, 1] and give results for the interior points t 0 = 0.3, 0.4, 0.5 and 0.6. Although these points are somewhat arbitrarily chosen, the results are representative for what happens in the interior of the interval.
It can be seen from the tables that the squared bias for the MLE is, in all cases, negligible compared to the variance. We note that this is in contrast with Birgé's estimator. Moreover, the variance of the MLE is generally smaller than that of Birgé's estimator. Table 5 shows, not unexpectedly, that the MSE of the smoothed MLE is much smaller than the MSE of either the MLE or Birgé's estimator.

Simulation results for the separated case
For the separated case the results of a simulation study are provided in the tables 6 to 14. We first take again F 0 to be the uniform(0, 1) distribution function. On the other hand, we chose the observation density defined by (4.8), with = 0.1, so the observation times T i and U i cannot become arbitrarily close. The results are based on 10,000 pseudo-random samples. As in the non-separated case, the     f 0 (t) = 1 f 0 (t) = 4(1 − t) 3 n = 10 6 1.12 1.09 n = 10 7 1.04 1.04 Table 7 MSE for Birgé's estimator and MLE, times n 2/3 , t 0 = 0.3, 0.4, 0.5 and 0.6, separated case. The asymptotic MSE (Birgé) and "the asymptotic variance" (MLE) are displayed in bold type.

MSE of the MLE turns out to be smaller than the MSE of Birgé's estimator.
Here the difference is however even more noticeable.
In the tables 7 to 9 we give the results for the MSE, variance and squared bias for both estimators. Again it can be seen that the variance of Birgé's estimator is generally larger than the variance of the MLE. Moreover, as in the non-separated case, the squared bias for the MLE is, in all cases, negligible compared to the variance.
To show that the results are not specific for the uniform distribution, we give in the tables 11 to 13 the corresponding comparisons for the distribution function F 0 , with density f 0 , defined by For the computation of the asymptotic variance of the MLE we used (5.1) of section 5. It is seen that the correspondence between the asymptotic expression for the variance and the actual sample variance of the MLE is rather good, and also that the superiority of the MLE w.r.t. Birgé's estimator is still more pronounced for this distribution function. Table 14 shows that the ratio of the MSE of the SMLE and the MSE of the actual MLE is somewhat larger here, which is probably due to the fact that the asymptotic bias plays a larger role for the SMLE in this case (this bias vanishes for the uniform distribution function). The bias of the actual MLE is again very small for this distribution function, however.
As the fit with the asymptotic MSE was not satisfactory for Birgé's estimator in the separated case, we also did some simulations for much larger sample sizes. It turns out that the MSE then approximates the values predicted by the asymptotic theory. Some evidence is given in table 6. The results are based on 1000 pseudo-random samples. Table 8 Variance for Birgé's estimator and MLE, times n 2/3 , t 0 = 0.3, 0.4, 0.5 and 0.6, separated case. The asymptotic variances are displayed in bold type.       Table 12 Variance for Birgé's estimator and MLE, times n 2/3 , t 0 = 0.3, 0.4, 0.5 and 0.6, f 0 (t) = 4(1 − t) 3 , t ∈ [0, 1], separated case. The asymptotic variances are displayed in bold type.

Summary
In the preceding, the limit distributions of three estimators for the interval censoring, case 2, problem were discussed: Birgé's estimator, the (nonparametric) maximum likelihood estimator (MLE) and the smoothed MLE, which is analogous to the smoothed MLE introduced in [8] for the current status model. Birgé's estimator is mainly of theoretical interest and constructed to show that the minimax rate can be attained. The construction uses prior knowledge on whether the observation distribution has arbitrarily small observation intervals (the socalled non-separated case) or not (the separated case). Such prior knowledge is not necessary for the MLE, which adapts automatically to either situation. The conjectured limit distribution of the MLE in the non-separated case, given in [5], was (partially) checked in a simulation study, comparing Birgé's estimator, the MLE and the smoothed MLE. The simulation study seems to support the conjecture. The smoothed MLE converges at a faster rate than either Birgé's estimator or the MLE on which it is based if the underlying distribution is smooth, as is also borne out by the simulation study.
The limit distribution of the MLE in the separated case was given in [6] and the simulation study for the separated case shows that the asymptotic variance, arising from this result, provides a good approximation to the actual finite sample variance. The difference in behavior for the separated and non-separated cases persists for the smoothed MLE and in that case crucially depends on properties of the solution of an integral equation, as discussed in section 6. This analysis is based on a local version of the theory developed in [2], [3] and [4]. The (numerical) solution of the integral equation can be used to estimate the variance of the smoothed MLE. The theoretically computed asymptotic variance, using a numerical solution of the integral equation, fits the observed sample variance rather well, but the discussion on this matter is heuristic and still contains lots of open questions.

Appendix
We split the proof of Theorem 3.1 into several parts, dealing with the difficulties (1), (2) and (3), mentioned in section 3. Here and in the following we will use some empirical process notation to make the transition to the asymptotic distribution more transparant. As an example, we give a representation of in terms of integrals with respect to empirical distributions. First we write: where δ = (δ 1 , δ 2 , δ 3 ) is the vector of indicators giving the position of the unobservable random variables X i with respect to the observation interval [T i , U i ], and where P n is the empirical measure of the random variables ( The denominator of (10.1), after dividing by n, is rewritten in the form: where G n,1 is the empirical distribution function of the T i , with underlying df G 1 and underlying density g 1 , which is the first marginal of h. Using this notation, we get: where we the define the ratio to be zero if the denominator is zero. The terms M k /M k and Q j,k /Q j,k can be rewritten in a similar way. We will also use the following decomposition: We similarly have, denoting 1 − F 0 by F 0 , and One can consider this as a decomposition into a "variance part" and a "bias part", where the first terms on the right-hand sides of the above expressions correspond to the variance part and the second terms to the bias part. We first deal with the bias part. Proof.
(i). If N k , M k , Q j,k and Q k,j are strictly positive, for all (relevant) values of k, we can write see (10.4) to (10.6). We can write this in the following form: t∈Ij , u∈I k dH n (t, u) .
imsart-ejs ver. 2011/12/01 file: interval.tex date: December 13, 2011 Combining these results, we find that the variance of the conditional expectation and similarly, with a similar expansion for k < j. The o p (1)-terms are uniform in k, as follows by using exponential inequalities of the same type as used in Lemma 3.1.
It is easily seen that the terms, involving values of t k / ∈ [ , 1 − ] give a negligible contribution, by noting that if k > j, with a similar upper bound if t k < t j . The results now follows by multiplying with w j,k and summing over k, see (3.11).
We now define and Note that these are the numerators of the "variance parts" in (10.4) and (10.5), divided by n. The following lemma shows that (in the proper scaling for Birgé's statistic) the variances of the sums of terms, involving U n,k and V n,k in Birgé's statistic, tend to zero.
Lemma 10.2. Let the conditions of Theorem 3.1 be satisfied, let t j = t 0 , and let α n be defined by (3.12). Moreover, let U n,k and V n,k be defined by (10.9) and (10.10). Then, as n → ∞, Proof. We have: , since the covariances of the terms in the sum are zero. As before, we define the ratios to be zero if the denominator is zero. Furthermore: var we now obtain: where (as before), By (3.10): We now define, if j < k, 12) and, if j > k: } has a nondegenerate distribution, this has to come from the sum: 14) The following lemma shows that (10.14), with the random weights w j,k replaced by the deterministic weights w j,k indeed has a nondegenerate limit distribution.
where the right-hand side denotes a normal random variable, with expectation 0 and variance σ 2 0 , defined by (3.14) in Theorem 3.1. Proof. We will prove the result by constructing a martingale-difference array, and applying Theorem 1, p. 171 of [12]. Define, for k > j, the random variables .
For k < j we define and (for notational convenience) we define ξ n,j ≡ 0. Let the increasing sequence of σ-fields F n,k , k = 0, 1, . . . be defined by , as before. Note: I k = [t k , t k+1 ), k < K, and , under scheme (i), and I k = [t k , t k+1 ), k ≤ K, and I K+1 = [t K+1 , t K+2 ] under scheme (ii) at the beginning of this section.
Then: E ξ n,k F n,k−1 = 0, k = 1, 2, . . . (10.15) Here and in the following the indices k run from 1 to K or to K + 1, depending on whether scheme (i) or (ii) holds, respectively. Note that, if k < j, we can write and that if t k < T i < t j and U i ∈ I j , using the independence of the X i from the pairs (T i , U i ). Similar relations hold if t i ∈ I j . This implies E ξ n,k F n,k−1 = 0, k = 1, 2, . . . . (10.16) It is also clear that ξ n,k is measurable with respect to F n,k . Let the conditional variances v n,k be defined by v n,k = E ξ 2 n,k F n,k−1 , k = 1, 2, . . . . We first consider the indices k such that |j − k| < n K, where n = (log n) −1/3 . We then get, if k < j, We similarly get: if k > j and k − j < n K. The terms v n,k , where |k − j| ≥ n K, give a negligible contribution, since as n → ∞. So we find as n → ∞. Using again the conditional independence of the X i , given the values of the pairs (T i , U i ), and defining p 0 (t, u) = F 0 (u) − F 0 (t), p 0 (t, u) = 1 − p 0 (t, u), we get, if k < j: The first conditional expectation on the right-hand side of (10.21) arises from terms of the form where T i ∈ I k , U i ∈ I j , and the second one from terms of the form where i = i and T i , T i ∈ I k ; U i , U i ∈ I j , where we added the diagonal terms (where i = i ) for simplicity of notation, since they give a negligible contribution. The other conditional expectations of crossproducts are zero. If k > j we get an entirely similar expansion, with the roles of t and u interchanged. The first term on the right-hand side of (10.21) gives a contribution of order Proof of Theorem 3.1. ad (i). Lemma 10.2 shows that the terms involving N k /N k and M k /M k only give a contribution to the asymptotic bias of Birgé's statistic, but not to the limit distribution of the centered part. The limit distribution of the centered part therefore arises from the terms W n,j,k , where W n,j,k = n log n (t,u)∈I k ×Ij which are the numerators of the fractions . Now note that where the o p (1)-term is uniform in k by the results, given above. Moreover, by part (i) of Lemma 3.1, where h(t, u) = h(t, u), t < u, h(t, u) = h(u, t), t ≥ u.
The result now follows from Lemma 10.3. ad (ii). We first prove (3.15). Since E F n (t j ) is the expectation of E F n (t j ) N k , Q j,k , k > j; M k , Q k,j , k < j , part (i) of Lemma 10.1 tells us that as n → ∞. This implies: as n → ∞. But since, by part (ii) of Lemma 10.1, as n → ∞, we must have: This yields (3.15). To prove (3.16), we first note that, by part (i) of Lemma 10.1, the variance of the conditional expectation tends to zero. By Lemma 10.2 the sum of terms involving N k and M k also gives an asymptotically negligible contribution to α −1 n F n (t j ) − F 0 (t 0 ) . So we only have to consider the contribution of terms of the form and The variance of (10.22) is given by Lemma 3.1 gives (uniform) exponential inequalities are derived for the probabilities of the events of the following type: imsart-ejs ver. 2011/12/01 file: interval.tex date: December 13, 2011 yielding upper bounds, tending to zero faster than any power of n. So we get: which tends to zero faster than any power of n, uniformly in k. Here we use the lower bound 1/K for W j 1 {Wj >0} . So we find: This implies: Now let, for t k < 1 − δ, where δ > 0, the event B k be defined by For t k ≥ 1 − δ, we define the event B k by: imsart-ejs ver. 2011/12/01 file: interval.tex date: December 13, 2011 Similarly to what is true for A j,k , we have that P (B k ) tends to zero faster than any power of n, uniformly in k. This shows that we also can replace w j,k by w j,k in the asymptotic expression for the variance, using the fact that the terms for t k > 1 − δ will give a contribution of lower order in the summation. So we find: n(log n) 2 · · k:j>k Similarly we find that the summation for k < j gives a contribution which is asymptotically equivalent to 3b(t 0 ) 2 f 0 (t 0 ) c {a(t 0 ) + b(t 0 )} 2 h(t j , t k ) .
also on the last interval, since g 1 (t) → 0 on this interval, sup k:k<j |KEM k /n − g 2 (t k )| = sup k:k<j K t∈I k g 2 (t) dt − g 2 (t k ) → 0, also on the first interval, since g 2 (t) → 0 on this interval. We also have: uniformly for all t k , not belonging to the first or last interval, which may not have length 1/K (see the construction of the intervals of Birgé's statistic at the beginning of section 3). But on these intervals we have h(t, t j ) ∧ g 2 (t) = g 2 (t) and h(t j , t) ∧ g 1 (t) = g 1 (t), respectively. So we get: sup k:k>j (KEN k /n) ∧ K 2 EQ j,k /n − g 1 (t k ) ∧ h(t j , t k ) → 0, (10.29) and sup k:k<j (KEM k /n) ∧ K 2 EQ k,j /n − g 2 (t k ) ∧ h(t k , t j ) → 0, (10.30) Hence, we get from (10.28), (10.29) and (10.30), on the set A j, , (KEM k /n) ∧ (K 2 EQ k,j /n) j − k + 1 and similarly Moreover, letting n = (log n) −1/3 , we get:, Hence, using (4.3) and a Riemann sum approximation, we obtain: for each δ > 0. We again use the conditional Cauchy-Schwarz inequality E ξ 2 n,k 1 {|ξ n,k |>δ} F n,k−1 ≤ E ξ 4 n,k F n,k−1 E 1 {|ξ n,k |>δ} F n,k−1 . Letting p 0 (t, u) = F 0 (u) − F 0 (t), p 0 (t, u) = 1 − p 0 (t, u), we get, if k < j: E ξ 4 n,k F n,k−1 ∼ n w 4 j,k c 8 h(t k , t 0 ) 4 · E t∈I k , u∈Ij p 0 (t, u)p 0 (t, u)} p 0 (t, u) 3 + p 0 (t, u) 3 dH n (t, u) F n,k−1 + n 2 w 4 j,k c 8 h(t k , t 0 ) 4 E t∈I k , u∈Ij p 0 (t, u){1 − p 0 (t, u)} dH n (t, u) 2 F n,k−1 . (10.39) The first and second terms on the right-hand side are, respectively, of order The proof of Theorem 4.1 can now be finished by making the transition from the random weights to the deterministic weights, using Lemma 4.1 (see the proof of Theorem 3.1 at the end of section 3), and using the central limit result of Lemma 10.6.