Global Rates of Convergence of the MLE for Multivariate Interval Censoring

We establish global rates of convergence of the Maximum Likelihood Estimator (MLE) of a multivariate distribution function in the case of (one type of)"interval censored"data. The main finding is that the rate of convergence of the MLE in the Hellinger metric is no worse than $n^{-1/3} (\log n)^{\gamma}$ for $\gamma = (5d - 4)/6$.


Introduction and overview
Our main goal in this paper is to study global rates of convergence of the Maximum Likelihood Estimator (MLE) in one simple model for multivariate interval-censored data. In section 3 we will show that under some reasonable conditions the MLE converges in a Hellinger metric to the true distribution function on R d at a rate no worse than n −1/3 (log n) γ d for γ d = (5d − 4)/6 for all d ≥ 2. Thus the rate of convergence is only worse than the known rate of n −1/3 for the case d = 1 by a factor involving a power of log n growing linearly with the dimension. These new rate results rely heavily on recent bracketing entropy bounds for d−dimensional distribution functions obtained by Gao [2012].
We begin in Section 2 with a review of interval censoring problems and known results in the case d = 1. We introduce the multivariate interval censoring model of interest here in Section 3, and obtain a rate of convergence for this model for d ≥ 2 in Theorem 3.1. Most of the proofs are given in Section 4, with the exception being a key corollary of Gao [2012], the statement and proof of which are given in the Appendix (Section 6). Finally, in Section 5 we introduce several related models and further problems.

Interval Censoring (or Current Status Data) on R
Let Y ∼ F 0 on R + , and let T ∼ G 0 on R + be independent of Y . Suppose that we observe X 1 , . . . , X n i.i.d. as X = (∆, T ) where ∆ = 1 [Y ≤T ] . Here Y is often the time until some event of interest and T is an observation time. The goal is to estimate F 0 nonparametrically based on observation of the X i 's.
To calculate the likelihood, we first calculate the distribution of X for a general distribution function F : note that the conditional distribution of ∆ conditional on T is Bernoulli: where p(T ) = F (T ). If G 0 has density g 0 with respect to some measure µ on R + , then X = (∆, T ) has density p F,g0 (δ, t) = F (t) δ (1 − F (t)) 1−δ g 0 (t), δ ∈ {0, 1}, t ∈ R + , with respect to the dominating measure (counting measure on {0, 1}) × µ. The nonparametric Maximum Likelihood Estimator (MLE)F n of F 0 in this interval censoring model was first obtained by Ayer et al. [1955]. It is simply described as follows: let T (1) ≤ · · · ≤ T (n) denote the order statistics corresponding to T 1 , . . . , T n and let ∆ (1) , . . . , ∆ (n) denote the corresponding ∆'s. Then the part of the log-likelihood of X 1 , . . . , X n depending on F is given by It turns out that the maximizerF n of (2.1) subject to (2.2) can be described as follows: let H * be the (greatest) convex minorant of the points {(i, j≤i ∆ (j) ) : i ∈ {1, . . . , n}}: LetF i denote the left-derivative of H * at T (i) . Then (F 1 , . . . ,F n ) is the unique vector maximizing (2.1) subject to (2.2), and we therefore take the MLEF n of F to bê with the conventions T (0) ≡ 0 and T (n+1) ≡ ∞. See Ayer et al. [1955] or Groeneboom and Wellner [1992], pages 38-43, for details. Groeneboom [1987] initiated the study ofF n and proved the following limiting distribution result at a fixed point t 0 .
The distribution of Z has been studied in detail by Groeneboom [1989] and computed by Groeneboom and Wellner [2001]. Balabdaoui and Wellner [2012] show that the density f Z of Z is log-concave. van de Geer [1993] (see also van de Geer [2000]) obtained the following global rate result for pF n . Recall that the Hellinger distance h(p, q) between two densities with respect to a dominating measure µ is given by Proposition 2.2. (van de Geer, 1993) h(pF n , p F0 ) = O p (n −1/3 ). Now for any distribution functions F and F 0 the (squared) Hellinger distance h 2 (p F , p F0 ) for the current status model is given by For generalizations of these and other asymptotic results for the current status model to more complicated interval censoring schemes for real-valued random variables Y , see e.g. Groeneboom and Wellner [1992], van de Geer [1993], Groeneboom [1996], van de Geer [2000], Schick and Yu [2000], and Groeneboom, Maathuis and Wellner [2008a,b].
Our main focus in this paper, however, concerns one simple generalization of the interval censoring model for R introduced above to interval censoring in R d . We now turn to this generalization.

Multivariate interval censoring: multivariate current status data
We assume that G 0 has density g 0 with respect to some dominating measure µ on R d . Suppose we observe X 1 , . . . , X n i.i.d. as X = (∆, T ) where ∆ = (∆ 1 , . . . , ∆ d ) is given by ∆ j = 1 [Yj≤Tj ] , j = 1, . . . , d. Equivalently, with a slight abuse of notation, X = (Γ, T ) where Γ = (Γ 1 , . . . , Γ 2 d ) is a vector of length 2 d consisting of 0's and 1's and with at most one 1 which indicates into which of the 2 d orthants of R +d determined by T the random vector Y belongs. More explicitly, define Much as for univariate current status data, Y represents a vector of times to events, T is a vector of observation times, and the goal is nonparametric estimation of the joint distribution function F 0 of Y based on observation of the X i 's. See Dunson and Dinse [2002], Jewell [2007], Wang [2009], and Lin and Wang [2011] for examples of settings in which data of this type arises.
To calculate the likelihood, we first calculate the distribution of X for a general distribution function F : note that the conditional distribution of Γ conditional on T is Multinomial: where p(T ; F ) = (p 1 (T ; F ), . . . , p 2 d (T ; F )) and the probabilities p j (t; F ), j = 1, . . . , 2 d , t ∈ R +d are determined by the F measures of the corresponding sets. Then our model P for multivariate current status data is the collection of all densities with respect to the dominating measure (counting measure on {0, 1} 2 d )× µ given by for some distribution function F on R +d where t ∈ R +d and γ j ∈ {0, 1} with 2 d j=1 γ j = 1. Now the part of the log-likelihood that depends on F is given by and again the MLEF n of the true distribution function F 0 is given bŷ F n = argmax{l n (F ) : F is a distribution function on R +d }. (3.1) For example, when d = 2, we can write Characterizations and computation of the MLE (3.1), mostly for the case d = 2 have been treated in Song [2001], Gentleman and Vandal [2002], and Maathuis [2005Maathuis [ , 2006. Consistency of the MLE for more general interval censoring models has been established by Yu, Yu and Wong [2006]. For an interesting application see Betensky and Finkelstein [1999]. This example and other examples of multivariate interval censored data are treated in Sun [2006] and and Deng and Fang [2009]. For a comparison of the MLE with alternative estimators in the case d = 2, see Groeneboom [2012a].
An analogue of Groeneboom's Theorem 2.1 has not been established in the multivariate case. Song [2001] established an asymptotic minimax lower bound for pointwise convergence when d = 2: if F 0 and G 0 have positive continuous densities at t 0 , then no estimator has a local minimax rate for estimation of F 0 (t 0 ) faster than n −1/3 . By making use of additional smoothness hypotheses, Groeneboom [2012a] has constructed estimators which achieve the pointwise n −1/3 rate, but it is not yet known if the MLE achieves this.
Our main goal here is to prove the following theorem concerning the global rate of convergence of the MLEF n .
Theorem 3.1. Consider the multivariate current status model.
Since the inequality (2.3) continues to hold in R d for d ≥ 2 (with 1/4 replaced by 1/8 on the right side), we obtain the following corollary: Corollary 3.2. Under the conditions of Theorem 3.1 it follows that

Proofs
Here we give the proof of Theorem 3.1. The main tool is a method developed by van de Geer [2000]. We will use the following lemma in combination with Theorem 7.6 of van de Geer [2000] or Theorem 3.4.1 of van der Vaart and Wellner [1996] (Section 3.4.2, pages 330-331). Without loss of generality we can take M = 1 where M is the upper bound of the support of F (see Theorem 3.1).
It remains only to show that (4.6) holds. But this is easy since g 2 Qσ,2 = g 2 Qσ,2 · Q σ (X ). This lemma is based on van de Geer [2000], pages 101 and 103. Note that our constants differ slightly from those of van de Geer.
Lemma 4.2. Suppose that F 0 has density f 0 which satisfies, for some 0 < c 1 < ∞, Then p 0 (which we can identify with the vector p 0 (·, F 0 )) satisfies Proof. This follows immediately from the general d version of (3.2) and the assumption on f 0 .
These inequalities can also be written in the following compact form: Lemma 4.3. Suppose that the assumption of Lemma 4.2 holds. Suppose, moreover, that G 0 has density g 0 which satisfies Proof. The first inequality follows easily from Lemma 4.2: note that The second inequality follows from the first inequality of the lemma.
Lemma 4.4. If the hypotheses of Lemmas 4.2 and 4.3 hold, then the measure Q σ defined by dQ σ ≡ (1/p 0 )1{p 0 > σ}dµ has total mass Q σ (X ) given by (4.10) Proof. This follows from Lemma 4.2, followed by an explicit calculation. In particular, the equality in (4.10) follows from dx by the change of variables t j = e −xj , where the second equality follows by induction: it holds easily for d = 1 (and d = 2); and then an easy calculation shows that it holds for d if it holds for d − 1.
Proof. This follows by combining the results of Lemmas 4.3 and 4.4 with Lemma 4.1, and then using Corollary 6.2 of the bracketing entropy bound of Gao [2012] and stated here as Theorem 6.1. Here is the explicit calculation: by Lemmas 4.3 and 4.4 by Corollary 6.2(b) for ǫ sufficiently small.

Some related models and further problems
There are several related models in which we expect to see the same basic phenomenon as established here, namely a global convergence rate of the form n −1/3 (log n) γ in all dimensions d ≥ 2 with only the power γ of the log term depending on d. Three such models are: (a) the "in-out model" for interval censoring in R d ; (b) the "case 2" multivariate interval censoring models studied by Deng and Fang [2009]; and (c) the scale mixture of uniforms model for decreasing densities in R +d . Here we briefly sketch why we expect the same phenomenon to hold in these three cases, even though we do not yet know pointwise convergence rates in any of these cases.

The "in-out model" for interval censoring in R d
The "in-out model" for interval censoring in R d was explored in the case d = 2 by Song [2001]. In this model where U and V are random vectors in R 2 with U ≤ V coordinatewise). We observe only (1 R (Y ), R), and the goal is to estimate the unknown distribution function F . Song [2001] (page 86) produced a local asymptotic minimax lower bound for estimation of F at a fixed t 0 ∈ R 2 . Under the assumption that F has a positive density f at t 0 , Song [2001] showed that any estimator of F (t 0 ) can have a local-minimax convergence rate which is at best n −1/3 . Groeneboom [2012a] has shown that this rate can be achieved by estimators involving smoothing methods. Based on the results for current status data in R d obtained in Theorem 3.1 and the entropy results for the class of distribution functions on R d , we conjecture that the global Hellinger rate of convergence of the MLEF n (t 0 ) will be n −1/3 (log n) ν for all d ≥ 2 where ν = ν d .

"Case 2" multivariate interval censoring models in R d
Recall that "case 2" interval censored data on R is as follows: suppose that Y ∼ F 0 on R + , the pair of observation times (U, V ) with U ≤ V determines a random interval (U, V ], and we observe X = (∆, Nonparametric estimation of F 0 based on X 1 , . . . , X n ) i.i.d. as X has been discussed by a number of authors, including Groeneboom and Wellner [1992], Geskus and Groeneboom [1999], and Groeneboom [1996]. Deng and Fang [2009] studied generalizations of this model to R d , and obtained rates of convergence of the MLE with respect to the Hellinger metric given by n −(1+d)/(2(1+2d) (log n) d 2 /(2(2d+1) in the case most comparable to the multivariate interval censoring model studied here. While this rate reduces when d = 1 to the known rate n −1/3 (log n) 1/6 , it is slower than n −1/3 (log n) ν for some ν when d > 1 due to the use of entropy bounds involving convex hulls (see Deng and Fang [2009], Proposition A.1, page 66) which are not necessarily sharp. We expect that rates of the form n −1/3 (log n) ν with ν > 0 are possible in these models as well.

Scale mixtures of uniform densities on R +d
Pavlides [2008] and Pavlides and Wellner [2012] studied the family of scale mixtures of uniform densities of the following form: for some distribution function G on (0, ∞) d . (Note that we have used the notation d j=1 y j = |y| for y = (y 1 , . . . , y d ) ∈ R +d .) It is not difficult to see that such densities are decreasing in each coordinate and that they also satisfy for all u, v ∈ R +d with u ≤ v; here ∆ d denotes the d−dimensional difference operator. This is the same key property of distribution functions which results in (bracketing) entropies which depend on dimension only through a logarithmic term. The difference here is that the density functions f G need not be bounded, and even if the true density f 0 is in this class and satisfies f 0 (0) < ∞, then we do not yet know the behavior of the MLEf n at zero. In fact we conjecture that: (a) If f 0 (0) < ∞ and f 0 is a scale mixture of uniform densities on rectangles as in (5.1), thenf n (0) = O p ((log n) β ) for some β = β d > 0. (b) Under the same hypothesis as in (a) and the hypothesis that f 0 has support contained in a compact set, the MLE converges with respect to the Hellinger distance with a rate that is no worse than n −1/3 (log n) ξ where ξ = ξ d . Again Pavlides [2008] and Pavlides and Wellner [2012] establish asymptotic minimax lower bounds for estimation of f 0 (x 0 ) proving that no estimator can have a (local minimax) rate of convergence faster than n −1/3 in all dimensions. This is in sharp contrast to the class of block-decreasing densities on R +d studied by Pavlides [2012] and by Biau and Devroye [2003]: Pavlides [2012] shows that the local asymptotic minimax rate for estimation of f 0 (x 0 ) is no faster than n −1/(d+2) , while Biau and Devroye [2003] show that there exist (histogram type) estimatorsf n which satisfy E f0 f n − f 0 1 = O(n −1/(d+2) ).
Our goal here is to use this result to control bracketing numbers for F d with respect to two other measures C d and R d,σ defined as follows. Let C d denote the finite measure on [0, 1] d with density with respect to λ d given by For fixed σ > 0, let R d,σ denote the (probability) measure on (0, 1] d with density with respect to λ d given by Corollary 6.2. (a) For each d ≥ 2 it follows that for ǫ ≤ ǫ 0 (d) log N [ ] (2 d/2 ǫ, F d , L 2 (C d )) ǫ −1 (log(1/ǫ)) 2(d−1) .
Then it follows easily by direct calculation using Thus for σ ≤ σ 0 (d) we have h j −g j L2(R d,σ ) ≤ 2 d/2+1 ǫ by the arguments in (a). Hence the brackets [g j ,h j ] yield a collection of 2 d/2+1 ǫ− brackets for F d with respect to L 2 (R d,σ) , and this implies that (b) holds.