The bivariate current status model

For the univariate current status and, more generally, the interval censoring model, distribution theory has been developed for the maximum likelihood estimator (MLE) and smoothed maximum likelihood estimator (SMLE) of the unknown distribution function, see, e.g., [12], [7], [4], [5], [6], [10], [11] and [8]. For the bivariate current status and interval censoring models distribution theory of this type is still absent and even the rate at which we can expect reasonable estimators to converge is unknown. We define a purely discrete plug-in estimator of the distribution function which locally converges at rate n^{1/3} and derive its (normal) limit distribution. Unlike the MLE or SMLE, this estimator is not a proper distribution function. Since the estimator is purely discrete, it demonstrates that the n^{1/3} convergence rate is in principle possible for the MLE, but whether this actually holds for the MLE is still an open problem. If the cube root n rate holds for the MLE, this would mean that the local 1-dimensional rate of the MLE continues to hold in dimension 2, a (perhaps) somewhat surprising result. The simulation results do not seem to be in contradiction with this assumption, however. We compare the behavior of the plug-in estimator with the behavior of the MLE on a sieve and the SMLE in a simulation study. This indicates that the plug-in estimator and the SMLE have a smaller variance but a larger bias than the sieved MLE. The SMLE is conjectured to have a n^{1/3}-rate of convergence if we use bandwidths of order n^{-1/6}. We derive its (normal) limit distribution, using this assumption. Finally, we demonstrate the behavior of the MLE and SMLE for the bivariate interval censored data of [1], which have been discussed by many authors, see e.g., [18], [3], [2] and [15].


Introduction
We consider the bivariate current status model, also called the bivariate interval censoring, case 1, model. This means that our observations consist of a quadruple (T, U, ∆ 1 , ∆ 2 ), where and (X, Y ) is independent of the observation (T, U ). We want to estimate the distribution function F 0 of the 'hidden' random vector (X, Y ). A maximum likelihood estimatorF n of F 0 , the distribution function of (X, Y ), maximizes the expression over all bivariate distribution functions F . Another formulation is thatF n maximizees over F , where F 1 and F 2 are the first and second marginal dfs of F , respectively, and P n is the empirical measure of the observations (T i , U i , ∆ i1 , ∆ i2 ), i = 1, . . . , n.
One looks for a solution of the form where we denote by [τ j , ∞) an infinite rectangle [t j1 , ∞) × [t j2 , ∞), where τ j = (t j1 , t j2 ). Then the (Fenchel or Kuhn-Tucker) duality conditions for the solution are: [t,∞) for all t = (t 1 , t 2 ) ∈ R 2 , where P n is the empirical df of the observations (u i , v i , δ i1 , δ i2 ). We must have equality in (1.2), if t = τ j , j = 1, . . . , m n , is a point of mass of the solution (i.e., the constraints are active!), where the rectangles [τ j , ∞) are the generators of the solution. An R package, called 'MLEcens' is available for computing the MLE. The algorithm determines the maximal intersection rectangles where the MLE has mass via a preliminary reduction algorithm, and next computes the mass of the MLE in these rectangles, using the support reduction algorithm of [12]. The reduction algorithm is described in [15]. The R package uses as an example a data set, studied in [1], which is actually not of the current status type but has interval censoring, case 2, data. We shall discuss these data in section 5. The MLE for this data set is also discussed in section 7.3.3 of [18], who also refers to [3] and [2] for discussions of the computation of the MLE for this data set. The computation of the rectangles where the MLE puts mass is also treated in [17], where also minimax lower bounds for the estimation rate of the MLE and consistency of the MLE are derived.
There is an extensive discussion on where to put the mass, once one has determined rectangles which can have positive mass, see, e.g. [18], section 7.3, [3], [2] and [15]. The algorithm for computing these rectangles, proposed in [15], seems at present to be the fastest.
It is in our view somewhat doubtful whether all the energy spent on computing these maximal intersection rectangles and the ensuing question of whether one should place the mass of the MLE at the left lower corner or the right upper corner of the rectangles is really worth the effort. One could also specify in advance a set of points where one allows mass to be placed. In this way one obtains an MLE on a sieve, where the sieve consists of distributions having discrete mass at these points. The bottleneck in the computation of the MLE for the bivariate interval censoring problem is not the determination of the maximal intersection rectangles, but the computation of the mass the MLE puts on these rectangles, since there usually are very many! The latter phenomenon shows up in particular in simulations. As an example, simulating data from the distribution with density f (x, y) = x + y on the unit square, with a uniform observation distribution, we got for sample size n = 5000 about 5 · 10 5 possible rectangles where the masses could be placed, which is (at present) an almost prohibitive number if one wants to do simulations of the limit behavior of the MLE on an ordinary table computer or laptop. Ultimately, the discussion on these matters should in our view be determined by insights into the distribution theory of the MLE or the MLE on the chosen sieve.
In section 2 we show that a purely discrete plug-in estimator locally attains the n 1/3 rate. We also determine its asymptotic (normal) distribution. In section 3 we study the local limit behavior of the smoothed maximum likelihood estimator and derive its asymptotic distribution under the assumption that it can asymptotically can be represented by an integral in the observation space, proceeding along similar lines as in the one-dimensional case (see, e.g., [5], [9] and [8]). Section 4 presents a simulation study, comparing the behavior of the MLE, the SMLE and the plug-in estimator. The results seem to be in accordance with Theorems 2.1 and 3.1 in sections 2 and 3, respectively. Section 5 extends the MLE and SMLE to a more general setting of interval censoring and applies the methods on a data set in [1], which has been discussed by many authors, see e.g., [18], [3], [2] and [15]. The paper ends with some concluding remarks on faster rates for the SMLE, which can be attained if one uses higher order kernels.

2.
A purely discrete n 1/3 -rate estimator for the bivariate current status model Basically, the MLE for the 1-dimensional current status model is the monotone derivative of the cusum diagram where I is the observation interval and G n the empirical distribution function of the observations T 1 , . . . , T n . So it can be considered to be a monotone version of the 'derivative' dV n (t)/dG n (t). Note that if we replace V n and G n by their deterministic equivalents, the derivative becomes so is indeed the object we want to estimate.
For the simplest bivariate current status model, which is sometimes called the 'in-out' model, we only have the information whether the hidden variable is below and to the left of the observation point (T i , U i ) or not. In this case we could also define where δ = 1 represents the situation that the hidden variable is below and to the left of (v, w). If the empirical observation distribution is again denoted by G n , we this time want to estimate the 'derivative' dV n (t, u)/dG n (t, u), since, replacing V n and G n by their deterministic equivalents, the derivative becomes So we want to find a version of the derivative dV n (t, u)/dG n (t, u), under the (shape) restriction that it is a bivariate distribution function. However, a natural cusum diagram for this situation does not seem to exist. But we can define a 2dimensional 'Fenchel process', incorporating the duality conditions for a solution of the optimization problem. Analogously to the 1-dimensional current status model, the Fenchel duality conditions for the isotonic least squares (LS)estimate, minimizing with equality if (t, u) is a point of mass of the solution. So we have to deal with a process which has to lie above the process with points of touch at points of mass of F . Denoting temporarily the process (2.2) by Q n , we get that the isotonic least squares estimator (which no longer necessarily coincides with the MLE!) can (formally) be denoted by dQ n (t, u)/dG n (t, u). Note, however, that the function Q n is not necessarily close to a convex or concave function, so here the analogy with 1-dimensional current status breaks down. But it must have the property that its 'derivative' w.r.t. dG n must be a distribution function, which is analogous to the fact that the derivative of the convex minorant of the cusum diagram must be a distribution function in the 1-dimensional case.
For the real current status model the situation is more complicated, since we then have to deal with 4 regions instead of 2. From (1.2) we get: where t = (t 1 , t 2 ), with equality if (t 1 , t 2 ) is a point of mass ofF n . It has been conjectured that the MLE in the bivariate current status model converges locally at rate n 1/3 , just as in the 1-dimensional current status model (with smooth underlying distribution functions). [17] proves a minimax lower bound of order n −1/3 . It would be somewhat surprising if the 1-dimensional rate would be preserved in dimension 2, since in general one gets lower rates for density estimators if the dimension gets up, and the estimation of the distribution function in the current status model is similar to density estimation problems, as argued above.
To show that it is in principle possible to attain the local rate n 1/3 , we construct a purely discrete estimator, converging locally at rate n 1/3 . We restrict ourselves for simplicity to distributions with support [0, 1] 2 in the remainder of this section, but the generalization to more general rectangles is obvious. We have the following result, which is proved in the Appendix.
Theorem 2.1. Consider an interior point (t, u), and define the square A n , with midpoint (t, u), by: Moreover, suppose that the observation distribution G is twice continuously differentiable at (t, u) with a strictly positive density g(t, u) at (t, u), and that F 0 is twice continuously differentiable at (t, u). Moreover, suppose lim n→∞ Then the estimatorF where G n is the empirical distribution function of the observations (T i , U i ) and P n is the empirical distribution function of the observations satisfies: and variance We now allow as possible points of mass the points (t n1 , u n1 ), . . . , (t n,mn , u n,mn ) running through a rectangular grid, where the distances between the points on the x-and y-axis are of order n −1/3 , and define the estimateF n at each point (t ni , u ni ) as in Theorem 2.1. Next we define the masses p ni at the points (t ni , u ni ) by the equations j:tnj ≤tni, unj ≤uni p nj =F n (t ni , u ni ), i = 1, . . . , m n .
Note that the estimateF n we obtain in this way is not necessarily a distribution function and that the masses p nj can have negative values. Also note that we get roughly order n 1/3 × n 1/3 equations in this way, which turned out to be solvable, although it is not clear beforehand that the system one gets is non-singular. Nevertheless, one can build up the system from left and below up to the right and above, where one gets more an more values in the corresponding matrix, so it seems likely that in general the solution exists. This seems a point for further research. A picture ofF n , together with the MLE, computed on the sieve of points of mass of the plug-in estimator, is shown in Figure 1. The sieved MLE is a proper (discrete) distribution function, so all its masses are nonnegative.
Assuming the support of the distribution with function F 0 to be [0, 1] 2 , we treat the points near the upper and right boundary in a special way in computingF n . If, for example t ni > 1 − h n , where h n ∼ n −1/6 and (t ni , u ni ) is a point of the grid, we put the δ j1 corresponding to the contribution of observations (T j , U j ) such that T j ≥ 2 − t ni − h n , equal to 1. We treat the second coordinate in the same way. This reduces the bias we otherwise would get, with an underestimation of the distribution function near the right and upper boundary. The idea to treat the points near the boundary in this way is inspired by, but different from, the reflection method proposed by [16]. For points near the left and lower boundary, we follow a similar procedure. If, for example t < h n , where h n ∼ n −1/6 and (t, u) is a point of the grid, we put the δ j1 corresponding to the contribution of observations (T j , U j ) such that T j ≤ h n − t, equal to 0. The bias near the boundary is actually of order O(h n ) in this way and we do not attain the O(h 2 n ) for interior points, though.

The smoothed maximum likelihood estimator
Throughout this section we will assume for simplicity that the support of the distribution of the 'unobservables' is [0, 1] 2 and that we want to estimate the corresponding distribution function F 0 on [0, 1] 2 . The generalization to more general rectangles is obvious. Let K be a symmetric non-negative kernel, for example the triweight kernel Moreover, let the integrated kernel IK be defined by We follow the approach for the 1-dimensional case, discussed in the references [4] to [9]. At an interior point (t, u), not too close to the boundary, the smoothed maximum likelihood estimator (SMLE) is just defined bŷ To prevent the negative bias at the right and upper boundary of the support, we also perform a correction by extending the definition of IK near the upper and right boundary by: and defininĝ and This definition of the (integrated) boundary kernel is based on the reflection boundary correction method for density estimates, proposed in [16]. Note that the definitions (3.1) and ( We next define the score function in the hidden space: Scores in the observation space are given by where a is a score in the hidden space. We have, for example With this notation, we want to solve the equation where a is a score function in the hidden space, and differentiating (3.5) w.r.t. x and y, we now obtain the equation: This equation has the solution Note that the solution satisfies: This suggests that the asymptotic behavior of the SMLE is given by: leading at interior points (t, u) to an asymptotic variance, given by: Assume that max(t, u) ≤ 1 − h. Then the bias is given by: The SMLE is compared with the MLE in Figure 2. Using the assumption thatF has the asymptotic representation (3.8), we get the following result. where N (β, σ 2 ) is a normal distribution with first moment and variance Remark 3.1. Note that choosing h n n −1/6 is the asymptotically optimal choice (modulo constants) since the variance is of order 1/(nh 2 n ) and the bias of order h 2 n , unless the bias is of order o(h 2 n ) (as happens when F 0 is the uniform distribution function on [0, 1] 2 ). Also note that the bias term, caused by the interaction of the observation distribution G and the distribution function F 0 , which entered into the bias term of Theorem 2.1, plays no role here.

A simulation study
In order to compare the behavior of the three estimators, we took 1000 samples from the distribution with distribution function F 0 (x, y) = 1 2 xy(x + y), x, y ∈ [0, 1] 2 , and generated bivariate current status data from this with respect to the uniform distribution on [0, 1] 2 . The samples size taken were n = 100, 500, 1000 and 5000, respectively. So we have observations (T i , U i ) from the uniform distribution in [0, 1] 2 , and for each such pair we get from the corresponding pair (X i , Y i ), drawn from F 0 , independently w.r.t. (T i , U i ), the indicators ∆ i1 = 1 {Xi≤Ti} and ∆ i2 = 1 {Yi≤Ui} . Since simulations for sample size 5000 with the 'real' MLE, based on the maximal intersection rectangles, would have taken prohibitively long, we instead computed the MLE on a sieve, constructed in the following way. For each sample of size n, we distributed m n = n 2/3 points on the unit square, by letting their x-and y-coordinates be multiples of n −1/3 and permuting these coordinates randomly, according to the uniform distribution on permutations. Here [x] denotes the largest integer ≤ x.
So we start with order n 2/3 points on which the sieved MLE can place its mass, to which we add the vertices of the unit square to ensure a finite log likelihood, and compute the MLE which only is allowed to have mass at these points. This set of points is shown in Figure 3, corresponding to sample size n = 1000, and the reduced set of points on which the MLE actually puts positive mass is also shown in this picture.
The rather different set of points on which the plug-in estimate puts its mass is shown in Figure 4. In fact, inspection of the set of points on which the MLE, based on the maximal intersection rectangles puts its mass, shows that that such sets are rather similar to Figure 3b and not similar to Figure 4. By the rather irregular structure of sets like Figure 3b the bias of the MLE is reduced w.r.t. the plug-in estimator which is constant on squares with sides of order n −1/3 . We note, however, that the number of points in Figure 3a is the same as in Figure 4 (the number is 121).
We took the bandwidth h n for the SMLE equal to n −1/6 and we also used this as binwidth for the plug-in estimator. Table 1a shows that there is no indication that n 1/3 times the standard deviation of the MLE is increasing with sample size, and Table 1b suggests that the bias times n 1/3 is vanishing (as is also true in the one-dimensional case!), as n → ∞, so the hypothesis that the rate of convergence of the MLE is n 1/3 is not contradicted.
However, if the rate of the MLE is actually of order n 1/3 (log n) −1/3 , for example, we can probably not detect this in the present way. The theory for the MSLE and plug-in estimators seems to be confirmed by the simulations, in particular at the points (0.4, 0.6) and (0.6, 0, 6), where the boundary effects are still not active. Note that if, for example, n = 500, the bandwidths for the SMLE are equal to 500 −1/6 ≈ 0.3549537, so the boundary kernel starts getting active for the points (0.2, 0.6) and (0.8, 0.6). This is still true for sample size n = 5000, where the bandwidth is 5000 −1/6 ≈ 0.2418271. It is clear from Table 1a that the MLE has a bigger variance than the other two estimators, but on the other hand the bias of the MLE is usually smaller than that of the other estimators. If the second order partial derivatives of the distribution function F 0 vanish at (t, u), as happens, for example, with the uniform distribution, it is possible to achieve higher rates of convergence with the SMLE Table 2 Estimated values of the standard deviations times n 1/3 for three estimators of F 0 (t, u) at a number of values of (t, u), where F 0 (t, u) = tu. The values corresponding to ∞ are the asymptotic values, deduced from Theorems 2.1 and 3.  (b) n 1/3 times bias and the plug-in estimator. We demonstrate this for the uniform distribution function F 0 , where we keep the bandwidth constant and equal to 0.4 for the SMLE and equal to 0.2 for the plug-in estimator (to avoid the boundary correction). In this case the bias times n 1/3 tends to zero for the SMLE and the plug-in estimator, see Table 2b. If we keep the bandwidth constant, the variance of the SMLE and plug-in estimator should be of order n −1 and these estimators should therefore actually attain a parametric rate of convergence in this case. We expect the MLE to have again rate n 1/3 in this case however, which is also (to a certain extent) suggested by Table 2.

More general bivariate interval censoring
The purpose of this section is to show that the MLE and SMLE, discussed in the preceding sections in the context of the bivariate current status model, can also be used for more general interval censored data. As mentioned earlier, the current status model is the simplest case of the interval censoring model. For the bivariate interval censoring, case 2, model, the data are of the form and defining the corresponding generic values δ and an MLE of F is then obtained by maximizing over bivariate distribution functions F . The Fenchel duality conditions become: with equality if (x, y) is a point of mass of dF . For computational purposes (but probably not for the development of distribution theory!) it is more convenient not to distinguish between the measures V (ij) n and just to introduce rectangles to which the unobservable observations are known to belong, where we represent the (one-sided) unbounded rectangles by finite rectangles with upper or lower bounds outside the range of the observed data. In this set-up we simply have to maximize where p = (p 1 , . . . , p m ) is a vector of probability masses at possible points of mass (x j , y j ) and H i is a vector of length m, consisting of ones and zeros, where the component and is zero, otherwise, and where the f i denote the multiplicities at the ith observation point. The masses p j should be nonnegative and sum to 1. This optimization can easily be handled by using iterative quadratic minimization and the support reduction algorithm, documented in [12]. In fact, the treatment is completely analogous to the treatment of the Aspect experiment for quantum statistics, discussed there. The data of [1] are given in Table 3, where the rectangles to which the hidden observations are known to belong are denoted by [L i1 , R i1 ] × [L i2 , R i2 ], i = 1, . . . , n. The frequencies of the hidden observations belonging to these rectangles are given in the 5th and 10th column. There are 87 observation rectangles and the total sample size, taking the frequencies into account, is 204. The table is also given in [18], Table 7.1, p. 165, but there the rectangles are slightly changed from the data in [1] by lowering the left bounds by 1 if they are larger than zero. Since we do not see a pressing reason for doing that, we just give the data here as they were given by [1]. If the upper bound R ij is unknown, we put R ij = ∞ and if the lower bound L ij is unknown, we put L ij = −∞. Table 3 The Betensky-Finkelstein data  Figure 6, is due to the 'ridge' for the larger values of the second coordinate. The levels of both estimates are shown in Figure 7. It seems to me that the SMLE might be the more sensible estimate, also in view of the representational non-uniqueness of the MLE, which is somewhat 'washed out' by the SMLE.

Concluding remarks
In the preceding, three estimators for the bivariate current status model were studied: the maximum likelihood estimator (MLE), which in the simulation study was restricted to the MLE on a sieve, the smoothed maximum likelihood estimator (SMLE), obtained by integrating a kernel w.r.t. the masses of the MLE, and a purely discrete plug-in estimator. The SMLE and plug-in seem to attain the n 1/3 rate, with asymptotically normal distributions.  It might be somewhat surprising that the SMLE have the same rate, whereas the natural rates in the one-dimensional case are n 1/3 and n 2/5 , respectively, see [9]. But in the bivariate case the variance is of order 1/(nh 2 ) and the bias of order h 2 , if a bandwidth h is taken in both directions and the kernel is of the usual symmetric and positive type. This makes the optimal choice of bandwidth of order n −1/6 , leading to a rate of order n 1/3 .
We concentrated on compact support, but since we focused on local estimation, this restriction does not seem essential, except for the SMLE, since there the boundary played an important role. Further research on this matter seems needed.
It is also possible to attain again the local n 2/5 rate in the bivariate case, but then one has to take recourse to higher order kernels K with the property u 2 K(u) du = 0.
One also has to take bandwidths of order n −1/10 instead of order n −1/6 in this case, which makes the judicious choice of boundary kernels even more important. Moreover, one has to strengthen the conditions of the theorems to the existence of 4th derivatives. However, if one is willing to do that, it is easy to strengthen Theorem 3.1 by letting the kernel IK be based on, for example, the kernel But one loses the property that the resulting estimate is necessarily a distribution function, since the kernel K 1 is no longer positive. For the choice of higher order kernels, see, e.g., [14].