Nonparametric estimation of the lifetime and disease onset distributions for a survival-sacriﬁce model

: In carcinogenicity experiments with animals where the tumor is not palpable it is common to observe only the time of death of the animal, the cause of death (the tumor or another independent cause, as sacriﬁce) and whether the tumor was present at the time of death. These last two indicator variables are evaluated after an autopsy. Deﬁning the non-negative variables T 1 (time of tumor onset), T 2 (time of death from the tumor) and C (time of death from an unrelated cause), we observe ( Y, Δ 1 , Δ 2 ), where Y = min { T 2 ,C } , Δ 1 = 1 { T 1 ≤ C } , and Δ 2 = 1 { T 2 ≤ C } . The random variables T 1 and T 2 are independent of C and have a joint distribution such that P ( T 1 ≤ T 2 ) = 1. Some authors call this model a “survival-sacriﬁce model”. of F 1 and that the Kaplan-Meier estimator is more eﬃcient than the MLE of F 2 . However, we show that the MLE of F 1 was not computed correctly, and that the (claimed) MLE estimate of F 1 is even undeﬁned in the case of active constraints. In our simulation study we used a primal-dual interior point algorithm to obtain the true MLE of F 1 . The results showed a better performance of the MLE of F 1 over the weighted least squares estimator in LJP (1997) for points where F 1 is close to F 2 . Moreover, application to the model, used in the simulation study of LJP (1997), showed smaller variances of the MLE estimators of the ﬁrst and second moments for both F 1 and F 2 , and sample sizes from 100 up to 5000, in comparison to the estimates, based on the weighted least squares estimator for F 1 , proposed in LJP (1997), and the Kaplan-Meier estimator for F 2 . R scripts are provided for computing the estimates either with the primal-dual interior point method or by the EM algorithm. In spite of the long history of the model in the biometrics literature (since about 1982), basic properties of the real maximum likelihood estimator (MLE) were still unknown. We give necessary and suﬃcient conditions for the MLE (Theorem 3.1), as an element of a cone, where the number of generators of the cone increases quadratically with sample size. From this and a self-consistency equation, turned into a Volterra integral equation, we derive the consistency of the MLE (Theorem 4.1). We conjecture that (under some natural conditions) one can extend the methods, used to prove consistency, to proving that the MLE is √ n consistent for F 2 and cube root n convergent for F 1 , but this has presently not yet been proved.


Introduction
Suppose that (T 1 , T 2 ) is a pair of nonnegative random variables with joint distribution F concentrated on {(t 1 , t 2 ) : 0 ≤ t 1 ≤ t 2 < ∞}. Here we think of T 1 as the "time of disease onset", and T 2 as the "time of death from the disease", and let F 1 and F 2 denote their respective (marginal) distribution functions. Suppose that C is a nonnegative random variable with distribution function G which is independent of (T 1 , T 2 ). We think of C as the "time of death from an unrelated cause". Furthermore, we can only observe the triple (1.1) If G has density g with respect to Lebesgue measure, and the marginal distribution function F 2 of T 2 has density f 2 with respect to Lebesgue measure, then it is easily seen that the joint density of X with respect to the product of Lebesgue measure on R + and counting measure on D ≡ {(0, 0), (1, 0), (1, 1)} is given by p(y, δ 1 , δ 2 ) = ⎧ ⎨ ⎩ (1 − F 1 (y))g(y), if (δ 1 , δ 2 ) = (0, 0) , (F 1 (y) − F 2 (y))g(y), if (δ 1 , δ 2 ) = (1, 0) , (1 − G(y))f 2 (y), if (δ 1 , δ 2 ) = (1, 1) . (1.2) Let P = P F,G denote the corresponding probability measure on R + × D. Note that the marginal density of (Y, Δ 2 ) is given by which is exactly that of random right censoring of T 2 ∼ F 2 by C ∼ G. On the other hand, the marginal density of (Y, Δ 1 ) is p 1 (y, δ 1 ) = (1 − F 1 (y))g(y) , if δ 1 = 0 , (F 1 (y) − F 2 (y))g(y) + (1 − G(y))f 2 (y) , if δ 1 = 1 , Survival-sacrifice model 3197 which is not the same as "current status data" for T 1 ∼ F 1 with observation time C ∼ G since the δ 1 = 1 component of this density only reduces to F 1 (y)g(y) if F 2 puts all its mass at +∞ (corresponding to a non-lethal disease). While the resulting "survival-sacrifice model" is very much related to right-censored data via its marginal distribution P 2 , and to current status data via its marginal distribution P 1 , the model as a whole is more complicated than either of these simpler models, especially so because of the restriction F 1 ≤ s F 2 which results from T 1 ≤ T 2 a.s. F . Our goal is to construct nonparametric estimators of F 1 and F 2 based on observation of X 1 , . . . , X n i.i.d. as X ≡ (C ∧ T 2 , Δ 1 , Δ 2 ). This model has been proposed for experiments involving the study of onset and mortality from undetectable irreversible diseases (e.g. occult tumors). The model is reasonable when the disease is moderately lethal but incurable and when the cause of death is known. It has a long history in the biometrics literature: see e.g. [4], [14], [19] and, more recently, [20].
The parameter space can be taken to be where F 1 < s F 2 means that F 1 (x) ≥ F 2 (x) for every x ∈ R and F 1 (x) > F 2 (x) for some x ∈ R. The maximum likehood estimation method for this problem is based on maximization of the log-likelihood function where f 2 (x) ≡ F 2 (x) − F 2 (x−) and K(g, G) is a term involving only the distribution G of C. [14] studied nonparametric estimation of S 1 = 1 − F 1 and S 2 = 1 − F 2 , but their work is restricted to the case where R(t) = S 1 (t)/S 2 (t) is non-increasing, an assumption that may not be reasonable, for example, for progressive diseases whose incidence is concentrated in the early or middle part of the life span. [19] proposed an EM algorithm for the joint estimation of F 1 and F 2 which converges very slowly to the MLE of (F 1 , F 2 ) (provided the support of the initial estimator contains the support of the MLE). We discuss an implementation of this method in Section 5.2, and give an R script for this implementation in [10].
A more efficient way of computing the full MLEs is given by the primaldual interior point algorithm, as, for example, also discussed in [15]. The latter authors first transform their model before applying the algorithm, but such a transformation is not needed in our present approach for the model considered here. Our implementation is discussed in Section 5.1 and an R script for the implementation is also given in [10].
However, theorem 1.10 in [2] is applicable to a real convex function Φ defined on R while in the application above the function Φ is in fact defined on R 2 since the value of F 2 is not supposed to be constant. It should be mentioned here that, although the Kaplan-Meier estimatorF 2,KM is uniquely defined, except possibly at times exceeding the largest observation, the pseudo MLEF 1,n is uniquely defined only over certain data-determined intervals. Specifically,F 1,n is always uniquely defined at the observed C i 's, i.e., the observations for which Δ 2,i = 0. [6] established uniform consistency of the pseudo-likelihood estimatorF pseudo 1,n obtained by maximizing (1.3) over the polyhedron {x : 0 ≤ x 1 ≤ · · · ≤ x n ≤ 1} under the assumption that F 1 and F 2 are continuous distribution functions and that P F1 ≺ P G . Recently [17] established a global rate of convergence result for the pseudo-likelihood estimator of F 1 for a slightly different model under reasonable assumptions concerning a preliminary estimator F 2,n of F 2 . For the Weighted Least Squares estimator proposed by [20], [7] established uniform consistency under the same assumptions used to study the pseudolikelihood estimator in [6]. [7] also established an asymptotic minimax lower bound for estimation of F 1 (t 0 ) under the following assumptions: (ii) F 1 and G are continuously differentiable at t 0 with derivatives f 1 (t 0 ) > 0 and g(t 0 ) > 0.
Specifically, [7] used Lemma 4.1 of Groeneboom (1996) (see also Theorem 6.1 of [11]) to show that n 1/3 inf F1,n max E n,p0 |F 1,n (t 0 ) − F 1 (t 0 )|, E n,pn |F 1,n (t 0 ) − F 1,n (t 0 )| (1.4) and the maximum value of this last expression (with respect to t > 0) is Here p 0 (y, δ 1 , δ 2 ; F 1 , F 2 , G) is given by (1.2) and p n is given by p n (y, δ 1 , Note that the dependence on n in p n is just through F 1 = F 1,n and not via F 2 or G. [7] also showed that the WLS estimator achieves this bound in the sense that (under slightly stronger conditions than those imposed in the minimax lower bound), Thus we see that virtually all of the progress to date concerning estimation of F 1 and F 2 in the model given by (1.2) concerns either the pseudo MLE of F 1 or the Weighted Least Squares estimatorF 1,n .
The current challenge is to understand the behavior of the (real, joint) MLE. Here we establish consistency of the sequence of MLEs {( F n,1 , F n,2 ) : n ≥ 1}. We also give some new lower bounds for smooth functionals of the distribution functions, which partially explain why we expect that the full MLE performs better than the pseudo MLE or the WLS estimators for the distribution function F 1 and also better than the Kaplan-Meier estimator of the distribution function F 2 , based on the marginal likelihood, with the advantage of the real MLE appearing mostly when the distribution functions F 1 and F 2 are nearly equal.
The plan of the paper is as follows. In Section 2 we explain the difficulties in the approach of [20] (LJP 1997). In Section 3 we give the characterization of the (real) MLE as the solution of a maximization on a cone and derive from this a self-consistency equation in Corollary 3.3 which is used to prove consistency in Section 4. In Section 5 we describe the primal-dual interior point method and the EM algorithm for computing the MLE. We also give in subsection 5.3 a comparison of the behavior of primal-dual interior point method and the EM algorithm. These results can be reproduced by running the R scripts for the algorithms in [10]. Results and examples for the estimation of smooth functionals are given in Section 6. Quantile estimation is considered in Section 7 and we end by some concluding remarks in Section 8.

A weighted least squares estimator
The main point of the present section is that the MLE is not the reweighted least squares estimator as introduced by [20] (LJP 1997), because that estimator is not well-defined. Their (not iteratively defined) weighted least squares estimator, claimed to be superior to the MLE, can actually coincide with the MLE. One might perhaps hope that maximizing the likelihood, ignoring the constraint F 1 ≥ F 2 (if it is not explicitly forced by the likelihood), and "resetting" values F 1 (t) and F 2 (t) if a violation F 1 (t) < F 2 (t) is encountered, will produce the maximum likelihood estimator in the end. But it is well-known that this method will in fact not work and lead to a procedure that may "hang" somewhere far from the maximizing value, since an algorithm of this type tries to move the estimators to values they cannot move to and resets the estimators to the same (non-maximizing) values each time the active constraints would be violated. In fact, it is quite easy to give numerical examples of this behavior for such a method in the present model, where we get a stationary point that will not correspond to the MLE, even if we start the procedure with positive masses at each point, as recommended in [20]. Use of Lagrange multipliers or methods using Section 3 below is the only way out here.
Indeed, a possibility for estimation of F 1 is to calculate a weighted least squares estimator as suggested by [20]. Making S 1 = 1 − F 1 and S 2 = 1 − F 2 , in terms of populations, R(c) = S 1 (c)/S 2 (c) is the proportion of subjects in the population alive at time c who are disease free (i.e., 1 − R(c) is the prevalence function at time c), and it can be written as So, it is possible to rewrite Estimating S 1 can be viewed, then, as a regression of S 2 (C) (1 − Δ 1 ) on the observed C i 's under the constraint of monotonicity. If we substitute S 2 by its Kaplan-Meier estimatorŜ 2,n =Ŝ 2,KM we automatically have an estimator for under the constraint that S 1 is nonincreasing. This minimization problem can be solved by using results from the theory of isotonic regression (see [2]) and is given bŷ , m = 1, . . . , n . However, is not constant. We may, then, use a weighted least squares estimator with weights w i , i = 1, . . . , n, inversely proportional to the variance ). This expression for the variance involves the unknown value S 1 (C i ) that we want to estimate, suggesting the use of an iterative procedure. In each step, the estimate would be given bŷ If we use w j = (1 − Δ 2(j) )/Ŝ 2 2,KM (Y (j) ) instead, we have an estimator with a closed form that can be calculated as the left derivative of the least concave majorant of the cumulative sum diagram (0, 0), (W 1 , G 1 ), . . . , (W n , G n ), where W i = i j=1 w j and [20] claim that the estimator above is more efficient than the MLE of F 1 which would be calculated, according to them, as follows. Let S 0 1n and S 0 2n be initial estimators of S 1 and S 2 . Let k = 0. For a given estimator S k 1n , use the EMalgorithm to compute the estimator S k+1 2n solving the equation adjusting the estimate so that S k+1 2n ≥ S k 1n in case this restriction is violated. For a given S k+1 2n , use the weighted least squares estimator proposed by [20] to obtain a new estimator S k+1 1n , which is adjusted so that S k+1 if this restriction is violated. By repeating this joint algorithm, the authors claim that the actual MLE of S 1 and S 2 is obtained. One would indeed expect this to be true in situations like this, but the weighted least squares estimator that they introduced and claim to represent the MLE is in fact undefined at crucial points. The trouble with this approach is that the Lagrangian terms in deriving the "score equations" on page 544 of their paper is completely neglected: where ( t j , t j+1 ] is an interval on which F 1n is constant and where t j and t j+1 are points of jump of F 1n . From this they derive the following formula for the survival function S 1n = 1 − F 1n : , but the point t j does not belong to the interval ( t j , t j+1 ] , so there is also some definition problem here). However, at points t where the constraint F 1n (t) ≥ F 2n (t) is active (these are precisely the points where we will get a Lagrange multiplier λ i > 0 in the primal-dual interior point algorithm, described in section 5.1), the expression above is not defined, since we get zeros in the denominators! As an example, in one of our simulations studies we found the following values for the MLE of the pair (F 1 , F 2 ) between points t j = 1.141807 and t j+1 = 1.567906: 3)! And, as noted above, the expression on the right-hand side of (2.4) is undefined in this situation, because of the fact that the denominators become zero if the constraint is active. We close this section by showing the Kaplan-Meier estimator and the (noniterative) weighted least squares estimator introduced in [20] for a real data set studied by [4] and [19] representing the ages at death (in days) of 109 female RFM mice (table 1). The disease of interest is reticulum cell sarcoma (RCS). These mice formed the control group in a survival experiment to study the effects of prepubertal ovariectomy in mice given 300 R of X-rays. The smoother picture for the estimate of F 2 in figure 1 is a consequence of the fact that the estimators of F 1 have a n −1/3 rate of convergence.
The value of the log-likelihood for these data set at the estimates obtained through the algorithm proposed by [20] above is smaller than that obtained at the true MLE of F 1 and F 2 obtained through the Primal-Dual Interior Point algorithm (-262.7964 and -262.5468, respectively).

Characterization of the MLE
In this section we derive the so-called Fenchel duality conditions for the (real) MLE. Our problem is to maximize the function where δ 1,(i) and δ 2,(i) correspond to the ith order statistics of the observations, and where the vectors x = (x 1 , . . . , x n ) and y = (y 1 , . . . , y n ) have to satisfy where m is the number of distinct points, and we define φ(z) by: 3 j=1 f ij = n, and where f i1 is the frequency of the observations where both delta's are zero, f i2 the frequency of observations where the first delta equals 1 and the second zero, and f i3 the frequency of observations where both delta's are equal to 1. Note that (3.2) is the same as (3.1), but with a changed sign, so we have minimization problem instead of a maximization problem.
We are going to reduce the minimization problem to the problem of minimizing on a cone C. Before being able to reduce the problem to a problem of minimizing on the cone C, we have to remove the restriction x m ≤ 1. We do this by introducing Lagrange terms to the criterion.
where λ = (λ 1 , λ 2 ), and where the Lagrange multipliers λ 1 and λ 2 are nonnegative. If the constraints at the right end are not active, the Lagrange multipliers are equal to zero. We have at the solution: where φ is defined by (3.2). Note that the Lagrange multipliers, as just defined, are zero if the constraints are not active. Furthermore, (ii) If (i) occurs and f k2 > 0, we set y k = 0.
We disregard the observations with index i for i < k in condition (i) from the minimization problem. Note that (i) and (ii) make the criterion as small as possible without influencing the remaining minimization problem.
We define x 0 = y 0 = 0, and, apart from the definitions in Definition 3.2, we define and where the nonnegative Lagrange multipliers λ 1 and λ 2 are defined in Definition 3.2. The following lemma from [8] will be used.
if and only if z, ∇φ(ẑ) ≥ 0, for all z ∈ C, (3.5) and A solution of the minimization problem for φ is now characterized in the following theorem.

Theorem 3.1. Let the function φ be defined by (3.2) and the function φ λ by (3.3). Then the vectorẑ = (x,ŷ) minimizes φ(z) over the set of vectors
Equality holds in (3.8) ifx i , andŷ k are points of increase ofx andŷ, respectively, and ifx i >ŷ k .
whereλ is defined as in Definition 3.2 for the solutionẑ.
Proof. By Definition 3.2 we reduced the minimization problem to the minimization problem on the cone C of Definition 3.1. We now apply Lemma 3.1. Since the inequalities (3.5) only have to be checked on the generators of the cone C, we can use the generators of Definition 3.1. This gives the inequality conditions of part (i). The equality condition (ii) is (3.6) in another notation.

Corollary 3.1. Let conditions (i) and (ii) of Definition 3.2 be satisfied and let
the points where x i = y i = 0 be excluded from the minimization problem. Let the function φ be defined by (3.3). Furthermore, let the nonnegative Lagrange multipliersλ 1 andλ 2 be defined as in Definition 3.2 for the solutionẑ. Then the vectorẑ = (x,ŷ) minimizes φ(z) iff the following conditions are satisfied:

9)
and, for Moreover, equality holds in (3.10) if (i, k) corresponds to anx i and anŷ k such thatx i andŷ k are points of increase ofx andŷ, respectively, and if Proof. Parts (a) and (b) are straightforward consequences of Theorem 3.1. As to part (c), the expression on the left of (3.12) is equal to: We can write: Since the result now follows from part (b).
The convergence criterion of Corollary 3.1 was used to check the convergence of the EM algorithm. Since the EM algorithm has such slow convergence, differences between successive iterations are not the right criterion for checking convergence. The primal-dual interior point algorithm uses another criterion, directly based on the gradient with Lagrange multipliers (see Section 5.1), but it was independently checked that it indeed satisfies the conditions of Corollary 3.1 with high accuracy (10 −10 ).
We also have: Let conditions of Corollary 3.1 be satisfied and let the nonnegative Lagrange multipliersλ 1 andλ 2 also be defined as in Corollary 3.1. Let z = (x,ŷ) minimize φ(z). Then: if (i, k), i ≤ k, corresponds to anx i and anŷ k such thatx i andŷ k are points of increase ofx andŷ, respectively, and ifx i >ŷ k , that is: we are in the interior of the cone.
Proof. The equality relations in (3.10), together with part (b) of Corollary 3.1 imply: if (i, k) corresponds to anx i and anŷ k such thatx i andŷ k are points of increase ofx andŷ, respectively, and ifx i >ŷ k . The second sum can be written: The equality relations in (3.10), together with part (c) of Corollary 3.1 imply: if (i, k) corresponds to anx i and anŷ k such thatx i andŷ k are points of increase ofx andŷ, respectively, and ifx i >ŷ k , which in turn implies, by telescoping sums: for such points. Here n = m (no ties) and the MLE is given bŷ The pair (t, u) = (1.34, 1.67) gives a pair of points, as described in Corollary 3.2, and it can be verified thatF n1 (t) = 1 >F n2 (u) = 3/5, and that (3.14) is satisfied, since for (i, k) = (3, 4): In the present case:λ To verify part (c) of Corollary 3.1 we compute: In empirical process notation, (3.13) means: A. E. Gomes et al.
where t ≤ u and f u3 is the number of observations such that δ 1 = δ 2 = 1 at u, and where t and u are points of increase ofF n1 andF n2 , respectively, satisfyinĝ F n1 (t) >F n2 (u). Relation (3.14) says: Using this, we get the equation where t and u are as described above. We define the following random measures: and similarly the non-random measures We can now write (3.16) in the form So we get the following self-consistency property.

Corollary 3.3.
Let t and u be points of increase ofF n1 andF n2 , respectively, so that t ≤ u andF n1 (t) >F n2 (u). Then:

Consistency of the MLE
In this section we derive consistency of the (real) MLE from the Fenchel duality conditions in Section 3. For clarity, we use the notation F 01 and F 02 instead of F 1 and F 2 , respectively, for the underlying distribution functions and use F 1 and F 2 for limits of sequences ofF n1 andF n2 . Assuming that thatF n1 andF n2 tend along subsequences to limits F 1 and F 2 , we get from (3.16) in the limit: For proving consistency, we now will use the following lemma.  Gomes et al. Assume that for all t in the interior of the support of F 01 there is a nondegenerate interval of points u such that (t, u) ∈ S 0 and F 01 (t) > F 02 (u). Moreover, assume: for all (t, u) in the interior of the set S 0 . Finally, let the following relation for the nondecreasing right-continuous functions F 1 and F 2 be satisfied for (t, u) in the interior of the set S 0 : where we define 0/0 = 0. Then we must have: F 1 ≡ F 01 and F 2 ≡ F 02 on the support of g.
Remark 4.1. Note that our conditions are somewhat similar to those of [12] for the model of doubly censored data. In fact, in their approach also a selfconsistency equation is used to derive the results.
Proof of lemma 4.1. Replacing F 0i and F i , i = 1, 2 in (4.2) we get an identity. So we can subtract this identity from (4.2), and defining H 1 = F 01 − F 1 and Differentiating w.r.t. t we get: If F 1 (t) > F 2 (u) and g(t) > 0, this can only be true if .
Since we can differentiate (4.2) on the right-hand side, and since also the factor of F 2 (u) can be differentiated w.r.t. u on the left-hand side, we can differentiate F 2 (u) and define its derivative by f 2 (u). We get: Using (4.3), with t replaced by u, yields: v<t So we end up with the equation Integrating on both sides yields: This is a homogeneous Volterra equation which can only have the solution H 2 ≡ 0. Hence also H 1 ≡ 0 (see (4.3)).
The uniqueness lemma gives the consistency of the MLE.
Proof. By the self-consistency equation (3.16), all limit points of subsequences of (F n1 ,F n2 ) must satisfy (4.2) for (t, u) in the interior of the set S 0 , defined in Lemma 4.1. The result now follows from the uniqueness of the solution of this equation.

The primal-dual interior point algorithm and the EM algorithm for the real MLE
In this section we explain the primal-dual interior point method and the EM algorithm, as for example used in [19], for computing the real maximum likelihood estimate of the pair (F 1 , F 2 ) and compare their behavior. The EM algorithm is vastly inferior to the primal-dual interior point method and leads to prohibitive computing times for bigger samples. One of the reasons for its inferiority is that it needs many iteration steps, but the other more fundamental reason is that it needs a number of parameters that increases quadratically with sample size, whereas the primal-dual interior point method only needs a number of parameters which increases linearly with sample size.

Primal-dual interior point algorithm
Instead of using the notation z = (x 1 , . . . , x m , y 1 , . . . , y m ) as in (3.2), it is more convenient in this section to define the vector z by z = (y 1 , x 1 , . . . , y m , x m ); the relations for the x i and y i remain the same. We define the vector g(z) = (g 1 (z), . . . , g 3m (z)) by .
Note that the matrix G does not depend on z. The original maximization problem now becomes a problem of minimizing φ(z) over R 2m (adopting the convention of making φ(z) = ∞ if we encounter the logarithm of an argument less or equal to 0), under the restriction that all components of the vector g(z) are less or equal to 0. The latter restriction will be denoted by for vectors λ and w in R 3m + .
The primal-dual interior point method for finding the solution to this minimization problem is now formulated (the method is called "primal-dual" because we solve the primal problem for the vector z and simultaneously the dual problem for the vectors λ and w).
A peculiar difficulty is that not all variables appear in the object function that we want to maximize (the log likelihood). For example, if δ 1,(i) = δ 2,(i) = 0, then only x i figures in the (3.1) and not y i or y i−1 (y i−1 could appear in a preceding term, though). For this reason the log likelihood will never have 2m arguments, unless only terms log(x i − y i ) occur. Nevertheless, it is advantageous to work with the "overparametrized" set of 2n variables, since (after the inclusion of the constraints) this produces a Hessian which is a "band matrix" that can easily be inverted (see below). This structure is lost if we first perform a preliminary reduction to the variables that really appear in the likelihood.
We now start the computation of the MLE with a vector z 0 ∈ R 2m , strictly satisfying all constraints, i.e., g(z 0 ) < 0. An easy choice is: For λ and w we take as starting values λ 0 = w 0 = 0.5 · 1, where 1 denotes the vector in R m , m = 3n, with all components equal to 1. For a vector a we denote the diagonal matrix with component a i as its ith diagonal element by A; for example, Λ is the diagonal matrix with element λ i as its ith diagonal element. The first (Newton) iteration step now solves the system of equations: , w, λ), where μ 0 and σ ∈ (0, 1) are tuning parameters, which we take μ 0 = σ = 0.5. The notation ∇ zz φ(z) will be used to denote the matrix of second derivatives of φ λ (z) w.r.t. z, the so-called Hessian of the function φ λ with respect to z. We now define, for fixed β > 0, the set where · denotes the Euclidean norm, and we require the first iterate to be in this set. The parameter μ is called the duality measure, and defined by By taking λ = w = 0.5 · 1, we have made this parameter equal to 0.5 at the start of the iterations. We now take a final parameter γ ∈ (0, 1), and take α 1 as the first number in the sequence 1, γ, γ 2 , γ 3 , . . . , where (z, λ, w) solves the system of equations above, and such that We then take (z 1 , λ 1 , w 1 ) = (z(α), λ(α), w(α)) and μ 1 = μ(α), and repeat the procedure for the new values of the parameters, i.e., we solve the system ⎛ and find the new (z(α), λ(α), w(α)), required to lie in N (μ 1 ), for this system, denoted by (z 2 , λ 2 , w 2 ), and the new μ 2 . This is repeated until the duality measure μ k is below a certain criterion, say 10 −10 or 10 −15 .
Generally we start the kth iteration step with the system ⎛ where the vector (Δz, Δλ, Δw), denotes the vector is the value at the start of the kth iteration, and we solve for Δz, Δλ and Δw. We now transform this system into a system that is better suited for numerical computation. We first solve for Δw. This yields: Let D denote the diagonal matrix W −1 Λ. Then we can write the remaining part of the system in the form We then solve the system above for Δλ. This gives: Using this result to solve for Δz, we obtain the system which we first solve for Δz, next for Δλ, and finally for Δw. The only matrix for which inversion is not trivial is the matrix and this matrix is a symmetric positive definite matrix at each step. The matrices W and λ are diagonal matrices, so inversion of these is trivial. In our case, the matrix (5.9) is a "sparse" band matrix (by the particular parametrization we chose!). This fact can be used for fast and efficient inversion methods, where we only have to reserve computer memory for the elements that can be non-zero. The Primal-Dual Interior Point algorithm for the estimation of the MLE of F 1 and F 2 was applied to the data set presented in Table 1. Figure 2 shows the estimates. It can be seen that the difference with Figure 1 is rather minor. An R script for computing the estimates is given in [10] which uses the R package Rcpp to connect to the original C++ code (also given there) for the primal-dual interior point algorithm.

EM algorithm
We discuss here the implementation of the EM algorithm for the survivalsacrifice model in the case of ties. See also [19]. A 2-dimensional discrete distribution in the plane for (T, U ) is considered, where T is time of onset of the disease and U is time of death due to the disease. Note that our description is a bit different from the discussion on p. 43 and 44 of [19] and closer to their discussion on p. 45 on the actual implementation.
There are three situations to consider.
(i) T > Y and U > Y . In this case mass has to be distributed over points (t k , u k ) such that t k > Y and u k > Y . (ii) T ≤ Y and U > Y . In this case we should have mass points for T ≤ Y and for U strictly to the right of Y . So mass should be distributed over points (t k , u k ) with t k ≤ Y and u k > Y . (iii) Our observation is a time of death U . In this case mass should be distributed over points (t k , u k ) such that u k = U .
Our goal is again to have a distribution of mass on these points so that the likelihood is maximized. We consider the log likelihood where m is the number of strictly different observation times, and where f i1 to f i3 are the frequencies of the three possibilities, mentioned above, in that order.
According to (i) above, we have: Similarly, and Since (with added outside points!) we have the side restriction that k p k = 1, we add a Lagrange term to the log likelihood, by which the criterion to maximize becomes: Standard arguments show that λ = N , where N is the total number of observations (sum of the frequencies). Differentiation w.r.t. p k then yields the so-called self-consistency equations. The maximizing density at the point (t k , u k ) is therefore given by with the convention 0/0 = 0, and the EM iteration will be: After having computed the values p new (t k , u k ) we can update F 1 , F 2 and f 2 via (5.10) to (5.12). This is done in the order: 1.    After this the iteration is repeated until some convergence criterion is reached.
We ran the EM algorithm for the data set presented in Table 1, taken from [19], until the Fenchel duality criteria (i) and (ii) of Theorem 3.1 were satisfied at accuracy 10 −10 , after having sorted the observations in strict order of magnitude, correcting for ties. This took 3139 iterations. After the correction for ties, there are 102 strictly different observation times (the same correction for ties was applied for the primal-dual algorithm) on a total of 109 observations. The Lagrange multipliers of Theorem 3.1, divided by sample size, were given by λ 1 = 0.055214 and λ 2 = 0.220856. As in the case of the primal-dual interior point algorithm, the implementation is given in [10].

Simulation results and computing times for the primal-dual interior point algorithm and the EM algorithm
We consider simulation of the example in Section 6.1 below: i.e., This is the primary example considered by [20]. We use the implementation of the algorithms in [10]. Because of the speed of the primal-dual interior point algorithm, we can easily run simulations up  to sample size n = 10, 000. First we present the boxplots of the results of the primal-dual interior point algorithm for sample sizes n = 100, 1000 and 10, 000.
Noting that the means for F 1 and F 2 are 2 and 3, respectively, it is seen that the estimates are still rather biased for sample size 100 (in particular for F 2 ), but become less biased for bigger sample sizes. The estimates of the means were defined as in (7.1) below.
The boxplots of the results for the corresponding computing times of the primal-dual interior point algorithm for sample sizes n = 100, 1000 and 10, 000 are given in Figure 5. The simulations were run on a MacBook Pro and time was measured in seconds.
We compare the estimates and run times in seconds for the primal-dual interior point method and the EM algorithm below for 1000 replications of sample size n = 100. The interior point method is for this sample size more than 1000 times faster than the EM algorithm and the discrepancy in computing time even grows for increasing sample size. We gave the EM algorithm an upper bound of 10, 000 iterations, but the (Fenchel duality) convergence criterion was then often still not reached, in contrast with the situation for the interior point method, which was run until the convergence criterion was reached. The estimates seem roughly similar, though, as can be seen from Figure 6, although the EM algorithm gives even more biased estimates of the means for F 2 (probably due to non complete convergence of the EM algorithm in computing the MLE).
All examples in this subsection can be checked by running the R scripts in [10].

Smooth functionals
In this section we consider lower bounds for the asymptotic variances of of estimators of smooth functionals. We use the methods of [3] and [21] to compute theoretical lower bounds for such functionals and then make numerical comparisons with the variances of the natural "plug-in" estimators in several special cases. In particular we consider moments as examples of smooth functionals. See Appendix A for details. We mainly give the computations for completeness and introduction of the notation used in the examples. Considerable work remains to prove that the plug-in estimators actually achieve the proposed bounds under reasonable conditions. For work in this direction for other models including deconvolution, interval censoring, and current status data, see Chapter 10 of [11].

Examples
Example 1. In the particular case studied by [20], we have Note that F 2 is the distribution of the sum of a standard exponential random variable U and an exponential random variable V with scale parameter 2, where U and V are independent. Let R be the integral operator, (see(A.14)). In this case we would get, if Ψ(x) = x (the functional, corresponding to the first moment of F 1 ), Hence, for b 2 , defined by (A.5), we get: and b 2 is given by: Note that b 2 ∈ L 2 (G). Furthermore, b(t, 1, 1) satisfies the relation (see (A.7)). We have: and hence b ∈ L 2 (P F,G ). The efficient asymptotic variance is given by: and numerical evaluation of this expression yields in the present case: If Ψ(x) = x 2 (corresponding to the second moment of F 1 ), we have to solve the equation where ψ(t) = 2t, under the side condition b 2 (0) = 0, see (A.18) and (A.19). We get: This can perhaps be most easily seen from the fact that b(·, 1, 1) has to satisfy the differential equation under the side condition b(0, 1, 1) = −8, which follows from (A.7), and by using (6.1). Using these expressions for b 2 and b(·, 1, 1), we obtain as estimate of the efficient asymptotic variance for the estimation of the second moment of F 1 : The largest contribution is coming from the first term in the information lower bound, that is, the integral showing that the largest contribution is in fact coming from this part of the first term. Also note that, for example 10 10 0 x 2 e −x/10 dx ≈ 1606.03. This indicates that the finite sample variances differ considerably from the values predicted by asymptotic theory, and this is confirmed by our experimental results in Section 5.2; see Table 5 and compare with the truncated integrals shown in Table 2 in the case of E F2 Ψ(Y ).

Example 2. If we take
so only changing the distribution of X to a standard exponential and leaving the other distributions the same, we get: since now lim t→∞ b(t, 1, 1){1 − G(t)} = 0 and b(·, 1, 1) ∈ L 2 (G), in contrast with the preceding example. In this case we calculate  G(x)) is the distribution function corresponding to the minimum of T 2 ∼ F 2 and C ∼ G, then the contribution to the integral in (A.25) from the interval [0, τ n ] can be considerably smaller than I −1 κ . The following table gives the values of These estimates seem to nicely fit the simulation results of the MLE estimates for corresponding sample sizes.

Quantiles of F 1 and F 2
In order to have a proper view of the behavior of the MLE of F 1 and the Weighted Least Squares estimator proposed by [20] to solve the problem of estimating F 1 , two examples of distributions for T 1 , T 2 and C were considered. Example 1 was exactly as described in section 6.1: t ≥ 0, e G(t) = 1 − e −2t/5 , t ≥ 0. It will make a comparison of our results with theirs possible. Here is our further example: Example 3. In this example we have In this case, the (marginal) distribution functions are In each case studied, we generated r = 625 samples of sizes 100 and 400. Those are the number of samples and sample sizes used by [20] in their paper.
The summary statistics include the Mean Squared Error at 9 quantiles (F −1 1 (j/10) and F −1 2 (j/10) for j = 1, . . . , 9) estimated by and similarly forF 2 . In tables 7, 8 and 9, the bias and variance of the estimators at each quantile are presented as well as the standard error of the Mean Squared Error, estimated by the square root of In order to assess the variability of the ratio of the averages of the MSE of the Weighted Least Squares and NPML estimators of F 1 (and the Kaplan-Meier and NPML estimators of F 2 ), confidence intervals (±2s.e.) for that ratio were calculated for each quantile. The delta method gives us an expression for the asymptotic variance of the ratio of two averages:

Ratio between average MSE's of WLSE and MLE of F 1 and between the Kaplan-Meier and MLE of F 2 (example 1).
Consequently we can estimate the variance of X r /Y r bŷ where ρ is the correlation between Z 1 and Z 2 , (i.e., the correlation between estimators being compared). Figures 7 to 8 show the relative efficiency (ratio between the average MSE's) with confidence bands between the Weighted Least Squares and NPML estimators of F 1 and between the Kaplan-Meier and the NPML estimators of F 2 for each quantile (x-axis). Figure 7 shows that our results for example 1, obtained using the Primal-Dual Interior Point algorithm, do not support the conclusions in [20] since the MLE seems to estimate F 1 better in the right-hand tail for both sample sizes (efficiency > 1), while in their results they claim to have observed a relative efficiency of 0.92 for both sample sizes in the last quantile considered. The difference in the results is caused by the way they calculated the MLE of F 1 , which may not yield the correct estimate. Looking at columns (2), (3) and (4) for both estimators in table 7 we see that the lower variance of the MLE of F 1 at the last quantile is the reason for its superior performance there, and the variance seems to explain also the better performance of the WLS estimator at the other quantiles since the variance is the biggest component of the MSE for both estimators.
For Example 3, the WLS estimator is beaten by the MLE in the central quantiles for all sample sizes. These results indicate that the good performance of their estimator claimed by [20] tends to happen when F 1 and F 2 are far apart since it was generally beaten by the MLE in the tails for Example 1 or in the central quantiles in Example 3, where F 1 = F 2 . Figure 9 shows the plots of the logarithm of the variance of each estimator against the logarithm of the sample size. The slopes of the curves give us information about the rate of convergence of the estimators since Var(θ n ) . = cn −r implies log(Var(θ n )) . = log c − r log n.
In each plot, the solid line refers to the second quantile, the dotted line refers to the fifth quantile, and the dashed line refers to the eighth quantile. In the second and eighth quantiles, F 1 and F 2 are far apart, while in the fifth one (in the central part of the range of T 1 and T 2 ) we have F 1 = F 2 . We can see in the plots and in table 3 that the slopes of the curves for the estimators of F 1 are around −2/3 in the second and eighth quantiles (where F 1 > F 2 ) and around  −1 in the fifth one (where F 1 = F 2 ), suggesting that when F 1 = F 2 we probably have n −1/2 as the rate of convergence of the estimators of F 1 , a property that the estimators of F 2 have for the whole real line, as we can see in the plots for the Kaplan-Meier and NPML estimators of F 2 in Figure 9 and table 3, where all the lines have slopes close to −1. This happens because F 1 (t) = F 2 (t) for t ∈ (a, b) means that the disease kills instantly in the interval (a, b) and thus the time of death is also the time of disease onset, which is then actually observed in that situation. It should be noticed that when F 1 − F 2 ∞ = 1 the MLE, the pseudo MLE and the Weighted Least Squares estimator of F 1 proposed by [20] coincide. The same is true for the MLE, the pseudo MLE and the Kaplan-Meier estimator for F 2 . That happens because F 1 − F 2 ∞ = sup t∈R | F 1 − F 2 |= 1 implies that the ordered observed death times U i = C i ∧ T 2,i are such that there are two blocks of observations, the first one with Δ 2,i = 0, and the second one with Δ 2,i = 1. That makesF 2,KM (U (i) ) =F 2,n (U (i) ) = 0 whereF 2,KM is the Kaplan-Meier estimator andF 2,n is the NPML estimator of F 2 for the observations in the first block. Also, in that block the weights w i become equal to 1 and the cumulative sum diagram that will be used to calculate the WLS estimator is similar to the one used to calculate the MLE of F ≡ F 1 in the Interval Censoring, case 1 (see Groeneboom and Wellner (1992)), implying that the MLE of F 1 , the pseudo MLE of F 1 and the weighted least squares estimators will coincide. On the other hand, in the second block of observations we have Δ 1,(i) = 1. A look at the expression of the log-likelihood shows that it will be maximized makingF 1,n = 1 for all the observations in the second block, and hence the log-likelihood coincides with the log-likelihood for the right censoring problem, which is maximized in F 2 by the Kaplan-Meier estimator. Also, the cumulative sum diagram for the calculation of the weighted least squares estimator has G i = 0 for those observations, makingF 1,W LS (U (i) ) = 1, whereF 1,n is the MLE of F 1 andF 1,W LS is the estimator proposed by [20]. Putting all these facts together we see that in the case where F 1 − F 2 ∞ = 1, all the estimators of F 1 and F 2 considered here coincide.

Estimation of moment functionals
In the computations for the following tables, the estimators of the first and second moments were of the form respectively whereF n represents any one of the possible estimators of F 1 or F 2 . This amounts to using the estimators but puttingF n equal to 1 at Y (n) . It is also possible to allow defective distribution functionsF n in (7.2), but then these estimators will have a large downward bias and a much bigger variance and for this reason we prefer to use the estimators (7.1).

Concluding remarks
Theoretical properties of the real maximum likelihood estimator for the survivalsacrifice problem such as consistency were unknown and for this reason several pseudo maximum likelihood estimators have been proposed for this model in the past. The Weighted Least Squares of F 1 , proposed in [20] (LJP (1997)) tends to be more efficient than the MLE of F 1 for the survival-sacrifice model when F 1 and F 2 are far apart. When F 1 and F 2 are close, however, the opposite seems to happen. This was observed after the calculation of the joint MLE of F 1 and F 2 by the Primal-Dual Interior Point algorithm, since the algorithm LJP (1997) applied for that purpose does not yield the true MLE estimator of F 1 and F 2 ,  which invalidates their results about the relative efficiency of their estimator and the MLE of F 1 . The magnitude of the variance of each estimator at the quantiles seems to explain the differences between the MLE and the Weighted Least Squares estimator of F 1 , since the bias of both estimators are similar. The better performance of the MLE estimators in the estimation of the first and second moments is probably due to the fact that the MLE's do a better job in estimating the tails of the distributions than the weighted least squares estimator proposed in [20], combined with the Kaplan-Meier estimator. This, in turn, is probably due to the better performance of the MLE's in regions where the constraints become active.
We gave a characterization of the real MLE as an element of a cone in Section 3, and used this to derive a self-consistency equation, which in turn was used to prove consistency of the MLE in Section 4, turning the self-consistency equation into a Volterra integral equation. Further properties of the MLE are still unknown, but we believe that (under some natural conditions) the present approach will also lead to further distribution results, in particular rate n 1/2 results for the estimate of F 2 (the "Kaplan-Meier part") and rate n 1/3 results We introduced a primal-dual interior point method for computing the MLE, which was shown to be very much faster than the EM algorithm, introduced for this model in [19]. The results of both algorithms can be checked by running the R scripts in [10], where the implementation of both methods is given. The duality criteria of Theorem 3.1 give a natural stopping criterion for the EM algorithm, for which a stopping criterion, based on distances between successive iterations, is not appropriate because of the slow convergence of the algorithm.

Appendix A
If we want to estimate a functional of the bivariate distribution H of (T 1 , T 2 ), like the first moment of F 1 , the score operator is given by: where G is the distribution of the censoring time C. This operator may be defined on L 2 (H), with range in L 2 (P H,G ). Its range is contained in L 0 2 (P H,G ). The adjoint of A on L 0 2 (P H,G ) can be written as See [21], and [3], Theorem 5.4.1, page 202. We now investigate under what conditions the expectation of a function Ψ(T 1 ) of the time of onset T 1 ∼ F 1 is a smooth functional and can be therefore estimated at rate √ n. We assume smoothness of the distribution functions and the function Ψ, allowing us to differentiate in the relations we get below. We also assume Ψ(0) = 0, as is true for the moment functionals, considered in the examples. We have:κ and we have to solve the equation Differentiating with respect to t 1 we obtain We note here, that, if But we will see an example below of a smooth functional for which (A.8) is not satisfied and therefore also (A.9) does not hold. We now determine the null space of A * . Suppose A * φ = 0. Then we get: for all α = φ(·, 0, 0), where φ ∈ N (A * ) and has compact support, and R * is the adjoint of R in L 2 (G). Since the functions φ with compact support are dense in L 2 (G), we get the equation (A.16) since R • R * = I; see, e.g., Proposition 8, Part B, p. 421, in [3].
Then the information bound for estimation of κ(F 2 ) = E F2 Ψ(Y ) is This is in agreement with the results of [5], [18], and [1].