Maximum-Likelihood Estimation of a Log-Concave Density based on Censored Data

We consider nonparametric maximum-likelihood estimation of a log-concave density in case of interval-censored, right-censored and binned data. We allow for the possibility of a subprobability density with an additional mass at $+\infty$, which is estimated simultaneously. The existence of the estimator is proved under mild conditions and various theoretical aspects are given, such as certain shape and consistency properties. An EM algorithm is proposed for the approximate computation of the estimator and its performance is illustrated in two examples.


Introduction
We consider estimation of an unknown distribution P on (−∞, ∞] based on data which are "censored" in a rather general sense. We assume that q := P ({∞}) is a number in [0, 1) and that P has a log-concave sub-probability density f on R. This means that f = e φ for some concave function In the simplest setting our data consists of independent observations X 1 , X 2 , . . . , X n drawn from P . For q = 0 this case was investigated in detail in Dümbgen and Rufibach (2009). As explained in the latter paper, the shape constraint of log-concavity is rather natural in many situations and leads to enhanced estimators of the distribution function of P as well as good estimators of the density f without requiring the choice of any tuning parameter. See also the review of Walther (2009) about the benefits and possible applications of log-concavity.
In many applications the values X i are not exactly observed. One well-known example is rightcensoring: Suppose that the X i are event times in a biomedical study with values in (0, ∞], i.e. P ((−∞, 0]) = 0 and φ(x) = −∞ for x < 0. Here X i = ∞ means that the event does not happen at all, and q is sometimes referred to as the "cure parameter". If the study ends at time C i from the viewpoint of the i-th unit but X i > C i , then we have a right-censored observation and know only that X i is contained in the interval X i = (C i , ∞]. In other settings one has purely intervalcensored data: The i-th unit is inspected at one or several time points, and at each inspection one can only tell whether the event in question has already happened or not. This gives also an interval X i = (L i , R i ] ⊂ (0, ∞] containing X i . Related to interval-censoring is rounding or binning: For a given partition of (−∞, ∞] into nondegenerate left-open and right-closed intervals, we only know which interval observation X i belongs to. In view of econometric applications (e.g. log-returns, log-incomes) it is desirable to allow negative values of the X i . Whenever we talk about "censored data" we mean right-censored, interval-censored, binned or rounded data. The censoring or inspection time points or the binning intervals are assumed to be either fixed or random and independent from the random variables X i .
In case of censored data, the potential benefits of shape-constraints are even higher than in settings with complete data. To analyze interval-censored data, Dümbgen et al. (2006) constrained the density f on [0, ∞) to be non-increasing or unimodal. The former constraint leads typically to accelerated rates of convergence compared to the unrestricted nonparametric estimator, see for instance Dümbgen et al. (2004). An obvious question is how we can cope with the constraint of f being log-concave, which is stronger than f being unimodal.
The remainder of this paper is organized as follows: In Section 2 we introduce the loglikelihood functions for our general setting and provide necessary and sufficient conditions for the existence of maximizers. In Section 3 we show how the parameter space may be restricted and approximated. Particular algorithms for the computation of the MLEs are proposed in Section 4.
They utilize the EM paradigm of Dempster et al. (1977) and the fast algorithms for complete data by Dümbgen et al. (2011a). Section 5 discusses (partial) identifiability of the special parameter q and some consistency properties of our estimators. In Section 6 we illustrate our methods with real and simulated data. Proofs and technical details are deferred to Section 7.
2 Log-likelihoods and maximum-likelihood estimators Log-likelihood functions. Our full parameter space Θ is the set of all pairs (φ, q) consisting of a concave and upper semicontinuous function φ : R → [−∞, ∞) and a parameter q ∈ [0, 1) such that e φ(x) dx + q = 1. (1) If we fix the value q, the set of all concave and upper semicontinuous functions φ satisfying (1) is denoted by Φ(q).
If we could observe the random variables X 1 , X 2 , . . . , X n , an appropriate normalized loglikelihood function˜ : Θ → [−∞, ∞) would be given bỹ In case of censored data we observe random subintervals X 1 , X 2 , . . . , X n of (−∞, ∞]. More precisely, we assume that either Note that we exclude the possibility of L i = −∞, which is convenient and typically no serious restriction. For instance, in connection with event times X i > 0, the left end points L i are always nonnegative. After conditioning on all censoring and inspection time points or binning intervals, we end up with independent observations X i , and the normalized log-likelihood function : Θ → [−∞, ∞) for our setting is given by Sometimes we want to rule out the possibility of a positive mass q at infinity, in which case we Maximum-likelihood estimators. Our goal is to find a maximum-likelihood estimator (MLE) (φ,q) of (φ, q), i.e. a maximizer of (·, ·) over Θ. Under the restriction that q = 0 we aim to find a MLEφ 0 of φ, i.e. a maximizer of (·) over Φ(0).
Our first theorem characterizes the existence of these MLEs.
Theorem 2.1 (Existence of MLEs) A maximizerφ 0 of (·, 0) over Φ(0) exists if, and only if, A maximizer (φ,q) of (·) over Θ exists if, and only if, there exists no uncensored observatioñ Note that the MLEs may only fail to exist in situations where the exact observations {L i : L i = R i , 1 ≤ i ≤ n} form a one-point set. Therefore both MLEsφ 0 and (φ,q) exist in the case of purely interval-censored, rounded or binned data. In the classical right-censored case, assuming i.i.d. censoring times C 1 , . . . , C n and writing := IP(X i ≤ C i ), the probability for existence of both MLEs is at least 1 − n (1 − ) n−1 , which goes to 1 geometrically fast.
In the first part of Section 3 we describe some simple special cases in which the MLE (φ,q) either does not exist or is rather trivial.
3 Restricting and approximating the parameter spaces Special cases. In some situations a MLE (φ,q) may not exist or may be rather trivial. The next two lemmas describe such scenarios.
In case of µ = µ =: µ, one has to check whether L i = R i = µ for at least one index i. If yes, there exists no MLE. If no, one has to determine the numbers n , n r and boundaries a < µ < b as described in Lemma 3.2. Then any (φ, q) ∈ Θ satisfying (4) is a MLE. Here one can also show that q = 0 and with suitable α, β ∈ R fulfill this constraint.
In case of at least one uncensored observation we have to rule out an additional pathological case: Lemma 3.3 Suppose that L io = R io = µ for some index i o and µ ∈ R. Further suppose that all Then for any q ∈ (0, 1), Shape of the maximizers. We start this section with a rather simple and intuitive fact about the domains ofφ andφ 0 , where the domain of a concave function φ is defined as In what follows let In particular, τ 1 = min{L 1 , L 2 , . . . , L n }. We assume that m ≥ 2, because otherwise Lemma 3.1, 3.2 or 3.3 would apply. It follows directly from Lemma 3.4 that One may even require that dom(φ) ⊂ [τ 1 , τ m ], because for any (φ, q) ∈ Θ, the value of (φ, q) remains the same if we replace q with q + ∞ τm e φ(t) dt and redefine φ(t) := −∞ for t > τ m . Note that φ enters (φ, q) only via the values φ(X i ) for those i with L i = R i and via the integrals τ j+1 τ j e φ(t) dt, 1 ≤ j ≤ m. Indeed we may restrict our attention to piecewise linear functions φ with at most m − 1 changes of slope within their domain: Then there exists a (φ, q) ∈ Θ satisfying (φ, q) ≥ (φ, q) and dom(φ) ⊂ [τ 1 , ∞) and the following conditions: (i) For j ∈ {1, 2, . . . , m} either dom(φ) ∩ (τ j , τ j+1 ) = ∅ or (τ j , τ j+1 ) ⊂ dom(φ). In the latter case,φ is piecewise linear on [τ j , τ j+1 ] ∩ R with at most one change of slope within (τ j , τ j+1 ). It Thenφ is linear on [τ j , τ ] ∩ R.
Approximating the parameter spaces. In view of Theorem 3.5 we consider arbitrary tuples Note that functions φ ∈ Φ t (q) and φ o ∈ Φ o t (q) are completely determined by the tuples In addition we need the larger sets Φ t (q) and Φ o t (q) of functions in Φ(q) which may be represented as pointwise limits of sequences in Φ t (q) and Φ o t (q), respectively. One can easily verify that In case of m ≤ 3, when maximizing (·) over Φ(0), we may replace Φ(0) with its subset Φ (τ 1 ,...,τm) (0). To maximize (·, ·) over Θ, it suffices to consider the set Φ o (τ 1 ,...,τm) (q) in place of Φ(q) for 0 ≤ q < 1.

Algorithms
Throughout this section we exclude the special situations described in Lemmas 3.1, 3.2 and 3.3.
In particular, we assume that R i < ∞ for at least one observation, and m ≥ 2.
Augmented log-likelihood functions. Using a trick of Silverman (1982), we can remove the constraint (1). Let Φ be the set of all concave and upper semicontinuous functions Moreover, for this particular value c, the parameter (φ + c, qe c ) belongs to Θ, and (φ + c, qe c ) = Λ(φ + c, qe c ). These considerations imply the following result: and arg max where arg max refers to the (possibly empty) set of the corresponding maximizers.
Optimizing the cure parameter. It seems that we cannot use Silverman's trick to maximize (φ, q) for a given fixed value q > 0. But the augmented likelihood Λ(φ, q) is useful for finding a better value of q: Let φ be a fixed function in Φ such that Λ(φ, q) > −∞ for some (and thus all) In the special case of all right endpoints R i being finite, Λ(φ, q) < Λ(φ) for any q > 0. Otherwise, if R i = ∞ for at least one observation, the right hand side is strictly decreasing in q > 0 and strictly negative for q > q := #{i : R i = ∞}/n. Hence one can easily maximize Λ(φ, q) with respect to q ≥ 0 as follows: then the maximizer is given by q = 0. Otherwise it is the unique number q ∈ (0, q] such that This number may be determined, for instance, by binary search or a Newton procedure. Note also that q < 1 by assumption. The EM paradigm. Maximizing the augmented log-likelihood function Λ(φ, q) with respect to φ ∈ Φ for a fixed value of q ≥ 0 is a non-trivial task. A major problem is that (·, q) is convex rather than linear or concave. Namely, let φ, where v(∞) := 0, the observationsX i are viewed temporarily as fixed, and Y denotes a random variable such that with the following sub-probability distribution M φ,q on R: For any Borel set B ⊂ R, i.e. the conditional expectation of the complete-data log-likelihood, given the available data. This is the more traditional motivation for the EM algorithm. Existence of a unique maximizer of Λ φ,q (·) over Φ is guaranteed by the following auxiliary result which is just a modification of Theorem 2.2 of Dümbgen et al. (2011b): This maximizer φ satisfies the equation e φ(x) dx = M (R), and the closure of dom(φ) equals the closure of S(M ).
Φ. It will automatically satisfy the equation For if φ new = φ, then the definition of φ new and convexity of (·, q) imply that Now we replace φ with φ new . When maximizing (·, ·) over Θ, we also recalculate q via (6) and (7). This yields possibly a further increase of Λ(φ, q), and the new value q equals 0 or satisfies (7). In principle this procedure is iterated until the "difference" between φ and φ new becomes negligible.
Practical implementation of the EM step. Maximization of Λ φ,q (·) over Φ may be achieved via an active set algorithm as described in Dümbgen et al. (2011a) The latter two are defined as the sets Φ t (q) and Φ o t (q) in Section 3 with the constraint φ ∈ Φ(q) replaced with the requirement φ ∈ Φ. Initially the tuple t = (t 1 , t 2 , . . . , t N ) is chosen as described at the end of Section 3. Later on it may be a subtuple of that.
is a closed set and equal to the convex hull of the support of M φ,q . Hence the closure of the domain of arg max ψ∈Φ Λ φ,q (ψ) is equal to dom(φ). Consequently, if we restrict our attention to candidates in Φ t , then it even suffices to consider functions But for ψ ∈Φ we may write and Stopping the EM iterations and modifying the domains ofφ 0 orφ. Let φ 1 , φ 2 , φ 3 , . . . be our candidates forφ 0 orφ. One should stop iterating the EM step (plus optimization with respect to q) when the changes in the (sub-)probability density is much easier to compute and of the same order of magnitude: where m j and m j are the minimum and maximum, respectively, of It happens often that φ k → −∞ on a non-empty subset of dom(φ 1 ), which may lead to numerical problems or a waste of computation time. One possible way out is as follows: For the computation ofφ 0 one could replace (φ) with For the computation of (φ,q) and working with Φ o t , we may replace (φ, q) with where ε 1 is chosen as before, while ε 2 > 0 unless L j = R j = τ m for some index j. Hence we add artifical "observations" {τ 1 } and (τ m , ∞] or {τ m } with very small weights to our original data set. One can be more ambitious and try to estimate the domain ofφ 0 orφ. That means, whenever there is strong evidence for dom(φ 1 ) being too large, the candidate setΦ may be reduced as follows: Possible reduction 1. Suppose that we would like to computeφ 0 and thatΦ = Φ t . With θ := The latter is strictly negative for all values of θ ∈ (−∞, 0) if In that case, then we recompute φ k , working from now on withΦ = Φ o (t 1 ,...,t N −1 ) . Possible reduction 3. Analogously, suppose thatΦ = Φ o t and #{i : L i = R i = t 1 } = 0. With θ := φ k (t 1 ), γ := φ k (t 2 ) and δ := t 2 − t 1 ) we may write then we recompute φ k , working from now on withΦ = Φ o (t 2 ,...,t N ) .

13
Partial identifiability of q. Without any shape constraints on φ = log f , the cure parameter q would not be identifiable. Indeed, with τ * denoting the maximum of {L 1 , . . . , L n } ∪ {R 1 , . . . , R n } ∩ R, the data X 1 , . . . , X n would only provide information about P = P φ,q on (−∞, τ * ] and the number P ((τ * , ∞]). Even if we knew the distribution function F of P on the whole interval [−∞, τ * ], we could only conclude that On the other hand, let φ be concave, and suppose we know F only on some bounded inter- as well, and the unknown

The latter inequality follows from
is sufficiently large, we get an equality or at least nontrivial lower and upper bounds for q.
Consistency. For simplicity we restrict our attention to the setting of interval-censoring: Let P = P φ,q andP n = Pφ ,q , where (φ,q) = (φ n ,q n ) is based on the following observations: For 1 ≤ i ≤ n let A =: T n,i,0 < T n,i,1 < · · · < T n,i,M ni < T n,i,M ni +1 := ∞ be given design points, where A ∈ [−∞, ∞) is a known lower bound for the support of P . Then observation X i = X n,i is defined as the unique interval T n,i,j = (T n,i,j−1 , T n,i,j ], 1 ≤ j ≤ M ni + 1, containing X i .
For instance, in connection with event times, A = 0 and T n,i,1 , . . . , T n,i,M ni could be inspection time points at which one determines whether the event in question has already happened or not.
In this setting, Theorem 2.1 guarantees existence of a MLEP n . The following consistency result is essentially Theorem 3 of Dümbgen et al. (2006) with obvious modifications of its proof.
Throughout this section asymptotic statements refer to n → ∞. for B ⊂ R. Then Theorem 5.1 implies the following result: for any x ∈ (a, b).
Suppose, for instance, that M ni = 1 for all n and i. A special example for this setting is current status data. Here H n is the empirical distribution of the time points T n,i,1 , 1 ≤ i ≤ n. If H n converges weakly to a probability distribution H on the real line such that the distribution function of H is strictly increasing on an open interval (a, b) ⊂ R, then the assumption of Corollary 5.2, part (ii) is satisfied.
The subsequent result is no longer restricted to the setting of purely interval-censored data. It shows that pointwise stochastic convergence ofF n to F on a nondegenerate interval (a, b) implies uniform convergence in probability, unless F is constant on (a, b). Furthermore, the corresponding estimatorf n of the density f = e φ is consistent on (a, b), too, and the estimatorq n of q satisfies certain inequalities. In what follows we denote the positive part of a real number s by s + := max{s, 0}.
such that 1 (a,b) f is continuous on (x ± δ). Then for any fixed δ > 0, The statements aboutf n − f in this theorem are similar to results of Schuhmacher et al. (2011) in the context of log-concave probability densities on R d . They imply that for any x ∈ (a, b) at which f is continuous, and (1) for any real y ∈ [a, b].

Applications
The algorithm described in the previous section was implemented and made publicly available as contributed package logconcens (Schuhmacher et al., 2013) for the statistical computing environment R (R Core Team, 2013). We give here two demonstrations of this implementation, one for simulated interval-censored data and one for real right-censored data. In both cases we used the domain reduction technique detailed above, but the trick of adding artificial very small or large pseudo-observations with little weights led virtually to the same densities and survival functions.
Simulated data. We simulate event times X i , 1 ≤ i ≤ 100, from a Γ(3, 1)-distribution and inspect them according to independent homogeneous Poisson processes with rate 1. The latter means that for each i, we consider a random sequence (T i,j ) ∞ j=1 which is independent from X i , starts at T i,1 = 0 and has independent, standard exponentially distributed increments T i,j − T i,j−1 , j ≥ 2. Then X i is the unique interval (T i,j−1 , T i,j ], j ≥ 2, containing X i . Figure 1 presents the generated data and comparisons of our log-concave NPMLE in terms of log-densities and survival functions. In the first panel we see the censored data consisting of the n intervalsX i sorted by their left endpoints. The second panel compares our estimatorφ to the true log-density of the Gamma distribution and to the NPMLE based on the exact data X i . The differences are rather small. Note thatq = 0 andφ =φ 0 because all right endpoints R i are finite. The third panel compares the survival functionŜ obtained fromφ to the true survival function S and to the unconstrained nonparametric maximum likelihood estimator of the survival function from Turnbull (1976) (produced with the R package interval; see Fay, 2013). Compared to the latter the survival curve stemming fromφ is clearly preferable as it captures not only the approximate course but also the smoothness of the true survival curve.
In order to analyze the performance of the estimators more thoroughly, we simulated 500 data sets by the above procedure and computedφ andŜ every time. The average supremum norm of |Ŝ − S| was 0.0614, which compares favourably to the value of 0.1540 obtained for the same quantity if we replaceŜ with the Turnbull estimate.
To study the performance for a distribution with a positive cure parameter, we also simulated 500 data sets (X i , 1 ≤ i ≤ 100) from the distribution 0.7 · Γ(3, 1) + 0.3 · δ ∞ and inspected them according to Poisson processes that were restricted to only six inspection times each. The average supremum norm of |F − F | = |Ŝ − S| was then 0.0763 and the average estimation error |q − q| for the cure parameter was 0.0514. ReplacingŜ andq with the Turnbull estimate and its rightmost value, we obtained 0.1482 and 0.07199 respectively.
Of course we benefit in these examples from the fact that the true distribution is really logconcave. On the other hand, many distributions with a non-decreasing hazard rate are log-concave (see Dümbgen and Rufibach, 2009) and in the case of slight misspecification of the model, at least for exact data, the log-concave density estimator is still consistent for a close approximation of the true density (see Dümbgen et al., 2011b).
Real data. We estimate the survival curve for the data from Edmunson et al. (1979), which is available in the dataset ovarian in the R package survival (Therneau, 2013). The survival times in days of 26 patients with advanced ovarian carcinoma were recorded along with certain covariate information, which we ignore here. Twelve observations are uncensored and the rest is right-censored.
The data is depicted in the left panel of Figure 2, where a dot represents an exact observation for a patient at a certain time. The right panel shows the survival function based on our estimator (φ,q) together with the celebrated Kaplan-Meier estimator, which is just the special case of the

Proofs and technical results
An essential ingredient for the proof of Theorem 2.1 are the following inequalities: be a concave function such that R e φ(t) dt ≤ 1, and let x ∈ dom(φ). Then for any y ∈ R with |x − y| ≥ e −φ(x) ,

As for the second part, let
x e φ(t) dt, and by concavity of φ there exists a number Then there exists a concave and upper semicontinuous function φ : R → [−∞, ∞) and a subse- Proof of Theorem 2.1. We first consider (·, 0). According to Lemmas 3.1 and 3.2 it suffices to consider data sets such that In other words, there exist indices i(1), i(2) ∈ {1, 2, . . . , n} such that This implies that M k := max x∈R φ k (x) is bounded. For if M k ≥ log((b − a)/2) and x k is a maximizer of φ k , then in case of x k ≤ (a + b)/2, for all k. But then it follows from Lemma 7.1 that for all k and any x ∈ R.
Hence we may apply Lemma 7.2 to conclude that after replacing (φ k ) k with a subsequence, if necessary, there exists a concave and upper semicontinuous function φ : for any x ∈ interior(dom(φ)).
In addition we may and do assume that lim k→∞ q k = q o ∈ [0, 1]. Again let x k be a maximizer of φ k and set M k := φ k (x k ). We first show that (M k ) k may be assumed to be bounded. Note that by Lemma 7.1, If M k → ∞, then ρ ik → 0 for all i with R i < ∞. This implies that for certain real numbers µ , µ with µ ≤ lim inf k→∞ x k ≤ lim sup k→∞ x k ≤ µ .
If there are no uncensored observations, then for each i ∈ {1, 2, . . . , n} either L i < µ or Thus we may conclude that and γ k , γ kr ≥ 0 are chosen such that µ a eφ k (t) dt = p k and b µ eφ k (t) dt = p kr . The previous considerations show that, after replacing (φ k ) k with a surrogate sequence if necessary, we may assume that φ k ≤ M for all k and some real constant M . Next we show that the limit q o of (q k ) k is strictly smaller than one. Note that so q o = 1 would imply that each observation has to be uncensored or of the form (L i , ∞]. If all uncensored observations would be identical, we could conclude from Lemma 3.3 that there exists no maximizer of (). If a : Thus we may assume that φ k ≤ M for all k and lim k→∞ q k = q o ∈ [0, 1). Let [a, b] := Consequently, max x∈[a,b] φ k (x) ≥ m for all k and some real number m. Hence Lemma 7.1 implies that Again we may apply Lemma 7.2 and dominated convergence to conclude that there exists a func- Proof of Lemma 3.1. Note first that all observations satisfy L i ≤ µ < µ ≤ R i . Hence with equality if, and only if, P φ,q ((L i , R i ]) = 1 for 1 ≤ i ≤ n. But this is easily shown to be equivalent to P φ,q ((µ , µ ]) = 1. For ε > 0, Hence (φ ε ) → ∞ as ε ↓ 0 which implies the first assertion.
with P := P φ,q . But Finally, this bound becomes maximal if, and only if, x = n r /(n + n r ).
Our proof of Theorem 3.5 is based on the following two results: Lemma 7.3 Let a < b < c be real numbers and φ : [a, c] → R continuous and concave. Then there exist real numbers (i) Let γ be the unique real number such thatφ(x) := φ(a) + γ(x − a) satisfies the equation The latter two inequalities are strict unlessφ ≡ φ.   Proof of Lemma 7.3. Let with certain constants γ ≥ (φ(c) − φ(a))/(c − a) ≥ γ r yet to be specified. This is done in two steps. First let for some real number y ≥ φ(b). That means,φ is a triangular function connecting the points (a, φ(a)), (b, y) and (c, φ(c)). Now we choose y as large as possible such that still This means, at least one of the former two inequalities is an equality. It follows from y ≥ φ(b) (15) is strict, we replace the current slope γ by a larger value such that (15) becomes an equality. Likewise, if (16) is strict, we replace the current slope γ r by a smaller value such that (15) becomes an equality. One can easily verify that γ ≤ φ (a +) and

Now comes the second step: If
Proof of Lemma 7.4. Existence and uniqueness of the surrogate functionφ follow from elementary considerations in both scenarios (i) and (ii). One can also verify easily that eitherφ ≡ φ, or γ < φ (a +) and there exists a number b o ∈ (a, c) such that The latter conditions imply the inequalities of part (iii).
If lim inf n→∞ H n ([x, y]) > 0, the latter inequality occurs by Theorem 5.1 with asymptotic probability zero, which proves part (i).
Theorem 5.3 is closely related to results of Schuhmacher et al. (2011) in the context of logconcave probability densities on R d . For the reader's convenience a self-contained proof is given here. We start with some elementary inequalities: Lemma 7.5 Let x 0 < x 1 < x 2 < x 3 be real numbers such that for j = 0, 1, 2. Then Proof of Lemma 7.5. For j = 0, 2 let z j be a maximizer of φ over [x j , x j+1 ]. By concavity, the function φ is bounded from below by min{φ(z 0 ), φ(z 2 )} ≥ min{c 0 , c 2 } on [z 0 , z 2 ] ⊃ [x 1 , x 2 ].