Theoretical properties of the log-concave maximum likelihood estimator of a multidimensional density

We present theoretical properties of the log-concave maximum likelihood estimator of a density based on an independent and identically distributed sample in $\mathbb{R}^d$. Our study covers both the case where the true underlying density is log-concave, and where this model is misspecified. We begin by showing that for a sequence of log-concave densities, convergence in distribution implies much stronger types of convergence -- in particular, it implies convergence in Hellinger distance and even in certain exponentially weighted total variation norms. In our main result, we prove the existence and uniqueness of a log-concave density that minimises the Kullback--Leibler divergence from the true density over the class all log-concave densities, and also show that the log-concave maximum likelihood estimator converges almost surely in these exponentially weighted total variation norms to this minimiser. In the case of a correctly specified model, this demonstrates a strong type of consistency for the estimator; in a misspecified model, it shows that the estimator converges to the log-concave density that is closest in the Kullback--Leibler sense to the true density.


Introduction
Although work on shape-constrained density estimation dates back to the celebrated paper of Grenander (1956) on the estimation of a decreasing density on the positive half-line, it is in recent years that the area has enjoyed its most significant interest. This is partly because algorithmic and technological advances now allow the computation of estimators that would not previously have been feasible, and partly because statisticians now have more tools at their disposal for the study of the theoretical properties of these estimators.
The attraction of the use of these estimators is that, in contrast to alternative nonparametric density estimation methods such as those based on kernels or wavelets, they provide fully automatic procedures, with no smoothing parameters to choose. Such procedures are particularly desirable when the data are multidimensional, and the choice of (often multiple) smoothing parameters is particularly problematic.
The properties of the Grenander estimator are now quite well understood, thanks to the works of Marshall and Proschan (1965), Prakasa Rao (1969), Devroye (1987), Birgé (1989), van de Geer (1993) and . Other examples of shape constraints on univariate densities that have been studied in the literature include convexity (Groeneboom, Jongbloed and Wellner, 2001;Dümbgen, Rufibach and Wellner, 2007) and k-monotonicity (Balabdaoui and Wellner, 2008). It is also known that a maximum likelihood estimator does not exist over the class of unimodal densitiescf. Birgé (1997).
The existence and uniqueness of a log-concave maximum likelihood estimatorf n of a density f 0 based on a sample X 1 , . . . , X n in R d was proved in Cule, Samworth and Stewart (2007). There, the structure off n was outlined and an algorithm for its computa-tion was described. The algorithm was implemented in the R package LogConcDEAD Samworth, 2007, 2009).
In this paper, we study the statistical properties of the estimator. Importantly, our results do not assume that the underlying density is log-concave. To the best of our knowledge, such results have not been obtained before even for univariate data, but are of interest because in practice it is impossible to tell from a sample of data whether the assumption of log-concavity is satisfied. It is therefore natural to seek assurance that the estimator will behave sensibly if the condition is violated. In our main result (cf. Theorem 4 below), we prove under very mild conditions the existence and uniqueness of a log-concave density f * that minimises the Kullback-Leibler divergence from f 0 and show that there is an interval of values of a for which → 0 as n → ∞. The upper bound for the range of allowable values of a is explicitly linked to the rate of tail decay of f * . In the case where f 0 is log-concave, it is well-known that f * = f 0 , so the result demonstrates the strong consistency off n in exponentially weighted total variation norms, and in exponentially weighted supremum norms if f 0 is continuous. If the true density is not log-concave, we see that the limiting density is the one that is closest (in the Kullback-Leibler sense) to f 0 . As described in Section 3 below, this result strengthens what was previously known even for the case d = 1.
The rest of this paper is organised as follows. In Section 2, we study sequences of logconcave densities that converge in distribution to a limiting density, and demonstrate that the convergence must also occur in much stronger senses. In Section 3, we show that, with probability one, the estimator is uniformly bounded above on R d , and uniformly bounded below on compact subsets in the interior of the support of the true density. This enables us to state and prove the main result described in the previous paragraph. Further auxiliary results can be found in the Appendix.

Convergence of log-concave densities
We begin with an elementary lemma, whose proof is given in the Appendix. Let f be a log-concave density on R d .
Notice in particular that if X has density f , then Lemma 1 implies that the moment generating function of X is finite in an open neighbourhood of the origin.
If f, f 1 , f 2 , . . . are densities on R d , we write f n d → f for the convergence in distribution of the corresponding measures; in other words, f n The following result shows that when it is known that a sequence of densities is log-concave, convergence in distribution in fact implies much stronger forms of convergence. A similar result, proved independently at around the same time and using different techniques, can be found in Schumacher, Hüsler and Dümbgen (2009). We write µ for Lebesgue measure on R d .
for some density f . Then: (a) f is log-concave (b) f n → f , µ-almost everywhere (c) Let a 0 > 0 and b 0 ∈ R be such that f (x) ≤ e −a 0 x +b 0 . Then for every a < a 0 , we Proof. (a) This part of the proposition can be deduced from Theorems 2.8 and 2.10 of Dharmadhikari and Joag-Dev (1988). Their proof relies on a non-trivial correspondence between log-concave probability measures and log-concave densities, which in turn depends on several other facts about log-concavity -cf. Dharmadhikari and Joag-Dev (1988, pp.46-54). We give an alternative proof because it is perhaps a little more direct, and because it forms part of the proof of part (b) below.
Let f n d → f . Our proof relies on two crucial results. The first is that if D is the class of all Borel-measurable, convex subsets of R d , then as n → ∞ (Bhattacharya and Rao, 1976, Theorem 1.11, p.22). The second is Lebesgue's differentiation theorem: recall that a family {A δ : δ > 0} of Borel subsets of R d shrinks nicely to x 0 ∈ R d with eccentricity bound η > 0 if A δ ⊆ B δ , where B δ is the closed (Euclidean) ball of radius δ centered at x 0 , and if the family is such that µ(A δ ) > ηµ(B δ ) for all δ > 0. Then for µ-almost all x 0 , we have as δ → 0, for every family {A δ : δ > 0} that shrinks nicely to x 0 (Folland, 1999, Theorem 3.21). Points x 0 at which this equality holds are called Lebesgue points of f ; notice that any continuity point of f is certainly a Lebesgue point of f . In fact, it will be helpful to note the following small generalisation: if we have a sequence {A k,δ : k ∈ N, δ > 0} of families that shrink nicely to x 0 with the same eccentricity bound η, then the convergence in (2.1) is uniform in k.
Observe by the concavity of log f n k that if x ∈ D k,δ then 2x 0 −x ∈ B δ \D k,δ . It follows that µ(B δ \D k,δ ) ≥ µ(B δ )/2. This means that we can apply Lebesgue's differentiation theorem to choose δ > 0 small enough that for every k, Thus, without loss of generality, we may assume f ≤ lim inf n f n . But by Fatou's lemma, so in fact we may assume f = lim inf f n . Since lim inf f n is log-concave, this proves (a).
. There are three cases to consider: contradicting our first crucial result.
Case (ii): Suppose we are not in Case (i), and that f (x 0 ) > 0, so that by reducing ǫ if necessary we may assume f (x 0 ) > ǫ/2. In this case, since for each k the ratio µ(D k,δ )/µ(B δ ) is decreasing as δ increases, there exist δ > 0 and positive integers k(1) < k(2) < . . . such that It is straightforward to show, using the concavity of log f n k , that µ(D k,δ ) ≤ µ(D k,δ )/t d , where as above, We may also assume that We then obtain a contradiction as in the proof of Proposition 2(a). Hence µ(E 2 ) = 0, as required. This proves (b).
(c) Write φ n = log f n and φ = log f . Without loss of generality assume 0 ∈ int(dom φ), and let η > 0 be small enough that B η (0), the closed ball of radius η > 0 about the origin, is contained in int(dom φ).
Fix a ∈ (0, a 0 ), and let δ = a 0 − a. By Lemma 1, we can find R > 0 such that for all x ≥ R/2. We claim there exists n 0 such that for all x ≥ R and n ≥ n 0 . Note that since, for each n, the ratio on the left-hand side of (2.2) is a decreasing function of x , it suffices to prove that the inequality in (2.2) holds for all x = R and n ≥ n 0 . This is straightforward to see if the ball of radius R about the origin is in int(dom φ), because in that case φ n → φ uniformly on this ball (Rockafellar, 1997, Theorem 10.8). In general, however, we argue as follows.
Suppose we can find a subsequence (φ n k ) and a sequence ( From this we deduce that contradicting the first crucial result in the proof of Proposition 2(a). This establishes our claim at (2.2). But this means there exists b ∈ R such that sup n≥n 0 f n (x) ≤ e −(a+δ/4) x +b . From Proposition 2(b) and the dominated convergence theorem we conclude that Now suppose that f is continuous and let ǫ ∈ (0, 1). Choose R > 0 large enough that f (x) + sup n≥n 0 f n (x) ≤ ǫe −a x /2 for all x ≥ R. Then certainly, sup x ≥R e a x |f n (x) − f (x)| ≤ ǫ for n ≥ n 0 . Using the continuity of f , we can choose a closed, convex set S ⊆ int(dom φ) ∩ B R (0) such that f (x) < e −aR /2 for all x ∈ S c . Since f n → f uniformly on S, we may assume sup x∈S |f n (x) − f (x)| ≤ ǫe −aR /2 for all n ≥ n 0 . Finally, for x ∈ B R (0) \ S, we may assume ǫ > 0 is small enough that f n (0) ≥ ǫe −aR for n ≥ n 0 . Since f n (x) ≤ ǫe −aR for x on the boundary of S and n ≥ n 0 , it follows that f n (x) ≤ ǫe −aR for x ∈ B R (0) \ S and n ≥ n 0 . We deduce that 3 Theoretical properties of the log-concave maximum likelihood estimator Let X 1 , X 2 , . . . be an independent and identically distributed sequence with density f 0 (not necessarily log-concave), and for n ≥ d + 1 letf n denote the log-concave maximum likelihood estimator of f 0 based on X 1 , . . . , X n . Let E denote the support of f 0 ; that is, the smallest closed set with E f 0 = 1. The lemma below establishes appropriate upper and lower bounds forf n . Part (a) of the lemma strengthens Theorem 3.2 of Pal, Woodroofe and Meyer (2007), where for the case of univariate data and a log-concave underlying density it was shown that the random variable sup n∈N sup x∈R df n (x) is finite with probability one. To the best of our knowledge, lower bounds such as that appearing in part (b) have not been studied before. Proof. (a) Let g(x) = exp(− x + b), where the normalisation constant b is chosen to ensure g is a density, so that say. Now let C = e M , where M is large enough that M > k + 1 and such that Let f be any log-concave density with sup x∈R d f (x) = C. We claim that, for sufficiently large n, the log-concave density g has higher log-likelihood. More precisely, if 'i.o.' stands for 'infinitely often', we claim that The result follows immediately from this claim. To establish the claim, write φ = log f , and observe that The first term on the right-hand side of (3.1) is zero, by the strong law of large numbers.
To prove the second term on the right-hand side of (3.1) is zero, the idea is to show that there is only a small set on which φ is large, so with high probability only a small proportion of the observations are in this set. To this end, It follows by Fubini's theorem that Hoeffding's inequality. The first Borel-Cantelli lemma then completes the proof of (a).
(b) By the concavity of logf n , it suffices to prove the conclusion of this part of the lemma when the infimum over x ∈ conv S is replaced with an infimum over x ∈ S. Let S be a compact subset of int(E) and let δ > 0 be small enough that S δ = {x ∈ R d : dist(x, S) ≤ δ} is contained in int(E). Consider the function where B δ again denotes the closed ball of radius δ centered at x 0 . This function is positive and continuous on S δ/2 , so attains its (positive) infimum on this compact set. Thus there exists p > 0 such that B f 0 ≥ p for any Borel subset B of R d containing a ball of radius δ/2 centered at a point in S δ/2 . Now let f be any log-concave density on R d , and let c = 2 inf x∈S f (x). We show that if c ≥ 0 is sufficiently small, then f is not the maximum likelihood estimator for large n. By Lemma 3(a), we may assume sup x∈R d f (x) ≤ C. Recall that the density g was defined by g(x) = e − x +b , where b is a normalisation constant, and again by Hoeffding's inequality. By the first Borel-Cantelli lemma, and arguing as in the proof of Lemma 3(a) above, we conclude that Our next theorem is the main result in this paper and establishes desirable performance properties of the log-concave maximum likelihood estimator. We first recall that the Kullback-Leibler divergence of a density f from f 0 is given by Jensen's inequality shows that the Kullback-Leibler divergence is non-negative, and equal to zero if and only if f = f 0 (almost everywhere). Thus in the case where f 0 is log-concave, Theorem 4 below shows that the log-concave maximum likelihood estimatorf n is strongly consistent in certain exponentially weighted total variation metrics. Convergence in exponentially weighted supremum norms also follows if f 0 is continuous. The theorem strengthens known results even in the univariate case, which include Corollary 1 of Pal, Woodroofe and Meyer (2007), where it was proved thatf n is strongly consistent in Hellinger distance, and Corollary 4.2 of Dümbgen and Rufibach (2009), where (weak) consistency off n in the unweighted total variation distance was established. (The observation that the mode of convergence in the univariate consistency result of Corollary 4.2 of Dümbgen and Rufibach (2009) could be strengthened was also made independently at around the same time in Schumacher, Hüsler and Dümbgen (2009).) In the case where the model is misspecified, i.e. f 0 is not log-concave, we prove that the existence and uniqueness of a log-concave density f * that minimises the Kullback-Leibler divergence from f 0 . Moreover, we show that the log-concave maximum likelihood estimatorf n converges in the same senses as in the previous paragraph to f * . The natural practical interpretation is that provided f 0 is not too far from being log-concave, the estimator is still sensible.
We write log + x = max(log x, 0) and recall that E denotes the support of f 0 .
Theorem 4. Let f 0 be any density on There exists a log-concave density f * , unique almost everywhere, that minimises the Kullback-Leibler divergence of f from f 0 over all log-concave densities f . Taking a 0 > 0 and b 0 ∈ R such that f * (x) ≤ e −a 0 x +b 0 , we have for any a < a 0 that and, if f * is continuous, sup x∈R d e a x |f n (x) − f * (x)| a.s.
Remark: The conditions on the underlying density f 0 are very weak indeed. The first condition asks for a finite mean, while the second is satisfied by any bounded density, as well as a wide class of unbounded densities. The third condition is also very weak, but it may help to give an example where it fails: let (q n ) be an enumeration of the rationals in [0, 1], and let f 0 ∝ ½ E , where E = [0, 1] \ ∪ ∞ n=1 (q n − 1 n 2 , q n + 1 n 2 ). In this case int(E) = ∅.
Proof. By the two integrability conditions, the log-concave density g(x) = e − x +b , where b is a normalisation constant, satisfies d KL (f 0 , g) < ∞. We can therefore pick a minimising sequence of log-concave densities (f n ) for the Kullback-Leibler divergence from f 0 ; in other words, the sequence (f n ) satisfies where F 0 denotes the class of all log-concave densities. A slightly simpler version of the argument given in the proof of Lemma 3(a) shows that there exists C > 0 such that f n ≤ C for all n. Similarly, a small modification to the argument in the proof of Lemma 3(b) shows that for every compact subset S of int(E), there exists c > 0 such that lim inf We claim that the sequence (f n ) is tight (or more precisely that the sequence of probability measures corresponding to the sequence of densities is tight). To see this, let S be a compact subset in int(E), and choose c > 0 such that inf x∈S f n (x) ≥ c for large n. Without loss of generality we assume 0 ∈ S and µ(S) > 0. Now, for R sufficiently large, we must have lim sup n→∞ sup x >R f n (x) ≤ c/2, as otherwise the Lebesgue measure of the convex level sets {x ∈ R d : f n (x) > c/2} would be too large for each f n to be a density. It follows that there exist a 0 > 0 and b 0 ∈ R such that sup n∈N f n (x) ≤ e −a 0 x +b 0 , and tightness of the sequence follows.
Prohorov's theorem (Billingsley, 1999, Theorem 5.1) therefore guarantees the existence of a probability measure ν * such that a subsequence (f n k ) converges in distribution to ν * . Now, given ǫ > 0, choose δ = ǫ/(2C). If A is a Borel subset of R d with µ(A) ≤ δ, then since Lebesgue measure is regular we can find an open set A ′ ⊇ A such that µ(A ′ ) ≤ 2δ. Now Thus ν * is absolutely continuous with respect to µ, and we may write f * for its density with respect to µ. By Proposition 2(a), f * is log-concave, and by Proposition 2(b), f n k → f * almost everywhere. Finally, by Fatou's lemma, we have Thus f * does indeed minimise the Kullback-Leibler divergence from f 0 over the class of log-concave densities.
Suppose now that both f * 1 and f * 2 minimise the Kullback-Leibler divergence from f 0 over the class of log-concave densities. Defining we see that f * is a log-concave density with by the Cauchy-Schwarz inequality, with equality if and only if f * 1 = f * 2 , µ-almost everywhere. This proves the claimed uniqueness property of f * . Now, write F 0 for the distribution function corresponding to the density f 0 and P 0 for the distribution on R d induced by F 0 . Similarly, writeF n for the empirical distribution function of X 1 , . . . , X n , andP n for the corresponding empirical measure. By definition off n , we have for any b > 0 that The idea of adding the small constant b > 0 in this calculation first appeared in Pal, Woodroofe and Meyer (2007). We first derive an appropriate uniform law of large numbers to handle the first term on the right hand side of (3.2). By Lemma 3(a), we may assume thatf n ≤ C. Recall that D denotes the class of all Borel-measurable convex subsets of R d . For any log-concave density f with f ≤ C, we have by Fubini's theorem that Combining this result with an application of the strong law of large numbers to the fourth term on the right-hand side of (3.2), we deduce that with probability one, lim sup It follows by the monotone convergence theorem that with probability one, Lemma 6 in the Appendix allows us to deduce from this that R d |f n − f * | a.s.
→ 0, so the full result follows by Proposition 2.

Appendix
Before proving Lemma 1, we first derive a basic property of a log-concave density f . Recall that the epigraph of a concave function φ : The closure of φ, denoted cl(φ), is the concave function whose epigraph is the closure in R d+1 of the epigraph of φ. The functions φ and cl(φ) agree almost everywhere, and we say φ is closed if φ = cl(φ).
Lemma 5. A log-concave density f is bounded above and the version of f that is closed attains its maximum.
Proof. Without loss of generality, we may assume log f is closed. It has no directions of increase, because otherwise there would exist ǫ ∈ R such that the set {x ∈ R d : log f (x) ≥ ǫ} were d-dimensional, convex and unbounded (so of infinite Lebesgue measure). Theorem 27.2 of Rockafellar (1997) therefore gives that log f attains its (finite) maximum.
We can now prove Lemma 1.
The following lemma is used in the proof of Theorem 4. The conclusion can be immediately strengthened using Proposition 2, and is stated in this way only for brevity.
Lemma 6. Let f 0 be any density on Let f * be a log-concave density that minimises the Kullback-Leibler divergence from f 0 over the class of log-concave densities. If (f n ) is a sequence of log-concave densities satisfying Proof. Let Φ : R d → R be the Young function Φ(x) = (1 + |x|) log(1 + |x|) − |x|. The Orlicz space L Φ is the set of (equivalence classes of) measurable functions f : R d → R, whose Luxemburg norm f Φ , given by is finite. LetΦ(y) = e |y| − |y| − 1 denote the Young conjugate of Φ, and let · Φ denote the corresponding Luxemburg norm on LΦ. Then by Rao and Ren (1991, Proposition 1, p.58), and the remark following it, for f ∈ L Φ and g ∈ LΦ, we have |f g| ≤ 2 f Φ g Φ .
Noting that f 0 Φ < ∞, an immediate application of this formula yields that for any Borel subset D of R d , . (4.1) Now let f be a log-concave density with sup x∈R d f (x) = C ≡ e M . For large M, we have as in the proof of Lemma 3(a) that µ({x : f (x) ≥ 1}) ≤ 1 8 M d e −M ≤ e −M/2 . It follows that for any b 0 ∈ (0, 1) and b < b 0 , we have Here, the penultimate inequality uses (4.1). We deduce that the sequence (f n ) in the statement of the lemma satisfies the condition that there exists C ≥ 1 such that sup n∈N sup x∈R d f n (x) ≤ C.
Now let S be a compact subset of int(E). Find δ > 0 such that S δ ⊆ int(E) and, as in the proof of Lemma 3(b), find p > 0 such that B f 0 ≥ p for all Borel subsets B of R d that contain a ball of radius δ/2 centered at a point in S δ/2 . Let f be any log-concave density on R d with sup x∈R d f (x) ≤ C, and write c = 2 inf x∈S f (x). If c ∈ [0, C] is sufficiently small, then we can find b 0 > 0 small enough that Then writing B = {x : f (x) ≤ c}, we have for all b ∈ (0, b 0 ) that We deduce that there exists c > 0 such that lim inf n→∞ inf x∈conv S f n (x) ≥ c. (4.2) As in the proof of Theorem 4, we have from (4.2) that the sequence (f n ) is tight. Thus if (f n k ) is an arbitrary subsequence of (f n ), then there exists a further subsequence (f n k(l) ) and a log-concave density f such that |f n k(l) − f | → 0. But then, by the dominated and monotone convergence theorems, lim sup bց0 lim sup with equality if and only if f = f * almost everywhere. By the hypothesis of the lemma, we must have |f n k(l) −f * | → 0. Since every subsequence of (f n ) has a further subsequence converging in total variation norm to f * , we must have |f n − f * | → 0.