Uniform central limit theorems for the Grenander estimator

We consider the Grenander estimator that is the maximum likelihood estimator for non-increasing densities. We prove uniform central limit theorems for certain subclasses of bounded variation functions and for H\"older balls of smoothness s>1/2. We do not assume that the density is differentiable or continuous. The proof can be seen as an adaptation of the method for the parametric maximum likelihood estimator to the nonparametric setting. Since nonparametric maximum likelihood estimators lie on the boundary, the derivative of the likelihood cannot be expected to equal zero as in the parametric case. Nevertheless, our proofs rely on the fact that the derivative of the likelihood can be shown to be small at the maximum likelihood estimator.


Introduction
A fundamental approach to statistical estimation is finding the probability measure which renders the observation most likely. This principle of maximum likelihood estimation has proved very successful in parametric estimation but leads to difficulties in nonparametric problems since the likelihood is typically unbounded so that no maximum is attained. However, in nonparametric problems with shape constrains the maximum likelihood estimator is often well-defined and thus the maximum likelihood approach can be extended to these situations. Examples include nonincreasing, convex, concave and log-concave functions.
The classical parametric maximum likelihood theory is based on the estimatorθ n being in the interior of the parameter space and on the resulting fact that the derivative of the likelihood vanishes atθ n . The theory of nonparametric maximum likelihood estimation is quite different from the parametric theory since the estimator lies on the boundary of the parameter space and thus in general the derivative of the likelihood will not be zero. But in some nonparametric situations the derivative of the likelihood can be shown to be sufficiently small and thus enabling a proof strategy paralleling the one in the classical parametric theory. Nickl (2007) considered the maximum likelihood estimator for estimating a density in a Sobolev ball and proved uniform central limit theorems using this approach.
We pursue this method of proof in the problem of estimating a non-increasing density p 0 on the unit interval. The maximum likelihood estimatorp n is called the Grenander estimator in this situation since it was first derived by Grenander (1956). It is well known to be the leftderivative of the least concave majorant of the empirical distribution function. The main results will be uniform central limit theorems for the Grenander estimator that in particular imply for functions f where P 0 is the probability measure of the non-increasing density p 0 and P 0 f ≡ 1 0 f (x)dP 0 (x). Our results are uniform for f varying in a class of functions and we cover two different types of classes. The first type is a subclass of the bounded variation functions and for each point of discontinuity t of p 0 the indicator function ½ [0,t] is contained in such a class. The second type of classes is given by balls in Hölder spaces C s of order s > 1/2. Under a strict curvature condition and for continuously differentiable p 0 , Kiefer and Wolfowitz (1976) proved that the difference between the Grenander estimator and the empirical distribution function in supremum norm is with probability one of order n −2/3 log(n). This implies in particular a uniform central limit theorem for the class of all indicator function ½ [0,t] , t ∈ [0, 1].
Results similar to the Kiefer-Wolfowitz theorem hold under other shape constraints as well. Balabdaoui and Wellner (2007) showed such a result in the case where the density is assumed to be convex decreasing and where the maximum likelihood estimator is replaced by the least squares estimator. Dümbgen and Rufibach (2009) derived the rate of estimation for log-concave densities in supremum norm and showed that the difference between empirical and estimated distribution function is o(n −1/2 ). Durot and Lopuhaä (2014) showed a general Kiefer-Wolfowitz type theorem which covers the estimation of monotone regression curves, monotone densities and monotone failure rates. Jankowski (2014) studied the local convergence rates of the Grenander estimator in situations where it is misspecified and derived the asymptotic distribution of linear functionals under possible misspecification. Our perturbation approach is particularly suitable to domains where a strict curvature condition holds and leads to weaker assumptions on p 0 and f in these domains. Beyond asymptotic normality for single functionals our work includes a uniformity in the underlying functional, which is important for the application of the results by Nickl (2009) concerning convolutions of density estimators.
The uniform results can be interpreted in the context of Bickel and Ritov (2003), who consider rate optimal estimators of the density which simultaneously lead to the efficient estimation of functionals with uniform convergence. They coin the expression "plug-in property" for such estimators. Nickl (2007) discusses applications in the context of the uniform central limit theorems for the MLE over a Sobolev ball. Uniform central limit theorems were also shown for kernel density estimators and for wavelet density estimators Nickl, 2008, 2009). This paper is organised as follows. Section 2 states the uniform central limit theorems for a subclass of the bounded variation functions and for Hölder balls. In Section 3 we explain the general approach. In Section 4 we derive upper and lower bounds in probability for the Grenander estimator and recall the L 2 -convergence rate. In Section 5 the perturbation approach is further developed and the main results are proved.

Main results
Let X 1 , ..., X n be i.i.d. on [0, 1] with law P 0 and distribution function F 0 (x) = x 0 dP 0 , x ∈ [0, 1]. In order to state the main results we introduce some notation. We define the empirical measure P n = n −1 n i=1 δ Xi , the empirical cumulative distribution function F n (x) = x 0 dP n , x ∈ [0, 1] and the log-likelihood function Under the assumption E P0 | log p(X)| < ∞ for all p ∈ P, we can define the limiting log-likelihood function If P is known to have a monotone decreasing density p then the associated maximum likelihood estimatorp n maximises the log-likelihood function ℓ n (p) over that is, max p∈P mon ℓ n (p) = ℓ n (p n ). ( The maximum likelihood estimatorp n is known to be the left-derivative of the least concave majorantF n of the empirical distribution function F n . For a set T let ℓ ∞ (T ) denote the space of bounded real-valued functions with the usual supremum norm · ∞ . Throughout we will denote by → d the convergence in distribution as in Chapter 1 in van der Vaart and Wellner (1996). The P 0 -Brownian bridge G P0 is defined as tight Gaussian random variable arising from the centred Gaussian process with covariance where P 0 f = 1 0 f (x)dP 0 (x). The first main result is a uniform central limit theorem for a subclass of the bounded variation functions. We start with the general result in Theorem 1 and consider its consequences in Corollary 1 and Theorem 2. Let f, p 0 ∈ L 1 [0, 1] and assume that the weak derivatives of f | (0,1) and p 0 | (0,1) in the sense of regular Borel signed measures exist and denote them by Df and Dp 0 , respectively; cf., e.g., p. 42 in Ziemer (1989). We define BV [0, 1] := {f ∈ L 1 [0, 1] : f 1 + |Df |(0, 1) < ∞}, where |Df | is total variation of the signed measure Df . In the following theorem it will be important that Df is absolutely continuous with respect to Dp 0 since we want to ensure that the perturbations p 0 ± ηf with |η| small (or slightly modified perturbations) are decreasing functions. To this end we denote the Radon-Nikodym derivative by Df /Dp 0 and assume that its essential supremum with respect to Dp 0 , denoted by Df /Dp 0 ∞,Dp0 , is bounded.
The proof of Theorem 1 is deferred to the end of the paper. We note the difference of the n −2/3 -rate compared to the (n/ log n) −2/3 -rate in the Kiefer-Wolfowitz theorem and that Theorem 1 does not imply the Kiefer-Wolfowitz theorem. However, we now present a result on the distribution function that is not covered by the Kiefer-Wolfowitz theorem. We can take for points t ∈ (0, 1) where p 0 is discontinuous the indicator function f = ½ [0,t] in Theorem 1. Then Df = −δ t . Say p 0 has at t a discontinuity of size ∆ ≡ lim sրt p 0 (s) − lim sցt p 0 (s) > 0. Then Dp 0 = −∆δ t − µ for a positive measure µ. In this case Df /Dp 0 ∞,Dp0 = 1/∆. This leads to the following.
Under a strict curvature condition the set B in Theorem 1 contains C 1 -Hölder balls. By strict curvature condition we have in mind that p ′ 0 is bounded away from zero, that is inf x∈[0,1] |p ′ 0 (x)| ≥ ξ > 0 or equivalently 1/p ′ 0 ∞ ≤ 1/ξ. We do not want to assume that p ′ 0 exists classically. To allow for discontinuities of p 0 and to stay in the general setting of weak derivatives we assume that the Lebesgue measure λ on [0, 1] is absolutely continuous with respect to Dp 0 and replace the assumption 1/p ′ 0 ∞ ≤ 1/ξ by the weaker assumption λ/Dp 0 ∞ ≤ 1/ξ, where λ/Dp 0 is the Radon-Nikodym derivative. We remark that λ/Dp 0 ∞,Dp0 = λ/Dp 0 ∞ . Let F be a C 1 -Hölder ball and f ∈ F . Then Df /Dp 0 ∞,Dp0 = Df /λ · λ/Dp 0 ∞,Dp0 ≤ (1/ξ) f ′ ∞ and we see that the C 1 -Hölder ball F is contained in B for some B. This special case of Theorem 1 with F instead of B can be generalized to functions in s-Hölder spaces C s of order s > 1/2. Theorem 2. Suppose p 0 ∈ P is bounded and satisfies p 0 ≥ ζ > 0. Let the Lebesgue measure λ on [0, 1] be absolutely continuous with respect to Dp 0 and let λ/Dp 0 ∞ < ∞. Let F be a ball in the s-Hölder space of order s > 1/2. Then In particular for any f ∈ C s we have The proof of Theorem 2 will be given at the end of the paper.
Linear functionals of the Grenander estimator have been studied by Jankowski (2014) so let us discuss the differences in scope and in the assumptions between the results. A distinct feature of Theorems 1 and 2 is that they are not for a fixed function f but they are uniform in f over classes of functions. Taking a different perspective Jankowski's emphasis is on the problem of possible misspecification, meaning that the true density p 0 does not necessarily has to be non-increasing. Jankowski (2014) distinguishes between curved and flat parts of p 0 (or in case of misspecification of its Kullback-Leibler projection) and assumes on the portion of support where p 0 is curved that p 0 is continuously differentiable and that |p ′ 0 | is bounded, which is used for the application of the Kiefer-Wolfowitz theorem in the proof. The assumption that p 0 is continuously differentiable is widely used in the literature on the Grenander estimator so it is worthwhile to remark that we do not require p 0 to be differentiable nor to be continuous. The function f defining the functional is assumed by Jankowski (2014) to be differentiable on the curved part and to be in L p , p > 2, on the flat part. On the curved part Theorem 1 allows for discontinuities of f at points where p 0 is discontinuous and Theorem 2 only requires Hölder smoothness of order s > 1/2. On a possible flat part Theorem 1 assumes f to be constant which in view of the results by Jankowski (2014) is the natural condition to ensure a Gaussian limit. To summarise the discussion we can say that the perturbation approach has the advantage of providing uniform results under low regularity assumptions on p 0 and requires stronger assumptions on f on the flat part while needing weaker assumptions on the curved part.

The derivative of the likelihood function
Many classical properties of maximum likelihood estimatorsθ n of regular parameters θ ∈ Θ ⊂ R p , such as asymptotic normality, are derived from the fact that the derivative of the log-likelihood function vanishes atθ n , ∂ ∂θ ℓ n (θ) |θn = 0.
This typically relies on the assumption that the true parameter θ 0 is interior to Θ so that by consistencyθ n will then eventually also be. In the infinite-dimensional setting, even if one can define an appropriate notion of derivative, this approach is usually not viable sincep n is never an interior point in the parameter space even when p 0 is. We now investigate these matters in more detail in the setting where P consists of bounded probability densities. In this case we can compute the Fréchet derivatives of the log-likelihood function on the space L ∞ = L ∞ (X ) equipped with the · ∞ -norm. Recall that a real-valued function L : The following proposition shows that the log-likelihood function ℓ n is Fréchet differentiable on the open convex subset of L ∞ consisting of functions that are positive at the sample points. A similar result holds for ℓ if one restricts to functions that are bounded away from zero. We recall here these results of Proposition 3 in Nickl (2007).
Proposition 1. For any finite set of points x 1 , . . . , x n ∈ X define Then U(x 1 , . . . , x n ) and U are open subsets of L ∞ (X ). Let ℓ n be the log-likelihood function from (1) based on X 1 , . . . , X n ∼ i.i.d. P 0 , and denote by P n the empirical measure associated with the sample. Let ℓ be as in (2). For α ∈ N and f 1 , . . . , f α ∈ L ∞ the α-th Fréchet derivatives of ℓ n : U(X 1 , . . . , X n ) → R, ℓ : U → R at a point f ∈ U(X 1 , . . . , X n ), f ∈ U, respectively, are given by We deduce from the above proposition the intuitive fact that the limiting log-likelihood function has a derivative at the true point p 0 > 0 that is zero in all 'tangent space' directions h in However, in the infinite-dimensional setting the empirical counterpart of (9), for h ∈ H andp n the nonparametric maximum likelihood estimator is not true in general. Even if the set P the likelihood was maximised over is contained in U(X 1 , . . . , X n ) it will itself in typical nonparametric situations have empty interior in L ∞ , and the maximiserp n will lie at the boundary of P. As a consequence we cannot expect thatp n is a zero of Dℓ n . Following ideas in Nickl (2007) we can circumvent this problem in some situations: if the true value p 0 lies in the 'interior' of P in the sense that local L ∞ -perturbations of p 0 are contained in P ∩ (X 1 , . . . , X n ), then we can bound Dℓ n atp n .
Lemma 1. Letp n be as in (3) and suppose that for some h ∈ L ∞ , η > 0, the line segment joiningp n and p 0 ± ηh is contained in P ∩ U(X 1 , . . . , X n ). Then Proof. Sincep n is a maximiser over P we deduce from differentiability of ℓ n on U(X 1 , . . . , X n ) that the derivative atp n in the direction p 0 + ηh ∈ P ∩ U(X 1 , . . . , X n ) necessarily has to be nonpositive, that is or, by linearity of Dℓ n (p n )[·], Applying the same reasoning with −η we see Divide by η to obtain the result.
The above lemma is interesting if we are able to show that as then the same rate bound carries over to Dℓ n (p n ) [h]. This can in turn be used to mimic the finite-dimensional asymptotic normality proof of maximum likelihood estimators, which does not require (4) but only that the score is of smaller stochastic order of magnitude than 1/ √ n. As a consequence we will be able to obtain the asymptotic distribution of linear integral functionals of p n , and more generally, forP n the probability measure associated withp n , central limit theorems for √ n(P n − P ) in 'empirical process -type' spaces ℓ ∞ (F ). To understand this better we notice that Proposition 1 implies the following relationships: If we define the following projection of f ∈ L ∞ onto H, and if we assume p 0 > 0 then and Dℓ n (p 0 )[π 0 (f )] = (P n − P 0 )f so that: Lemma 2. Suppose p 0 > 0. Letp n be as in (3) and letP n be the random probability measure induced byp n . For any f ∈ L ∞ and P n the empirical measure we have Heuristically the right hand side equals, up to second order Control of (11) at a rate o P (1/ √ n) combined with stochastic bounds on the second centred loglikelihood derivatives and convergence rates forp n − p 0 → 0 thus give some hope that one may be able to prove (P n − P 0 − P n + P 0 )(f ) = (P n − P n )(f ) = o P (1/ √ n) and that thus, by the central limit theorem for (P n − P 0 )f , as n → ∞.

Bounding the estimator and L 2 -convergence rate
We establish some first probabilistic properties ofp n that will be useful below: If p 0 is bounded away from zero then so itp n on the interval [0, X (n) ] where X (n) is the last order statistics. Similarly if p 0 is bounded above then so isp n with high probability.
Note next that since F 0 is strictly monotone we have X i = F −1 0 F 0 (X i ) and where the U (i) 's are distributed as the order statistics of a sample of size n of a uniform random variable on [0, 1], and where U (0) = 0 by convention. Hence it suffices to bound Pr U (n) − U (n−j) > ζj ξn for some j = 1, ..., n .
By a standard computation involving order statistics, the joint distribution of U (i) , i = 1, ..., n, is the same as the one of Z i /Z n+1 where Z n = n l=1 W l and where W l are independent standard exponential random variables. Consequently, for δ > 0, the probability in (18)  To bound A, note that it is equal to which, since δ > 0, is less than ǫ/2 > 0 arbitrary, from some n onwards, by the law of large numbers. For the term B we have, for ξ small enough and by Markov's inequality using the fact that EW p 1 = p! and V ar(Y 1 ) = 1. b) Sincep n is the left derivative of the least concave majorant of the empirical distribution F n , Since F 0 is concave and continuous it maps [0, 1] onto [0, 1] and satisfies F 0 (t) ≤ p 0 (0)t ≤ t p 0 ∞ so that we obtain where F U n is the empirical distribution function based on a sample of size n from the uniform distribution. The density of the order statistic of n uniform U (0, 1) random variables is n! on the set of all 0 ≤ x 1 < · · · < x n ≤ 1. Let M ≥ p 0 ∞ and set C ≡ M/ p 0 ∞ then the complement of the event in (19) has the probability where j = 2, . . . , n − 1. In particular, the probability in (19) equals p 0 ∞ /M and can be made small by choosing M large.
We can now derive the rate of convergence of the maximum likelihood estimator of a monotone density. The rate corresponds to functions that are once differentiable in an L 1 -sense, which is intuitively correct since a monotone decreasing function has a weak derivative that is a finite signed measure. The following convergence rate in Hellinger distance is given in Example 7.4.2 in van de Geer (2000). It is used here to derive the L 2 -convergence rate. Kulikov and Lopuhaä (2005) prove a much finer result by deriving the asymptotic distribution of the L k -errors under stronger smoothness assumptions.
Proposition 2. Suppose p 0 ∈ P mon and that p 0 is bounded. Letp n satisfy (3). Then and also Proof. The statement for the Hellinger distance is contained in Example 7.4.2 in van de Geer (2000). p 0 is bounded by assumption and p n ∞ = O P N 0 (1) by Lemma 3b). Then the result in L 2 -distance follows by the inequality p n − p 0 2 2 ≤ 2( p n ∞ + p 0 ∞ )h 2 (p n , p 0 ).

Putting things together
The maximiserp n is in some sense an object that lives on the boundary of P -it is piecewise constant with step-discontinuities at the observation points, exhausting the possible 'roughness' of a monotone function.
We can construct line segments in the parameter space through p 0 , following the philosophy of Lemma 1. In order to ensure that the perturbed function lies again in P we will perturb a non-increasing density p 0 ≥ ζ > 0 with weak derivative Dp 0 by ηh where h ∈ L ∞ , h = 0 and Dh is absolutely continuous with respect to Dp 0 such that the Radon-Nikodym density satisfies Dh/Dp 0 ∞,Dp0 < ∞. Then we have indeed for η of absolute value small enough and D(p 0 + ηh) = Dp 0 + ηDh is a negative measure such that p 0 + ηh is non-increasing where p 0 + ηh is possibly modified on a nullset to equal the integral of Dp 0 + ηDh. A similar statement holds if we replace h by π 0 (f ) defined in (15) when f ∞ + Df /Dp 0 ∞,Dp0 is finite. For simplicity we write p 0 + ηπ 0 (f ) for the integral of its weak derivative.
For p 0 and f as above we thus obtain from Lemma 1 that on events of probability as close to one as desired, for some constant C that depends on K and ζ only. Note that the differential calculus from Proposition 1 applies sincep n , p 0 as well as all points on the line segment (1−t)p n +tp 0 , t ∈ (0, 1), lie in U(X 1 , . . . , X n ) ∩ P, using Lemma 3a). We next need to derive stochastic bounds of the likelihood derivative atp n in the direction of p 0 .
Lemma 5. Suppose p 0 is bounded and satisfies inf x∈ [0,1] By Lemma 3 we can restrict to an event where and by (21) further to an event where for some finite constant M . For any a n → 0 with na n → ∞ and some c > 0 Pr((1 − X (n) ) > a n ) = Pr((1 − a n ) > X (n) ) = (F 0 (1 − a n )) n ≤ (1 − ca n ) n → 0, in particular we obtain for a n = log n/n that Let us define the random functionp −1 n ≡p −1 n on [0, X (n) ] and zero on (X (n) , 1]. By Dℓ n (p n ) and Dℓ(p n ) we denote the corresponding right hand sides in (6) and (7). We observe that Dℓ n (p n ) = Dℓ n (p n ). The function h ≡p −1 n (p n − p 0 ) on [0, 1] and h ≡ 0 elsewhere is of bounded variation with norm h BV ≡ h 1 + |Dh|(R) bounded by a fixed constant C that depends only on k, ξ, p 0 ∞ . We observe that Dℓ(p 0 )[p n − p 0 ] = 0 by (7) and obtain where we have used Theorem 3.1 in Giné and Koltchinskii (2006) with H = id, σ =M n −1/3 , F = const combined with the bracketing entropy bound for monotone functions (van der Vaart and Wellner, 1996, Theorem 2.7.5) and its straight forward generalisation to bounded variation functions to control the supremum of the empirical process, (21) to control the second term, and (24) for the last integral.
We are now ready to prove Theorem 1 and Theorem 2.
Proof of Theorem 1. We use Lemma 2, Proposition 1,p n , p 0 ∈ U(X 1 , . . . , X n ) by Lemma 3 and a Taylor expansion up to second order to see wherep n equals, on [0, X (n) ], some mean values betweenp n and p 0 , andp −1 n is zero otherwise by convention. Here again D 3 ℓ n (p n ) and D 3 ℓ(p n ) stand for the corresponding right hand sides in (6) and (7). The first term is bounded using Proposition 3, giving the bound Bn −2/3 in probability. We define h ≡ p −1 0 (p n − p 0 )(f − P 0 f ) on [0, 1] and h ≡ 0 elsewhere so that the second term equals |(P n − P 0 )h|. With probability arbitrarily close to one we have h BV f ∞ + f BV f ∞ + Df /Dp 0 ∞,Dp0 and h 2,P f ∞ n −1/3 . The second term is bounded similarly as in (25) above by sup h: h BV ≤CB, h 2,P ≤M Bn −1/3 |(P n − P 0 )(h)| = O P N 0 (Bn −2/3 ).
The third term is bounded the same way, using p n − p 0 BV = O(1), and noting thatp n as a convex combination ofp n , p 0 has variation bounded by a fixed constant on [0, X (n) ], so that we can estimate the term by the supremum of the empirical process over a fixed BV -ball, and using again Lemma 3 to boundp n from below on [0, X (n) ]. Using the last fact the fourth term is also seen to be of order f ∞ p n − p 0 2 2 = O P (Bn −2/3 ) in view of (21) completing the proof the first claim. The second claim follows from the fact that B is a bounded set in the space of bounded variation functions and thus a Donsker class.
Proof of Theorem 2. It is sufficient to prove the result for 1/2 < s < 1. Let {ψ lk } be a boundary corrected Daubechies wavelet basis, where we use the notation ψ −1,k = φ l0,k . We denote Besov spaces by B s pq and will use their wavelet characterisation, see for example Giné and Nickl (2009), which also holds for Besov spaces on compact intervals using boundary corrected wavelets (Giné and Nickl, 2014). We decompose the functions f in a ball F of C s by using the projection π Vj (f ) onto the span of the wavelets up to resolution level j, Since C s = B s ∞∞ for s / ∈ N and since the C 1 -norm is bounded by the B 1 ∞1 -norm, we have for the wavelet partial sum π Vj (f ) of f ∈ C s π Vj (f ) C 1 l≤j 2 3l/2 max k | f, ψ lk | 2 j(1−s) max and since the class {f − π Vj (f )} is contained in a fixed s-Hölder ball, which is a P 0 -Donsker class for s > 1/2 in view of Corollary 5 in Nickl and Pötscher (2007) and has envelopes that converge to zero we see that the third term in (26) is also o P N 0 (1/ √ n) (since the empirical process is tight and has a degenerate Gaussian limit). The remaining claims follow from the fact that F is a P 0 -Donsker class.