A Kiefer-Wolfowitz type of result in a general setting, with an application to smooth monotone estimation

We consider Grenander type estimators for monotone functions $f$ in a very general setting, which includes estimation of monotone regression curves, monotone densities, and monotone failure rates. These estimators are defined as the left-hand slope of the least concave majorant $\hat{F}_n$ of a naive estimator $F_n$ of the integrated curve $F$ corresponding to $f$. We prove that the supremum distance between $\hat{F}_n$ and $F_n$ is of the order $O_p(n^{-1}\log n)^{2/(4-\tau)}$, for some $\tau\in[0,4)$ that characterizes the tail probabilities of an approximating process for $F_n$. In typical examples, the approximating process is Gaussian and $\tau=1$, in which case the convergence rate is $n^{-2/3}(\log n)^{2/3}$ is in the same spirit as the one obtained by Kiefer and Wolfowitz (1976) for the special case of estimating a decreasing density. We also obtain a similar result for the primitive of $F_n$, in which case $\tau=2$, leading to a faster rate $n^{-1}\log n$, also found by Wang and Woodfroofe (2007). As an application in our general setup, we show that a smoothed Grenander type estimator and its derivative are asymptotically equivalent to the ordinary kernel estimator and its derivative in first order.


Introduction
Grenander [8] proved that the maximum likelihood estimator of a distribution F that is concave on its support, is the least concave majorant F n of the empirical distribution function F n of the n independent observations. In the case where F is absolutely continuous with probability density function f , the concavity assumption on F simply means that f is nonincreasing on its support, and the so-called Grenander estimator of f is the left-hand slope of F n . Kiefer and Wolfowitz [9] showed that F n and F n are close for large n and as a consequence, that F n enjoys similar optimality properties as F n , with the advantage of taking care of the shape constraint of being concave. Roughly speaking, Kiefer and Wolfowitz [9] prove in their Theorem 1 that, if f is bounded away from zero with a continuous first derivative f ′ that is bounded and bounded away from zero, then, with probability one, the supremum distance between F n and F n is of the order n −2/3 log n. Their main motivation was to prove the asymptotic minimax character of F n . Their result easily extends to the case of an increasing density function, replacing the least concave majorant with the greatest convex minorant.
In the setting of estimating an increasing failure rate, Wang [19] proves that under appropriate assumptions, the supremum distance between the empirical cumulative hazard and its greatest convex minorant is of the order o p (n −1/2 ), again with the motivation of establishing asymptotic optimality of the constrained estimator. A similar result is proved in Kochar, Mukerjee and Samaniego [10] for a monotone mean residual life function. In the regression setting with a fixed design, Durot and Toquet [7] consider the supremum distance between the partial sum process and its least concave majorant and prove that, if the regression function is decreasing with a continuous derivative that is bounded and bounded away from zero, then this supremum distance is of the order O p (n −2/3 (log n) 2/3 ). They also provide a lower bound, showing that n −2/3 (log n) 2/3 is the exact order of the supremum distance. A generalization to the case of a random design was developed by Pal and Woodroofe [14]. Similar results were proved for other shape-constrained estimators, see Balabdaoui and Wellner [1] for convex densities and Dümbgen and Rufibach [4] for log-concave densities.
Although the first motivation for Kiefer-Wolfowitz type of results has been asymptotic optimality of shape constrained estimators, other important statistical applications are conceivable. For instance, the Kiefer-Wolfowitz result was a key argument in Sen, Banerjee and Woodroofe [16] to prove that, although bootstrapping from the empirical distribution function F n or from its least concave majorant F n does not work for the Grenander estimator of a decreasing density function at a fixed point, the m out of n bootstrap, with m ≪ n, from F n does work. Likewise, Durot, Groeneboom and Lopuhaä [6] use a Kiefer-Wolfowitz type of result to prove that a smoothed bootstrap from a Grenander-type estimator works for k-sample tests in a general statistical setting, which covers the monotone regression model and monotone density model among others. Mammen [12] suggests to use such a result to make an asymptotic comparison of two different estimators for a monotone regression function: one of them is obtained by smoothing a Grenander type estimator and the other one is obtained by "monotonizing" a kernel estimator.
The aim of this paper is to establish a Kiefer-Wolfowitz type of result in the same general setting as considered in [6]. We recover the aforementioned Kiefer-Wolfowitz type of results as special cases of our general result. As an application of our result, we consider the problem of estimating a smooth monotone function: we provide an asymptotic comparison between an ordinary kernel estimator and a smooth monotone estimator.
The paper is organized as follows. In Section 2, we define our general setting and state our Kiefer-Wolfowitz type inequality. Applications to estimating smooth monotone functions are described in Section 3. Proofs are deferred to Section 4.

A Kiefer-Wolfowitz type inequality
First, we define our general setting as well as the notation that will be used throughout the paper. Then we state our main result. Finally, we illustrate our result on several classical settings, such as monotone density or monotone regression, and we recover the Kiefer-Wolfowitz type of results mentioned in Section 1 as special cases of our main theorem.

The setting
Suppose that based on n observations, we have a cadlag step estimator F n at hand for a concave function F : [a, b] → R, where a and b are know reals. In the sequel, we assume that F is continuously differentiable with F (a) = 0 and we denote by f the first derivative, which means that for t ∈ [a, b]. A typical example is the case where we have independent observations with common density f on [a, b], and where the estimator for F is the empirical distribution function F n of the observations. Further details are given in Section 2.2 where two more examples are investigated. We will impose the following assumptions: (A2) There exists a constant C > 0, such that for all x ≥ 0 and t = 0, 1, the process Furthermore, we assume that there exists an embedding either into Brownian motion or into Brownian bridge.
(A3) There exists a Brownian motion or Brownian bridge B n , an increasing function L : > 0, and constants q ≥ 3 and C > 0, such that for all x ∈ (0, n]: Note that we can write where the W n is Brownian motion and ξ n ≡ 0, if B n is Brownian motion, and ξ n ∼ N (0, 1) independent of B n , if B n is Brownian bridge. Finally, we require the following smoothness assumption.
(A4) The function L : [a, b] → R is increasing and continuously differentiable, such that These are the usual assumptions when studying the L p -error of isotonic estimators. Our assumptions (A1) and (A2) are identical to (A1) and (A2') in [5]. Our assumptions (A3) and (A4) are similar to (A4) in [5]. However, the latter assumption is somewhat stronger and requires q > 12 and a bounded second derivative of L.
It is explained in [5] that several classical models are covered by the above general framework. Some of them will be discussed below. Moreover, note that (A1) resembles the usual assumption for a Kiefer-Wolfowitz type of result. Indeed, it can be proven that in the case where the derivative f ′ vanishes on an interval with a non-empty interior, the supremumdistance between F n and F n is no longer of the order n −2/3 (log n) 2/3 .

Main results in the general setting
We are mainly interested in the supremum distance between the estimator F E n and its least concave majorant on [a, b]. The processes F B n and F W n are not estimators in the sense that they are built from observations, but they play an important role as approximating processes for the estimator F E n .
In the particular case where S = B, W , this result can be made more precise by considering moments of the supremum distance between F S n and F S n , rather than the stochastic order. The result for moments is stated in the following theorem.
for some K r > 0 that does not depend on n.
Since the supremum distance between least concave majorant processes is less than or equal to the supremum distance between the processes themselves, the triangle inequality yields Theorem 2.1 with S = E now follows, using (5) together with Theorem 2.1 with S = B.
Remark 2.1 Note that the sharper Theorem 2.2 also holds with S = E if instead of (A3), we make a slightly more restrictive assumption. More precisely, assume (A1), (A2), (A4), and instead of (2), assume that for some q ≥ 3 and some K that does not depend on n. By convexity, we have (a + b) q ≤ 2 q−1 (a q + b q ) and therefore, (6) yields using that 2/3 ≤ 1 − 1/q. It thus follows from (7) and Theorem 2.
for some K q that does not depend on n.
To conclude Subsection 2.2, we give the main steps for the proof of Theorem 2.2. The proof is inspired by the proof of Theorem 5.1 in [7]. One of the main ingredients is to show that although the least concave majorant F S n depends on the whole process F S n , its value at a fixed point x depends only on the values of F S n in a small neighborhood of x. This is made precise in the following lemma, which states that, with probability tending to one, the least concave majorant of the process F S n at a point x, where S = B, W , is asymptotically identical to the least concave majorant of the restriction of this process to an interval with center x and length of the order n −1/3 (log n) 1/3 . The lemma generalizes Lemma 5.1 in [7] where only the case S = W with the specific variance function L(t) = t was considered. n,cn be the least concave majorants of the processes Note that the right hand side tends to zero as n → ∞, provided c 0 is sufficiently large. As Lemma 2.1 proves that F S n (x) is close to the localized version F (S,x) n,cn (x), another ingredient to prove our main results is a precise bound on the difference between F (S,x) n,cn (x) and F S n (x). Such a bound is established in the following lemma. Lemma 2.2 Assume (A1), (A2), and (A4). Let c n = (c 0 log n) 1/3 for some c 0 > 0 that does not depend on n. Using the same notation as in Lemma 2.1, there exist positive numbers K 1 and K 2 such that for S = B, W , we have for all A > 0 sufficiently large.
The proof of Lemmas 2.1 and 2.2, as well as the proof of Theorem 2.2 are deferred to Section 4.

Examples of specific settings
The section is devoted to specific settings to which Theorem 2.1 applies.

Monotone regression function.
We have observations We assume that the ǫ i 's are independent having the same distribution with a finite variance σ 2 > 0. In this case, the estimator for F in (1) is As a special case of Theorem 2.1 we obtain the following corollary.
Corollary 2.1 Suppose that E|ǫ i | q < ∞, for some q ≥ 3, and that (A1) holds. Let F n be defined in (10) and let F n denote the least concave majorant of F n on [a, b]. Then,

Monotone density.
We have independent observations X i , for i = 1, 2, . . . , n, with common density f : [a, b] → R, where a and b are known real numbers. The estimator for the distribution function F in this case is the empirical distribution function Corollary 2.2 Suppose that (A1) holds and that inf t∈[0,1] f (t) > 0. Let F n be defined in (11) and let F n denote the least concave majorant of F n on [a, b]. Then,

Random censorship with monotone hazard.
We have right-censored observations ( The failure times T i are assumed to be nonnegative independent with distribution function G and are independent of the i.i.d. censoring times Note that in this setting, we only consider the case a = 0, since this is more natural. The estimator for the cumulative hazard F is defined via the Nelson-Aalen estimator N n as follows: let t 1 < · · · < t m denote the ordered distinct uncensored failure times in the sample and n k the and N n (t) = 0 for all t < t 1 and N n (t) = N n (t m ) for all t ≥ t m . The estimator F n is the restriction of N n to [0, b].

Estimating a smooth monotone function
In many applications, the parameter of interest f : [a, b] → R, e.g., a density function, a regression mean, or a failure rate, is known to be non-increasing (the non-decreasing case can be treated likewise) so it is natural to incorporate this shape constraint into the estimation procedure. Consider the setting of Section 2.1. A popular estimator for f under the constraint that f is non-increasing is the Grenander-type estimator f n , defined on (a, b] as the left-hand slope of the least concave majorant F n of F n , with This estimator is a step function and as a consequence it is not smooth. Moreover, the rate of convergence of f n is n 1/3 , if f has a first derivative f ′ that is bounded away from zero, whereas competitive smooth estimators may have faster rates in cases where f is smooth, provided that the smoothing parameter on which they depend is chosen in an appropriate way. On the other hand, such estimators typically do not satisfy the monotonicity constraint. In this section, we are interested in estimators that are both non-increasing and smooth. Based on our Theorem 2.1, we will prove that the estimator achieves the optimal rate of convergence under certain smoothness conditions. The estimator is obtained by smoothing the Grenander-type estimator f n , and resembles the estimators m IS in [12] and m n in [13]. In this way, one first applies an isotonization procedure followed by smoothing. Note that a natural alternative would be to interchange the two steps, that is, first smooth and then isotonize (for example, the L 2 projection of a smooth estimator on the space of non-increasing functions). However, the second proposal may have the drawback of not being smooth. It may happen that the two proposals are asymptotically equivalent in first order if f is smooth; see [12] for a precise statement in the regression setting. See also [17] for a comparison of the second proposal with an ordinary kernel estimator and with the Grenander estimator when estimating a monotone density with a single derivative.
We will compare a non-increasing smooth estimator to an ordinary kernel-type estimator f n , corrected at the boundaries in such a way thatf n converges to f , with a fast rate over the whole interval [a, b] (whereas the non-corrected kernel estimator may show difficulties at the boundaries). To define f n , we consider a sequence of positive real numbers h n and a non-negative kernel function K : R → [0, ∞) that has a continuous second derivative and satisfies At the boundaries [a, a+h n ) and (b−h n , b], we consider the local linear bias correction defined as follows: see, e.g. [18]. We are interested in f ns , the estimator defined in the same manner as f n , with F n replaced by the least concave majorant F n : Thus, f ns is a smoothed version of the Grenander-type estimator f n , linearly extended at the boundaries. In the following lemma we establish that the estimator is monotone, provided K is non-negative and supported on [−1, 1]. A similar result was obtained by [13], page 743, in the regression setting for a log-concave kernel K. Proof. Let p 1 , . . . , p m be the jump sizes of the Grenander-type estimator f n at the points of jump τ 1 < · · · < τ m ∈ (a, b]. When we define K hn (t) = h −1 n K(t/h n ), for t ∈ R, and τ 0 = a, then for t ∈ [a + h n , b − h n ], we can write , for all x ∈ (τ m , b], and for i = 1, 2, . . . , m, where we used the fact that K is supported on [−1, 1], together with the fact that (t−a)/h n ≥ 1 and In particular, we have f ′ ns (a + h n ) ≤ 0 and f ′ ns (b − h n ) ≤ 0, so it immediately follows from the definition of f ns that f ns is also non-increasing on the intervals [a, a + h n ] and [b − h n , b]. Since f ns is continuous, we conclude that f ns is non-increasing on the whole interval [a, b].
Note that according to (16), for t ∈ [a + h n , b − h n ], the estimate f ns (t) can be computed as a finite sum over the jump sizes p i of the Grenander-type estimator f n . Note, moreover, that f n can easily be computed using the PAVA or a similar device, see e.g., [2]. Thus, the monotone smooth estimator f ns (t) is easy to implement. This was already pointed out in [6], Section 4.2, where the estimator is used in the bootstrap calibration of a statistical test for monotone functions. Such estimators have also been considered in [13] and [12].
Next, as application of Theorem 2.1, we establish that the smoothed Grenander-type estimator f ns is close to the kernel estimator f n . This will ensure that the two estimators are asymptotically equivalent in first order.
where we used (17) with l = 0, 1 in the last inequality. Combining this with a similar argument on [b − h n , b], together with an application of (17) for l = 0 on [a + h n , b − h n ], completes the proof of the lemma.
Thanks to Lemma 3.2, we are able to derive the limit behavior of f ns from that of f n . To give an example of application of the lemma, suppose that f belongs to a Hölder class H(L, α), for some L > 0 and α ∈ (1, 2], which means that f has a first derivative satisfying It is known that in typical settings (including the specific settings investigated in Section 2.3 above), the kernel estimator defined by (13) with where 0 < R n + R −1 n = O P (1), and a kernel function K satisfying uK(u)du = 0, satisfies for all fixed x ∈ (a, b) independent of n. Moreover, this rate of convergence is optimal in the minimax sense in typical settings, e.g., see Theorem 2.3 in [3]. With h n defined as in (18) we have h −1 n n −2/3 (log n) 2/3 ≪ n −α/(2α+1) , so it follows from Lemma 3.2 combined with (19), that This means that f ns (x) has the same limit distribution as f n (x), and that f ns (x) achieves the same minimax rate of convergence provided that h n is chosen according to (18). According to Lemma 3.2, the smoothed Grenander-type estimator f ns is asymptotivally equivalent in first order to the ordinary kernel estimatorf n provided that both estimators are based on a smoothing parameter h n satisfying (18). The literature provides us with various adaptive methods for calibrating the smoothing parameter of an ordinary kernel estimator, e.g., see [11]. Therefore, one can use any of these methods to calibrate the smoothing parameter of the smoothed Grenander-type estimator, so that it achieves the minimax rate.
Proof of Corollary 2.2. It immediately follows from Theorem 6 in [5] that (A2) holds. Moreover, from the proof of Theorem 6 in [5] it can be seen that, due to the Hungarian embedding, (A3) holds for arbitrary q with L = F and B n a Brownian bridge. Since f is bounded and bounded away from zero, (A4) also holds with L = F , so Corollary 2.2 follows from Theorem 2.1.
Proof of Corollary 2.3. It immediately follows from Theorem 3 in [5] that (A2) holds under the assumptions of Corollary 2.3. Similarly to Theorem 3 in [5], it can be proved that (A3) holds for arbitrary q under the assumptions of Corollary 2.3, with B n a Brownian bridge and With this choice of L, assumption (A4) also holds under the assumptions of Corollary 2.3, so Corollary 2.3 follows from Theorem 2.1.
The following result on the increments of Brownian motion will be used several times.
Proof of Lemma 2.1. The proof is inspired by the proof of Lemma 5.1 in [7]. Recall that we only need to prove the lemma for the case [a, b] = [0, 1]. Moreover, possibly enlarging K 1 , we only need to prove the lemma for n sufficiently large. For all x ∈ [0, 1], let with the convention that the infimum of an empty set is (x + 2c n n −1/3 ) ∧ 1, and let with the convention that the supremum of an empty set is ( n,cn (x), then we must have eitherx S i > x orx S s < x. Therefore, it suffices to prove that there exist positive numbers K 1 and K 2 such that and provided n is sufficiently large. We will only prove (20) since (21) can be proven with similar arguments. Ifx S i > x for some x, then by definition, for all u ≤ x. In that case, there exist y ≤ x − 2c n n −1/3 and z ≥ x such that the line segment joining (y, F S n (y)) and (z, F S n (z)) is above (t, F S n (t)) for all t ∈ (y, z). This line segment is above (x − c n n −1/3 , F S n (x − c n n −1/3 )), which implies that the slope of the line segment joining (y, F S n (y)) and (x − c n n −1/3 , F S n (x − c n n −1/3 )) is smaller than the slope of the line segment joining (z, F S n (z)) and (x − c n n −1/3 , F S n (x − c n n −1/3 )), that is For any fixed α ∈ R, this implies that In particular with α x = f (x) + c n n −1/3 |f ′ (x)| we have where Furthermore, with t x = n −2/3 c 2 n f ′ (x)/4, we have P 1 ≤ P 1,1 + P 1,2 , where We first consider P 1,1 . From (A1), the derivative f ′ (which is defined respectively as the right and the left derivative of f at the boundary points 0 and 1) is continuous on the compact interval [0, 1]. Therefore, f ′ is uniformly continuous on the interval. Since c n n −1/3 tends to zero, we obtain by using Taylor's expansion that where the o(1) term is uniform in x ∈ [0, 1]. Therefore, with M S n = F S n − F , we obtain |f ′ (t)|(1 + o(1)) .
Hence P 1,1 ≤ P n 2/3 sup provided n is sufficiently large. By definition of F S n , for all t ∈ [0, 1] we have where W n and ξ n are taken from (3). Moreover, L is increasing with a bounded first derivative, which implies that Combining (24)  L ′ (t).
It thus follows from (23) that P 1,1 ≤ P |ξ n |c n n −1/6 sup provided n is sufficiently large. Since ξ n is distributed as a standard Gaussian variable, the first probability on the right hand side of (26) is bounded from above by exp(−K 2 c 2 n n 1/3 ), where Furthermore, using again (25) proves that the second probability on the right hand side of (26) is bounded from above by It follows from Lemma 4.1 that there exist positive numbers K 1 and K 2 that only depend on L and f , such that Since c −1 n n 1/3 ≥ 1 for n sufficiently large, possibly diminishing the constants K 2 above, we conclude that there exist positive numbers K 1 and K 2 such that provided n is sufficiently large. Next, consider P 1,2 . For all x ∈ [0, 1] and z ≥ 1, let Y n (x, z) be defined by Let ε = inf t∈[0,1] |f ′ (t)| and let a be a real number with aε > 2 sup t∈[0,1] |f ′ (t)| (which implies that a ≥ 2). Moreover, recall that α x = f (x) + c n n −1/3 |f ′ (x)| and t x = n −2/3 c 2 n f ′ (x)/4. Now, distinguish between z ≥ a and z ∈ [1, a]. For all z ≥ a, it follows from Taylor's expansion that Denote by E n the set of all x, z such that x ∈ [0, 1] and z ∈ [a, xn 1/3 /(2c n )]. We have Using (24) then yields Since ξ n is distributed as a standard Gaussian variable, the first probability on the right hand side of (31) is bounded from above by exp(−K 2 c 2 n n 1/3 ), where To bound the second probability on the right hand side of (31), denote by k n the integer part of n 1/3 c −1 n and for all j ∈ {0, 1, . . . , k n + 1}, let t j = j/k n . If for some x, z, one has then either Setting y = 2c n n −1/3 z or y = 0, we have from the triangle inequality that We conclude that For all j we have with Lemma 4.1, for some positive K 1 and K 2 that depend only on L and f . We obtain Therefore, it follows from the definitions of k n and c n that there exist positive numbers K 1 and K 2 such that P sup provided n is sufficiently large, where we changed scale and origin in the Brownian motion. By (3.3) in [7] we have for all positive numbers b and d. Using that L ′ is bounded and bounded away from zero, we conclude that there exist K > 0 such that for all j, We conclude that there exist positive numbers K 1 and K 2 such that where we recall that E n denotes the set of all x, z such that x ∈ [0, 1] and z ∈ [a, xn 1/3 /(2c n )]. Using (30), we conclude that Next, we consider the case z ∈ [1, a] and establish an upper bound for the probability on the right hand side of (33). Since c n n −1/3 tends to zero as n → ∞ and f ′ is uniformly continuous on [0, 1], we have where 2c 2 n |f ′ (x)|z(1 − z) ≤ 0 and o(c 2 n ) is uniform in z ∈ [1, a] and x ∈ [0, 1]. Therefore, for all z ∈ [1, a] and x ∈ [0, 1] provided n is sufficiently large, so it follows from (33) that Using (24) where we recall that W n and ξ n are taken from (3), we conclude that the probability on the right hand side is bounded from above by With similar arguments as used to bound (26), we obtain that there exist positive numbers K 1 and K 2 such that We have already proved that P 1 ≤ P 1,1 +P 1,2 , where P 1,1 satisfies (29), so we derive from (22) that for some positive K 1 and K 2 we have P (x i > x for some x ∈ [0, 1]) ≤ P 2 + K 1 n 2/3 c −2 n exp(−K 2 c 3 n ). To deal with P 2 one can write P 2 ≤ P 2,1 + P 2,2 , where whereα x = f (x) + c n n −1/3 |f ′ (x)|/4 andt x = n −2/3 c 2 n f ′ (x)/4, withx = x − c n n −1/3 /2. One can then conclude using similar arguments as above that there exist positive numbers K 1 and K 2 such that According to (24), since |η| ≤ 2c n for all η ∈ I n (x), we obtain P sup L ′ (t).
According to Lemma 4.1 there exist positive numbers K 1 and K 2 , such that for all A > 0 we have P n 1/6 sup u,v W n (u) − W n (v) > Ac 2 n /4 ≤ K 1 n 1/3 c −1 n exp(−K 2 A 2 c 3 n ).
Moreover, ξ n is standard Gaussian and therefore, P 2c n n −1/6 |ξ n | sup t∈[0,1] L ′ (t) > Ac 2 n /4 ≤ exp −K 2 A 2 n 1/3 c 2 n , where K 2 = (8 sup t∈[0,1] L ′ (t)) −2 /2. Since n 1/3 ≥ c n for sufficiently large n, combining (36), Possibly enlarging K 1 , the inequality still holds true for small n, so it holds true for all integers n. This completes the proof of (9). where c n is defined as in Lemmas 2.1 and 2.2. It follows from Fubini's Theorem that where we used the fact that a probability is smaller than or equal to one, and performed a change of variable v = u 1/r . To bound the probability in the right hand side of (40), note that from the triangle inequality, it follows that P(X S n > v) ≤ P n 2/3 sup since F is equal to its least concave majorant, and the supremum distance between least concave majorant processes is less than or equal to the supremum distance between the processes themselves. Hence, we have where W n and ξ n are taken from (3). As in the proof of Lemma 4.1, the first probability on the right hand side is bounded from above by 2 exp(−v 2 c 4 n n −1/3 /(2 7 L(1))). Since ξ n is standard Gaussian the second probability on the right hand side is bounded from above by exp(−v 2 c 4 n n −1/3 /(2 7 L(1) 2 )). Thus, there exist a positive number K 3 such that