CDF and Survival Function Estimation with Infinite-Order Kernels

A reduced-bias nonparametric estimator of the cumulative distribution function (CDF) and the survival function is proposed using infinite-order kernels. Fourier transform theory on generalized functions is utilized to obtain the improved bias estimates. The new estimators are analyzed in terms of their relative deficiency to the empirical distribution function and Kaplan-Meier estimator, and even improvements in terms of asymptotic relative efficiency (ARE) are present under specified assumptions on the data. The deficiency analysis introduces a deficiency rate which provides a continuum between the classical deficiency analysis and an efficiency analysis. Additionally, an automatic bandwidth selection algorithm, specially tailored to the infinite-order kernels, is incorporated into the estimators. In small sample sizes these estimators can significantly improve the estimation of the CDF and survival function as is illustrated through the deficiency analysis and computer simulations.


Introduction
We consider the problem of estimating the CDF in contexts of independently and identically distributed (iid) data and randomly right-censored data. Indeed, the seminal paper of Kaplan and Meier [11] solves this problem with the product-limit estimatorthe nonparametric maximum likelihood estimator of the CDF-but there is still room for improvement, especially when the sample size is small.
The most obvious drawback of the Kaplan-Meier estimator, like the empirical distribution function (EDF), is its lack of smoothness. Kernel smoothing easily remedies this problem, but also introduces two new issues of choosing the best kernel and bandwidth.
Kernel smoothing also improves the estimator mean square error (MSE) performance by decreasing its variance while introducing a slight bias resulting in an overall improvement of the MSE. The MSE improvement, however, is typically only a second-order improvement, since the original estimator's first-order MSE convergence rate already achieves the best-possible √ n-rate. When the asymptotic relative efficiency (ARE) between the Kaplan-Meier estimator and its smoothed counterpart is one, as is typically the case, a distinction in performance can be measured by considering the asymptotic relative deficiency, or just simply the deficiency between the two estimators. The general notion of deficiency and subsequent calculations with the proposed estimators is provided in Section 3 which also illustrates that an actual increase in efficiency can be achieved with the new estimators under certain (rather strong) assumptions of the distribution function.
Higher-order MSE improvement is influenced by the kernel order-the higher the kernel order, the greater the improvement. Therefore the best kernel-based estimators, the ones with smallest asymptotic MSE, are the estimators that use infinite-order kernels. Current methods traditionally invoke second-order kernels [23] and more recently a hybrid kernel estimator has been investigated [13], but infinite-order kernel methods allow for the greatest improvement in bias rates without affecting the rates of the variance. The main argument against the use of large-order kernels in density estimation is the concern that the estimator may be negative on some intervals when it is known that the true probability density is always nonnegative. This argument, however, is moot in the density estimation context (so also in the CDF estimation context) since the estimator can easily be truncated to zero when it goes negative then renormalized to have a total area of one without affecting the MSE convergence rate. General construction of the infinite-order kernel estimators are introduced in the following section and a compatible bandwidth selection algorithm that adapts to the infinite-order kernels is described in Section 4.
Another pitfall of all kernel estimators of the density is the lack of consistency at boundary points when the support of the density lies in an interval or half-interval.
Simple reflection [25] solves this problem in the density estimation context and an analogous fix also exists for CDF estimators. Boundary correction and standardization methods specific to kernel-smoothed CDF estimators are discussed in Section 5.
Simulations with iid and censored data illustrate the effectiveness of the infiniteorder kernel estimators coupled with the automatic bandwidth selection algorithm of Section 6. Uniform improvement in MSE over existing estimators is observed in the simulations. Since estimation of the CDF is so fundamental in standard statical analysis, there are many applications of the new estimators beyond just estimating the underlying CDF. Some of these applications are included in the last section on Discussions and Conclusions.

Estimation with Flat-Top Kernels
The analysis will be confined to independently and identically distributed (iid) data, but extensions to randomly right censored with possible left truncation can be more generally derived; cf. [3,24].
Let X 1 , . . . , X n be independent 1 and identically distributed random vectors in R with absolutely continuous distribution function F and corresponding probability den-sity function f . Estimation of f with infinite-order kernels was considered in [21] and [3]; here we consider the integration of those estimators in the construction of the CDF estimator.
The traditional estimator of the CDF is the empirical distribution function, or EDF, which is given byF where I(·) represents the indicator function. The kernel estimator of the probability density, f , is then given bŷ where K is a kernel that integrates to one (but not necessarily nonnegative!) and h is the bandwidth parameter. To insure consistency off h , h should satisfy the condition h → 0 as n → ∞ but with nh → ∞.
The smoothed estimation of the CDF,F h , is constructed by integratingf h . That The estimatorF h (t) is equivalent to the EDF in terms of first-order asymptotic performance, but improvements are achieved in the higher-order terms. The estimator F h (t) effectively smooths the EDF, decreasing its variance at the cost of introducing a slight bias. The variance improvement is uniform across different kernels, affecting only the second-order constant and not the second-order rate (refer to equation (2) below); however the additional bias that gets introduced in the smoothing can be minimized significantly by using kernels of large order with infinite-order kernels providing the most benefit. The variance ofF h (t), as derived in [15], is given by The bandwidth parameter h only enters the variance expression through the secondorder term which is negative. So the larger h is, the smaller the variance ofF h (t) becomes. However, we will see below in Theorem 1 that the smaller h is, the smaller the bias ofF h (t) becomes. Therefore there is an optimal h that strikes a compromise between the bias and variance terms which is presented in Corollary 1 below.
We now construct a family of infinite-order kernels, following [21], that are derived from "flat-top functions". We start with a continuous, real-valued function κ given by where g is any continuous, square-integrable function that is bounded in absolute value by one and satisfies g(|c|) = 1. The region |s| < c is referred to as the "flattop neighborhood", but in some cases we may wish to relax the requirement to allow g(s) ≈ 1 when s is close to c. This "effective flat-top neighborhood" is useful when using an infinitely smooth function κ(s) as described in [19] and Section 6 below. The Fourier transform of κ then produces the infinite-order kernel, K, of interest. Specifically, The MSE ofF h (t) with an infinite-order kernel K is now computed under various assumptions on the smoothness of the underlying density. Let φ(t) be the characteristic The following three assumptions quantifies the degree of smoothness of the density f (x) by the rate of decay of its characteristic function.
Assumption A(p): There is a p > 0 such that Assumption B: There are positive constants d and D such that |φ(t)| ≤ De −d|t| .
Assumption C: There is a positive constant b such that φ(t) = 0 when |t| ≥ b.
Theorem 1. LetF h (t) be a kernel smoothed estimator of the CDF with an infiniteorder kernel derived from a flat-top function.
(i) Suppose assumption A(p) holds, then (ii) Suppose assumption B holds, then To optimize the amount of smoothing under the MSE criterion-i.e., to optimize the bandwidth h-we choose the bandwidth that allows the squared bias rates to be comparable to the second-order variance rates. The optimal bandwidths are provided in the following corollary.
(i) Suppose assumption A(p) holds. Letting h ∼ an −β where a is any positive constant and β = (2p + 1) −1 optimizes the tradeoff between the bias and variance of F h (t) and gives (ii) Suppose assumption B holds. Letting h ∼ a/ log n where a < 2d is a constant optimizes the tradeoff between the bias and variance ofF h (t) and gives (iii) Suppose assumption C holds. Letting h ≤ 1/b be fixed guarantees zero bias and the best possible variance rate.
Estimation of the survival function with randomly right censored data can be similarly improved with the smoothing of the Kaplan-Meier estimator with infinite-order kernels. Density estimation of censored data with infinite-order kernels is analyzed in [3], and an estimator of the survival function can be similarly derived from this density estimator through integration as in (1). The same conclusions as Theorem 1 and Corollary 1 will also hold for the smoothed version of the Kaplan-Meier estimator with infinite-order kernels. This is detailed in the following theorem where the proof has been omitted as it follows naturally from the iid case above.
DefineŜ h (t) to be a smoothed estimator of the survival function, derived from smoothing the Kaplan-Meier estimator with an infinite-order kernel of the form given in (4); i.e.,Ŝ where s j is the height of the jump of the Kaplan-Meier estimator at X j (cf. [3] for more details). The following theorem is consistent with the results described in [14].
Theorem 2. LetŜ h (t) be a kernel smoothed estimator of the survival function as in when h ∼ an −β where a is any positive constant and β = (2p + 1) −1 .
The analysis under assumptions B and C of the above theorem are considerably more complex and have been omitted.

Deficiency
The notion of deficiency was introduced in the article "Deficiency" by Hodges and Lehmann [10] wherein several deficiency calculations are provided. Many articles followed suit using the deficiency concept to compare kernel-smoothed estimators, but many of the approaches used in calculating the deficiency strayed from the original and simple techniques employed by Hodges and Lehmann; c.f. [1,5,6,7,23,26].
The simplicity of the original deficiency computations is maintained in the proof of The deficiency concept is described as follows. Given an estimator, S m , based on a sample of size m and a more efficient estimator, T n , based on a sample of size n with equivalent performance as S m . The difference between the sample sizes, d = m − n, defines the relative deficiency between the two estimators. The original paper of Hodges and Lehmann mostly dealt with situations where d approaches a finite limit as n goes to infinity in which case the two estimators have an asymptotic relative efficiency (ARE) of one. However, it is still possible for two estimators to have an ARE of one yet with a deficiency that approaches infinity. Therefore calculation of the rate in which d approaches infinity gives a generalization of the original deficiency concept.
In the following theorem, a formula is derived for computing the generalized deficiency between two estimators from their MSE performance which explicitly computes the rate at which d approaches infinity.
Theorem 3. Suppose the mean squared errors of two estimators S n and T n are given Define m = m(n) to be the sample size for which MSE(T m ) equals (up to a second order term) MSE(S n ). Then the asymptotic deficiency of T n relative to S n is d = m − n and In the next theorem, the deficiency of two estimators is calculated when the secondorder term in the MSE expansion decreases at the rate n r log n which is very close to the leading term of n r . Therefore the deficient index, d, will approach infinity at a faster rate indicating a larger discrepancy in the performance of the two estimators. (i) Suppose assumption A(p) holds. When h ∼ an −β where a > 0 is constant and (ii) Suppose assumption B holds. When h ∼ a/ log n where a < 2d is a constant, the deficiency ofF h (t) relative toF (t) is n.

Bandwidth Selection
We now present a simple bandwidth selection algorithm that requires very minimal computation and adapts to the specialized family of infinite-order kernels that is utilized in this paper. The methods suggested in [20] for iid data and in [3] for censored data present an algorithm that automatically selects the optimal bandwidth in density estimation. Remarkably, these same algorithms can also be used to select the best bandwidth in CDF estimation. Although the bias in estimating the CDF is smaller than the bias of the density estimators, the variance of the CDF estimator is also smaller than the variance of the density estimator. This algorithm automatically adapts to the appropriate assumption A(p), B, or C and generates a bandwidth that is consistent for the ideal bandwidth given by Corollary 1. The algorithm is also computationally light as well as being simple to describe, and we now proceed to describe it.
Letφ be the natural estimate of the characteristic function given bŷ In the context of censored data,F (x) in the above expression is replaced with the Kaplan-Meier estimator of the CDF. The main key to the algorithm is finding when φ(t) ≈ 0; more specifically, determining the smallest value t * such that φ(t) ≈ 0 for all t ∈ (t * , t * + ε) for some pre-specified ε. Then the estimate of the bandwidth is given byĥ = 1/t * . The formal algorithm is presented below.

Bandwidth Selection Algorithm
Let C > 0 be a fixed constant, and ε n be a nondecreasing sequence of positive real numbers tending to infinity such that ε n = o(log n). Let t * be the smallest number such that |φ(t)| < C log 10 n n for all t ∈ (t * , t * + ε n ) Then letĥ = c/t * where c is the "flat-top radius" depicted in equation (3).
The positive constant C is irrelevant in the asymptotic theory, but is relevant for finite-sample calculations. The main idea behind the algorithm is to determine the smallest t such that φ(t) ≈ 0. In most cases this can be visually seen without explicitly computing the threshold in (6).

Boundary Correction and Standardization
Vanilla versions of the kernel estimators for density estimation break down when the support of the density is restricted to a subset of the real line. For instance, in estimating the probability density function of data taken from an exponential distribution, most kernel estimators give substantial area to negative values even when it is known that the support of the density is nonnegative. It is not too difficult to see that simple kernel estimators of the density will not be consistent at the boundary of the density's support; cf. [25]. However, a simple remedy by reflection works well when the support is not too complex. For instance when the support of the density is [a, ∞), then the is consistent at the boundary point a ( [25]).
This problem, therefore, also carries over to the situation of estimating the CDF.
Indeed the EDF and Kaplan-Meier estimators do not suffer from this drawback, but the kernel smoothed versions do. By integrating (7), we deduce a boundary-corrected version of the kernel-smoothed CDF estimator with the same formulation as (7). For andF h (t) = 0 when t < a. In the special case a = 0, we have the simple formulâ There is an additional issue that only affects higher-order kernel estimators and not second-order estimators. Specifically, higher-order kernel estimators of the density are not necessarily nonnegative, which means higher-order kernels estimators of the CDF are not necessarily contained within the range [0, 1] or forced to be nondecreasing. The natural remedy for these density estimators is to truncate negative estimates to zero and then renormalize the area to one. When this is performed, the corresponding CDF estimator will be a valid CDF. However this approach causes the kernel estimator of the CDF to lose its simplistic representation that is given in the right-hand side of (1), so instead, a simple alternative standardization technique is suggested. To insure the estimator is nondecreasing,F h (t) is replaced by sup (−∞,t)Fh (t), and to insure the range is between 0 and 1, max(F h (t), 1) and min(F h (t), 0) are invoked.
ReplacingF h (t) with sup (−∞,t)Fh (t) is equivalent to replacing the estimator of the densityf h (x) with the truncated versionf + h (x) = max(f h (x), 0) and then integrating the truncated density estimator from −∞ to t. Sincef + h (x) has better MSE performance than the nontruncated counterpartf h (x) [22], it follows that the nondecreasing estimator sup (−∞,t)Fh (t) has better MSE performance than the originalF h (t). Similarly, the MSE of the range restricted estimator produced from max(F h (t), 1) and min(F h (t), 0) will also be improved since it is known the CDF has a range bounded in [0,1]. This is formalized in the following corollary.
Corollary 3. LetF h (t) be as in Theorem 1. A modified estimator is defined as Then it follows that andF h (t) satisfies the necessary properties of a CDF.

Simulations
We evaluate the performance of the proposed infinite-order kernel estimators with the more traditional second-order kernel estimators and the EDF/Kaplan-Meier estimator.
Boundary correction, as described in Section 5, is applied to the estimators when appropriate. As any choice of function g(x) in (3)  By making the flat-top function κ(x) infinitely smooth, the resulting kernel via the Fourier transform will have tails that decay exponentially. Therefore in situations in estimating the density with boundary conditions, the kernel derived from the infinitely smooth flat-top function is more close to having the desirable quality of being compactly supported than the kernel which is derived from the trapezoidal function. One example of an infinitely smooth κ(x) is [17] κ(s) = which resembles and infinitely smooth trapezoid and is controlled by the two parameters b and c. In the simulations, we also used this function κ for comparisons with the parameters b = 1 and c = .05. A plot of this κ is given below. This function is perfectly flat only from 0 to .05, but it is "effectively" flat from 0 to about .5. Therefore the effective flat-top radius is taken to be .5, and it is this value that is used in the bandwidth selection algorithm described above in Section 4.
A slightly modified bandwidth selection algorithm was invoked that retains the function of the bandwidth algorithm described above. The key in the bandwidth algorithm is to find the smallest value of t * so thatφ(t * ) ≈ 0. To automate this procedure, the value t * was chosen to be the first value for whichφ(t * ) starts to level off.
A Gaussian kernel is used in the second-order kernel estimator, and cross validation, as suggested in [4], is used to select the bandwidth for this estimator. Estimates were simulated over 1000 realizations.
The first simulation study considers the estimation of a N (0, 1) CDF from iid data.
One may imagine the second-order Gaussian kernel estimator to do quite well in this context, but in fact the infinite-order kernel performs uniformly better. MSE estimates are provided at three points (t = −1.5, 0, 1.5) and under two different sample sizes (n = 15, 30).  Here again the infinite-order kernels consistently outperform the second-order kernel estimator and the Kaplain-Meier estimator in term of MSE performance. In particular, the smoothed trapezoid is shown to perform well near the boundary point which can be attributed to its exponential tails making it more compactly supported.   Another very standard use of the EDF is found in the bootstrap method. In the smoothed bootstrap, data is drawn from a smoothed EDF, and when the estimator of the smoothed EDF is improved, the smoothed bootstrap is also improved to give more accurate inferences [9,18]. The bootstrap method is particularly beneficial when sample sizes are small, and therefore invoking infinite-order kernel estimators in this situation is often very natural.
Hazard function estimation on small samples can also be significantly be improved.
Hazard estimators, constructed from dividing a smoothed density estimate by a smoothed survival function, as in [12], have performance that is typically dictated by the convergence of the density estimator [3]. However in small sample sizes, accurate estimation of the survival function is just as crucial as accurate estimation of the density.
The new infinite-order kernel estimators of the CDF and survival function is shown through analysis and demonstrated through simulations to be more accurate than the EDF and Kaplain-Meier estimators with significant improvements seen in small sample sizes and data from a distribution that has a rapidly decaying characteristic function.
Significant improvements in terms of an increase in efficiency is also produced by the new estimators when the characteristic function of the data is identically zero after some finite value. Additionally, the bandwidth selection algorithm that accompanies the new estimator is computationally simpler with faster convergence rates than the cross-validation bandwidth selection algorithms used with finite-order kernels.

A Technical Proofs
Proof of Theorem 1.
From the following computation computing the bias ofF h (t) amounts to computing the bias ofK t−X i h . Starting with its expectation, we have If we define K h (t) = 1 h K t h , then the expectation above can be written in very simply as where denotes convolution.
To proceed, we will employ Fourier transform theory on (mathematical) distributions, otherwise known as generalized functions. By invoking generalized functions, we can compute the Fourier transform of not just the standard class of integrable functions, but also many non-integrable functions like constants and cumulative distribution functions. This theory, in general, is very technical and readers unfamiliar with the subject are referred to [2] for a nice treatment of the subject.
As K is the Fourier transform of κ, κ is therefore the inverse Fourier transform of K. Through a simple change of variables, we have where the notation F and F −1 will represent the Fourier transform and its inverse.
Next we wish to derive the Fourier transform of the CDF F (t). This is the first generalized function that we encounter and its Fourier transform involves the Dirac delta function, δ(s). Using the Heaviside step function H(x) given by H(x) = 1(x > 0), we rewrite F (t) as Therefore the Fourier transform of F (t) reduces to the product of the Fourier transforms of f (x) and H(x); i.e.
We will now proceed with estimating the bias ofF h (t).
bias F h (t) = K h F (t) − F (t) The last equality comes from the flat-top property of κ function; specifically, κ(sh) = 1 for |sh| ≤ 1 implies κ(sh) − 1 = 0 for |s| ≤ 1/h. Since κ is bounded by one, we have the following bound on the bias ofF h (t), And under assumption C, Proof of Theorem 4.
The proof of Theorem 4 follows the same lines as the proof of Theorem 3 with n δ replaced with log n.