Uniform in bandwidth consistency of kernel estimators of the density of mixed data

We establish a general uniform in bandwidth consistency result for kernel estimators of the unconditional and conditional joint density of a distribution, which is defined by a mixed discrete and continuous random variable. MSC 2010 subject classifications: Primary 60F15, 62G07; secondary 62G08.


Introduction
Kernel nonparametric function estimation methods have long attracted a great deal of attention. Although they are popular, they present only one of many approaches to the construction of good function estimators. These include, for example, nearest-neighbor, spline, neural network, and wavelet methods. These methods have been applied to a wide variety of data. In this article, we shall restrict attention to the construction of consistent kernel-type estimators of joint (unconditional and conditional) densities based on mixed data, that is data with both discrete and continuous components.
When faced such data, researchers have traditionally resorted to a "frequency" approach. This involves breaking the continuous data into subsets according to the realizations of the discrete data ("cells"), in order to produce consistent estimators. However, as the number of subsets increases, the amount of data in each cell tends to decrease, leading to a "sparse data" problem. In such cases, there may be insufficient data in each subset to deliver sensible density estimators (they will be highly variable). Aitchison and Aitken [1] proposed a novel extension of the kernel density estimation method to a discrete data setting in a multivariate binary discrimination context.
The approach we consider below uses "generalized product kernels". For the continuous component of a variable we use standard kernels (Epanechnikov, etc.) and for a general multivariate unordered discrete component we apply the kernels suggested by Aitchison and Aitken [1]. In case of ordered categorical data, alternative approaches can be used by essentially applying near-neighbor weights (see, e.g., Wang and van Ryzin [20]; Burman [3] and Hall and Titterington [10]). Smoothing methods for ordered categorical data have been surveyed by Simonoff [18,Sec. 6]. For illustration purposes, we show how this can be done using a kernel estimator proposed by Wang and van Ryzin [20].
Mason and Swanepoel [13] introduced a general method based on empirical process techniques to prove uniform in bandwidth consistency of a wide variety of kernel-type estimators. It is a distillation of results of Einmahl and Mason [8] and Dony et al. [5], whose work was motivated by the original groundwork of Nolan and Marron [14]. The goal of the present paper is to provide a general uniform in bandwidth consistency result for kernel estimators of the joint density of a distribution, which is defined by a mixed discrete and continuous random variable. We shall use the setup of Li and Racine [11] and show that the general Theorem of Mason and Swanepoel [13] applies to it. Our results will imply uniform in bandwidth consistency of the kernel density estimators for mixed discrete and continuous data of Li and Racine [11] and the kernel estimator of the conditional density for such data of Hall, Racine and Li [9].
In Section 2 we introduce and describe our basic setup, and some needed notation, constructions and assumptions. We prove our main technical result in Section 3 and in Section 4 we use it to prove a uniform in bandwidth consistency theorem for kernel density estimators of mixed data. Applications are given in Section 5. Section 6 contains the material from Mason and Swanepoel [13] that we use to prove our results. We conclude in Section 7 with an appendix on pointwise measurability.

Some basic notation, a probability construction and assumptions
In order to state and prove our results we shall need the following basic setup, notation, probability constructions and assumptions. First, we focus on the case when we have a mix of continuous and general multivariate unordered (nominal) variables. The case when the discrete variables are ordered (ordinal) will be dealt with at the end of Section 4.

The Li and Racine setup
We shall take our basic setup from Li and Racine [11], using the notation (with some modifications) of Hall, Racine and Li [9]. Let for p ≥ 1, q ≥ 1, be a random vector. Assume that X d takes on a finite number of values where K is a measurable real-valued function on R satisfying conditions (K.i)-(K.iv) stated in Subsection 2.4.1 below. From now on we assume for convenience of labeling that for each 1 ≤ k ≤ q, X d k takes on values 0, 1, . . . , r k − 1, where r k ≥ 2, and thus For any In particular, we have .
Notice that for each λ ∈ Γ For any 0 < δ < 1 let We see that uniformly in λ ∈ Γ(δ) Consider the Aitchison and Aitken [1] kernel estimator of p(x d ),

Remark 1.
Although p n (x d , λ) was initially proposed by Aitchison and Aitken [1] as a smooth estimator of p(x d ) in a multivariate binary data discrimination context, it has since then often been applied to the analysis of general multivariate unordered discrete variables. Note that when λ = 0, the estimator p n (x d , λ) reduces to the conventional frequency estimatorp n (x d ) = n −1 N n (x d ). Therefore, the smoothed estimator p n (x d , λ) includes the frequency estimator as a special case. From a statistical perspective it is known (see, e.g., Brown and Rundell [2], and Ouyang et al. [16]) that the smooth estimator p n (x d , λ) may introduce some finite sample bias; however, it may also reduce the variance substantially, leading (using a bandwidth λ which balances bias and variance) to a reduction in the mean squared error of p n (x d , λ) relative to the frequency estimatorp n (x d ).
Ouyang et al. [16] provide an informative discussion on some further interesting properties of p n (x d , λ). It is, among others, pointed out that p n (x d , λ) can be viewed as a Bayes-type estimator because it is a weighted average of a uniform probability and a frequency estimator. Their simulation studies also show that p n (x d , λ), particularly when used in conjunction with a data-driven method of bandwidth selection such as least-squares cross-validation, performs much better than the commonly used frequency estimatorp n (x d ), especially in the case when some of the discrete variables are uniformly distributed (a specific definition of "uniformly distributed variables" is provided in their Section 2).
Our aim is firstly to study the uniform in bandwidth consistency of estimators Our objective is to establish the result stated in Theorem 2, which is given in Section 4. In order to do this we must first build some needed framework and machinery.

Some useful classes of functions
In order to apply the Mason and Swanepoel [13] general uniform in bandwidth consistency theorem we must introduce the following classes of functions.
Notice there is a one to one correspondence between T × (0, 1] and (0, 1] p given by From this class we form the class G K,0 of measurable real valued functions of z ∈ R p defined as (2.8) Using this notation we see that where we use the one to one correspondence given in (2.6).

Remark 2.
The class of functions given in this subsection can be used to apply the Theorem in Mason and Swanepoel [13] to obtain uniform in bandwidth consistency results for multivariate kernel estimators based on a vector of smoothing parameters, where the components may be different.

A useful probability construction
We shall see that the following probability construction will come in very handy.
are independent of each other. For each x d and n ≥ 1, recall the definition of N n (x d ) given (2.4). We find that for any class F of measurable real valued functions ϕ defined on To see the kind of argument that establishes this distributional identity consult the proof of Proposition 3.1 of Einmahl and Mason [6].

Assumptions
Here are our basic assumptions on the kernel and the joint and marginal densities.

Assumptions on the kernel K
The kernel K satisfies the following conditions: Note that (K.ii) and (K.iv) imply that for any h > 0

Assumptions on the joint and marginal densities
. . , p} and for a measurable subset A ⊂ R p and ε > 0 we define

Technical result
In this section we establish a technical result that will be used in the next section to prove our uniform in bandwidth theorem for kernel density estimators for mixed discrete and continuous data.
For any i ≥ 1 and x d ∈ D, set In the following proposition, for g t,x c ∈ G K , (Here and elsewhere in these notes log x denotes the natural logarithm of the maximum of x and e.)

Proposition 1. Let K satisfy (K.i)-(K.iv) and the marginal densities fulfill (f.i). Then for any
is a finite constant depending on c, x d , and the stated assumptions on the kernel K and the marginal densities.
Proof. Throughout the proof keep in mind that A is the set used in assumption (f.i) and to define the class G K in (2.7). Choose any x d ∈ D. Notice that for any (See the notation (6.33) below.) The assumptions of Proposition 1 allow us to apply the general Theorem of Mason and Swanepoel [13] (see below) with G = G K to conclude (3.11). In particular we see that (K.ii) implies that (G.i) holds (assumptions (G.i)-(G.iv) are stated in Subsection 6.2). Also it is readily shown using (f.i) and (K.ii) that (G.ii) is fulfilled, that is, for some constant To see this, observe that g t,x c (·, h) is zero off the set and for all h small enough uniformly in x c ∈ A and t ∈ T , B t,h (x c ) ⊂ A ε so that (f.i) holds. From these observations (3.12) follows. The results in the Appendix prove that (K.i) implies that the pointwise measurable assumption (G.iii) holds for the class G K,0 . (Note that in assumption (F.ii) of Mason and Swanepoel [13] G should be G γ .) For any 1 ≤ j ≤ p, define the class of functions Using assumption (K.i), an application of Lemma 22 of Nolan and Pollard [15] shows that each K j satisfies (G.iv). Further since by assumption (K.ii), |K| is assumed to be bounded by some κ > 0, we can apply Lemma A.1 of Einmahl and Mason [7] to infer that G K,0 satisfies (G.iv).

Main technical result
Here is our main technical result. In the following, for any λ ∈ Γ, g t,x c ∈ G K and x d ∈ D Theorem 1. Let K satisfy (K.i)-(K.iv) and the marginal densities fulfill (f.i).
Then for any choice of c > 0 and 0 < b 0 < 1 we have, with probability 1,

13)
where c n = c log n n , B(c) is a finite constant depending on c, and the stated assumptions on the kernel K and the marginal densities.
In order to prove the theorem we require the following lemma.

14)
where c n = c log n n , C(c, z d ) is a finite constant depending on c, z d and the stated assumptions on the kernel K and the marginal densities.
Proof. Choose any z d ∈ D. Notice that by Wald's identity Since the assumptions of Proposition 1 hold, the sequence of random variables {N n (z d )} n≥1 is independent of {Z n (z d )} n≥1 , and N n (z d ) → ∞, with probability 1, we see that, for every d 0 > 0, with probability 1, for some finite constant . Now since, with probability 1, N n (z d )/n → p(z d ) > 0, and thus d Nn(z d ) ≤ 2d0 log n np(z d ) for all large enough n and 2d0 log n np(z d ) ≤ c n for small enough d 0 > 0, we see from (3.15 Next, for each g t,x c ∈ G K , we get using the assumptions on K, (f.i) and (2.9) that for all h > 0 small enough Thus, by the law of the iterated logarithm, with probability 1, for some C 0 > 0, The proof of (3.14) now follows from (3.16) and (3.17) and the Kolmogorov zero one law.

Proof of Theorem 1. Notice that as a process in (X
(Recall the probability construction in Subsection 2.3.) From this we see that Noting that each |K d λ (x d , z d )| ≤ 1, we see then using (3.19), with |D| denoting the cardinality of D, that by Lemma 2, with probability 1, The Kolmogorov zero one law now completes the proof.

Uniform in bandwidth consistency theorem
For any δ > 0 let where Γ is as in (2.2). Given sequences 0 < a n < b n < 1, set Note that if h 1 = · · · = h p = h, then H n becomes H n = {h ∈ (0, 1] : a n ≤ h ≤ b n } .

Theorem 2. Let K satisfy (K.i)-(K.iv) and the marginal densities fulfill (f.i).
For any sequences 0 < a n < b n < 1, 0 < δ n < 1 satisfying b n → 0, δ n → 0, and na n / log n → ∞, and density f on is uniformly continuous on the subset A ε of R p for some ε > 0, we have, with probability 1, In order to prove the theorem we require the following lemma. Let {ε n } n≥1 be a sequence of positive constants such that ε n → 0 as n → ∞ and set H (ε n ) = {h ∈ (0, 1] p : max h ≤ε n } .

Lemma 3. Let K satisfy (K.i)-(K.iv) and the marginal densities fulfill (f.i). Whenever for a given z d ∈ D, f (·|z d ) is uniformly continuous on
Notice that when (K.i)-(K.iv) are satisfied, we get by using (2.9) that Hence, with ε n (p) = (ε and using the assumption that f (·|z d ) is uniformly continuous on A ε , we get (4.21), keeping in mind that ε n → 0 as n → ∞.
Proof of Theorem 2. Notice that by the one to one correspondence given in (2.6), for any x d ∈ D, where h = max h. Since by the probability construction in Subsection 2.3, as a process in ( we can assume for the purpose of proving limit results that we have equality in (4.22). We see then, keeping in mind the one to one correspondence given in (2.6), that max which by (3.13) is almost surely for some constant C > 0 Let max λ = max{λ 1 , . . . , λ q }. Notice that for each λ ∈ Γ for x c ∈ R p and x d ∈ D. Theorems 1 and 2 then again hold with Γ, K d λ and D replaced by Γ o , K d,o λ and D o respectively, where D o is a finite subset of D. This follows from the inequality above and an exact repetition of the steps in the proofs above.
In practice, it is likely that some of the discrete variables will have natural orderings while the others will be unordered. Following Section 2.5 of Racine [17], letX d denote a q 1 × 1 vector (say the first q 1 components of X d ) of discrete variables that do not have a natural ordering (1 ≤ q 1 ≤ q), and letX d denote the remaining discrete variables that do have a natural ordering. In this case, we can construct a product kernel of the form . Then the conclusions of Theorems 1 and 2 remain unchanged using this kernel. The proofs of this claim are identical to those above.

Application to Li and Racine estimator
In this Subsection we shall apply Theorem 3.1 of Li and Racine [11] to obtain a uniform in bandwidth consistency result for their estimator. They treat the density estimator of f (x c , x d ) in the case h i = h for i = 1, . . . , p and λ j = λ for j = 1, . . . , q. Also their h i is our h 1/p i . So in our notation .
is bit different than ours. However, this does not affect the conclusion of their Theorem 3.1. See their comment on the general multivariate discrete case following the statement of Theorem 3.1. Keeping in mind that their h i is our h 1/p i , if one assumes in addition to the conditions of our Theorem 2, those of their Theorem 3.1 one gets for their cross-validation estimators h and λ of the smoothing parameters h and λ that where for appropriate c 1 > 0 and c 2 > 0 and α = min{2, p/2} and β = min{1/2, 4/(4 + p)}. This implies that λ = o p (1) and for appropriate 0 < a < b < ∞, with probability converging to 1, Thus, we can apply Theorem 2 to conclude that P max . . , h and λ = λ, . . . , λ .

Application to Hall, Racine and Li estimator
The Hall, Racine and Li [9] setup is a follows. Assume that for p ≥ 1, q ≥ 1, is as in the Li and Racine [11] setup. Introduce an additional continuous real valued random variable Y and assume that (X, Y ) = (X c , X d , Y ) has joint density f (x, y) = f (x c , x d , y) with marginal density m(x) = f (x). They study the kernel estimator of the conditional density of Y given X = x, i.e., In order to apply our Theorem 2 we assume that for x c and z = (z 1 , . . . , z p ) ∈ R p , and for y and z 0 ∈ R, with L being a kernel with the same properties as K. Notice that the Hall, Racine and Li [9] h j are h 1/(p+1) j in our notation. If one assumes in addition to the conditions of our Theorem 2, those of their Theorem 2 one gets for their cross-validation estimators h and λ of the smoothing vector h and λ that for appropriate a i > 0, i = 0, . . . , p, and b j > 0, j = 1, . . . , q, whenever all of the variables (X c , X d ) are relevant in the sense of Hall, Racine and Li [9]. Therefore we can apply Theorem 2 to get that It will be necessary in our presentation to distinguish between G and G 0 . Always keep in mind that functions g ∈ G are defined on S ×(0, 1] and functions g 0 ∈ G 0 are defined on S. Introduce the class of estimatorŝ h), g ∈ G and 0 < h < 1. (6.33)

The underlying assumptions and basic definitions
Let X be a random variable from a probability space (Ω, A, P ) to a measure space (S, S). In the sequel, || · || ∞ denotes the supremum norm on the space of bounded real valued measurable functions on S. To formulate our basic theoretical results we shall need the following class of functions. Let G denote the class of measurable real valued functions g of (u, h) ∈ S × (0, 1] introduced in our general setup (6.31) and recall the class of functions G 0 on S defined in (6.32). We shall assume the following conditions on G and G 0 : Note that (G.iii) is a measurability condition that we assume in order to avoid using outer probability measures in all of our statements. A pointwise measurable class G 0 has a countable subclass G c such that we can find for any function g ∈ G 0 a sequence of functions {g m , m ≥ 1} in G c for which lim m→∞ g m (x) = g(x) for all x ∈ S. See Example 2.3.4 in [19].
Condition (G.iv) is a so-called uniform entropy condition. As is usual, we define the covering numbers where G is an envelope function for G 0 , and where the supremum is taken over all probability measures Q on (S, S) with Q(G 2 ) < ∞. We shall now define the notation in (6.34). By an envelope function G for G 0 we mean a measurable function G : S → [0, ∞], such that Note that by the definition of the class G 0 , The d Q in (6.34) is the L 2 (Q)-metric and for any γ > 0, N (γ, G 0 , d Q ) is the minimal number of d Q -balls with radius γ which is needed to cover the entire function class G 0 .
We use η as our (constant) envelope function, when condition (G.i) holds. (In this case EG 2 (X) < ∞ is trivially satisfied.) For future reference, recall that we say that a class F is of VC-type for the envelope function F , if N ( , F) ≤ C −ν , 0 < < 1, for some constants C > 0, ν ≥ 1. (Here N ( , F) is defined as in (6.34) with F and F replacing G 0 and G, respectively.) This condition is automatically fulfilled if the class is a VC subgraph class (see Theorem 2.6.7 on page 141 of [19], where we refer the reader for a definition of a VC subgraph class).

A uniform in bandwidth result
We shall need the following special case of the Theorem in Mason and Swanepoel [13]. Note that when we apply this result, we should keep in mind that in condition (F.ii) given there, G should be G γ . For an even more general uniform in bandwidth result see Theorem 4.1 of Mason [12].

Appendix: Pointwise measurability
We say that a class G 0 of measurable functions g : S → R is pointwise measurable if there exists a countable subclass G c ⊆ G 0 , so that for any function g in G 0 , we can find a sequence of functions g n ∈ G c , m ≥ 1 for which g m (x) → g(x), x ∈ S.
Example. Consider a real valued right-continuous function K : R → R, and define the class of functions (7.36) Then this class is always pointwise measurable. Let Q denote the rationals. The subclass that will do the job here is Proof. We claim that F K is a pointwise measurable class. To see this choose any g(u) = K(γu + ρ) ∈ F K , u ∈ R and set for m ≥ 1, g m (u) = K(γ m u + ρ m ), Thus since γ m u + ρ m → γu + ρ and K is right continuous at γu + ρ, we see that g m (u) → g(u) as m → ∞.
This proof is taken from that of Lemma A.1 of Deheuvels and Mason [4] with a couple of misprints fixed, and for the benefit of the reader is repeated here.
Trivially we get that if K 1 ,. . . , K p are right continuous functions on R and ϕ is a fixed measurable real-valued function on R, then the class of functions (x 1 , . . . , x p , y) −→ Π p j=1 K j (γ j x j + ρ j ) :, γ j > 0, ρ j ∈ R, 1 ≤ j ≤ p , is pointwise measurable.