Empirical measures for incomplete data with applications

: Methods are proposed to construct empirical measures when there are missing terms among the components of a random vector. Fur-thermore, Vapnik-Chevonenkis type exponential bounds are obtained on the uniform deviations of these estimators, from the true probabilities. These results can then be used to deal with classical problems such as statistical classiﬁcation, via empirical risk minimization, when there are missing covariates among the data. Another application involves the uni- form estimation of a distribution function.


Introduction
Consider the random vector X T = (Y T , Z T ) ∈ R d+p , where Y ∈ R d , d ≥ 1, is always observable but Z ∈ R p , p ≥ 1, may be missing. Now let t : R d+p → R s , s ≥ 1, be a given (known) function and, for any measurable set A ⊂ R s , consider the estimation of µ(A) = P {t(X) ∈ A} . (1) When t(X) = X and A = (−∞, x 1 ] ×· · ·×(−∞, x d+p ] then (1) is the usual empirical cumulative distribution function of the random vector X. Alternatively, taking t(X) = Z and A = (−∞, z 1 ] × · · · × (−∞, z p ], gives the empirical c.d.f. of the subvector Z. Let D n = {X 1 , . . . , X n } be a random sample, where X i i.i.d. = X, i = 1, . . . , n. Clearly, when every X T i = (Y T i , Z T i ) is fully observable, (i.e., there are no missing Z i 's), one can use the classical empirical version Now the celebrated inequality of Vapnik and Chervonenkis (1971) can be used to obtain uniform (in A) performance bounds on the deviations of µ n (A) from µ(A). More specifically, let A be a class of measurable sets {A | A ⊂ R s } and define S(A, n) = max t(x1), ..., t(xn)∈R s {# of different sets in {{t(x 1 ), . . . , t(x n )}∩A | A ∈ A}} .
Here, the combinatorial quantity S(A, n), called the n th shatter coefficient of the class A, measures the richness/massiveness of the class A, and is always bounded by 2 n . The following result is well-known and goes back to Vapnik and Chervonenkis (1971).
For a proof of the above result based on symmetrization arguments see, for example, Devroye et al. (1996). Using more complicated techniques, it is possible to improve the constants that appear in the exponent of Theorem 1; see Devroye (1982) for more on this. Other relevent results along these lines are those of Talagrand (1994), Dudley (1978), and Massart (1990).
Remark 1. When the collection of sets A is uncountable, the measurability of the supremum in the above theorem can become an important issue. One way to deal with the measurability problem, in general, is to work with outer probability; see, for example, the monograph by van der Vaart and Wellner (1996). In the rest of this article we shall assume that the supremum functionals do satisfy measurability conditions.
In passing we also note that if n −1 log(S(A, n)) → 0, as n → ∞, then by the Borel-Cantelli lemma sup To deal with the general case where there are missing Z i 's (recall that X T = (Y T , Z T )) among the data, we start by defining the random variable Then the data D n may be represented by Clearly, the estimator µ n in (2) is no longer computable because some of the Z i 's may be missing. In order to revise (2) appropriately we also need to take into account the missing probability mechanism, i.e., the quantity P {δ = 1 | Y, Z} = E(δ | Y, Z). If the missing probability mechanism satisfies then it is said to be Missing Completely At Random, (MCAR). Of course the MCAR assumption is rather unrealestic and restrictive. A more widely used assumption in the literature is the Missingness At Random assumption, MAR, where one has i.e., the probability that Z is missing does not depend on Z itself. For more on these and other missing patterns one may refer to Little and Rubin (2002), p. 12. When the missing probability mechanism satisfies the MCAR assumption (unrealestic), one may just use the complete cases to estimate µ(A); (a complete case is a case where δ i = 1). For example, if p := P {δ = 1} = 0 then one may consider the simple estimator provided that p is known. Under the MCAR assumption, the estimator (5) is unbiased for µ(A). To appreciate this simply observe that E ( µ n (A)) = because under the MCAR assumption, E(δ | X) = P (δ = 1) = p. In fact more is true: Theorem 2. Let µ n (A) be as in (5) then for every ǫ > 0 and every n ≥ 1 where c 1 = 8 and c 2 = 2 −1 p 2 .
When p = P (δ = 1) is unknown, it may be replaced by p = n −1 n i=1 δ i in (5), and the bound in Theorem 2 continues to hold with different constants c 1 > 0 and c 2 > 0.
In the rest of this article we shall focus on the popular (and more realistic) assumption of MAR missing mechanism, given by (4). In this case the empirical version of µ(A), given by where n ′ may be taken to be n p or n i=1 δ i or n, is no longer appropriate. This is because the resulting set-indexed empirical process { µ n (A) − µ(A) | A ∈ A} is not centered -not even asymptotically.
In the next section we propose methods for estimating µ(A), uniformly (in A), and derive counterparts of Theorem 1 for our proposed estimators. As an immediate consequence of our results, one can establish various Glivenko-Cantelli type theorems for incomplete data under the MAR assumption.

Main results
In this section we consider procedures to correct the naive estimator where the correction is done by weighting the complete cases by the inverse of the missing data probabilities p(Y) := P {δ = 1 | Y}, or its estimates. We recall that under the MAR assumption P {δ = 1 | Y, Z} = P {δ = 1 | Y} := p(Y) . To motivate our approaches we first consider the simple (but unrealistic) case where the function p(y) is completely known. Now consider the revised estimator How good is µ n (A) as an estimator of µ(A)? Clearly under the MAR assumption E (µ n (A)) = µ(A). More importantly, we have Theorem 3. Let µ n be as above and suppose that p min = min y p(y) > 0. Then for every ǫ > 0 and every n ≥ 1, is almost always unknown and must first be estimated. Here, we consider two possible estimators of p(y): the first one is a kernel regression function estimator, whereas the second approach is based on the least-squares method.

A kernel-based method
with the convention 0/0 = 0, where the function K : R d → R is the kernel with smoothing parameter h n (→ 0, as n → ∞). We then estimate µ(A) = P {t(X) ∈ A} by To study µ n we first state some conditions. C 1. p min = min y p(y) > 0.
C 2. The random vector Y has a compactly supported probability density function f(y) and is bounded away from zero on its compact support. Furthermore, both f and its first-order partial derivatives are uniformly bounded.
Furthermore, h n → 0 and n h d n → ∞, as n → ∞. C 4. The partial derivatives ∂ ∂ yi p(y) exist for i = 1, . . . , d, and are bounded uniformly in y on the support of f.
Condition C1 essentially states that the probability that Z can be observed (i.e., δ=1) will be nonzero (for all Y = y). Condition C2 is often imposed in nonparametric regression in order to avoid having unstable estimates (in the tails of the pdf f of Y). Condition C3 is not a restriction since the choice of the kernel K is at our discretion. In fact, K only needs to be a proper density with a finite first absolute moment. Condition C4, which has also been used by Cheng and Chu (1996), p. 65, is technical.
Theorem 4. Let µ n (A) be as in (8). Then, under conditions C1-C4, for every ǫ > 0 there is an n o such that for all n > n o , where c 4 , c 5 , and c 6 are positive constants not depending on n or ǫ.
The constants c 4 , c 5 , and c 6 that appear in Theorem 4 depend on the function p (as well as many other terms) through p min = min y p(y). In fact, the proof of the theorem makes it clear (with some more efforts) that one can always take The estimator µ n defined via (8) and (7) is quite easy to compute in practice. However, the bound in Theorem 4 is not as tight as the one in Theorem 3. This is because of the presence of the term n h d n that appears in the exponent of the bound of Theorem 4. In a sense, this shows that the effective sample size for the results of Theorem 4 is n h d n (and not n).

The least-squares method
Our second method uses the least-squares estimator of p(y) to construct an empirical version of µ(A). More specifically, suppose that the regression function p(y) = E(δ | Y = y) belongs to a class P of functions p : R d → [p min , 1], where, as before, p min = min y p(y). Also, letp LS (y) be the least-squares estimator of p(y), i.e.,p Then we have the following counterpart of (8).
To assess the performance ofμ n (A), we employ results from the empirical process theory: For fixed y 1 , . . . , y n , let N 1 (ǫ, P, (y i ) n i=1 ) be the ǫ-covering number of P with respect to the empirical measure of the points y 1 , . . . , y n .
is the cardinality of the smallest subclass of func- van der Vaart andWellner (1996) p. 83, or Pollard (1984), p. 25.) The following result gives bounds on the uniform deviations ofμ n (A) from µ(A).
Theorem 5. Letμ n be as in (9). Then, under condition C1, for every ǫ > 0 and every n ≥ 1, where c 7 = p 2 min /8, c 8 = p 4 min /2048, and c 9 = p 8 min /((16 2 )(128)). The two methods of estimation discussed in this section are of course very different and the performance of each one depends on the function p. More specifically, if one is certain that p belongs to a known class of functions P, then the least-squares method would be preferable. In this case the performance bound of Theorem 5 is nonasymptotic in that it holds for every n ≥ 1. Unfortunately, if p / ∈ P then the conclusion of Theorem 5 will be incorrect. In this case (and in the general case where one has no knowledge of the class of functions P), one can use the kernel estimator instead. There are, however, some theoretical drawbacks here: Theorem 4 is only asymptotic (it holds for large n). Furthermore, the kernel estimator requires more regularity conditions for the function p, (as reflected by Theorem 4).

An application
The results developed in Section 2 can be used to estimate a distribution function in the presence of missing covariates. This was briefly explained in the introduction section. Here we consider an application of our results to the problem of statistical classification. More specifically, let (X, W ) be an R d+p × {0, 1}valued random pair, where X = (Y T , Z T ) T , with Y ∈ R d , d ≥ 1, and Z ∈ R p . The problem of statistical classification involves the prediction of W based on the vector of covariates X. Formally, one seeks to find a classifier (a function) Ψ : R d+p → {0, 1} for which the probability of misclassification (incorrect prediction), i.e., P {Ψ(X) = W } is as small as possible. It is a simple exercise to verify that the best classifier, i.e., the classifier with lowest misclassification probability, is given by Since Ψ B is virtually always unknown, one uses the data to construct a classifier. Given a random sample D n = {(X 1 , W 1 ), . . . , (X n , W n )}, where each pair (X i , W i ) is fully observable, one tries to construct a classifier Ψ n in such a way that its misclassification error L n ( Ψ n ) = P Ψ n (X) = W D n is in some sense as small as possible. Let L L(Ψ), we say Ψ n is strongly consistent. If the consequence holds in probability, Ψ n is said to be weakly consistent. Given a class Ψ of candidate classifiers Ψ, the principal of empirical risk minimization (ERM) chooses the classifier (from Ψ) that minimizes the empirical error Now, consider the case where Z i may be missing in Clearly, the data can be represented by where δ i = 0 if Z i is missing, (otherwise δ i = 1). First note that (10) cannot be computed because not every X i is fully observable. Furthermore, using the complete cases alone will not work because , which is not the same as P {Ψ(X) = W } . Our propose ERM-type estimator of the best classifier is given by where p(y) is either p(y), (if p(y) is known), or p(y) in (7) orp LS (y) of Section 2.2. Let Ψ * be the best classifier in Ψ, i.e. Ψ * satisfies How good is Ψ n as an estimator of Ψ * ? To answer this question, let L n ( Ψ n ) be the error of the classifier Ψ n , i.e., L n ( Ψ n ) = P Ψ n (X) = W D n . Also let A Ψ be the class of all sets A of the from Then, we have the following result.
Theorem 6. Let Ψ n be as above. Then for every ǫ > 0 there is an n o > 0 such that for every n > n o Here c 20 , c 21 , c 22 , and c 23 are positive constants not depending on n or ǫ.
Remark 2. Theorem 6 is based on the assumption that the missing probability mechanism satisfies the strong MAR assumption that i.e., the probability that Z i is missing can depend on both Y i and W i . In this case, one can revise Ψ n in (11) by one can show, with some more efforts, that the conclusion of part (i) of Theorem 6 continues to hold with different constants c 20 > 0 and c 21 > 0. Similar results can also be established for part (ii) of Theorem 6.
In passing we also note that the size of S(A Ψ , n) depends on the underlying class Ψ. When, for example, Ψ is the popular class of linear classifiers where a 0 , a 1 , . . . , a d+p ∈ R, then S(A Ψ , n) ≤ n d+p+1 , (see, for example, chapter 13 of Devroye et al. (1996).). In this case, if we choose p(y) = p(y), where p is as in (7), then L n ( Ψ n ) a.s.
Remark 3. Strictly speaking, the classifier Ψ n is not suitable if the new observation X, (based on which W has to be predicted), also has missing covariates. I.e., in addition to the data D n , the new observation X = (Y T , Z T ) T is also allowed to have a missing Z.
Therefore, for n ǫ 2 ≥ 2/p 2 But the far left and far right sides of (12) do not depend on any particular A ǫ and the chain of inequalities between them remain valid on the set sup A∈A µ n (A) − µ(A) > ǫ . Therefore, integrating the two sides with respect to the distribution of D n , over this set, one finds Next, let R 1 , . . . , R n be i.i.d. random variables, independent of D n and D ′ n , where P {R i = +1} = P {R i = −1} = 1/2, and observe that This last expression together with (13) yield (upon conditioning on D n ), Now observe that for fixed x 1 , . . . , x n , the number of different vectors obtained, as A ranges over A, is just the number of different sets in and this number is bounded by S(A, n). Thus Since, conditional on D n , the term n i=1 Ri δi p(Yi) I {t(X i ) ∈ A} is the sum of n independent zero-mean random variables, bounded by −1/p min and +1/p min (where p min = min y p(y) > 0), one can use Hoeffding's inequality to conclude The above bound together with (14) and (15) give This proves the theorem for n ǫ 2 ≥ 2/p 2 min . When n ǫ 2 < 2/p 2 min the theorem is trivially true.
To prove Theorem 4, we first state a lemma.
where c is a positive constant not depending on n. Proof.
Now a one-term Taylor expansion gives where Y 1, i and Y i are the i th components of Y 1 and Y, and Y * is a point on the interior of the line segment joining Y and Y 1 . Therefore, Therefore, using a one-term Taylor expansion yields This completes the proof of the lemma.

Proof of Theorem 4
First note that for every A ∈ A, Now let µ n (A) be as in (6) and observe that .
To deal with the term II n , first note that To deal with II n (1), first note that But for n large enough, Lemma 1 implies that > f min p 2 min ǫ 8 (for n large enough, by Lemma 1) But, conditional on Y i , the terms T j 's are independent, zero-mean random variables, bounded by −h d n K ∞ and +h d n K ∞ . Furthermore, Var T j (Y i ) Y i = and the last line above follows from because S n (p LS ) − S n (p ′ ) ≤ 0 by the definition ofp LS . Therefore, from the empirical process theory, where c 17 = p 8 min /((16 2 )(128)).

Proof of Theorem 6
Let L n (Ψ) = 1 n where the last line follows from the fact that L n (Ψ) = L(Ψ), for any nonrandom classifier Ψ, and the observation that L n ( Ψ n ) ≤ L(Ψ), for every Ψ ∈ Ψ, (by the definition of Ψ n in (11). Therefore, if we let A Ψ be the class of all sets of the form then one finds ≤ 8 S(A Ψ , n) e −c19 n ǫ 2 + r n (ǫ), for n large enough, where r n (ǫ) is as in the statement of the theorem (by Theorems 4 and 5).