The False Discovery Rate for Statistical Pattern Recognition

The false discovery rate (FDR) and false nondiscovery rate (FNDR) have received considerable attention in the literature on multiple testing. These performance measures are also appropriate for classification, and in this work we develop generalization error analyses for FDR and FNDR when learning a classifier from labeled training data. Unlike more conventional classification performance measures, the empirical FDR and FNDR are not binomial random variables but rather a ratio of binomials, which introduces challenges not addressed in conventional analyses. We develop distribution-free uniform deviation bounds and apply these to obtain finite sample bounds and strong universal consistency.

mance probabilities, whose empirical versions are related to binomial random variables, empirical versions of FDR and FNDR are related to ratios of binomial variables. This necessitates the development of novel concentration inequalities and methods of analysis.

Notation
More formally, in this paper we consider the following scenario: Let X be a set and Z = (X, Y ) be a random variable taking values in Z = X × {0, 1}. The variable X corresponds to a pattern or feature vector and Y to a class label associated with X; Y = 0 corresponds to the null hypothesis (e.g., that no target is present) and Y = 1 corresponds to the alternative hypothesis (e.g., that a target is present). The distribution on Z is unknown and is denoted by P. Assume we make n independent and identically distributed training observations of Z, denoted Z n = (X i , Y i ) n i=1 . A classifier is a function h : X −→ {0, 1} mapping feature vectors to class labels. Let H denote a collection of different classifiers. A false discovery occurs when h(X) = 1 but the true label is Y = 0. Similarly, a false nondiscovery occurs when h(X) = 0 but Y = 1. We define the false discovery rate (FDR)

Related Concepts
These definitions, which are natural in the classification setting, coincide with the so-called positive FDR/FNDR of Storey [21,22], so named because it can be seen to equal the expected fraction of false discoveries/nondiscoveries, conditioned on a positive number of discoveries/nondiscoveries having been made. Storey makes some decision-theoretic connections to classification [22], but does not consider learning from data. Storey's definition does not cover the case where the conditioning event has probability zero. We define FDR and FNDR in these cases to be infinity. Our convention has the effect of assigning high costs to classifiers that fail to make at least some discoveries (and nondiscoveries). This is consistent with the multiple testing perspective, where the goal is to generate interesting hypotheses for further examination. A classifier that makes no discoveries is of no use for such purposes. Further comments on the definition of FDR and FNDR are given after the proof of Theorem 2.
In certain communities, different terms embody the idea behind FDR. In the medical diagnostic testing literature, the positive predictive value (PPV) is defined as the "proportion of patients with positive test results who are correctly diagnosed" [1]. In database information retrieval problems, the precision is defined as the ratio of the number of relevant documents retrieved by a search to the total number of documents retrieved by a search [23]. Both PPV and precision are equal to 1 -FDR. Precision is discussed further is Section 5.2.
Finally, several researchers have recently investigated connections between multiple testing and statistical learning theory. McAllester's PAC-Bayesian learning theory may be viewed as an extension of multiple testing procedures to (possibly uncountably) infinite collections of hypotheses [16]. Blanchard and Fleuret present an extension of the Occam's razor principle for generalization error analysis in classification, and apply it to derive p-value adjustment procedures for controlling FDR [5]. Arlot et al. develop concentration inequalities that apply to multiple testing with correlated observations [2]. None of these works consider FDR/FNDR as performance criteria for classification.

Connections to Cost-Sensitive Learning
In Sections 3 and Section 4 we consider the performance measure E λ (h) := R ND (h) + λR D (h). It can be shown that the global minimizers of this criterion have the form for some c, where η(x) := P (Y = 1|X = x) and, if necessary, this family of classifiers is extended by a standard randomization argument if its receiver operating characteristic (ROC) is not concave. Storey [22] gives a proof for the case where the two class-conditional distributions are continuous. The classifiers in (1) are also the optimal classifiers for Bayes cost-sensitive learning. That is, they are also the minimizers of weighted Bayes costs of the form Proof of this fact is a direct generalization of the case of the probability of error, when γ = 1 [7]. Unfortunately, existing analyses for cost-sensitive classification cannot be readily applied to our problem. Given λ, it is true that our criterion E λ (h) can be minimized by performing cost-sensitive classification with a certain cost parameter γ. The critical issue is that γ is an implicit function of λ, and cannot be determined a priori without knowledge of the underlying distribution. Therefore, when only data are given, applying existing cost-sensitive classification methods to our problem would require estimating γ. In practice, this would most likely entail learning cost-sensitive classifiers h γi for some grid of values {γ i } that grows increasingly dense as n → ∞. Then, the best of these candidates would be selected by minimizing an estimate of E λ (h). Such a procedure would likely be expensive computationally. From an analytical standpoint, it seems plausible that generalization error analyses for cost-sensitive classification could be useful; however, the need to search for a γ that approximately minimizes our criterion would significantly complicate the analysis. The objective of our work is to develop a much more direct approach, which does not require repeated cost-sensitive classification.
Therefore, the distinction between our problem and cost-sensitive classification is in some ways analogous to the difference between the Neyman-Pearson and Bayesian theories of hypothesis testing. Even though these two problems have a likelihood ratio as their optimal solution, the specific thresholds for the likelihood ratios are determined in very different ways depending on which criterion is employed. In our setting, the differences are further compounded by the fact that we are learning from data.

Overview
In the next section we present and prove uniform deviation bounds for FDR and FNDR. In Section 3, we discuss performance measures based on FDR and FNDR, and in Section 4 we establish the strong universal consistency of a learning rule with respect to the measure E λ . Section 5 treats performance measures which constrain FDR, and the final section offers a concluding discussion. Several aspects of our analysis deviate from standard techniques, a consequence of certain unique features of FDR and FNDR, and we highlight these throughout the paper.

Uniform Deviation Bounds
Define empirical analogues to the FDR and FNDR according to are binomial random variables. This section describes a uniform bound on the amount by which the empirical estimate of FDR/FNDR can deviate from the true value. Note that unlike the usual empirical estimates for the probability of error/false positive rate/false negative rate, here both numerator and denominator are random, and both depend on h.
Assume H is countable, and let h be a real valued functional on H such that h∈H 2 − h ≤ 1. Such a functional can be identified with a prefix code for H, in which case h is the codelength associated to h. If h∈H 2 − h = 1, then 2 − h may be viewed as a prior distribution on H.
For δ > 0, we introduce the penalty terms The interpretation of these expressions as penalties comes from the learning algorithms studied below, where we minimize the empirical error plus a penalty to avoid overfitting. Note that the penalties are data dependent.
Theorem 1. With probability at least 1 − δ with respect to the draw of the training data, for all h ∈ H. Similarly, with probability at least 1 − δ with respect to the draw of the training data, for all h ∈ H. The results are independent of the underlying probability distribution.
Because of the form of the penalty terms, the bound is larger for classifiers h that are more complex, as represented through the codelength h , and smaller when more discoveries/nondiscoveries are made. This result leads to finite sample bounds and strong universal consistency for certain learning rules based on minimization of the penalized empirical error, as developed in the sequel.
Proof. We prove the first statement, the second being similar. For added clarity, write the penalty as φ D (h, δ, n D (h, Z n )), where Consider a fixed h ∈ H. The fundamental concentration inequality underlying our bounds is Hoeffding's [14], which, in one form, states that if S k is the sum of k > 0 independent random variables bounded between zero and one, and To apply Hoeffding's inequality, we need the following conditioning argu- We may then write where in the next to last step we use independence of the realizations. Therefore, applying Hoeffding's inequality conditioned on The result now follows by applying the union bound over all h ∈ H.
The technique of conditioning on the random denominator of a ratio of binomials has also been applied in others settings [15,18]. Unlike those works, however, here the binomial denominator depends on the classifier h. This presents difficulties for extending the above techniques to uncountable classes H. See the final section for further discussion of this issue.

Measuring Performance
We would like to be able to make FDR/FNDR related guarantees about how a data-based classifier h performs. For this, we need to specify a performance measure or optimality criterion that incorporates both FDR and FNDR quantities simultaneously. One possibility is to specify a number 0 < α < 1 and seek the classifier such that R ND (h) is minimal while R D (h) ≤ α. We consider this setting in Section 5. Another is to specify a constant λ > 0 reflecting the relative cost of FDR to FNDR, and minimize This measure was introduce by Storey [22], but was not studied in a learning context. The uniform deviation bounds of the previous section immediately imply the following computable bound on a classifier's performance with respect to this measure. Corollary 1. For any δ > 0 and n ≥ 1, with probability at least 1 − 2δ with respect to the draw of the training data, In the next section, we analyze a learning rule based on minimizing the bound of Corollary 1, and establish its strong universal consistency.

Strong Universal Consistency
Denote the globally optimal value of the performance measure by where the inf is over all measurable h : X → {0, 1}. We seek a learning rule h λ,n such that E λ ( h λ,n ) → E * λ almost surely, regardless of the underlying probability distribution. Thus let {H k } k≥1 be a family of finite sets of classifiers with universal approximation capability. That is, assume that lim k→∞ inf h∈H k E λ (h) = E * λ for all distributions on (X, Y ). Furthermore, assume this family to be nested, meaning H 1 ⊆ H 2 ⊆ H 3 · · · . For example, if X = [0, 1] d , we may take H k to be the collection of histogram classifiers based on a binwidth of 2 −k . Recall that we can set h = log 2 |H k | for h ∈ H k , where |H k | is the cardinality of H k . For histograms, we have |H k | = 2 2 kd and hence h = 2 kd log 2.
The bound of Corollary 1 suggests bound minimization as a strategy for selecting a classifier empirically. However, rather than minimizing over all possible classifiers in some H k , we first discard those classifiers whose empirical numbers of discoveries or nondiscoveries are too small. In these cases, the penalties are possibly quite large, and we are unable to obtain tight concentrations of empirical FDR/FNDR measures around their true values. However, as n increases, we are able to admit classifiers with increasingly small proportions of (non)discoveries, so that in the limit, we can still approximate arbitrary distributions. This aspect is another unique feature of FDR/FNDR compared to traditional performance measures.
Formally, set δ n = 1/n 2 and define where p n := (log n) −1 . Here k n is such that k n → ∞ as n → ∞ and log |H kn | = o(n/ log n). For the histogram example, log |H kn | = 2 knd log 2, and thus the assumed conditions on the growth of k n are essentially the same (up to a logarithmic factor) as for consistency of histograms in other problems. For example, in standard classification, 2 knd = o(n) is required [7]. Denote the bound of Corollary 1 by and define the classification rule h λ,n := arg min h∈ Hn E λ (h).
If H n = ∅, then h λ,n may be defined arbitrarily.
Theorem 2. For any distribution on (X, Y ), and any λ > 0, almost surely. That is, h λ,n is strongly universally consistent.
Proof. First consider the case where there is no measurable h : X → {0, 1} such that both P(h(X) = 0) > 0 and P(h(X) = 1) > 0. This occurs when X is deterministic. Then E * λ = ∞, and trivially h λ,n achieves optimal performance. So assume this is not the case.
By this lemma we have P(Θ n ) ≤ δ n = 1/n 2 for all n > N 1 (Here we only the need the second part of the conclusion of the lemma; later we use the lemma in its full generality). Thus Now consider the first term on the right-hand side of (4). Define the events We consider the two terms individually and show that each of them is finite.
To bound the first term on the right-hand side of (5) we use the following lemma.
Define the events and hence it suffices to show ∞ n=1 P(Ω n 1i |Θ n ) is finite for each i = 1, 2, 3, 4. We shall consider Ω 11 and Ω 13 , the other two cases following similarly.
For h ∈ H n we have n ND (h, Z n )/n ≥ p n and therefore for n ≥ N 2 , for some N 2 sufficiently large. Here we use δ n = 1/n 2 and log |H kn | = o(n/ log n). Then Furthermore, by the uniform deviation bound, Now consider the event Ω n 2 . Applying Lemma 1 with ν = ǫ/2, we have that In the definitions of R D (h) and R ND (h), we define these quantities to be infinity when the conditioning event has probability zero (see Introduction). This forces the globally optimal classifier to have both P(h(X) = 1) > 0 and P(h(X) = 0) > 0) whenever possible. The same property would hold provided R D (h) > (1 + λ)/λ when P(h(X) = 1) = 0 and R ND (h) > (1 + λ) when P(h(X) = 0) = 0. Were we to define FDR or FNDR to be smaller, our consistency argument would not apply universally. In particular, it might fail for distributions where the global minimizer of E λ has either P(h(X) = 0) = 0 or P(h(X) = 0) = 1, such as when X is deterministic. In a preliminary version of this work, we defined R D (h) and R ND (h) to be zero when the conditioning event is a null event, and were able to prove consistency under a very mild condition on the underlying distribution [19].

Constraining FDR
In this section we apply Theorem 1 to analyze a rule that seeks to minimize the FNDR subject to the constraint that FDR ≤ α, where α is a user-defined significance level. In fact, we first present a more general result, and then deduce results for this and other constrained learning problems as corollaries.
Thus, let H be a collection of classifiers as before, but not necessarily finite. Let R 0 and R 1 be measures of Type I and Type II error. For example, these may be FDR and FNDR, false positive rate and false negative rate, or some combination thereof. Assume that for i = 0, 1, there exists a data-based estimate R i of R i , and a penalty φ i (h, δ), which define a symmetric confidence interval for R i . That is, suppose that for any 0 < δ < 1, imsart-ejs ver. 2008/08/29 file: ejs_2009_363.tex date: January 27, 2009 Consider the learning rule Theorem 3. The learning rule defined in Eqn. (6) is such that, for any δ > 0 and any n ≥ 1, with probability at least 1 − 2δ with respect to the draw of the training data, The result holds regardless of the data-generating distribution.
Proof. Assume that both and which, by assumption, occurs with probability at least 1 − 2δ. By (7), we deduce the second half of the theorem from where the second inequality follows from R 0 ( h H,α ) ≤ α + φ 0 ( h H,α , δ), which follows from the definition of h H,α . To get the first half of the theorem, observe . Therefore, h * H,α is among the candidates in the minimization defining h H,α . Then This theorem can immediately be combined with Theorem 1 to give performance guarantees for the case R 0 (h) = R D (h) and R 1 (h) = R ND (h), for a countable class H. In particular, define the rule We have the following.

Corollary 2.
Assume H is countable. For any δ > 0 and any n ≥ 1, with probability at least 1 − 2δ with respect to the draw of the training data, the learning rule in (9) satisfies To extend such a result to a universally consistent estimator, based on the discussion of Theorem 2, it would be necessary to take H growing with the sample size n, and to exclude classifiers making too few discoveries or nondiscoveries. The details are similar to those of Section 4, and a formal development is omitted.

Neyman-Pearson Classification
If we take R 0 and R 1 to be the false positive rate and false negative rate, respectively, we may apply Theorem 3 to recover and generalize known results for Neyman-Pearson classification [6,18]. Specifically, set There are several possible penalties that provide uniform bounds on the deviation between these quantities and their natural empirical estimates, where n j := n i=1 1 {Yi=j} . Examples of such penalties (e.g., VC and Rademacher penalties) are given in [17]. As a concrete example, we state a result here for the case of countable H. Thus define the penalties , n 0 > 0, 1, , n 1 > 0, 1, n 1 = 0.
Define the rule We have the following.

Corollary 3.
Assume H is countable. For any δ > 0 and any n ≥ 1, with probability at least 1 − 2δ with respect to the draw of the training data, the learning rule in (10) satisfies We note that Theorem 3, applied in the context of Neyman-Pearson classification, is a stronger result than those in [6,18], which do not explicitly allow penalties that depend on the classifier h.

Precision and Recall
As a final application of Theorem 3, we analyze the precision and recall performance measures, common in database information retrieval problems (see Introduction). Precision and recall can both be defined in terms of quantities already discussed. Denote the precision and let Q P R (h) := 1 − R D (h) and Q RE (h) := 1 − R FN (h) be the empirical estimates. In this setting the goal is to find the classifier with the largest precision, while maintaining a recall of at least β, where β is a user-specified level. Thus the optimal classifier in a given class H is

Define the rule
We have the following.  Proof. To apply Theorem 3, note that maximizing Q P R (h) is equivalent to minimizing R D (h), that the constraint Q RE (h) ≥ β is equivalent to R FN (h) ≤ α := 1 − β, and similarly for the empirical objective and constraint. Furthermore, since |Q P R (h) − Q P R (h)| = |R D (h) − R D (h)|, and |Q RE (h) − Q RE (h)| = |R FN (h) − R FN (h)|, we have that the assumptions of Theorem 3 are satisfied with the stated penalties.

Conclusion
This paper demonstrates that FDR and FNDR control is possible in the context of statistical learning theory, where the distribution of (X, Y ) is unknown except through training data. We develop empirical estimates of these quantities and derive uniform deviation bounds which assess the closeness of these empirical estimates to the true FDR and FNDR. Unlike most other performance measures in statistical learning theory, which are related to binomial random variables, the FDR and FNDR measures are related to ratios of binomial random variables, which requires the development of novel bounding techniques. These bounds are then used to analyze learning rules that minimize a weighted combination of FDR and FNDR, or that minimize FNDR subject to a constraint on FDR.
Our strong universal consistency result indicates that it is necessary to prevent the learning algorithm from selecting classifiers making too few discoveries or nondiscoveries, as error estimates for such classifiers may be poor. Extending our results to uncountable classes H is an interesting open question, and may require the development of new techniques. The standard proofs of common generalization error bounds for uncountable classes, such as Rademacher and VC penalties, rely on the introduction of an artificial "ghost" sample [7]. That technique would require every h ∈ H to have the same empirical number of discoveries (or nondiscoveries) on both the original and ghost samples, which is generally not the case. Recently El-Yaniv and Pechyony [11] have extended the ghost sample technique to cases where the training and ghost samples have different sizes (their results are stated in the context of transductive learning), and some of their arguments may be useful in this regard.