P-values for classification

Let $(X,Y)$ be a random variable consisting of an observed feature vector $X\in \mathcal{X}$ and an unobserved class label $Y\in \{1,2,...,L\}$ with unknown joint distribution. In addition, let $\mathcal{D}$ be a training data set consisting of $n$ completely observed independent copies of $(X,Y)$. Usual classification procedures provide point predictors (classifiers) $\widehat{Y}(X,\mathcal{D})$ of $Y$ or estimate the conditional distribution of $Y$ given $X$. In order to quantify the certainty of classifying $X$ we propose to construct for each $\theta =1,2,...,L$ a p-value $\pi_{\theta}(X,\mathcal{D})$ for the null hypothesis that $Y=\theta$, treating $Y$ temporarily as a fixed parameter. In other words, the point predictor $\widehat{Y}(X,\mathcal{D})$ is replaced with a prediction region for $Y$ with a certain confidence. We argue that (i) this approach is advantageous over traditional approaches and (ii) any reasonable classifier can be modified to yield nonparametric p-values. We discuss issues such as optimality, single use and multiple use validity, as well as computational and graphical aspects.


Introduction
Let (X, Y ) be a random variable consisting of a feature vector X ∈ X and a class label Y ∈ Θ := {1, . . . , L} with L ≥ 2 possible values. The joint distribution of X and Y is determined by the prior probabilities w θ := IP(Y = θ) and the conditional distributions P θ := L(X | Y = θ) for all θ ∈ Θ. Classifying such an observation (X, Y ) means that only X is observed, while Y has to be predicted somehow. There is a vast literature on classification, and we refer to McLachlan [7], Ripley [10] or Fraley and Raftery [4] for an introduction and further references.
Let us assume for the moment that the joint distribution of X and Y is known, so that training data are not needed yet. In the simplest case, one chooses a classifier Y : X → Θ, i.e. a point predictor of Y . A possible extension is to consider Y : X → {0} ∪ Θ, where Y (X) = 0 means that no class is viewed as plausible. A Bayesian approach would be to calculate the posterior distribution of Y given X, i.e. the posterior weights w θ (X) := IP(Y = θ | X). In fact, a classifier Y * satisfying Y * (X) ∈ arg max θ∈Θ w θ (X) is well-known [7, Chapter 1] to minimize the risk R( Y ) := IP( Y (X) = Y ).
An obvious advantage of using the posterior distribution instead of the simple classifier Y * (or Y ) is additional information about confidence. That means, for instance, the possibility of computing the conditional risk IP( Y * (X) = Y | X) = 1 − max θ w θ (X). However, this depends very sensitively on the prior weights w θ . Small changes in the latter may result in drastic changes of the posterior weights w θ (X). Moreover, if some classes θ have very small prior weight, the classifier Y * tends to ignore these, i.e. the class-dependent risk IP( Y * (X) = Y | Y = θ) may be rather large for some classes θ. For instance, in medical applications each class may correspond to a certain disease status while the feature vector contains information about patients, including certain symptoms.
Here it would be unacceptable to classify each person as being healthy, just because the diseases in question are extremely rare. Note also that some study designs (e.g. case-control studies) allow for the estimation of the P θ but not the w θ . Moreover, there are applications in which the w θ change over time while it is still plausible to assume fixed conditional distributions P θ .
Another drawback of the posterior probabilities w θ (X) is the following: Suppose that the prior weights w θ are all identical and that for some subset Θ o of Θ with at least two elements the conditional distributions P θ , θ ∈ Θ o , are very similar. Then the posterior distribution of Y given X divides the mass corresponding to Θ o essentially uniformly among its elements. Even if the point X is right in the 'center' of the distributions P θ , θ ∈ Θ o , so that each class in Θ o is perfectly plausible, the posterior weights are not greater than 1/#Θ o . If w θ (X) is viewed merely as a measure of plausibility of class θ, there is no compelling reason why these measures should add to one.
If Y α (X) happens to be a singleton, we have classified X uniquely with given confidence 1 − α. In case of 2 ≤ # Y α (X) < L we can at least exclude some classes with a certain confidence. So far the classification problem corresponds to a simple statistical model with finite parameter space Θ. A distinguishing feature of classification problems is that the joint distribution of (X, Y ) is typically unknown and has to be estimated from a set D consisting of completely observed training observations (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ). Let us assume for the moment that all n + 1 observations, i.e. the n training observations (X i , Y i ) and the current observation (X, Y ), are independent and identically distributed. Now one has to consider classifiers Y (X, D) and p-values π θ (X, D) depending on the current feature vector X as well as on the training data D. In this situation one can think of two possible extensions of (1.1): For any θ ∈ Θ and α ∈ (0, 1), 3) It will turn out that Condition (1.2) can be guaranteed in various settings. Condition (1.3) corresponds to "multiple use" of our p-values: Suppose that we use the training data D to construct the p-values π θ (·, D) and classify many future observations ( X, Y ). Then the relative number of future observations with Y = b and π θ ( X, D) ≤ α is close to a random quantity depending on the training data D. P-values as discussed here have been used in some special cases before. For instance, McLachlan's [7] "typicality indices" are just p-values π θ (X, D) satisfying (1.2) in the special case of multivariate gaussian distributions P θ ; see also Section 3. However, McLachlan's p-values are used primarily to identify observations not belonging to any of the given classes in Θ. In particular, they are not designed and optimized for distinguishing between classes within Θ. Also the use of receiver operating characteristic (ROC) curves in the context of logistic regression or Fisher's [3] linear discriminant analysis is related to the present concept. One purpose of this paper is to provide a solid foundation for procedures of this type.
The remainder of this paper is organized as follows: In Section 2 we return to the idealistic situation of known prior weights w θ and distributions P θ . Here we devise p-values that are optimal in a certain sense and related to the optimal classifier mentioned previously. These p-values serve as a gold standard for p-values in realistic settings. In addition we describe briefly McLachlan's [7] typicality indices and a potential compromise between the these p-values and the optimal ones. Section 3 is devoted to p-values involving training data. After some general remarks on cross-validation and graphical representations, we discuss McLachlan's [7] p-values in view of (1.2) and (1.3). Nonparametric p-values satisfying (1.2) without any further assumptions on the distributions P θ are proposed in Section 3.3. These p-values are based on permutation testing, and the only practical restriction is that the group sizes N θ := #{i : Y i = θ} within the training data should exceed the reciprocal of the intended test level α. We claim that any reasonable classification method can be converted to yield p-values. In particular, we introduce p-values based on a suitable variant of the nearest-neighbor method. Section 3.4 deals with asymptotic properties of various p-values as the size n of D tends to infinity. It is shown in particular that under mild regularity conditions the nearest-neighbor p-values are asymptotically equivalent to the optimal methods of Section 2. These results are analogous to results of Stone [12,Section 8] for nearest-neighbor classifiers. In Section 3.5 the nonparametric p-values are illustrated with simulated and real data. Finally, in Section 3.6 we comment on Condition (1.3) and show that the o p (1) cannot be avoided in general.
In Section 4 we comment briefly on computational aspects of our methods. Section 5 introduces the notion of 'local identifiability' for finite mixtures, which is of independent interest. For us it is helpful to define the optimal p-values in a simple manner and it is also useful for the asymptotic considerations in Section 3.4. Proofs and technical arguments are deferred to Section 6.
Let us mention a different type of confidence procedure for classification: Suppose that a θ (X, D), b θ (X, D) is a confidence interval for w θ (X). Precisely, let a θ (X, D) ≤ w θ (X) ≤ b θ (X, D) for all θ ∈ Θ with probability at least 1 − α.
would be a prediction region for Y such that Y * (X) ⊂Y(X, D) with probability at least 1 −α. Note, however, that this gives no control over the probability that Y ∈Y (X, D). In fact, the latter probability could be close to 50 percent. By way of contrast, with the p-values in the present paper we can guarantee to cover Y with a certain confidence, even in situations where consistent estimation of the conditional probabilities w θ (X) is difficult or even impossible.

Optimal p-values and alternatives
Suppose that the distributions P 1 , . . . , P L have known densities f 1 , . . . , f L > 0 with respect to some measure M on X . Then the marginal distribution of X has density f := b∈Θ w b f b with respect to M , and Hence the optimal classifier Y * may be characterized by

Optimal p-values
Here is an analogous consideration for p-values. Let π = (π θ ) θ∈Θ consist of p-values π θ satisfying (1.1). Given the latter constraint, our goal is to provide small p-values and small predicion regions. Hence two natural measures of risk are, for instance, Elementary calculations reveal that with R α (π θ ) := IP(π θ (X) > α).
Thus we focus on minimizing R α (π θ ) for arbitrary fixed θ ∈ Θ and α ∈ (0, 1) under the constraint (1.1). Since x → 1{π θ (x) > α} may be viewed as a level-α test of P θ versus b∈Θ w b P b , a straightforward application of the Neyman-Pearson Lemma shows that the p-value is optimal, provided that the distribution L (f θ /f)(X) is continuous. Two other representations of π * θ are given by The former representation shows that π * θ (x) is a non-decreasing function of w θ (x). The latter representation shows that the prior weight w θ itself is irrelevant for the optimal p-value π * θ (x); only the ratios w c /w b with b, c = θ matter. In particular, in case of L = 2 classes, the optimal p-values do not depend on the prior distribution of Y at all.
Here and throughout this paper we assume the likelihood ratios T * θ (X) to have a continuous distribution. It will be shown in Section 5 that many standard families of distributions fulfill this condition. In particular, it is satisfied in case of X = R q and P θ = N q (µ θ , Σ θ ) with parameters (µ θ , Σ θ ), Σ θ nonsingular, not all being identical. Further examples include the multivariate t-family as it has been advocated by Peel and McLachlan [8] to robustify cluster and discriminant analysis. These authors also discuss maximum likelihood via the EM algorithm in this model. Without the continuity condition on L(T * θ (X)) one could still devise optimal p-values by introducing randomized p-values, but we refrain from such extensions.
Let us illustrate the optimal p-values in two examples involving normal distributions: Example 2.1. (Standard model) Let P θ = N q (µ θ , Σ) with mean vectors µ θ ∈ R q and a common symmetric, nonsingular covariance matrix Σ ∈ R q×q . Then Thus the two classes are separated well so that any observation X is classified uniquely (or viewed as suspicious) with confidence 1 − α. In case of µ 1 − µ 2 Σ /2 < Φ −1 (1 − α), the feature space contains regions with unique prediction and a region in which both class labels are plausible: Example 2.2. Consider L = 3 classes with equal prior weights w θ = 1/3 and bivariate normal distributions  Figure 1 shows a typical sample from this distribution and the corresponding p-value functions π * θ . The latter are on a grey scale with white corresponding to zero and black corresponding to one. The resulting predition regions Y α (x) for α = 5% and α = 1% are depicted in Figure 2. In the latter plots, the color of a point x ∈ R 2 has the following meaning:

Typicality indices
An alternative definition of p-values is based on the densities themselves, namely, These typicality indices quantify to what extent a point x is an outlier with respect to the single distributions P θ . These p-values τ θ are certainly suboptimal in terms of the risk R α (π θ ). On the other hand, they allow for the detection of observations which belong to none of the classes under consideration.
Σθ with conditional distribution χ 2 q given Y = θ, the typicality indices may be expressed as q . These p-values allow for the separation of two different classes θ, b ∈ Θ only if is sufficiently large. Thus they suffer from the curse of dimensionality and may yield much more conservative predition regions than the p-values π * θ .

Combining the optimal p-values and typicality indices
The optimal p-values π * θ and the typicality indices τ θ may be viewed as extremal members of a whole family of p-values if we introduce an additional class label 0 with 'density' f 0 ≡ 1 and prior weight w 0 > 0. Then we define the compromise p-value In the setting of Example 2.1 there is another modification which is similar in spirit to Ehm et al. [1]: When defining the p-value for a particular class θ we replace the other distributions

Training data
Now we return to the realistic situation of unknown distributions P θ and pvalues π θ (X, D) with corresponding prediction regions Y α (X, D). From now on we consider the class labels Y 1 , Y 2 , . . . , Y n as fixed while X 1 , X 2 , . . . , X n and (X, Y ) are independent with L(X i ) = P Yi . That way we can cover the case of i.i.d. training data (via conditioning) as well as situations with stratified training samples. In what follows let We shall tacitly assume that all group sizes N θ are strictly positive, and asymptotic statements as in (1.3) are meant as

Visual assessment and estimation of separability
Before giving explicit examples of p-values, let us describe our way of visualizing the separability of different classes by means of given p-values π θ (·, ·). For that purpose we propose to compute cross-validated p-values for i = 1, 2, . . . , n with D i denoting the training data without observation (X i , Y i ). Thus each training observation (X i , Y i ) is treated temporarily as a 'future' observation to be classified with the remaining data D i . Then we display these cross-validated p-values graphically. This is particularly helpful for training samples of small or moderate size. In addition to graphical displays one can compute the empirical conditional inclusion probabilities and the empirical pattern probabilities for b, θ ∈ Θ and S ⊂ Θ. These numbers I α (b, θ) and P α (b, S) can be interpreted as estimators of respectively; see also Section 3.4. For large group sizes N b , one can also display the empirical ROC curves which are closely related to the usual ROC curves employed, for instance, in logistic regression or linear discriminant analysis involving L = 2 classes.

Typicality indices
For the sake of simplicity, suppose that P θ = N q (µ θ , Σ) with unknown mean vectors µ 1 , . . . , µ L ∈ R q and an unknown nonsingular covariance matrix Σ ∈ R q×q . Consider the standard estimators Then the squared Mahalanobis distance can be used to assess the plausibility of class θ, where we assume that n ≥ L+q. Precisely, see [7]. Here F k,z denotes the F -distribution with k and z degrees of freedom, and we use the same symbol for the corresponding c.d.f.. Hence the typicality index is a p-value satisfying (1.2). Moreover, since the estimators µ b and Σ are consistent, one can easily verify property (1.3) as well.
1. An array of ten electrochemical sensors is used for "smelling" different substances. In each case it produces raw data X ∈ R 10 consisting of the electrical resistances of these sensors. Before analyzing such data one should standardize them in order to achieve invariance with respect to the substance's concentration. One possible standardization is to replace X with .
Thus we end up with data vectors in R 9 . For technical reasons, group sizes N θ are typically small, and not too many future observations may be analysed. This is due to the fact that the system needs to be recalibrated regularly. Now we consider a specific dataset with "smells" of L = 12 different brands of tobacco and fixed group sizes N θ = 3 for all θ ∈ Θ. We computed the crossvalidated typicality indices τ θ (X i , D i ) described above. Figure 3   training observation (X i , Y i ) the p-values τ 1 (X i , D i ), . . . , τ 12 (X i , D i ) as a row of twelve rectangles. The area of these is proportional to the corresponding p-value. The first three rows correspond to data from the first brand, the next three rows to the second brand, and so on. Figure 4 displays the corresponding prediction regions Y α (X i , D i ) for α = 0.01. Within each row the elements of Y α (X i , D i ) are indicated by rectangles of full size. These pictures show classes 1 and 2 are separated well from the other eleven classes. Classes 5, 8, 9 and 12 overlap somewhat but are clearly separated from the remaining eight classes. Finally there are three pairs of classes which are essentially impossible to distinguish, at least with the present method, but which are separated well from the other ten classes. These pairs are 3-4, 6-7, and 10-11. It turned out later that brands 6 and 7 were in fact identical. Note also that all except one prediction region Y α (X i , D i ) contain the true class and at most three additional class labels.

Nonparametric p-values via permutation tests
For a particular class θ let I(1) < I(2) < · · · < I(N θ ) be the elements of G θ . An elementary but useful fact is that (X, X I(1) , X I(2) , . . . , X I(Nθ) ) is exchangeable conditional on Y = θ. Thus let T θ (X, D) be a test statistic which is symmetric in (X I(j) ) Nθ j=1 . We define D i (x) to be the training data with x in place of X i .
As for the test statistic T θ (X, D), the optimal p-value in Section 2 suggests using an estimator for the weighted likelihood ratio T * θ (x) or a strictly increasing transformation thereof. In very high-dimensional settings this may be too ambitious, and T θ (X, D) could be any test statistic quantifying the implausibility of "Y = θ".
Plug-in statistic for standard gaussian model. For the setting of Example 2.1 and Section 3.2 one could replace the unknown parameters w c , µ c and Σ in T * θ with N c /n, µ c and Σ, respectively. Note that the resulting p-values always satisfy (1.2), even if the underlying distributions P c are not gaussian with common covariance matrix.
Nearest-neighbor estimation. One could estimate w θ (·) via nearest neighbors. Suppose that d(·, ·) is some metric on X . Let B(x, r) := {y ∈ X : d(x, y) ≤ r}, and for a fixed positive integer k ≤ n define Further let P θ denote the empirical distribution of the points X i , i ∈ G θ , i.e.
Then the k-nearest-neighbor estimator of w θ (x) is given by The resulting nonparametric p-value is defined with T θ (x, D) := − w θ (x, D). Note that in case of w b = N b /n, we simply end up with the ratio For simplicity, we assume k to be determined by the group sizes N 1 , . . . , N L only. Of course one could define π θ (X, D) with k = k θ (X, D) nearest neighbors of X, as long as k θ (X, D) is symmetric in the N θ + 1 feature vectors X and X i , i ∈ G θ . Moreover, in applications where the different components of X are measured on rather different scales, it might be reasonable to replace d(·, ·) with some data-driven metric.
Logistic regression. Suppose for simplicity that there are L = 2 classes and that X ∈ R d contains the values of d numerical or binary variables. Let ( a, b) = ( a(D), b(D)) be the maximum likelihood estimator for the parameter (a, b) ∈ R × R d in the logistic model, where Then possible candidates for T 1 (x, D) and T 2 (x, D) are given by Extensions to multicategory logistic regression as well as the inclusion of regularization terms to deal with high-dimensional covariable vectors X are possible and will be described elsewhere.

Asymptotic properties
Now we analyze the asymptotic behavior of the nonparametric p-values π θ (X, D) and the corresponding empirical probabilities I α (b, θ) and P(b, S). Throughout this section, asymptotic statements are to be understood within setting (3.1). As in Section 2 we assume that the distributions P θ have strictly positive densities with respect to some measure M on X . The following theorem implies that π θ (X, D) satisfies (1.3) under certain conditions on the underlying test statistic T θ (X, D). In addition the empirical probabilities I α (b, θ) and P(b, S) turn out to be consistent estimators of I α (b, θ | D) and P α (b, S | D), respectively. Theorem 3.1. Suppose that for fixed θ ∈ Θ there exists a test statistic T o θ on X satisfying the following two requirements: . In particular, for arbitrary fixed α ∈ (0, 1), If the limiting test statistic T o θ is equal to T * θ or some strictly increasing transformation thereof, then the nonparametric p-value π θ (·, D) is asymptotically optimal. The next two lemmata describe situations in which Condition (3.3) or (3.4) is satisfied.   that (X , d) is a separable metric space and that all densities f b , b ∈ Θ, are continuous on X . Alternatively, suppose that X = R q equipped with some norm. Then Condition (3.3) is satisfied with T o θ = T * θ in case of the k-nearest-neighbor rule with w θ = N θ /n, provided that k = k(n) → ∞ and k/n → 0.

Examples
The nonparametric p-values are illustrated with two examples.
Example 3.2. The lower right panel in Figure 1 shows simulated training data from the model in Example 2.2, where N 1 = N 2 = N 3 = 100. Now we computed the corresponding prediction regions Y 0.05 (x, D) based on the plug-in method for the standard gaussian model (which isn't correct here) and on the nearestneighbor method with k = 100 and standard euclidean distance. Figure 5 depicts these prediction regions.
To judge the performance of the nonparametric p-values visually we chose ROC curves, where we concentrated on the plug-in method. In Figure 6 we show for each pair (b, θ) ∈ Θ × Θ the true ROC curves of π * θ (·) and π θ (·, D), both of which had been estimated in 40'000 Monte Carlo Simulations of X ∼ P θ . In addition we show the empirical ROC curve α → 1 − I α (b, θ) (black step function). Note first that the difference between the (conditional) ROC curve of π θ (·, D) and its empirical counterpart 1 − I α (b, θ | D) is always rather small, despite the moderate group sizes N b = 100. Note further that the ROC curves of π θ (·, D) and π * θ (·) are also close together, despite the fact that the plug-in method uses an incorrect model. These pictures show clearly that distinguishing between classes 1 and 2 is more difficult than distinguishing between classes 2 and 3, while classes 1 and 3 are separated almost perfectly.
Of course these pictures give only partial information about the performance of the p-values. In addition one could investigate the joint distribution of the p-values via pattern probabilities; see also the next example.   Table 1 Empirical performance of Y 0.05 (·, ·) and Y 0.01 (·, ·) in Example 3.3. Example 3.3. This example is from a data base on quality management at the University hospital at Lübeck. In a longterm study on mortality of patients after a certain type of heart surgery, data of more than 20'000 cases have been reported. The dependent variable is Y ∈ {1, 2} with Y = 1 and Y = 2 meaning that the patient survived the operation or not, respectively. For each case there were q = 21 numerical or binary covariables describing the patient (e.g. sex, age, various specific risk factors) plus covariables describing the circumstances of the operation (e.g. emergency or not, experience of the surgeon).
We reduced the data set by taking all N 1 = 662 observations with Y = 2 and a random subsample of N 1 = 3N 2 = 1986 observations with Y = 1. Without such a reduction, the nearest-neighbor method wouldn't work well due to the very different group sizes. Now we computed nonparametric crossvalidated pvalues based on the plug-in method from the standard gaussian model, logistic regression, and the nearest-neighbor method with k = 200. In the latter case, we first divided each component of X corresponding to a non-dichotomous variable by its sample standard deviation, because the variables are measured on very different scales. Table 1 reports the performance of Y α (X i , D i ) as a predictor of Y i for α = 5% and α = 1%. In each cell of the table the entries correspond to the three methods mentioned above. This example shows the p-values' potential to classify a certain fraction of cases unambiguously even in situations in which overall risks of classifiers are not small which is rather typical in medical applications. Note again that the method doesn't require any knowledge of prior probabilities. Logistic regression yielded slightly better results than the other two in terms of the fraction of cases with Y α (X i , D i ) = {Y i }. The other two methods performed similarly.

Impossibility of strengthening (1.3)
Comparing (1.2) and (1.3), one might want to strengthen the latter requirement to IP π θ (X, D) ≤ α Y = θ, D ≤ α almost surely. (3.8) However, the following lemma entails that there are no reasonable p-values satisfying (3.8). Recall that we are aiming at p-values such that IP π θ (X, D) ≤ α Y = b is large for b = θ.

Computational aspects
The computation of the p-values in (3.2) may be rather time-consuming, depending on the particular test statistic T θ (·, D). Just think about classification methods involving variable selection or tuning of artificial neural networks by means of D. Also the nearest-neighbor method with some data-driven choice of k or the metric d(·, ·) may result in tedious procedures. In order to compute π θ (·, D) as well as π θ (X i , D i ) one can typically reduce the computational complexity considerably by using suitable update formulae or shortcuts.
Naive shortcuts for the nonparametric p-values. One might be tempted to replace π θ (X, D) with the naive p-values One can easily show that the conclusions of Theorem 3.1 remain true with π naive θ (·, ·) in place of π θ (·, ·). However, finite sample validity in the sense of (1.2) is not satisfied in general, so we prefer the alternative shortcut described next. Note also that empirical ROC curves offered by some statistical software packages, as a complement to logistic regression or linear discriminant analysis with two classes, are often based on this shortcut.
Valid shortcuts for the nonparametric p-values. Often the computations as well as the program code become much simpler if we replace T θ (X, D) and T θ (X i , D i (X)) in Definition (3.2) with T θ (X, D(X, θ)) and T θ (X i , D(X, θ)), respectively, where D(X, θ) denotes the training data D after adding the "observation" (X, θ). That means, before judging whether θ is a plausible class label for a new observation X, we augment the training data by (X, θ) to determine the test statistic T θ (·, D(X, θ)). Then we just evaluate the latter function at the N θ + 1 points X and X i , i ∈ G θ , to compute This p-value does satisfy Condition (1.2), and the conclusions of Theorem 3.1 remain true as well. In this context it might be helpful if the underlying test statistics satisfy some moderate robustness properties, because X may be an outlier with respect to the distribution P θ .
Update formulae for sample means and covariances. In connection with the typicality indices of Section 3.2 or the plug-in method for the standard gaussian model, elementary calculations reveal the following update formulae for groupwise mean vectors and sample covariance matrices: Replacing D with the reduced data set D i for some i ∈ G θ has no impact on Replacing D with the modified data set D i (X) for some i ∈ G θ results in Update formulae for the nearest-neighbor method. For convenience we restrict our attention to the valid shortcut involving D(X, θ). To compute the resulting p-values π naive θ (X, D(X, θ)) quickly for arbitrary feature vectors X ∈ X , it is convenient to store the n(1 + 2L) numbers with i ∈ {1, . . . , n} and b ∈ Θ, where For then one can easily verify that Hence classifying a new feature vector X requires only O(n) steps for determining the 1 + L 2 numbers r k (X, D(X, θ)) and N b (X, D(X, θ)) and the nL 2 numbers N b (X i , D(X, θ)), where 1 ≤ i ≤ n and b, θ ∈ Θ.
Computing the crossvalidated p-values with the valid shortcut is particularly easy, because replacing one training observation (X i , Y i ) with (X i , θ) does not affect the radii r k (x, D).
In case of data-driven choice of k or d(·, ·), the preceding formulae are no longer applicable. Then the valid shortcut is particularly useful to reduce the computational complexity.

Likelihood ratios and local identifiability
In previous sections we assumed that the distribution of likelihood ratios such as w θ (X) or T * θ (X) is continuous. This property is related to a property which we call 'local identifiability', a strengthening of the well-known notion of identifiability for finite mixtures. Throughout this section we assume that the distributions P 1 , P 2 , . . . , P L belong to a given model (Q ξ ) ξ∈Ξ of probability distributions Q ξ with densities g ξ > 0 with respect to some measure M on X .
A standard example of an identifiable family is the set of all nondegenerate gaussian distributions on R q ; see [13]. Holzmann et al. [6] provide a rather comprehensive list of identifiable classes of multivariate distributions. In particular, they verify identifiability of families of elliptically symmetric distributions on X = R q with Lebesgue densities of the form Here the parameter ξ = (µ, Σ, ζ) consists of an arbitrary location parameter µ ∈ R q , an arbitrary symmetric and positive definite scatter matrix Σ ∈ R q×q and an additional shape parameter ζ which may also vary in the mixture. For each shape parameter ζ, the 'density generator' h q (·; ζ) is a nonnegative function on [0, ∞) such that X h q ( x 2 ; ζ) dx = 1. One particular example are the multivariate t-distributions with for ζ > 0. We mention that the subsequent arguments apply to most of the elliptically symmetric families discussed by Holzmann et al. [6]. Peel et al. [9] discuss classification for directional data and our method can be extended to distributions with non-euclidean domain, combining the arguments below with methods in Holzmann et al. [5].
As prominent examples we mention the von Mises family for directional data and the Kent family for spherical data.
Local identifiability entails the following conclusion: Suppose that Q is equal to m j=1 λ j Q ξ(j) for some number m ∈ N, pairwise different parameters ξ(1), . . . , ξ(m) in Ξ and nonnegative numbers λ 1 , . . . , λ m . Then one can determine the ingredients m, ξ(1), . . . , ξ(m) and λ 1 , . . . , λ m from the restriction of Q to any fixed measurable set B o ⊂ X with M (B o ) > 0. The following theorem provides a sufficient criterion for local identifiability which is easily verified in many standard examples.
Theorem 5.1. Let M be Lebesgue measure on X = X 1 × X 2 × · · · × X q with open intervals X k ⊂ R. Suppose that the following two conditions are satisfied: may be extended to a holomorphic function on some open subset of C containing X k . Then the family (Q ξ ) ξ∈Ξ is locally identifiable.
One can easily verify that Condition (ii) of Theorem 5.1 is satisfied by the densities g ξ in (5.1), if the density generators h q (·; ζ) may be extended to holomorphic functions on some open subset of C containing [0, ∞). Hence, for instance, the family of all multivariate t-distributions is locally identifiable.

Proofs
Proof of Theorem 3.1. Since the distributions P 1 , . . . , P L are mutually absolutely continuous, Condition tends to zero for any fixed ǫ > 0. It follows from the elementary inequality for real numbers r, r o , s, s o that where |R 1 | ≤ (N θ + 1) −1 and by virtue of Condition (3.4). These considerations show that where F θ (r) := P θ z ∈ X : T o θ (z) ≥ r , F θ (r) := P θ z ∈ X : T o θ (z) ≥ r .
Here we utilized the well-known fact [11] that F θ − F θ ∞ = o p (1). Since π o θ (X) = F θ (T o θ (X)), this entails Conclusion (3.5). As to the remaining assertions (3.6-3.7), note first that (3.5) implies that tends to zero for any fixed ǫ > 0, again a consequence of mutual absolute continuity of P 1 , . . . , P L . Similarly as in the proof of (3.5) one can verify that with G b,θ (u) := P b z ∈ X : π o θ (z) > u and G b,θ (u) := P b z ∈ X : π o θ (z) > u , while Since the latter probability tends to zero as ǫ ↓ 0, we obtain Claim (3.7). This implies Claim (3.6), because Proof of Lemma 3.2. It is a simple consequence of the weak law of large numbers that . Now one can easily show that (3.3) is satisfied with T o θ defined as in (2.1). The results from Section 5 entail that Leb q x ∈ R q : T o θ (x) = c = 0 for any c > 0, so that (3.4) is satisfied as well. 2 Proof of Lemma 3.3. The assumptions imply the existence of a Borel set X o ⊂ X with IP(X ∈ X o ) = 1 such that the following additional requirements are satisfied: IP(X ∈ B(x, r)) > 0 for all x ∈ X o , r > 0, (6.1) lim r↓0 P b (B(x, r)) P θ (B(x, r)) = f b f θ (x) for all θ, b ∈ Θ, x ∈ X o . (6.2) In case of continuous densities f 1 , f 2 , . . . , f L > 0 on a separable metric space (X , d), this is easily verified with X o being the support of L(X), i.e. the smallest closed set such that IP(X ∈ X o ) = 1. In case of X = R q and d(x, y) = x − y , existence of such a set X o is a known result from geometric measure theory; cf. Federer [2, Theorem 2.9.8].
In view of (6.1-6.2), it suffices to show that for arbitrary fixed x ∈ X o and b ∈ Θ, r k(n) (x) → p 0 and P b B(x, r k(n) (x)) P b B(x, r k(n) (x)) → p 1. according to (6.4) and (6.1). These considerations show that r k(n) (x) → p 0, but r k(n) (x) ≥ r n with asymptotic probability one. Now we utilize that the process In case of q = 1, this entails that W ⊂ X = X 1 contains an accumultation point within X 1 , and the identity theorem for analytic functions yields that h = 0 on X . But this would be a contradiction to (Q ξ ) ξ∈Ξ being identifiable.