On optimal Bayesian classification and risk estimation under multiple classes

A recently proposed optimal Bayesian classification paradigm addresses optimal error rate analysis for small-sample discrimination, including optimal classifiers, optimal error estimators, and error estimation analysis tools with respect to the probability of misclassification under binary classes. Here, we address multi-class problems and optimal expected risk with respect to a given risk function, which are common settings in bioinformatics. We present Bayesian risk estimators (BRE) under arbitrary classifiers, the mean-square error (MSE) of arbitrary risk estimators under arbitrary classifiers, and optimal Bayesian risk classifiers (OBRC). We provide analytic expressions for these tools under several discrete and Gaussian models and present a new methodology to approximate the BRE and MSE when analytic expressions are not available. Of particular note, we present analytic forms for the MSE under Gaussian models with homoscedastic covariances, which are new even in binary classification.


Introduction
Classification in biomedicine is often constrained by small samples so that understanding properties of the error rate is critical to ensure the scientific validity of a designed classifier. While classifier performance is typically evaluated by employing distribution-free training-data error estimators such as cross-validation, leave-one-out, or bootstrap, a number of studies have demonstrated that these methods are highly problematic in small-sample settings [1]. Under real data and even under simple synthetic Gaussian models, cross-validation has been shown to suffer from a large variance [2] and often has nearly zero correlation, or even negative correlation, with the true error [3,4]. Among other problems, this directly leads to severely optimistic reporting biases when selecting the best results among several datasets [5] or when selecting the best classification rule among several candidates [6] and difficulties with performance reproducibility [7].
Furthermore, there are typically no accuracy guarantees for error estimators when applied under small samples. Distribution-free bounds on the mean-square error (MSE) or its square root, the root-mean-square (RMS), of an error estimator with respect to the true error rate are typically either unavailable or unhelpful under small samples. Consider leave-one-out error estimation for a discrete histogram rule that breaks ties with class 0. The following is a distribution-free RMS bound [8]: where S is a random sample, θ is a feature-label distribution, and n is the sample size. To guarantee an RMS less than 0.5 for all distributions, this bound indicates that a sample size of at least n = 209 would be required. Typically, the error of a classifier should be between 0 and 0.5 so that an RMS of 0.5 is trivially guaranteed. Rather than a distribution-free approach, recent work takes a Bayesian approach to address these problems. The idea is to assume the true distributions characterizing classes in the population are members of an uncertainty class of models. We also assume that members of the uncertainty class are weighted by a prior distribution, and after observing a sample, we update the prior to a posterior distribution. For a given classifier we may find an optimal MSE error estimator, called a Bayesian error estimator (BEE) [9,10] and evaluate the MSE of any arbitrary error estimator [11,12]. These quantities are found by conditioning on the sample in hand and averaging with respect to the unknown population distribution via the posterior, rather than by conditioning on the distribution and averaging over random samples as in (1). Not only does the Bayesian framework supply more powerful error estimators, but the sample-conditioned MSE allows us to evaluate the accuracy of error estimation. The Bayesian framework also facilitates optimal Bayesian classification (OBC), which provides decision boundaries to minimize the BEE [13,14]. In this way, the Bayesian framework can be used to optimize both error estimation and classification.
Classifier design and analysis in the Bayesian framework have previously been developed for binary classification with respect to the probability of misclassification. However, it is common in small-sample classification problems to be faced with classification under multiple classes and for different types of error to be associated with different levels of risk or loss. A few classical classification algorithms naturally permit multiple classes and arbitrary loss functions; for example, a plug-in rule takes the functional form for an optimal Bayes decision rule under a given modeling assumption and substitutes sample estimates of model parameters in place of the true parameters. This can be done with linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) for multiple classes with arbitrary loss functions, which essentially assume that the underlying class-conditional densities are Gaussian with equal or unequal covariances, respectively. Most training-data error estimation methods, for instance, cross-validation, can also be generalized to handle multiple classes and arbitrary loss functions. However, it is expected that the same difficulties encountered under binary classes with simple zero-one loss functions (where the expected risk reduces to the probability of misclassification) will carry over to the more general setting, as they have in ROC curve estimation [15]. Support vector machines (SVM) are inherently binary but can be adapted to incorporate penalties that influence risk by implementing slack terms or applying a shrinkage or robustifying objective function [16,17]. It is also common to construct multi-class classifiers from binary classifiers using the popular "one-versus-all" or "all-versus-all" strategies [18]. The former method builds several binary classifiers by discriminating one class, in turn, against all others, and at a given test point reports the class corresponding to the highest classification score. The latter discriminates between each combination of pairs of classes and reports a majority vote. However, it is unclear how one may assess the precise effect of these adaptations on the expected risk.
We are thus motivated to generalize the BEE, sampleconditioned MSE, and OBC to treat multiple classes with arbitrary loss functions. We will present analogous concepts of Bayesian risk estimation (BRE), the sample-conditioned MSE for risk estimators, and optimal Bayesian risk classification (OBRC). We will show that the BRE and OBRC can be represented in the same form as the expected risk and Bayes decision rule with unknown true densities replaced by effective densities. This approach is distinct from the simple plug-in rule discussed earlier, since the form of the effective densities may not be the same as the individual densities represented in the uncertainty class. We will also develop an interpretation of the conditional MSE based on an effective joint density, which is new even under binary classes with a zero-one loss function.
Furthermore, we will provide analytic solutions under several models: discrete spaces with Dirichlet priors (discrete models) and Gaussian distributions with known, independent scaled identity, independent arbitrary, homoscedastic scaled identity, and homoscedastic arbitrary covariance models, all with conjugate priors (Gaussian models). We provide expressions for the BRE and conditional MSE for arbitrary classification in the discrete model and binary linear classification in the Gaussian model. The analytic form that we provide for the MSE of arbitrary error estimators under homoscedastic models is completely new without an analog in prior work under binary classification and zero-one loss. For models in which an analytic form for the BRE and conditional MSE are unavailable, for instance, under multi-class or non-linear classification in the Gaussian model, we also discuss efficient methods to approximate these quantities. In particular, we present a new computationally efficient method to approximate the conditional MSE based on the effective joint density.

Notation
We denote random quantities with capital letters, e.g., Y ; realizations of random variables with lowercase letters, e.g., y; and vectors in bold, e.g., X and x. Matrices will generally be in bold upper case, e.g., S. Spaces will be denoted by a stylized font, e.g., X . Distributions with conditioning will be made clear through the function arguments; for instance, we write the distribution of X given Y as f (x | y). The probability space of expectations will be made clear by denoting random quantities in the expectation and conditioning, e.g., the expectation of Y conditioned on the random variable X and the event C = c is denoted by E[ Y | X, c]. When the region of integration in an integral is omitted then this region is the whole space. Any exceptions in notation will be defined throughout.

Bayes decision theory
We next review concepts from classical Bayes decision theory. Consider a classification problem in which we are to predict one of M classes, y = 0, . . . , M − 1, from a sample drawn in feature space X . Let X and Y denote a random feature vector and its corresponding random label. Let f (y | c) be the probability mass function of Y, parameterized by a vector c, and for each y, let f (x | y, θ y ) be the class-y-conditional density of X, parameterized by a vector θ y . The full feature-label distribution is parameterized by c and θ = {θ 0 , . . . , θ M−1 }.
Let λ(i, y) be a loss function quantifying a penalty in predicting label i when the true label is y. The conditional risk in predicting label i for a given point, x, is defined as . ( The expected risk of a given classification rule, ψ : X → {0, . . . , M − 1}, is given by where is the probability that a class-y point will be assigned class i by the classifier ψ, and the i = {x ∈ X : ψ(x) = i} partition the sample space into decision regions. A Bayes decision rule (BDR) minimizes expected risk or, equivalently, the conditional risk at each fixed point x: By convention, we break ties with the lowest index, i ∈ {0, . . . , M − 1}, minimizing R(i, x, c, θ).

Optimal Bayesian risk classification
In practice, the feature-label distribution is unknown so that we must train a classifier and estimate risk or error with data. The Bayesian framework resolves this by assuming the true feature-label distribution is a member of a parameterized uncertainty class. In particular, assume that c is the probability mass function of Y, that is, c = {c 0 , . . . , c M−1 } ∈ M−1 , where f (y | c) = c y and M−1 is the standard M − 1 simplex defined by c y ∈[ 0, 1] for y ∈ {0, . . . , M − 1} and M−1 y=0 c y = 1. Also assume θ y ∈ T y for some parameter space T y , and θ ∈ T = T 0 ×. . .×T M−1 .
Let C and denote random vectors for parameters c and θ, respectively. Finally, assume C and are independent prior to observing data and assign prior probabilities, π(c) and π(θ).
Priors quantify uncertainty we have about the distribution before observing the data. Although non-informative priors may be used as long as the posterior is normalizable, informative priors can supplement the classification problem with information to improve performance when the sample size is small. This is key for problems with limited or expensive data. Under mild regularity conditions, as we observe sample points, this uncertainty converges to a certainty on the true distribution parameters, where more informative priors may lead to faster convergence [12]. For small samples, the performance of Bayesian methods depends heavily on the choice of prior. Performance tends to be modest but more robust with a non-informative or weakly informative prior. Conversely, informative priors offer the potential for great performance improvement, but if the true population distribution is not well represented in the prior, then performance may be poor. This trade-off is acceptable as long as the prior is an accurate reflection of available scientific knowledge so that one is reasonably sure that catastrophic results will not occur. If multiple models are scientifically reasonable but result in different inferences, and if it is not possible to determine which model is best from data or prior knowledge, then the range of inferences must be considered [19]. For the sake of illustration, in simulations, we will utilize either low-information priors or a simple prior construction method for microarray data, although modeling and prior construction remain important problems [20].
Let S be a sample, that is, a realization of n independent labeled points drawn from X . Also let x y i denote the ith sample point in class y and n y denote the number of sample points observed from class y. Given a sample, the priors are updated to posterior densities: where the product on the right is the usual likelihood function. and are marginal posteriors of C and . Thus, independence between C and is preserved in the posterior. Constants of proportionality are found by normalizing the integral of posteriors to 1. When the prior density is proper, this all follows from Bayes' rule; otherwise, (7) and (8) are taken as definitions, where we require posteriors to be proper. f (c | S) depends on the prior and sampling method used. For instance, if C is known, then π(c) and f (c | S) are both point masses at the known value of C. Under separate sampling, in which the number of sample points in each class is fixed to an arbitrary value prior to sampling, f (c | S) = π(c). Under random sampling, the sample size is fixed at n and the number of points observed from each class is determined by independent draws from the feature-label distribution. Given a Dirichlet prior on C with hyperparameters α = {α 0 , . . . , α M−1 }, a special case being α 0 = . . . = α M−1 = 1 for a uniform distribution on M−1 , then under random sampling the posterior on C is still Dirichlet with hyperparameters α * y = α y + n y .

Bayesian risk estimation
We define the BRE to be the minimum mean-square error (MMSE) estimate of the expected risk or, equivalently, the conditional expectation of the expected risk given observations. Given a sample, S, and a classifier, ψ, that is not informed by θ, thanks to posterior independence between C and , the BRE is given by, If we assume that {X, Y } and S are independent given C and , then We may thus write the BRE in (12) as where ε i,y (ψ, S) = E[ ε i,y (ψ, ) | S] is the posterior probability of assigning a class-y point to class i, The second equality follows from Fubini's theorem, and in the last equality, X is a random vector drawn from the density in the integrand of (16). We also have f (y | S) = E C y | S , which depends on the prior for C and is easily found, for instance, from (9) under Dirichlet posteriors.
Comparing (3) and (15), observe that f (y | S) and f (x | y, S) play roles analogous to f (y | c) and f (x | y, θ y ) in Bayes decision theory. We thus call f (x | y, S) the effective class-y conditional density or simply the effective density.
Whereas the BRE addresses overall classifier performance across the entire sample space, X , we may also consider classification at a fixed point, x ∈ X . We define the Bayesian conditional risk estimator (BCRE) for class i ∈ {0, . . . , M−1} at point x ∈ X to be the MMSE estimate of the conditional risk: Again assuming {X, Y } and S are independent given C and , and if we further assume X is independent from C, , and S, then, Applying Bayes' rule, and applying this to (18), we have This is analogous to (2) in Bayes decision theory. Furthermore, given a classifier ψ, is the marginal distribution of x given S. Hence, the BRE of ψ is the mean of the BCRE across the sample space.
For binary classification, ε i,y (ψ, S) has been solved in closed form as components of the BEE for both discrete models under arbitrary classifiers and Gaussian models under linear classifiers, so the BRE with an arbitrary loss function is available in closed form for both of these models. When closed-form solutions for ε i,y (ψ, S) are not available, from (17), ε i,y (ψ, S) may be approximated for all i and a given fixed y by drawing a large synthetic sample from f (x | y, S) and evaluating the proportion of points assigned class i. The final approximate BRE can be found by plugging the approximate ε i,y (ψ, S) for each y and i into (15).
A number of practical considerations for BEEs addressed under binary classification naturally carry over to multiple classes, including robustness to false modeling assumptions [9,10] and a prior calibration method for microarray data analysis using features discarded by feature selection and a method-of-moments approach [21]. Furthermore, classical frequentist consistency holds for BREs on fixed distributions in the parameterized family owing to the convergence of posteriors in both the discrete and Gaussian models [12].

Optimal Bayesian risk classification
We define the OBRC to minimize the BRE, that is, where C is a family of classifiers. If C is the set of all classifiers with measurable decision regions, it can be shown that ψ OBRC exists and is given for any x ∈ X by Analogously to the relationship between the BRE and expected risk, the OBRC has the same functional form as the BDR with f (y | S) substituted for the true class probability, f (y | c), and f (x | y, S) substituted for the true density, f x | y, θ y , for all y. Closed-form OBRC are available for any model in which f (x | y, S) has been found, including discrete and Gaussian models [13]. A number of important properties also carry over, including invariance to invertible transformations, pointwise convergence to the Bayes classifier, and robustness to false modeling assumptions.

Sample-conditioned MSE of risk estimation
In a typical small-sample classification scenario, a classifier is trained from data and a risk estimate found for the true risk of this classifier. A key question arises: How close is the risk estimate to the actual risk? A Bayesian approach answers this question with the sample-conditioned MSE of the BRE relative to the true expected risk: This MSE is precisely the quantity that the BRE minimizes, and it quantifies the accuracy of R as an estimator of R, conditioned on the actual sample in hand. Thanks to posterior independence between C and , it can be decomposed: where we have applied (3) in (23) and noted E f (y | C) f(z | C) | S = E C y C z | S . Second-order moments of C y depend on our prior for C and can be found, for instance, from (10) and (11) under Dirichlet posteriors. Hence, evaluating the conditional MSE of the BRE boils down to evaluating the BRE itself, R(ψ, S), and evaluating expressions of the form E ε i,y (ψ, y )ε j,z (ψ, z ) | S . Furthermore, if we additionally assume 0 , . . . , M−1 are pairwise independent, then when y = z, where ε i,y (ψ, S), given in (16), is a component of the BRE. The conditional MSE of an arbitrary risk estimate, R • (ψ, S), is also of interest and may be easily found from the BRE and the MSE of the BRE: In this form, the optimality of the BRE is clear. For binary classification with zero-one loss, the sampleconditioned MSE of the BRE converges to zero almost surely as sample size increases, for both discrete models under arbitrary classifiers and Gaussian models with independent covariances under linear classifiers [12].
Closed-form expressions for the MSE are available in these models. In this work, we extend this to multi-class discrimination under discrete models and binary linear classification under homoscedastic Gaussian models. For cases where closed-form solutions are unavailable, in the next section, we present a method to approximate the MSE.

Efficient computation
The following new interpretation for E ε i,y (ψ, y ) ε j,z (ψ, z ) | S is useful in both deriving analytic forms for and approximating the MSE. From (4), where we have again applied Fubini's theorem. Further, we may write where X and W are random vectors drawn from an effective joint density, defined using similar independence assumptions as in (14): The marginal densities of X and W under f (x, w | y, z, S) are precisely the effective density, i.e., where f (θ y | S) is the marginal posterior density of y . Further, we have an effective conditional density of W given X, where we have used the fact that the fractional term in the integrand of the second equality is of the same form as the posterior defined in (8), updated with a new independent sample point with feature vector x and class y. Hence, the effective joint density may be easily found, once the effective density is known. Furthermore, from (29), we for each x, and evaluating the proportion of pairs, (x, w), for which x ∈ i and w ∈ j . Additionally, since x is marginally governed by the effective density, from (17) we may approximate ε i,y (ψ, S) by evaluating the proportion of x in i . Evaluating the OBRC, BRE, and conditional MSE requires and E[ C y C z | S] based on the posterior for C and finding the effective density, f (x | y, S), and the effective joint density, f (x, w | y, z, S), based on the posterior for . At a fixed point, x, one may then evaluate the posterior probability of each class, f (y | x, S), from (19) and the BCRE from (20). The OBRC is then found from (22) or, equivalently, by choosing the class, i, that For any classifier, the BRE is given by (15) with ε i,y (ψ, S) given by (16) (or equivalently (17)) using the effective density, f (x | y, S). The MSE of the BRE is then given by (24), where E ε i,y ( y )ε j,z ( z ) | S is given by (25) when 0 , . . . , M−1 are pairwise independent and y = z, and E ε i,y ( y )ε j,z ( z ) | S is otherwise found from (28) (or equivalently (29)) using the effective joint density, f (x, w | y, z, S). The MSE of an arbitrary risk estimator can also be found from (26) using the BRE and the MSE for the BRE. We summarize these tools for several discrete and Gaussian models in Appendices 1, 2, and 3 by providing the effective density, the effective joint density (or a related density), ε i,y (ψ, S), and E ε i,y ( y )ε j,z ( z ) | S .

Simulation setup and results
In the this section, we examine several synthetic data simulations, where random distributions and samples are generated from a low-information prior, and demonstrate the performance gain and optimality of Bayesian methods within the Bayesian framework. We also examine performance with informed priors in two real datasets.

Classification rules
We consider five classification rules: OBRC, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), linear support vector machine (L-SVM), and radial basis function SVM (RBF-SVM). We will implement OBRC under Gaussian models. We used built-in MATLAB functions to implement LDA and QDA. For a collection of binary-labeled training sample points, an SVM classifier finds a maximal margin hyperplane based on a well-behaved optimization objective function and a set of constraints. When the data are not perfectly linearly separable, introduction of slack variables in the optimization procedure leads to soft margin classifiers for which mislabeled sample points are allowed. The resulting hyperplane in the feature space is called L-SVM. Alternatively, the underlying feature space can be transformed to a higher dimensional space where the data becomes linearly separable. The equivalent classifier back in the original feature space will generally be non-linear [22,23]. When the kernel function is a Gaussian radial basis function, we call the corresponding classifier RBF-SVM. We used the package LIBSVM, which, by default, implements a one-versus-one approach for multi-class classification [24]. Since SVM classifiers optimize relative to their own objective function (for example, hinge loss), rather than expected risk, we exclude them from our analysis when using a non-zero-one loss function.
For all classification rules, we calculate the true risk defined in (3) and (4). We find the exact value if a formula is available; otherwise, we use a test sample of at least 10,000 points generated from the true feature-label distributions, stratified relative to the true class prior probabilities. This will yield an approximation of the true risk with RMS ≤ 1/ √ 4 × 10, 000 = 0.005 [8].

Risk estimation rules
We consider four risk estimation methods: BRE, 10-fold cross-validation (CV), leave-one-out (LOO), and 0.632 bootstrap (boot). When we do not have closed-form formulae for calculating the BRE, we approximate it by drawing a sample of 1,000,000 points from the effective density of each class. In CV, the training data, S, is randomly partitioned into 10 stratified folds, S (i) for i = 1, 2, . . . , 10. Each fold, in turn, is held out of the classifier design step as the test set, and a surrogate classifier is designed on the remaining folds, S \ S (i) , as the training set. The risk of each surrogate classifier is estimated using S (i) . The resulting risk values from all surrogate classifiers are then averaged to get the CV estimate. To reduce "internal variance" arising from random selection of the partitions, we average the CV estimates over 10 repetitions (10 randomly generated partitions over S). If the number of folds equals the sample size, n, then each fold consists of a single point and we get the LOO risk estimation. Bootstrap risk estimators are calculated using bootstrap samples of size n, where in each bootstrap sample, points are drawn, with replacement, from the original training dataset. A surrogate classifier is designed on the bootstrap sample and its risk estimated using sample points left out of the bootstrap sample. The basic bootstrap estimator is the expectation of this risk with respect to the bootstrap sampling distribution. The expectation is usually approximated by Monte Carlo repetitions (100 in our simulations) over a number of independent bootstrap samples. It is known that this estimate is high biased. To reduce bias, the 0.632 bootstrap reports a linear combination of this estimate, with weight 0.632, and the low-biased resubstitution risk estimate, with weight 0.368 [25][26][27].
Under linear classification, the sample-conditioned MSE from (24) is found analytically by evaluating (52), plugging in the appropriate values for k and γ 2 depending on the covariance model, and E ε i,y ( y )ε j,z ( z ) | S for z = y are found via (25) for independent and (53) for homoscedastic covariance models, plugging in appropriate values for k and γ 2 . When analytic forms are not available, the sampleconditioned MSE is approximated as follows. In independent covariance models, for each sample point generated to approximate the BRE, we draw a single point from the effective conditional density with y = z, giving 1,000,000 sample point pairs to approximate E ε i,y ( y )ε j,y ( y ) | S for each y. In homoscedastic covariance models, to find the BRE, we have 1,000,000 points available from the effective density for each y. We generate an additional 1, 000, 000 × (M − 1) synthetic points for each y, thus allocating 1,000,000 synthetic points for each combination of y and z. For each of these points, we draw a single point from the effective conditional density of a class-z point given a class-y point. For each y and z, the corresponding 1,000,000 point pairs are used to approximate

Synthetic data
In synthetic data simulations, we assume all classes are equally likely and that the data is stratified, giving an equal number of sample points from each class. We further assume Gaussian feature-label distributions. Table 1 lists all models and prior distributions used. We implement both a low number of features (D = 2) and a high number of features (D = 20), with independent arbitrary, homoscedastic arbitrary, and independent identity covariance priors. Under each type of prior, we consider classification under a non-zero-one loss function for binary classification and a zero-one loss function for multiple classes. For each prior model and a fixed sample size,  we evaluate classification performance in a Monte Carlo estimation loop with 10,000 iterations. In each iteration, we follow a two-step procedure for sample generation: (1) generate random feature-label distribution parameters from the prior (each serving as the true underlying feature-label distribution) and (2) generate a random sample of size n from this fixed feature-label distribution. The generated random sample is used to train classifiers and evaluate their true risk. In the non-zero-one loss case, we also estimate risk and evaluate its accuracy using the performance metrics discussed earlier. We vary the sample size throughout and analyze its effect on performance.

Real data
We consider two real datasets. The first is a breast cancer dataset containing 295 sample points [28], which will be used to demonstrate binary classification under a nonzero-one loss function. The second is composed of five different cancer types from The Cancer Genome Atlas (TCGA) project, which demonstrates multi-class classification under zero-one loss. In all real-data simulations, we assume that c y is known and equal to the proportion of class-y sample points in the whole dataset. We form a Monte Carlo estimation loop to evaluate classification and risk estimation, where we iterate 1000 times with the breast cancer dataset and 10,000 times with the TCGA dataset. In each iteration, we obtain a stratified training sample of size n, i.e., we select a subset of the original dataset, keeping the proportion of points in class y as close as possible to c y for every y. We use these training points to design several classifiers, while the remaining sample points are used as holdout data to approximate the true risk of each designed classifier. For the breast cancer dataset, we also use the training data to estimate risk and find the sample-conditioned MSE of the BRE. We vary sample size and analyze its effect on performance.
To implement Bayesian methods, we assume Gaussian distributions with arbitrary independent covariances in all real-data simulations. We calibrate hyperparameters, defined in Appendix 2, using a variant of the methodof-moments approach presented in [21]. In particular, we construct a calibration dataset from features not used to train the classifier and set ν y = s y /t y , κ y = 2(s 2 y /u y ) + D + 3, m y =[ m y , . . . , m y ], and S y = (κ y −D−1)s y I D , where m y is the mean of the means of features among class-y points of the calibration dataset, and s y is the mean of the variances of features in class y. t y is the variance of the means of features in class y, where the 10 % of the means with the largest absolute value are discarded. Likewise, u y is the variance of the variances of features in class y, where the 10 % of the variances with the largest value are discarded.
In the breast cancer data, 180 patients are assigned to class 0 (good prognosis) and 115 to class 1 (bad prognosis) in a 70-feature prognosis profile. A correct prognosis is associated with 0 loss, wrongly declaring a good prognosis incurs a loss of 1, and wrongly declaring a bad prognosis incurs a loss of 2. We use pre-selected features for classifier training, originally published in [29]. When D = 2, these features are CENPA and BBC3, and when D = 5, we also add CFFM4, TGFB3, and DKFZP564D0462. Rather than discard the 70 − D features not used for classification, we use these features to calibrate priors using the method-of-moments approach described above.
For our second dataset, we downloaded level-3 microarray data from the TCGA data portal for five different kinds of cancers: breast invasive carcinoma (BRCA) with 593 sample points, colon adenocarcinoma (COAD) with 174 sample points, kidney renal clear cell carcinoma (KIRC) with 72 sample points, lung squamous cell carcinoma (LUSC) with 155 sample points, and ovarian serous cystadenocarcinoma (OV) with 562 sample points. We pooled all the sample points into a single dataset, removed features with missing values in any cancer type (17,016 features remained out of 17,814), and quantile-normalized the data with the median of the ranked values. We preselect features for classifier training and prior calibration using the full dataset and one of two methods, which both operate in two phases: in phase 1, we pass D+100 features, and in phase 2, we select D features from those passing phase 1. The D features passing both phases are used for classifier training, and the features passing phase 1 but not phase 2 are used for prior calibration. The first feature selection method (FS-1) passes features that minimize a score evaluating separation between classes in phase 1 and selects features that minimize a score evaluating Gaussianity of the classes in phase 2. To evaluate separation between classes in phase 1, for each pair of classes, we obtain t-test p-values for each feature and rank these across all features, low p-values being assigned a lower rank, and finally, we report the rank product score for each feature over all 10 pairs of classes. To evaluate Gaussianity in phase 2, for each class, we rank Shapiro-Wilk test p-values across all features passing phase 1, high p-values being assigned a lower rank, and report the rank product score for each feature across all five classes. The second feature selection method (FS-2) passes features minimizing the rank product score from Shapiro-Wilk tests applied to all 17,016 features in phase 1, and in phase 2, we select D features from those passing phase 1 using sequential forward search (SFS) with LDA classification and resubstitution risk as the optimization criterion.

Discussion
Models 1 and 2 focus on the effect of risk on classification and risk estimation performance. In Fig. 1, we evaluate the performance of risk estimators and classifiers under model 1. Graphs in the left column present the mean, averaged over all 10,000 sample realizations, of the true risk and all risk estimators considered for LDA, QDA, and OBRC classification. Note for small samples of size n = 20 and LDA or QDA classification, surrogate classifiers in the bootstrap risk estimator are occasionally undefined depending on the realized bootstrap sample.
These events are thrown out so that only a subset of the original 10,000 sample realizations are used to approximate the mean bootstrap risk estimator. The graphs on the right column of Fig. 1 present the square root of the mean, averaged over all sample realizations, of the square difference between the true risk and each risk estimator, which we call the empirical RMS. The square root of the mean, averaged over all sample realizations, of the sampleconditioned MSE of the BRE from (24), which we call the Bayesian RMS, is also shown.
The BRE is an unbiased estimator, so the mean true risk and mean BRE curves should be aligned with enough iterations, which is observed. The empirical and Bayesian RMS both approximate the unconditional RMS so that these curves should also be aligned with enough iterations, as observed. Furthermore, the BRE is theoretically optimal in both the sample-conditioned and unconditioned RMS, and as expected, the empirical and Bayesian RMS curves for BRE under each classification rule outperform all other risk estimation rules. Thus, the BRE yields a significant improvement over classical risk estimators in terms of both bias and RMS performance within the Bayesian model. If we compare classification rules, the RMS of BRE is consistently lower for OBRC than LDA and QDA, although there is no theoretical guarantee for this. Similar curves for model 2 are provided in Fig. 2.
To illustrate how the sample-conditioned MSE may be used to assess the accuracy of a risk estimate, suppose that we have observed a sample, trained a classifier, and obtained the BRE and the MSE of the BRE. For this fixed sample, but random parameters in the Bayesian model, the true risk has a mean equal to the BRE and a variance equal to the sample-conditioned MSE so that the random variable Z = ( R − R)/RMS( R|S) must have zero mean and unit variance. This holds for any classification rule. In Fig. 3, we present quantile-quantile (Q-Q) plots of the sample quantiles of Z versus theoretical quantiles from a standard normal distribution. Figure 3a provides Q-Q plots with realizations of Z taken under OBRC classification and BRE risk estimation in model 1 with various sample sizes, along with a 45°reference line, and Fig. 3b provides similar graphs for model 2.
Observe that Z appears approximately standard normal, particularly under large sample sizes. Under smaller samples, Z appears more positively skewed but has approximately zero mean and unit variance. Q-Q plots for other classifiers are similar.
In Figs  In Fig. 6, we present the mean and standard deviation of the true risk with respect to all sample realizations as a function of sample size for models 3 and 4. OBRC outperforms all other classification rules with respect to mean risk, as it must, since the OBRC is defined to minimize mean risk. Although there is no guarantee that OBRC should minimize risk variance, in these examples, the risk variance is lower than in all other classification rules. The performance gain is particularly significant for small samples. Consider Figs. 6a and 6b, where we observe that, at a sample size of 10, the risk of OBRC has a mean of about 0.16 and standard deviation of about 0.065, whereas the risk of the next best classifier, RBF-SVM, has a mean of about 0.22 and standard deviation of about 0.09. Figure 7 provides the performance of risk estimators under OBRC classification in models 5 and 6, demonstrating performance in 20 dimensions with independent scaled identity covariance priors. Settings in model 5 are designed to produce a low mean risk and model 6 a high mean risk. Graphs in the left column present the mean true risk, averaged over all 10,000 sample realizations; the center column presents empirical and Bayesian RMS curves; and the right column presents Q-Q plots of Z. As in Figs. 1 and 2, the BRE appears unbiased, the empirical and Bayesian RMS curves are aligned, and the RMS curves are optimal. From the Q-Q plots, the distribution of Z appears to be skinny-tailed even under large n, although it is approximately zero mean and unit variance.
In Fig. 8  for D = 2 and D = 5, respectively. Graphs in the left column present the mean true risk and mean risk estimates, graphs in the center column present the empirical RMS of all risk estimates and the Bayesian RMS for the BRE, and graphs in the right column present the Q-Q plots of Z for various sample sizes. LDA, QDA, and OBRC are presented in the top, center, and bottom rows, respectively. Although the BRE is theoretically unbiased and minimizes RMS when averaged across random distributions in the uncertainty class, when applied to a specific dataset or distribution, we now observe a bias (in the left column) and a discrepancy between the empirical and Bayesian RMS (in the center column). In particular, for all classifiers under D = 2 and for LDA under D = 5, we observe a high bias, for QDA and OBRC under D = 5, we observe a low bias, and in all cases, the Bayesian RMS lies below the empirical RMS. That being said, the empirical RMS still outperforms that of distribution-free resampling error estimators (LOO, CV, and boot). Although resampling estimators are nearly unbiased, they suffer from such large variance under small samples that the BRE, despite imperfections in the Gaussianity assumption and prior construction method, may still outperform in practice thanks to optimization. Turning to classifier performance, in these simulations, LDA appears to outperform QDA and OBRC with independent arbitrary covariances. Keep in mind that Bayesian methods are not guaranteed to be optimal in all datasets and all settings but, rather, are only optimal within the assumed model. In fact, OBRC with homoscedastic arbitrary covariances (not shown in the figures) performs as well as, or significantly better than, LDA, suggesting that covariances in this problem are approximately homoscedastic. From the Q-Q plots, Z deviates from the reference standard normal CDF, with a clear shift in the mean and sometimes variance. For instance, under the LDA classification with D = 2 and n = 70 (corresponding to Fig. 9c), the mean of Z is 0.76 and the standard deviation is 1.08, and  Fig. 10i), the mean of Z is −1.08 and the standard deviation is 1.49. In Fig. 11, we present the mean of the true risk with respect to random samples from the TCGA dataset, as a function of sample size, for different feature selection methods and selected feature set sizes. Due to covariance estimation problems, QDA cannot be trained for D = 20 in this range of sample sizes. OBRC with calibrated priors consistently outperforms under small samples and performs robustly under large samples. These results depend on the particular features selected and note LDA may have an advantage under FS-2, which minimizes the apparent error of LDA classifiers.
In real applications, data rarely satisfy modeling assumptions, for instance, Gaussianity, and there may be a concern that performance will suffer. Firstly, keep in mind the need to validate assumptions in the Bayesian model. For example, Gaussianity tests and homoscedasticity tests may be used to validate these underlying assumptions. Our real-data simulations demonstrate a few examples of how Gaussianity tests may be used in conjunction with Bayesian methods. Secondly, previous works have shown that Bayesian methods are relatively robust to deviations from a Gaussianity assumption [10,14]. This is observed, for instance, in Figs. 9 and 10. Thirdly, inference from non-informative priors may serve as a reference. The OBRC under non-informative priors and an arbitrary homoscedastic covariance model behaves similarly to LDA and under an arbitrary independent covariance model behaves similarly to QDA [13,14]. Thus, the OBRC can be seen as unifying and optimizing these classifiers. This applies in Fig. 11, where OBRC with an appropriate covariance model and non-informative prior performs indistinguishably from LDA. The conditional MSE is also an immensely useful tool to quantify the accuracy of a risk estimator. For instance, one may employ the MSE for censored sampling by collecting batches of sample points until the sample-conditioned MSE reaches an acceptable level, and either an acceptable risk has been achieved or it has been determined that an acceptable risk cannot be achieved. Lastly, although we provide analytic solutions under discrete and Gaussian models, the basic theory for this work does not require these assumptions. For instance, recent work in [30] develops a Bayesian Poisson model for RNA-Seq data, where Bayesian error estimators and optimal Bayesian classifiers are obtained using Markov chain Monte Carlo (MCMC) techniques.

Conclusion
We have extended optimal Bayesian classification theory to multiple classes and arbitrary loss functions, giving rise to Bayesian risk estimators, the sample-conditioned MSE for arbitrary risk estimators, and optimal Bayesian risk classifiers. We have developed a new interpretation of the conditional MSE based on effective joint densities, which is useful in developing analytic forms and approximations for the conditional MSE. We also provide new analytic solutions for the conditional MSE under homoscedastic covariance models. Simulations based on several synthetic Gaussian models and two real microarray datasets also demonstrate good performance relative to existing methods. x must all be positive for all x and y. The effective density is thus given by:

Appendix 1: Discrete models
where α y * x . Thus, we have The effective joint density, f (x, w | y, z, S), for y = z, can be found from properties of Dirichlet distributions. We have for any y ∈ {0, . . . , M − 1} and x, w ∈ X , where δ xw equals 1 if x = w and 0 otherwise. From (28),

Appendix 2: Gaussian models
Suppose X is a D dimensional space in which each point is represented by a column vector and each class-y conditional distribution is Gaussian with mean vector μ y and covariance matrix y . We will consider independent covariance models, where the y are mutually independent prior to observing the data, and homoscedastic covariance models, where y are identical for all y [13]. We will also consider three structures for the covariance: known, scaled identity, and arbitrary. Throughout, we use μ y and y to denote both random quantities and their realizations, and we use y 0 to denote a valid covariance matrix, i.e., a symmetric, positive definite matrix. Throughout, we will find analytic forms for the BRE and conditional MSE under binary linear classifiers, ψ, of the form where g(x) = a T x + b for some vector a and scalar b, and a superscript T denotes matrix transpose.

Known covariance
Assume that y 0 is known so that y = μ y with parameter space T y = R D . We assume the μ y s are mutually independent and use the following prior:  with hyperparameters ν y ∈ R and m y ∈ R D , where |·| denotes a determinant. When ν y > 0, this is a Gaussian distribution with mean m y and covariance y /ν y . Under this model, the posterior is of the same form as the prior, with updated hyperparameters ν * y = ν y + n y , where μ y is the usual sample mean of training points in class y. We require ν * y > 0 for a proper posterior. The effective density was shown in [13] to be the following Gaussian distribution: To find the BRE for a linear classifier, let P = (−1) i g(X). From the effective density, Thus, where (x) is the standard normal CDF. This result was also found in [10].
and the effective joint density is thus given by Now let Q = (−1) j g(W). Since X and W are governed by the effective joint density in (40): Hence, from (29), we have where (x, y, ρ) is the joint CDF of two standard normal random variables with correlation ρ. When y = z, E ε i,y (ψ, y )ε j,z (ψ, z ) | S is found from (25).

Homoscedastic arbitrary covariance
Assume y =[ μ y , ], where the parameter space of μ y is R D and the parameter space of consists of all symmetric positive definite matrices. Further, assume a conjugate prior in which the μ y s are mutually independent given so that where π(μ y | ) is as in (34) with hyperparameters ν y ∈ R and m y ∈ R D , and with hyperparameters κ ∈ R and S, a symmetric D × D matrix. If ν y > 0, then π(μ y | ) is Gaussian with mean m y and covariance /ν y . If κ > D − 1 and S 0, then π( ) is an inverse-Wishart distribution with hyperparameters κ and S. If in addition κ > D + 1, the mean of exists and is given by E[ ] = S/(κ − D − 1); thus, S determines the shape of the expected covariance. The posterior is of the same form as the prior with the same updated hyperparameters given by (35) and κ * = κ + n, where y is the usual sample covariance of training points in class y ( y = 0 if n y ≤ 1). The posteriors are proper if ν * y > 0, κ * > D − 1 and S * 0. The effective density for class y is multivariate student t with k = κ * − D + 1 degrees of freedom, location vector m * y , and scale matrix ν * y +1 kν * y S * [13]. In other words, f (p | y, S) and f (q | p, y, y, S) are precisely in the form required by this lemma with D = 1. Hence, [ P, Q] T follows a bivariate student t distribution when y = z, and when y = z, Thus, E ε i,y ( y )ε j,y ( y ) | S can be found from (29). In particular, when y = z, E ε i,y ( y )ε j,y ( y ) | S = P(P ≤ 0 ∩ Q ≤ 0 | y, y, S) and when y = z, where T(x, y, ρ, d) is the joint CDF of two standard multivariate student t random variables with correlation ρ and d degrees of freedom.

Independent arbitrary covariance
Assume =[μ y , y ], where the parameter space of μ y is R D and the parameter space of y consists of all symmetric positive definite matrices. The independent arbitrary covariance model assumes a conjugate prior with independent y s and π(θ y ) = π(μ y | y )π( y ), where π(μ y | y ) is of the same form as in (34) with hyperparameters ν y ∈ R and m y ∈ R D , and π( y ) is of the same form as in (42) with hyperparameters κ y ∈ R and S y , a symmetric D×D matrix. The posterior is of the same form as the prior with updated hyperparameters given by (35) and κ * y = κ y + n y , S * y = S y + (n y − 1) y + ν y n y ν y +n y ( μ y − m y )( μ y − m y ) T .
The effective density for class y is multivariate student t as in (44) with k y = κ * y − D + 1 and S * y in place of k and S * , respectively [13]. Further, (45) also holds with m iy = (−1) i g(m * y ) and with k y and γ 2 y = a T S * y a in place of k and γ 2 , respectively. Under binary linear classification, ε i,y (ψ, S) is given by (46) with k y and γ 2 y in place of k and γ 2 . The same result was found in [10]. E ε i,y ( y )ε j,y ( y ) | S is solved similarly to before, resulting in (47), (50), (51), and ultimately (52), with k y , S * y and γ 2 y in place of k, S * , and γ 2 , respectively. E ε i,y ( y )ε j,z ( z ) | S for y = z is found from (25).

Homoscedastic scaled identity covariance
In the homoscedastic scaled identity covariance model, y is assumed to have a scaled identity structure, i.e., y =[ μ y , σ 2 ] where y = σ 2 I D and I D is a D × D identity matrix. The parameter space of μ y is R D for all y and of σ 2 is (0, ∞). We also assume the μ y s are mutually independent given σ 2 : where π(μ y | σ 2 ) is of the same form as (34) with hyperparameters ν y and m y , and with hyperparameters κ ∈ R and S, a symmetric D×D real matrix. When ν y > 0, π(μ y | σ 2 ) is a univariate Gaussian distribution with mean m y and covariance y /ν y , and when (κ + D + 1)D > 2 and S 0, π(σ 2 ) is a univariate inverse-Wishart distribution. If in addition (κ + D + 1)D > 4, then E[ σ 2 ] = trace(S) (κ+D+1)D−4 . The form of (57) has been designed so that the posterior is of the same form as the prior with the same hyperparameter update equations given in the arbitrary covariance models, (35) and (43). We require ν * y > 0, (κ * +D+1)D > 2, and S * 0 for a proper posterior.
Let P = (−1) i g(X). Since P is an affine transformation of a multivariate student t random variable, again it has the same form as in (45) with k = (κ * + D + 1)D − 2, m iy = (−1) i g(m * y ), and γ 2 = trace(S * )a T a. Following the same steps as in the homoscedastic arbitrary covariance model, under binary linear classification, ε i,y (ψ, S) is given y =[ μ y , σ 2 y ] where y = σ 2 y I D , and that the parameter space of μ y is R D and of σ 2 y is (0, ∞) for all y. Also, assume the y s are mutually independent, with π(θ y ) = π(μ y | σ 2 y )π(σ 2 y ), where π(μ y | σ 2 y ) is of the same form as in (34) with hyperparameters ν y ∈ R and m y ∈ R D , and π(σ 2 y ) is of the same form as in (57) with hyperparameters κ y ∈ R and S y , a symmetric D × D real matrix. The posterior is of the same form as the prior with the same hyperparameter update equations in (35) and (55). We require ν * y > 0, (κ * y + D + 1)D > 2 and S * y 0 for a proper posterior.
The effective density for class y is multivariate student t, as in (58) with k y = (κ * y + D + 1)D − 2 and S * y in place of k and S * , respectively [13]. Under binary linear classification, ε i,y (ψ, S) is given by (46) with m iy = (−1) i g(m * y ) and with k y and γ 2 y = trace(S * y )a T a in place of k and γ 2 . The effective joint density, f (x, w | y, y, S), is solved as before, resulting in (59) and (61) with k y and S * y in place of k and S * , respectively. Further, E ε i,y ( y )ε j,y ( y ) | S is solved from (51) resulting in (52), with k y and γ 2 y in place of k and γ 2 , respectively. E ε i,y ( y )ε j,z ( z ) | S for y = z is found from (25).
Proof. After some simplification, one can show Simplifying further, we obtain