Optimal decision theory for diagnostic testing: Minimizing indeterminate classes with applications to saliva-based SARS-CoV-2 antibody assays

In diagnostic testing, establishing an indeterminate class is an effective way to identify samples that cannot be accurately classified. However, such approaches also make testing less efficient and must be balanced against overall assay performance. We address this problem by reformulating data classification in terms of a constrained optimization problem that (i) minimizes the probability of labeling samples as indeterminate while (ii) ensuring that the remaining ones are classified with an average target accuracy X. We show that the solution to this problem is expressed in terms of a bathtub-type principle that holds out those samples with the lowest local accuracy up to an X-dependent threshold. To illustrate the usefulness of this analysis, we apply it to a multiplex, saliva-based SARS-CoV-2 antibody assay and demonstrate up to a 30 % reduction in the number of indeterminate samples relative to more traditional approaches.


Introduction
The SARS-CoV-2 pandemic has highlighted the importance of antibody testing as a means to monitor the spread of diseases such as COVID-19 [1,2]. But the widespread deployment of new assays has also revealed fundamental problems in the ability to reliably analyze the corresponding measurements. Early on, this shortcoming was attributed to low prevalence, which made it difficult to distinguish true and false positives [3]. However, it soon became clear that there were deeper issues related to statistical interpretation of raw data, suggesting the need to revisit the underlying theory of diagnostic classification [4][5][6].
In this context, a fundamental problem arises when many measurements fall near a cutoff used to distinguish positive and negative samples. The probability of correctly classifying these borderline cases hovers near 50%, so that even a small fraction thereof can significantly decrease overall accuracy. A common solution is to define a third, indeterminate class for which one cannot draw meaningful conclusions, although this is not always chosen to be near a cutoff [7][8][9][10][11][12][13]. While this approach increases the average accuracy for those samples that are classified, it also decreases testing efficiency. Thus, there is a need to develop strategies that balance the construction of indeterminate classes against overall assay performance.
At the outset and in contrast with traditional methods, it is important to note that concepts such as specificity and sensitivity per se are not fundamental quantities of interest in our analysis. As discussed in Section 6, they describe the accuracy of a fixed and subjective classification scheme in two degenerate cases: 0% and 100% prevalence. As such, it is trivial (but useless) to optimize either quantity by assigning all samples to a single class. Rather, we treat accuracy -defined as a prevalence-weighted, convex combination of sensitivity and specificity -as fundamental, since this naturally interpolates between the aforementioned degenerate cases. This choice also highlights an important (but oftenignored) fact: the numbers of false positives and false negatives change with prevalence. Thus, sensitivity and specificity may not be useful metrics of assay performance in a setting where a disease is actively spreading. The bathtub-type principle also reveals that these quantities are not mathematically fundamental, since they arise from more granular notions of conditional accuracy.
Ultimately, this analysis leads to the realization that classification accuracy has both a local and global interpretation, and the interplay between these interpretations is fundamental to both the problem considered herein the general theory of classification. 2 In particular, one can construct conditional probability density functions (PDFs) ( ) and ( ) of a measurement outcome -i.e. a local property -for (known) positive and negative samples. As shown in Ref. [5], these PDFs are necessary to maximize the global accuracy , since the equation defines the boundary between ⋆ and ⋆ when is the prevalence. In the present work, we show that ( ) and ( ) also directly define the local accuracy ( ), and that its global counterpart is the average value of ( ). We next observe that the boundary given by Eq. (1) is the set for which = 50%, its lowest possible value. The corresponding points are the first to be held out, since they contribute most to the average error. 3 Moreover, one sees that systematically removing the least accurate yields the fastest increase in the global accuracy for the remaining points. Our bathtub-type principle formalizes this idea.
This intuition also distinguishes our approach from Ref. [5], which considered uncertainty in classification due to effects that are external to the assay. In that work, the authors demonstrated that the optimal binary domains become ambiguous when prevalence is only given to within confidence intervals. They showed that this problem can be addressed by holding out samples whose classes were likewise ambiguous. In contrast, our approach defines the indeterminate class as those measurements with the highest inherent uncertainty as quantified in terms of local accuracy. In practice, we find that such effects are often more important, especially as several references have addressed issues pertaining to accurate, unbiased prevalence estimation [5,6,14].
From a practical standpoint, the main inputs to our analysis are training data associated with positive and negative samples; thus our approach is compatible with virtually any antibody assay. These data are used to construct the conditional PDFs ( ) and ( ), so that the classification and holdout problems are reduced to mathematical modeling. This is also the key limitation of our approach insofar as such models are necessarily subjective. However, this problem is not unique to our method. Where possible, we incorporate objective information about the measurement process. See Section 4 and Ref. [5] for a deeper discussion of such issues and other limitations.
The remainder of this manuscript is organized as follows. Section 2 reviews key notation and terminology. Section 3 presents the general 2 The testing community has largely restricted its attention to global assay properties, since regulatory reporting focuses on assay performance for large populations [2]. 3 An interesting corollary of the proofs in Ref. [5] is that ≥ 50% for optimally defined classification domains without indeterminates. Thus, we never need consider relative errors less than 50%. See also Section 3 and the Appendix. theory for defining optimal indeterminate domains. Section 4 illustrates this analysis in the context of a saliva-based, multiplex SARS-CoV-2 saliva assay. Section 5 considers numerical validation of our analysis, and Section 6 concludes with a discussion and comparison with past works. The Appendix provides a proof of our main result and other supporting information.

Notation and terminology
One of our primary goals is to provide practical, accessible tools for constructing indeterminate classes. In doing so, we must combine abstract ideas from measure theory with concepts in applied diagnostics; it is not reasonable to assume that any reader will have a background in both. In this section, we provide necessary information so that: (I) an expert in diagnostics can understand and implement our main results (i.e. construct holdout domains) without needing to derive the associated proofs; and (II) a mathematician can verify our work and understand the applied context. Readers with background in either the mathematical notation (Section 2.1) or diagnostic terminology (Section 2.2) may skip the corresponding sections. [We also advise mathematicians to see our definition of a bathtub-type principle in Section 2.1, as it differs from Theorem 1.14 in [15].] We also refer readers to Refs. [16,17] for deeper background on measure theory.

Mathematical notation and concepts
• By a set, we mean a collection of objects, e.g. measurements or measurement values. By a domain, we typically mean a set in some continuous measurement space; see, e.g., Fig. 1. • The symbol ∈ indicates set inclusion. That is, ∈ means that is in set . • The symbol ∅ denotes the empty set, which has no elements.
• The operator ∪ denotes the union of two sets. That is, = ∪ is the set containing all elements that appear in either or . • The operator ∩ denotes the intersection of two sets. That is, = ∩ is the set of elements shared by both and . • The operator ∕ denotes the set difference. We write = ∕ to mean the set of all objects in that are not also in . Note that in general, ∕ ≠ ∕ . Equivalently, ∕ can be interpreted as the ''subtraction'' or removal from of the elements it shares in common with . • The notation = { ∶ * } defines the set as the collection of satisfying condition * . • By a bathtub-type principle, we mean the solution to a constrained optimization problem that determines an optimal set ⋆ via a nonlinear inequality of the form ⋆ = { ∶ ( ) ≤ ⋆ } for some constant ⋆ and function ( ).
Unless otherwise specified, the ''size'' or measure of a set refers to the probability of a sample falling within that set, i.e. its probability mass. By the same token, we generally avoid using size to describe the actual dimensions (in measurement space) of a domain.

Notation and concepts from applied diagnostics
• Training data correspond to samples whose true classes are known. In general, training data is used to construct probability models and/or validate an analysis. • Test data corresponds to samples whose true classes are unknown, or treated as such for purposes of validation. Test data are the objects to which a classification analysis is applied. • Prevalence is the fraction of samples in a population that are positive. As such, it is the probability that a person picked at random is positive, given no other information. • Sensitivity (Specificity ) is the relative fraction of positive (negative) samples that are classified correctly. We take the common convention that these quantities refer to expectation values or averages.

Notation specific to the present work
• The non-caligraphic sets and denote positive and negative classification domains in the binary (no-holdout) problem.
• Caligraphic sets D and D are the corresponding domains in the classification problem with indeterminate samples. D ℎ is an indeterminate set. • The shorthand D = D ∪ D is used throughout and denotes the set of all samples that are classified as positive or negative. • The use of a superscript ⋆ denotes an optimal quantity. For example, ⋆ is an optimal positive classification domain.

Minimum probability indeterminate class
We begin with the mathematical setting underlying classification. Consider an antibody measurement , which can be a vector associated with multiple distinct antibody targets. We take the set of all admissible measurements to be . Our goal is to define three domains, D , D , and D ℎ associated with positive, negative, and indeterminate (or ℎ for ''hold-out'') samples. In particular, we say that a test sample is positive if it falls inside D (i.e. ∈ D ), and likewise for the other domains.
We require that these domains have several basic properties to ensure that they define a valid classification scheme. Recalling that ( ) and ( ) are conditional probabilities associated with positive and negative samples, define the measures of a set ⊂ with respect to and to be That is, ( ) is the probability of a positive sample falling in , etc. We then require that and when ≠ ′ , for , ′ chosen from D , D , or D ℎ . Eq. (3) states that the probability of any event falling in the positive, negative, or indeterminate domains is one; i.e. any sample can be classified. Eq. (4) states that the probability of a sample falling in more than one domain is zero, i.e. a sample has a single classification.
Given ( ) and ( ), the law of total probability [18] implies that is the PDF that a test sample yields measurement , where is the prevalence. 4 The quantity ( ) is the probability that a sample is both positive and yields , with a corresponding interpretation for (1 − ) ( ). This motivates us to define the total error rate The terms on the right-hand side (RHS) are the rates of false positives and false negatives (normalized by the number of tests). Eq. (6) treats any misclassification as equally undesirable, but importantly, indeterminates are not considered errors in Eq. (6). Thus, E so defined is not the error rate of the assay restricted to samples that fall only within D and D . The latter is defined as where D = D ∪ D is the set of all samples not in the indeterminate region. Eq. (7) is a conditional expectation; i.e. it is the average error conditioned on the set of samples that can be classified. 4 See Refs. [5,14] for an unbiased method to estimate without needing to classify.
In Ref. [5] we showed that when the set Z 1∕2 = { ∶ ( ) = (1− ) ( )} has measure zero and D ℎ is the empty set, 5 E is minimized by the binary classification scheme for prevalence . In light of the definition of ( ), interpretation of Eqs.
(8a) and (8b) is straightforward: classify a sample as positive (negative) if the probability of being both positive (negative) and having value is greater than the corresponding probability of being negative (positive) and having value . [See Chapter 3 of [19] for related ideas.] While ⋆ and ⋆ are not the optimal sets for the problem at hand, they play a fundamental role in the analysis that follows. We also note an important corollary that when the Z 1∕2 has non-zero measure, Eqs. (8a) and (8b) are generalized to wherêand̂are an arbitrary partition of Z 1∕2 . The physical interpretation of this generalization is that any point having equal probability of being negative or positive can be assigned to either class without changing the error. In practice, however, classification often reverts to Eqs. (8a) and (8b) as Z 1∕2 has zero measure for many practical PDFs.
In the present work, we assume that there is a desired average accuracy and that L = 1 − E [ ⋆ , ⋆ ] < when all samples are classified. Our goal is to define a minimum probability indeterminate class D ⋆ ℎ and domains D ⋆ and D ⋆ for which L[D ⋆ , D ⋆ ] = ; that is, we wish to hold out the fewest samples so that those remaining are classified with the desired accuracy. Mathematically, we seek to minimize subject to the constraint that for D = D ∪D . In light of Eq. (7), this constraint fixes the conditional expectation [20] of the assay accuracy restricted to D; i.e. the accuracy of the assay excluding the holdout domain must be .
To solve this problem, it is useful to introduce several auxiliary concepts. In particular, define the local accuracy of the unconstrained (i.e. no indeterminate), binary classification to be where and cover the whole set up to sets of measure zero; moreover, let ⋆ ( ) = ( , ⋆ , ⋆ ) be the local accuracy of the optimal solution to the binary problem. Then the solution to the constrained problem given by Eqs. (10) and (11) is where 0 ( ) is the solution to the equation for any set C ⊂ { ∶ ⋆ ( ) = 0 } satisfying Eq. (14). Proof of this result, as well as the strict interpretation of C requires significant analysis of 5 E and E are equal when D ℎ is the empty set. Note also that one can measure Z 1∕2 with respect to either or . This is because the set Z 1∕2 by definition entails that Training data associated with the Saliva assay described in Refs. [11,12]. Red x denote known positives (confirmed via polymerase chain-reaction measurements), and blue o denote pre-pandemic samples, which are assumed to be negative for SARS-CoV-2 antibodies. The bold, horizontal and vertical black lines are cutoffs used to classify samples. Data falling above the horizontal line (red shaded domain) are classified positive; data in the lower right box (shaded blue) are negative, and data in the lower left box (shaded yellow) are indeterminates. The SARS-CoV-2 IgG measurements (vertical axis) are a sum of seven antibody levels measured by the assay, whereas the total IgG measurement (horizontal axis) is the total immunoglobulin-G (IgG) measurement as determined by an enzyme-linked immunosorbent assay (ELISA).
Eq. (11) and is reserved for Appendix. Here we provide an intuitive interpretation and describe a straightforward algorithm for computing Eqs. (13a)-(13c). Eq. (13a) informs that the points to label indeterminate are those with the lowest local accuracy up to some threshold value 0 , which depends on . Eqs. (13b) and (13c) then amount to the observations that the positive and negative domains are the same as in the unconstrained binary problem, except that we remove the corresponding points with low enough local accuracy. Eq. (14) requires that the average local accuracy for the classification sets D ⋆ and D ⋆ be . By virtue of the fact that D ℎ = ∕D, this fixes the boundary of the indeterminate set. That is, the upper bound 0 ( ) on the indeterminate local accuracy is the lower bound on the accuracy for sets that can be classified. The C( ) is a bookkeeping artifact accounting for the situation in which the set of points with local accuracy 0 ( ) has non-zero probability mass. In this case, not all of these points need to be held out if doing so would make L greater than . The choice of which points to make indeterminate is subjective as they all have the same local accuracy. In practice (e.g. for smooth PDFs), C( ) is a set of measure zero with respect to , so that we can ignore it in Eq. (13a).
From Eqs. (13a)- (14) it is clear that determining 0 ( ) is the key step in defining the optimal classification domains. Fortunately, the interpretation of Eq. (14) leads to a straightforward bisection method. First note that 1∕2 ≤ ⋆ ( ) ≤ 1. Let 0 = 3∕4 be an initial guess for the value of 0 ( ), and let be the th update computed iteratively as follows. In the second case, the existence of a non-trivial set C( ) can be deduced from the observation that I does not converge, but rather cycles between two well-separated values, depending on whether is greater than or less than 0 ( ). In this case, the set C( ) can be defined arbitrarily but consistent with Eq. (14) once 0 ( ) is identified to sufficient accuracy. (In practice and given the speed of convergence, we find that there is little value in considering starting points other than 0 = 3∕4.)

Example applied to a salivary SARS-CoV-2 IgG Assay
To illustrate the analysis of Section 3, we consider a saliva-based assay described in Refs. [11,12]. We refer the reader to those manuscripts for details of assay design, sample preparation, and measurement processes. For each sample, two measurement values are output: a total immunoglobulin G (IgG) enzyme linked immunosorbent assay (ELISA); and a sum of seven SARS-CoV-2 IgG measurements associated with distinct antigen targets. As a preliminary remark, we observe that the numerical range of the data spans several decades of median fluorescence intensity (MFI), which is difficult to model directly. We also note that the measurements are bounded from below by zero and have a finite upper bound. This motivates us to transform each numerical value via log 2 [ +2]−1, which puts the data on the scale of bits. Empirically we also find that this transformation better separates positive and negative populations. Total IgG values are then rescaled to the domain [0, 1] by dividing each measurement by the maximum. SARS-CoV-2 measurements are similarly rescaled to the domain [0, 1], although we divide the log-transformed data by 7, since there were no samples with saturated values. After transformation, each sample is represented by a two-dimensional vector = ( , ), where is the normalized total IgG value, and is the normalized SARS-CoV-2 counterpart.
The results of this transformation are shown in Fig. 1, along with classification domains currently used with this assay. 6 The goal of the analysis is to maintain accuracy while decreasing the number of indeterminate samples by finding the domain D ℎ with the smallest probability mass. We remind the reader that size does not refer to the (generalized) volume in measurement space. Rather it refers to the fraction of samples expected to fall within the domain, since this is what controls the number of indeterminate samples. Thus, it is possible that D ℎ can be quite large when expressed in terms of antibody levels and yet contain very few samples.
To motivate our probability models, we consider the phenomena that could affect measurements. In particular, we anticipate that for positive samples, there should be a degree of correlation between total IgG and SARS-CoV-2 specific antibodies. However, at extreme total IgG values, the SARS-CoV-2 levels may become independent as (i) all measurements will revert to noise when → −∞ or (ii) SARS-CoV-2 antibody levels will decouple from total antibody levels when the latter is excessively high, e.g. if an individual has been exposed to a large number of different pathogens. We also recognize that the ELISA instrument only reports numerical values on the domain [ min , max ]. Thus, fluorescence levels above max are rounded down to the upper bound, and levels below min are rounded up to the lower bound. As shown in Fig. 1, this has the effect of accumulating data (and thus probability mass) on the lines = min and = max , which is a manifestation of data censoring [21,22]. While details are reserved for the Appendix, this observation leads us to model positive and negative samples via a PDF of the form where 0 ≤ ≤ 1, 0 ≤ < 1, ( ) is the Dirac delta function, and P 0 ( , ) is assumed to be bounded and continuous on the whole domain. The functions P ( ) and P ( ) characterize the probability of SARS-CoV-2 antibody levels for measurement values saturated at the left ( ) and right ( ) bounds. We emphasize that the use of delta functions in Eq. (15) is formal and should be treated with care. A more rigorous interpretation of what is meant by Eq. (15) is discussed in Appendix.
To model the function P 0 ( , ), we treat the total IgG measurements as independent normal random variables with an unknown mean and variance. Within the domain 0 < < 1 (note the strict inequalities) and 0 ≤ ≤ ∞, we assume that the SARS-CoV-2 measurements are well described by a Gamma distribution with a fixed (but unknown) scale factor and shape parameter with a sigmoidal dependence on . This dependence is motivated by the correlation described previously. Taken together, this yields the PDF where , , , and the are to-be-determined. The boundary functions are defined to be which describes the probability that a total IgG value below (above) = 0 ( = 1) will be mapped back to the lower (upper) instrument  In order to define the indeterminate region, we use the target global accuracy to define a maximum local accuracy up to which we hold out samples. Increasing the global accuracy of the restricted classification increases the waterline, thereby holding out more samples.
bound. The free parameters are determined via maximum likelihood estimation using a censoring-based technique; see the Appendix and Refs. [21,22]. As an approximation, we truncate the -domain to be 0 ≤ ≤ 1 and renormalize the resulting PDF on this domain.
For the negative PDF ( , ), we anticipate that non-specific binding of the total IgG antibodies to the SARS-CoV-2 antigens will lead to a degree of correlation, albeit to a less extent than for positives. Thus, we use the same form of ( , ), but refit the parameters using the negative training data. Fig. 2 shows the outcome of this exercise for the two training sets. Because P ( ), P ( ), and corresponding terms for ( , ) are continuous with respect to the Gamma portion of ( , ) and ( , ), the former can be inferred from the contour lines in the figure (up to a normalization factor) and are thus not shown.
Figs. 3 and 4 show ⋆ ( ) and waterlines necessary to achieve different average accuracies. The bathtub-type principle is shown in the latter; see also Ref. [15] for related ideas. To ensure that L = , we only hold out samples up to the corresponding value of 0 ( ). Note that indeterminates are concentrated in regions where there is significant overlap between positive and negative samples. Fig. 5 shows the corresponding classification domains computed according to the The empirical accuracy is 98.8%, with a specificity of 100% and sensitivity of 96.7%. The total accuracy is the prevalence-weighted combination of these latter quantities. Note the prevalence is associated with the restricted set of samples that are actually classified; see Section 6. Discrepancy between the theoretical and empirical accuracies is due to idealization of the modeling and stochasticity in the data. For comparison, the horizontal and vertical black lines are the same as in Fig. 1 and denote the corresponding classification domains originally used for this assay. The indeterminate region based on the bathtub-type principle reduces the number of unclassified samples by more than 12% relative to the original domains while maintaining specificity and improving sensitivity for the training data. See also Table 1   bathtub-type principle for a target accuracy of 99.6%; see also Table 1. Relative to the original classification domains, the analysis reduces the empirical rate of indeterminate samples by more than 12% while increasing both accuracy and sensitivity of the assay (with empirical specificity remaining constant). See also Fig. 6 and Section 6 for additional examples of holdout domains.

Numerical validation
To validate that the sets D ⋆ , D ⋆ , and D ⋆ ℎ obtained in Section 3 are optimal, we consider a numerical experiment wherein we perturb H as In principle ( ) can be an arbitrary definition of local accuracy, although in practice we take ( ) = ⋆ ( ) in this section. The interpretation of Eq. (19) is as follows. In taking point ′ from D and adding it to D ℎ and vice-versa for , we must ensure that the constraint Eq. (11) remains satisfied. The ratio ( )− ( ′ )− provides the ''rate-ofexchange'' of probability. For example, if ( ) − < ( ′ ) − < 0, then adding to D will infinitesimally decrease the global accuracy, so that we must hold out a larger yet still infinitesimal fraction of in the vicinity of ′ . It is clear that Eq. (19) goes through a singularity when ( ′ ) → and becomes negative for ( ′ ) > and ( ) < . The interpretation of this is straightforward: we should always reverse any swap for which a point with local accuracy greater than the average is put in the indeterminate class. Such points are not considered in the analysis below. More rigorous interpretations of Eq. (19) are considered in the Appendix, especially in the context of the singular PDF given by Eq. (15).
The benefit of Eq. (19) is that it allows us to estimate a ''set-partial derivative'' by computing the relative probability exchange for any point in the indeterminate domain. In particular, we compute for the optimal domains D ⋆ ℎ and D ⋆ . Fig. 7 shows the logarithm of Eq. (20) for a mesh of points in the indeterminate region, taking ( ) = ⋆ ( ). Note that swapping any point in the indeterminate region with one in the positive and negative classification domains increases the size of the indeterminate, as expected.
To validate that swapping points between D ⋆ and D ⋆ does not increase the accuracy of the assay or decrease the size of the indeterminate domain, we examine the quantity ( ) directly. In particular, the Appendix shows that ⋆ ( ) ≥ 1∕2 for all ∈ D ⋆ guarantees that D ⋆ = ⋆ ∕D ⋆ ℎ and D ⋆ = ⋆ ∕D ⋆ ℎ are optimal for the indeterminate region D ⋆ ℎ . Fig. 3 demonstrates that this inequality holds for the solution given by Eqs. (13a)- (14). Thus, no rearrangement of points decreases the size of the indeterminate domain. Table 1 Summary of fraction of holdouts, sensitivity, and specificity for the data in Figs. 5 and 6. The rectilinear classification method is described in Fig. 1, while the optimal method is given by Eqs. (13a)- (14). For sensitivity, specificity, and accuracy calculations, the numbers in brackets are empirical 95% confidence intervals.

The role of prevalence
Examination of Eq. (11) reveals that the terms of the LHS are proportional to prevalence-weighted estimates of sensitivity and specificity. In particular, recognize that are the sensitivity and specificity restricted to the domain D. When there is no indeterminate domain, the normalization factors ∫ D ( )d = ∫ D ( )d = 1, so that Eqs. (21a) and (21b) revert to the standard definitions of these quantities. In this case, we see that Eq. (11), which no longer acts as a constraint, amounts to the statement that the prevalence-weighted sum of sensitivity and specificity is equal to ; that is When we permit an indeterminate class, however, the interpretation is not as straightforward. In particular, the presence of the term N = ∫ D ( )d on the right-hand side (RHS) appears problematic, for note that it implies The normalization factor N differs from its counterparts in Eqs. (21a) and (21b). Thus, it is not obvious what our constraint enforces about the sensitivity and specificity of the assay restricted to D. The resolution to this conundrum is to recognize that the prevalence of the population also changes when we restrict classification to D. This is not to say that the value of itself (i.e. associated with the total population) changes, but rather that the relative fraction of positives and negatives differs on D ⊂ . This is not unexpected, since the shape of the indeterminate region is a function of the local accuracy , which depends on the specifics of the probability models. Mathematically, we understand these observations by rewriting Eq. (23) in the form where N = ∫ D ( )d and N = ∫ D ( )d are the required normalization constants. Eq. (24) becomes an analogue to Eq. (22) of the form where D = N ∕N is the prevalence restricted to the domain D. Note that D has the properties necessary to be a prevalence: which is a consequence of the definition of N . Thus, we see that the constraint corresponds to a domain-restricted-prevalence weighted sum of sensitivity and specificity. From a theoretical standpoint, Eq. (26) is extremely serendipitous. The constraint as defined by Eq. (11) only refers to the prevalence of the full population. It is not obvious that this equation will remain a prevalence-weighted sum when holding out samples, especially as the restricted-prevalence does not in general equal . Further implications of this observation are explored in the next section.
However, an immediate practical consequence of Eq. (26) is that the relative fraction of positives from an assay using indeterminates is not a reliable estimator of total prevalence. In order for the restricted prevalence D to equal , one requires That is, = D only occurs when the holdout domain removes equal mass from the probability models, which is extremely restrictive.
To overcome this problem, we recall that Ref. [5], demonstrated how an unbiased estimate of the total prevalence can be constructed without classifying samples using a simple counting exercise on subdomains of . The validity of that method is independent of the assay accuracy, so that it can be used to estimate in the present work. Indeed, such techniques are necessary to construct the optimal classification domains, given the fundamental role of in their definitions. We refer the reader to Ref. [5] for a deeper discussion of such issues.

Other notions of optimality
A common practice in the testing community is to preferentially optimize an assay so that either the specificity or sensitivity reaches a desired target, but not explicitly a linear combination of the two. Eq. (25) and the bathtub-type principle suggest a route by which our method can solve an analogue of this problem. However, a deeper investigation of sensitivity and specificity is first necessary to motivate this generalization and understand how such methods differ from Eqs. (13a)- (14). [See also Ref. [23] for additional notions of optimality, as This yields an empirical specificity of the training data was 100% while keeping the empirical sensitivity above 94%. Note that the indeterminate domain (light-blue) is increased only into the positive classification domain (yellow-green) in attempting to satisfy inequality (30). The teal strip adjacent to the light blue and yellow-green is the modified indeterminate domain. After increasing the empirical specificity to 100%, the optimized domains holds out 15.1% of samples, as opposed to 22.3% for the rectilinear method; see Table 1. well as Refs. [24][25][26] for other approaches to defining classification domains.] Examination of the binary problem reveals that when = 1∕2, the domains ⋆ and ⋆ equally weight sensitivity and specificity; that is, errors in either are treated as equally undesirable. It is straightforward to show that increasing will increase sensitivity at the expensive of specificity, and vice versa. The interpretation of this observation is that as the number of positive samples increases, we should increase the size of the positive classification domain so as to capture the their increasing share of the population. It is therefore possible and even likely that when the prevalence approaches 0 or 100%, either sensitivity or specificity may be unacceptably low, since the corresponding contribution to the total accuracy becomes negligible.
A possible solution to this problem is to recast Eq. (11) as an inequality constraint of the form together with the additional constraints where + and − are user-defined lower bounds. While an optimal solution to this problem is beyond the scope of the current manuscript, the bathtub-type principle suggests a construction akin to active-set methods [27]. First, solve the optimization problem associated with Eqs. (10)- (11) and check the resulting values of sensitivity and specificity. If these quantities are deemed to small, remove samples up to user-defined waterlines ≥ 0 and ≥ 0 (which may be different), where and apply only to samples in the negative and positive classification domains. Fig. 8 shows an example of this approach applied to the data in previous figures. We originally set = 0.99 but required that the empirical specificity be 100% for the training set. To accomplish this, we set = 0.972, which augments the size of the indeterminate domain (teal strip added to the light blue domain) without decreasing the number of true negatives.

Relationship between prevalence, sensitivity, and specificity
Eq. (25) and the examples of Secs. 6.1 and 6.2 beg the question: to what extent is prevalence-weighted accuracy a preferred or natural framework for diagnostic classification, as opposed to methods based on explicit reference to sensitivity and specificity? To unravel this, consider that the latter two are purely theoretical properties of a specific choice of classification domain and are only loosely connected to the reality of testing. This is evident from the definitions given by Eqs. (21a) and (21b). The concept of prevalence, i.e. implying existence of a population, does not enter; rather all that is needed is a choice of the classification domains. Thus, an assay can have exceptional sensitivity and yet still be wrong half the time if the prevalence is low. In a related vein, it is clear that specificity and sensitivity only characterize assay accuracy in the limits → 0 and → 1, respectively.
Here we encourage a new perspective. As a baseline strategy, the most important task is to correctly classify samples; at least this is of the utmost importance to patients. Moreover, computing accurate prevalence estimates is critical for epidemiologists (although we have shown previously that this problem is solved accurately without recourse to classification). With this goal in mind, the sensitivity and specificity are subservient to accuracy via Eq. (11), and it is not unreasonable to let them change with prevalence if doing so increases overall testing accuracy. We highlight this because under such a paradigm, and lose their status as the key performance metrics that define the ''quality'' of an assay, and they cannot be viewed as static properties. Such observations are not to say that and are useless, however. Clearly there are times when it is more important to correctly identify samples from one class, and this motivates the generalization of Section 6.2.
But these observations clarify our perspective of why the prevalence sets a natural scale for classification. In particular, Eq. (11) has two equivalent interpretations: (i) the accuracy of the assay must be ; and (ii) the prevalence-weighted sensitivity and specificity must be . The equivalence of these interpretations arises from the fact that notions of accuracy assume the existence of a population to which the test is applied. Thus, Eq. (25) is perhaps unsurprising in light of Eq. (11) because both are self-consistent statements about the properties of a population.
The benefit of treating prevalence-weighting as a natural framework for diagnostic classification is that one can easily identify when subjective elements (i.e. not intrinsic to the population) have been added to the analysis. For example, the indeterminate domain in Fig. 8 associated with the inequalities (28)-(30) is not optimal insofar as there is a smaller counterpart that yields the same average accuracy for the classified data. However, it is clear by construction how we have modified the latter, i.e. by adding a user-defined constraint on the specificity. Likewise, even Eq. (11) should be viewed as a subjective modification of the unconstrained, prevalence-weighted classification problem.
Ultimately the choice of classification method is best determined by assay developers, and there may be situations in which prevalence weighting is inappropriate. Nonetheless, we feel that the analysis herein highlights the assumptions behind our work and attempts to ground it in objective elements inherent to the population of interest.

Implications of an indeterminate class
The use of an indeterminate class in diagnostics can have consequences, especially for individual patients. We briefly consider such issues here.
We first note what does not change: it is still possible to estimate prevalence of an entire population, even though some samples are held out. This is a consequence of the methods in Ref. [5,14], which yield unbiased predictions of that converge in mean-square without ever classifying data. These results are understood physically by recognizing that prevalence estimation entails determining the number of positive samples, not identifying their classes. The latter task is more specific and amounts to a choice about how to interpret the data, which has no bearing on its underlying statistical properties (such as prevalence).
This observation highlights the subjective nature of classification insofar as Eqs. (6)- (11) are choices of the types of errors we wish to minimize. These choices are informed by the measurement setting and dictate what role an indeterminate class plays in reporting results to individuals. In mass surveillance studies, for example, the primary goal may be to deduce prevalence of various populations. In such cases, a large fraction of inconclusive results could reduce individual confidence in testing, although it should not affect the overall aims of the study. In other settings, such as antibody testing to assess immunity, the specific test results matter more. In some cases an indeterminate class as constructed herein (in terms of local accuracy) could in fact increase confidence and/or usefulness of the diagnostic. For example, in testing an immunocompromised individual, the potential loss associated with an incorrect result suggests the need for a more stringent criterion (e.g. using local accuracy) to determine if an individual has seroconverted.

Limitations and open directions
A fundamental limitation of our analysis is the assumption that the probabilistic models describing positive and negative samples can be used outside the scope of training data. This problem is common to virtually any classification scheme and is primarily an issue of modeling. Such issues have been explored in a previous manuscript, to which we refer the reader [5]. We note here, however, that modelform errors may introduce uncertainty on the order of a few percent in the conditional probability densities. Thus, it is likely that modeled estimates of accuracy will be incorrect by a proportional amount. This is seen, for example, in the holdout domain computed in Fig. 5. However, Section 6.2 provides means of ensuring that the indeterminate domains are recomputed to satisfy any constraints on empirical estimates of accuracy. We also note that approaches that do not explicitly account for prevalence and/or conditional probabilities are likely to have significantly more model-form errors than estimates based on our approach.
Regarding the indeterminate analysis, Eqs. (13a)- (14) and the generalization considered in Section 6.2 may be a challenging optimization problem to solve, although the solution could be extremely useful for satisfying regulatory and/or public health requirements. Moreover, formalizing the algorithm described in that section and studying its properties relative to the optimal solution may be useful.
A practical limitation of our analysis is the definition of assay performance, provided we allow for variable, prevalence-dependent classification domains. Current standards advocate using sensitivity and specificity estimated for a single validation population having a fixed prevalence. To realize the full potential of our analysis, it is necessary to (i) estimate assay accuracy and uncertainty therein, (ii) characterize the admissible classification domains, and (iii) compute sensitivities and specificities, all as a function of the variable prevalence. While such issues have been partly considered in [5], and deeper investigation of this uncertainty quantification is necessary for widespread adoption of these techniques.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.
Eq. (14) motivates the function which is a monotone increasing function of satisfying the inequalities (1∕2) < 0 and ( ) > 0 for some > 1∕2. Thus, there exists a unique value of 0 ( ) for which one of two situations holds: either (I) the function ( ) is continuous at 0 ( ) and ( 0 ( )) = 0, which directly implies Eq. (14); or (II) ( ) suffers a discontinuity, so that ( 0 ( )) < 0 and ( 0 ( ) + ) > 0 for any positive . The latter case occurs when S = { ∶ ⋆ ( ) = 0 ( )} has non-zero measure, and we may set C to be any subset C ⊂ S provided Eq. (14) is satisfied. The existence of such a C is guaranteed by the linearity of integration, which implies that is a continuous, monotone increasing function of the measure of ⊂ S that passes through zero. Any zero of̂( ) implies Eq. (14) and defines an appropriate C.
The proof that Eqs. (13a)- (14) minimize Eq. (10) relies on the observation that any ⋆ ( ) < 0 ( ) is farther from the mean value than any ⋆ ( ) > 0 ( ). Thus, it ''costs additional probability'' to swap points between the indeterminate region and D ⋆ = D ⋆ ∪ D ⋆ while satisfying the constraint. To see this mathematically, let D be any other union of positive and negative classification domains satisfying Eq. (11). We do not consider any domains D that consist only of choosing a different subset C ⊂ S while maintaining Eq. (14). By Eq. (11) one find We can further expand the second term as where Clearly the first term on the RHS of Eq. (34) is negative, whereas the second term is positive. Noting that ⋆ ( ∈ D ⋆ ) > ⋆ ( ∈ D ⋆ ℎ ), one finds by inserting Eq. (34) into Eq. (33) that the latter can be expressed in the form where ( ) > 0, ( ) < 0, and 0 < ( ) < ( ). This implies that Consider now the difference of objective functions By inequality (36), we see that H > 0. Moreover, note that ( , , ) ≤ ⋆ ( ) for any classification domains associated with the binary problem. Clearly any choice besides ⋆ and ⋆ entails increasing the measure of D ℎ to ensure that the constraint is satisfied. □ Remark. Lemma 1 is distinct from Theorem 1.14 of Ref. [15] in several subtle ways. The latter minimized a functional L[ ( ) ( )] of a product of two functions, where ( ) is arbitrary and ( ) satisfies the inequality 0 ≤ ( ) ≤ 1 for ∈ ( plays the role of our and has the same meaning as in our work). This objective is supplemented with the constraint that the expectation value of ( ) be a constant. The corresponding bathtub principle identifies an optimal ( ) as either an indicator function or a sum of two indicator functions. In the former case, ( ) defines an analog of our optimal domain. However, the structure of the constraint in Theorem 1.14 fixes the measure of the domain and only allows its shape to vary. In contrast, we minimize the measure of the holdout domain subject to an auxiliary constraint, which is a conditional expectation value. Doing so allows both the shape and measure of the holdout domain to vary. We refer the reader to Theorem 1.14 of Ref. [15] for more in-depth comparison. The reader may also verify that ( ) being a sum of two indicator functions does not alter the interpretation above. Fig. 1 illustrates that biological phenomena may generate a signal so strong that the instrument saturates, i.e. it reaches a limit max above which it cannot distinguish different measurement values. This saturation effectively rounds the ''true'' measurement down to the max . The only conclusion we can draw about a reported value max is that the true value satisfies the inequality ≥ max . Similarly there exists a lower limit min up to which smaller measurements values are rounded. The goal of this section is to incorporate such information into probability modeling.

Appendix B. On PDFs with Dirac masses
For concreteness, we restrict ourselves to the one dimensional measurements associated with the total IgG assay. We assume that were the optical photodetector not restricted to the range [ min , max ], the recorded measurement would have been returned on the domain −∞ < < ∞. Because the measurements have been transformed to a logarithmic coordinate system, → −∞ is meaningful. Without additional information about probability of total IgG antibody levels, we make a minimal assumption that is described by a Gaussian distribution with an unknown mean and variance 2 . Thus, on the open domain ( min , max ), assume that = , so that the probability of measuring iŝ However, on the boundaries min and max , we only know that the true values are below and above the respective thresholds. Thus, the probabilities of measuring min and max are given bŷ wherê0( | , 2 ) is the same as Eq. (38), but with replaced by . We may then write the full probability model for aŝ To determine the values of and , we maximize with respect to these parameters the product of likelihoods given by or alternatively, we minimize the negative log of L like ( ). To construct the two-dimensional PDF associated with Eq. (15), we assume a corresponding probability model for the SARS-CoV-2 IgG measurements and use standard MLE to identify the distribution parameters. The full PDF for training data is then given by the product of the corresponding PDFs for total IgG and SARS-CoV-2 measurements and has the form given by Eq. (15). Note that Eq. (15) does not require modification of the proof in the previous section, since any point ( , ) is a set of measure zero, provided that P ( ) and P ( ) (and their negative counterparts) are bounded functions of . However, we do require care in defining the local accuracy and classification domains ⋆ and ⋆ . Let where N ( ) and N ( ) are the analogous of P ( ) and P ( ) for the negative PDF.

Appendix C. On the point-swap derivatives
To justify the use of Eq. (19), return to Eq. (11) and consider a set D and its complement D ℎ . Consider balls B = ( , ) and B ′ = ( ′ , ′ ) having radii , ′ and centered about and ′ . Let these balls be entirely contained in D ℎ and D, respectively. Momentarily assume that the PDFs do not contain Dirac masses. Define D ′ ℎ and D ′ to be the sets where B and B ′ have been interchanged without violating Eq. (11). Taking the difference of Eq. (11) defined relative to D and D ′ yields where is the dimensionality of . Rearranging this last equation yields Note that and ( ′ ) are proportional to the volumes of the respective balls about the points and ′ , so that the quantity ( ′ ) ( ′ ) is, for example, proportional to the infinitesimal probability mass contained in the corresponding ball. Thus, the given by Eq. (48) is the relative change probability mass exchanged between D and D ℎ in swapping and ′ .
If we change the class of (either from D to D or vice versa), it may be necessary to hold out additional points ′ , or it may be possible to move points from the indeterminate into the classification domain. In either case, letting B and B ′ have the same definitions as before and assuming Eq. (11) holds, one finds where B ′ is the ball moved to (+) or from (−) the indeterminate domain, depending on the sign of the first term; note that we also require < inside B ′ . Again taking the limit that the respective are small, one finds The LHS must be positive, and the denominator on the RHS is positive. Thus, the + and -signs on the RHS occur when ( ) > 1∕2 and ( ) < 1∕2, corresponding to the situations in which probability moves to and from the indeterminate region. Thus, in assessing when D ℎ grows, it is sufficient to test the inequality ( ) > 1∕2. The analysis of this section is easily generalized to the case of Eq. (15) by noting that for points on the lines = 0 and = 1, the balls of radius should be taken as intervals on the line with length 2 . This yields the appropriate generalization of probability associated with those points.