Optimal classification and generalized prevalence estimates for diagnostic settings with more than two classes


 An accurate multiclass classification strategy is crucial to interpreting antibody tests. However, traditional methods based on confidence intervals or receiver operating characteristics lack clear extensions to settings with more than two classes. We address this problem by developing a multiclass classification based on probabilistic modeling and optimal decision theory that minimizes the convex combination of false classification rates. The classification process is challenging when the relative fraction of the population in each class, or generalized prevalence, is unknown. Thus, we also develop a method for estimating the generalized prevalence of test data that is independent of classification of the test data. We validate our approach on serological data with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) naïve, previously infected, and vaccinated classes. Synthetic data are used to demonstrate that (i) prevalence estimates are unbiased and converge to true values and (ii) our procedure applies to arbitrary measurement dimensions. In contrast to the binary problem, the multiclass setting offers wide-reaching utility as the most general framework and provides new insight into prevalence estimation best practices.



Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has highlighted the importance of accurate classification of antibody test results.Most work has focused on labeling data as previously infected (seropositive) or naïve.Due to the deployment of SARS-CoV-2 vaccines in late 2020, there is a clear need for a classification scheme that correctly distinguishes between naïve, previously infected, and uninfected but vaccinated individuals.However, the traditional diagnostic classification methods of confidence intervals and receiver operating characteristics have no obvious extensions to a multiclass setting.
Current multiclass applications in diagnostic classification are mostly limited to supervised learning and do not address the central role of mathematical modeling in diagnostics.Example studies include the application of support-vector machines to automatically sort endomysial autoantibody tests of celiac disease into one of four classes (Caetano dos Santos et al., 2019), and another that trained deep neural networks to label resting-state functional magnetic resonance imaging results with one of six Alzheimer's disease state (Ramzan et al., 2020).However, these approaches may not accurately quantify precise training population characteristics or account for the role of prevalence, both of which should inform the classification procedure.In contrast, modeling can overcome this limitation.Binary (two class) examples include two-dimensional (2D) modeling of antigen targets coupled with optimal decision theory (Patrone and Kearsley, 2021), statistical modeling applied to either antibody or viral-load tests (Böttcher et al., 2022), and an approach to the time-dependent problem for antibody measurements (Bedekar et al., 2022).However, none of these works discussed multiclass extensions.
This paper uses mathematical modeling to fully address the task of multiclass classification in the context of diagnostic testing.We begin by showing that the notion of generalized prevalence-the relative fraction of the population in each class-is fundamental for defining our objective function, the convex combination of false classification rates (Section 3).Minimization thereof yields optimal classification.Interestingly, we show that these prevalences can be computed without classification by solving a linear system.We validate our methods using a SARS-CoV-2 serological data set with naïve, previously infected, and vaccinated classes (Ainsworth et al., 2020;Wei et al., 2021) 1 (Section 4).We then computationally validate the convergence of generalized prevalence estimates to the true values in mean square and illustrate a generalization to 2D data (Section 5).Finally, the discussion includes further analysis of prevalence estimation, extensions, and limitations (Section 6).

Notation
This work combines set and measure theory with applied diagnostics; readers will likely not be experts in both.In order for the ideas of this paper to be readily implemented by diagnostics experts and the applications understood by mathematicians, we provide baseline terminology from both fields.

Definitions from applied diagnostics
• The naïve class comprises individuals that have not been previously infected or vaccinated.In a binary classification, such samples are often referred to as 'negative'.
• The previously infected class comprises individuals with a prior infection but who are unvaccinated.In a binary classification, such samples are often referred to as 'positive'.
• The vaccinated class comprises individuals who have been inoculated against a disease without a prior infection.
• Training data correspond to samples for which the true classes are known.Typically, such data are used to construct conditional probability models.
• Test data correspond to samples for which the true classes are unknown or assumed to be unknown for validation purposes.Typically, a classification procedure is applied to such data.
• Generalized prevalence is the relative fraction of samples in a population that belong to each class.

Definitions from measure theory
• A set is a collection of objects, e.g.measurement values.A domain is a set in a continuous measurement space; see Figure 1 for an example.
• The symbol R denotes the set of all real numbers.The symbol R m denotes the real coordinate space of dimension m consisting of all real-valued vectors of length m.
• The symbol ∈ indicates set inclusion.The expression r ∈ A means r is in set A.
• The symbol ⊂ denotes the subset relationship of two sets.The expression A ⊂ B means that all elements in A are contained in B.
• The use of a superscript C denotes the complement of a set.The set D C contains all elements in the measurement space not in D.
• The symbol ∅ denotes the empty set, which contains no elements.
• The operator ∪ denotes the union of two sets.The set C = A∪B contains all elements in either A or B or both.
• The operator ∩ denotes the intersection of two sets.The set C = A ∩ B contains all elements in both A and B.
• The operator \ denotes the set difference.The set C = A \ B contains all objects in A that are not also in B. An equivalent interpretation is that A \ B is the result of removing the common elements of A and B from A.
• The notation A = {r : * } defines the set A as all r that satisfy condition * .

Notation specific to this paper
• The set Ω denotes the entire measurement space.
• The label C j refers to the jth class.
• The generalized prevalence for class C j is denoted by q j .
• The set D j denotes a domain corresponding to C j .
• The use of a superscript ⋆ denotes an optimal quantity.For example, D ⋆ j could be an optimal classification domain corresponding to class C j .

Generalized prevalence estimation and multiclass classification
Prevalence estimation and classification rely on the same framework of antibody measurements.For each individual or sample, we represent corresponding measurements as a vector r = (r 1 , . . .r m ) ∈ Ω ⊂ R m .Here, r could denote m antibody types targeting different parts of a virus as measured in median fluorescence intensity (MFI).Let P j (r) describe the probability that a sample from class C j yields measurement value r.These conditional probability density functions (PDFs) are assumed known in this section; their construction is considered for example serological data in Section 4.1.
The generalized prevalence q j is the relative fraction of the population corresponding to class C j .In what follows we assume there are n classes.The generalized prevalences must sum to 1, which implies n j=1 q j = 1, (1a) The probability density Q(r) of a measurement r for a test sample is given by q j P j (r). (2) The product q j P j (r) is the probability that a random sample both belongs to class C j and has measurement value r; thus, the expression for Q is an instance of the law of total probability.This quantity plays an important role in prevalence estimation and classification.

Generalized prevalence estimation
To demonstrate the importance of prevalence in diagnostic classification, consider the United States population's SARS-CoV-2 antibody response.In early 2020, most samples should have been classified as naïve because the disease prevalence was small.By February 2022, the disease prevalence was estimated at 57.7 % (Clarke et al., 2022); a significant fraction of samples should have been classified as previously infected.Crucially, the same measurement value may be classified differently depending on the disease prevalence.This example shows that prevalence plays an integral role in classifying diagnostic tests and should be estimated before classification of test data.We address this need by designing unbiased estimators for the prevalences {q j }.
For n classes, consider a partition {D j } that separates the measurement space Ω into n nonempty domains.It is important to note that these D j are not classification domains.Define where Writing this as a linear system yields Taking k = n in (1b) implies (6) This yields the prevalences q as the solution to the system where q is the vector of length n − 1 whose jth entry is q j , P is the (n − 1) × (n − 1) matrix whose (i, j)th entry is P i,j , P n is the (n − 1) × (n − 1) matrix whose (i, j)th entry is P i,n , Q is the vector of length n − 1 whose jth entry is Q j , and P n is the vector of length n − 1 whose jth entry is P j,n .The last prevalence q n is found via (1b) with k = n.We assume that the inverse of the matrix P − P n exists; Section 6.1 further discusses the matrices P and P − P n .
To estimate the generalized prevalences, estimate the Q j by Qj , where Here, S is the total number of samples and I denotes the indicator function.Substituting Qj for Q j in (7) yields an estimate qk for q k .When the PDFs P j (r) are known and (P − P n ) −1 exists, these estimates qk are unbiased, i.e., E[q k ] = q k .This follows directly from the fact that Qj is a Monte Carlo estimator of Q j .Further, the generalized prevalence estimates converge to the true values in mean square as the number of samples is increased (Caflisch, 1998).This is illustrated in Section 5.1.We note that the generalized prevalence estimates are not unique due to the arbitrariness of the {D j }.However, the non-uniqueness allows us to select any reasonable partition over which to find the estimators { Qj }.This is discussed further in Section 6.1.

Optimal classification
Our task is to define a partition {D j } (not necessarily the same as for prevalence estimation) of the measurement space Ω such that each domain corresponds to one and only one class C j .A measurement r is assigned to class j if r ∈ D j .We require where µ ℓ (X) = X P ℓ (r)dr.Here, (9a) ensures that any sample can be classified and (9b) enforces single-label classification (up to sets of measure zero).To identify an optimal partition {D ⋆ j }, we construct the loss function Here, (10) is the generalized prevalence-weighted convex combination of false classification rates as a function of the domains D j .Intuitively, we expect that a sample with measurement r should be assigned to the class domain D j to which it has the highest probability of belonging; that is, the highest value of q j P j (r) for all j.Accordingly, the loss function (10) penalizes misclassified measurements r with high probability values.
To address situations in which a measurement has an equal highest probability of belonging to two or more classes, we introduce the following set for each class C j : In most practical implementations all E j have measure zero, and the domains minimize the loss function L up to sets of measure zero.The proof is shown in A and involves a straightforward application of set theory; see also Williams and Rasmussen (2006) for similar ideas.
If E j has nonzero measure, randomly assigning a measurement in E j to one of the classes to which it has equal maximal probability of belonging does not effect the loss L .In this case, the optimal domains are generalized to where Z E j is an element of a partition of E j that we define iteratively as follows.
This ensures that no measurement in a set E j is assigned to more than one optimal domain.Note that by construction, Z En is empty.
Figure 1 shows a 2D conceptual illustration of {E j }, which are the lines delineating the optimal regions.In 2D, line segments have Lebesgue measure zero.Thus, classification follows (12).Note that the "multipoint" at which the lines meet has equal probability of belonging to all three classes.We discuss this further in Section 6.2.

Example applied to SARS-CoV-2 antibody data with three classes
To demonstrate the concepts developed in Section 3, we apply our methods to serological data with three classes.Publicly available data sets associated with Ainsworth et al. (2020) and Wei et al. ( 2021) provide previously infected, naïve, and vaccinated antibody measurements.The vaccine data (Wei et al., 2021) are recorded for individuals that were innoculated with one of two vaccines.We refer to these as Vaccine A and Vaccine B and analyze the populations separately and together.The studies provide SARS-CoV-2 antispike immunoglobulin (IgG) antibody measurements; see B for measurement details.We use one-dimensional (1D) data to illustrate a straightforward multiclass example; Section 5.2 demonstrates that our analysis holds for higher measurement dimensions.
All data are transformed to a logarithmic scale as follows: Here, r and r represent the original and log-transformed values; r has units of ng/mL and r is nondimensional.This transformation puts the data on the scale of bits and allows for better viewing of measurements that range over several decades of MFI values in the original units.Wei et al. ( 2021) truncated vaccinated samples, with lower and upper transformed limits of 1 and roughly 8.
Figure 2 shows a histogram of the data with the vaccinated category split by vaccine manufacturer (Figure 2a) and combined (Figure 2b).Previously infected samples have the largest spike IgG antibody levels and naïve samples the smallest; the vaccinated class falls in the middle.The vaccinated class overlaps with some naïve and previously infected samples.Due to the truncation of vaccinated measurements, the corresponding right-most histogram bin contains many samples.We separate the data into randomly generated training (80 % of samples) and test (20 %) populations.

Conditional probability distributions
We fit probability distributions to the training data to model the naïve, previously infected, and vaccinated antibody responses.For our purposes, we assume these are distinct classes; a sample belongs to one and only one of the three categories.To construct the conditional PDF for each population, we select a parameterized model that empirically characterizes the shape and spread of the samples.We determine parameters separately for the three training populations by maximum likelihood estimation (MLE).The naïve training population is fit to a Burr distribution which describes a right-skewed sample population.The previously infected training population is fit to a stable distribution described by characteristic function for γ = 1.Here, i is the imaginary unit and sgn is the sign function, which returns +1, -1, or 0. This distribution describes a left-skewed sample population.
We fit the vaccinated training populations to an extreme-value distribution after observing the mostly symmetric shape of the data with a spike at the right truncation limit: We apply data censoring to better fit the truncated data; this is described in C. Figure 3 shows the conditional PDFs, represented as continuous curves, trained on the three-class training data with a Vaccine A vaccinated class.The blue, red, and black curves correspond to the naïve, previously infected, and vaccinated models.The effect of truncating the data at the upper limit is visible in the right-most bin of the vaccinated class histogram; this is accounted for by the data-censoring.As a result, the vaccinated class PDF exhibits spikes at the upper and lower truncation values.This spike is an artifact of the original data collection process and not a typical problem.

Generalized prevalence estimation
Recall that prevalence estimation of test data requires a partition that separates the measurement space Ω into n nonempty domains.Here, the number of classes is n = 3.We create a partition using k-means clustering with k = 3, which assigns each measurement to the cluster with the closest mean.Figure 4 shows the partition for our test data set with a Vaccine A vaccinated class.The clustering separates the three populations reasonably well; see Section 6.1 for the importance of this statement.The partition need not perfectly separate the data by class to estimate prevalences with high accuracy.We estimate generalized prevalences for the test data via (7) and record true and estimated values in Table 1.

Optimal classification
We classify the training data using known generalized prevalences via (12).Figure 5a shows the optimal domains, labeled D ⋆ N , D ⋆ V , and D ⋆ P , for a Vaccine A vaccinated class.For this 1D example with three classes, the optimal classification domain boundaries can be represented by upper and lower threshold levels.Samples with measurements below the smaller level are classified as naïve, samples with measurements between the thresholds as vaccinated, and samples with measurements above the larger level as previously infected.All three populations have overlapping PDFs, which reduces classification accuracy.Accurate classification of test data is possible with reasonably close prevalence estimates.We classify the test data using estimated generalized prevalences and display the optimal classification domains for a Vaccine A vaccinated class in Figure 5b.Training and test data classification errors are recorded in Table 2. Taken over the three considerations of the vaccinated class, the average error for the training data is 7.61 % and the same for the test data is 5.11 %.

Computational validation
We numerically demonstrate two important features of our generalized prevalence estimation and multiclass optimal classification procedures.First, we show the convergence of our prevalence estimates to the true values as the number of samples is increased.Second,  we present a 2D tri-class problem to show how the method generalizes to higher dimensional measurement spaces.

Convergence of prevalence estimates
We use our probability models ( 16)-( 18) to generate synthetic data sets whose relative frequencies match the generalized prevalences of the Ainsworth et al. (2020) and Wei et al. (2021) data.We systematically increase the number of synthetic data points used while holding generalized prevalences fixed to study the effect of sample size on prevalence convergence.For each number of points used, we generate 1000 synthetic data sets and compute statistics on our results.
Figure 6 shows our analysis for the Vaccine A vaccinated class.Figure 6b shows a boxchart of the statistics for using 10 2 , 10 3 , 10 4 , and 10 5 samples.The estimates have more outliers and variation when few samples are used, which decreases as the number of points is increased.Even with few samples, the median generalized prevalence estimates are close to the true generalized prevalences.Table 3 records the mean and standard deviations of our results.Even for only 1000 samples, our estimates agree with the true generalized prevalences with roughly 2 % relative error.
Figure 6c plots the standard deviation of the prevalence estimate error on a log-log scale against the number of samples.The standard deviation should decrease with the inverse square root of the number of samples (Caflisch, 1998), which is plotted for comparison.Our empirical convergence rates all agree with the theory through 10,000 samples; the rate is maintained for the Vaccine A vaccinated class.12

Generalization to higher dimensions
We now explore a 2D synthetic numerical validation of generalized prevalence estimation and multiclass optimal classification.See Luke et al. (2022) for a discussion of the implications of higher-dimensional modeling on diagnostic testing accuracy.The synthetic values we use are modeled off the receptor-binding domain (RBD) and nucleocapsid (N) SARS-CoV-2 antibody targets; together these form a measurement double r.Details about the models and information about the data are given in D. Figure 7a shows an example of 2D synthetic antibody measurements with naïve, previously infected, and vaccinated classes with true prevalences of 0.3, 0.2, and 0.5.The conditional PDFs are shown as contour lines of constant probability.We use 1000 total synthetic samples.
To quantify uncertainty in the prevalence estimates, we randomly generate 1000 synthetic sets of samples using fixed prevalences.We then partition the measurement space via kmeans clustering using one synthetic sample set (see Figure 7b), fix the partition, and use (7) to generate prevalence estimates for all sets.The results are shown in Table 4. Figure 8 shows histograms of the generalized prevalence estimates and true values, which fall within the middle of each distribution.We classify using these estimated prevalences via (12) and 13 find an average error of 1.58 %. Figure 7c shows example optimal classification domains.The gold region is the previously infected domain, the purple is the vaccinated, and the remainder of the measurement space, colored in light blue, defines the naïve domain.For this example, the false classification rate is 1.8 %.

Limiting cases of prevalence estimation and implications for assay design
An interesting observation of our prevalence estimation scheme is that the structure of the matrix underpinning the linear system encodes information about overlap between populations.As such, the matrix potentially informs best practices for prevalence estimation.Further, our method extends the binary procedure of Patrone and Kearsley (2021) and may provide insight into the simpler setting.Here we examine limiting cases of prevalence estimation and connect characteristics of the matrix P to assay accuracy.
We explore interpretations of equivalent definitions of singularity of the matrix P .Recall that the quantity P j,k gives the probability density of class k falling in domain D j .If all elements of a row (column) of the matrix P are zero, the probability of any measurement value falling in (belonging to) the corresponding domain (class) is zero.If the columns of P are linearly dependent, the probability of a sample belonging to class C k having a measurement in domain D j is a linear combination of the probabilities of samples belonging to all other classes having measurements in domain D j .This occurs for a choice of partition where all points fall in a single domain D j .In this extreme case, there is an apparent dependence (in the linear algebra sense) of the measurement values of different classes.As a related example, for the 1D SARS-CoV-2 antibody data from Ainsworth et al. (2020) and Wei et al. (2021), we can construct a partition where one trial domain is empty, both P and P − P n are singular, and therefore prevalence estimation is not possible.
To avoid this situation, one should select nonempty trial domains, i.e., training data should lie in each element of the partition.In the limiting case that P ij = 0, the measurement of a sample in class C j has zero probability of falling in domain D i .The most extreme separation of training data occurs when the PDFs have nonzero support only on mutually exclusive elements of the partition.In this setting, the matrix P is a permutation matrix, and the prevalence estimates are merely the relative fractions Q of measurements in each domain.If the partition elements are correctly matched to the classes, this extreme separation corresponds to a perfect assay because there are no misclassifications.We note that the matrices that result from a k-means partition of the 1D SARS-CoV-2 tri-class data are close to permutation matrices; one example is P =   0.9749 0.0058 0.1055 0.0237 0.0495 0.8101 0.0015 0.9447 0.0844 Selection of trial domains with a high degree of class separation may be a key to our low-error prevalence estimates.We speculate that under certain conditions a matrix P that is a permutation matrix may be optimal in the sense that it minimizes the prevalence estimate error.In 1D, it may be possible to construct an optimization in terms of the samples assigned to each element of the partition, such as argmin where f 1 (x) and f 2 (x) are indicator functions determining which samples are assigned to elements 1 and 2 of the partition (without loss of generality).Here, P π P is the row permutation of P closest to the identity matrix, I.For the matrix P given by ( 19), for example, P π = [e 1 ; e 3 ; e 2 ], where e j is the jth standard basis vector.We leave a search for the minimum prevalence error estimate to future work; see Patrone and Kearsley (2022) for an approach to the binary case.The extension of their work to the multiclass setting is not obvious because the objective function to minimize can be generalized in many different ways.
As a final note on extreme cases of prevalence estimation, in expectation, the problem is unconstrained.The constrained problem may be a viable alternative when it is known that one prevalence is close to zero.

Local accuracy
Recall that in Section 3.2 we needed to consider sets of measurements with equal probability of belonging to more than one class.This concept is related to local accuracy, Z, which compares the probability that a test sample belongs to a particular class and has measure-ment r to the measurement density of a test sample with measurement r.We generalize the binary version from Patrone et al. (2022) to the multiclass setting: where {D k } partitions Ω.Let Z ⋆ (r) = Z(r, D ⋆ 1 , . . ., D ⋆ n ) be the local accuracy of the optimal solution to the multiclass problem.It is straightforward to show that 1/n ≤ Z ⋆ ≤ 1. Due to optimality, if r ∈ D ⋆ k , we have q k P k (r) ≥ q j P j (r) for j = k.Then and so In the multiclass setting, we have Z ⋆ = 1/n when We will refer to such an r, if it exists, as a multipoint of the optimal domains.The lower bound on Z ⋆ is only attained at a multipoint.To see this, consider some measurement v that is not a multipoint.Then there exist j, k ∈ {1, . . ., n}, j = k, such that q j P j (v) < q k P k (v).Then, since the classification is optimal, v ∈ D ⋆ j and v ∈ D ⋆ m for some m (it may be that m = k).Clearly, q j P j (v) < q m P m (v).Further, q ℓ P ℓ (v) ≤ q m P m (v) for ℓ = j.It follows that and so Q(v) < nq m P m (v), which gives 1/n < Z ⋆ (v) for a non-multipoint.The concept of local accuracy could be used to decide which values to hold out in an indeterminate class in order to meet a global accuracy target (see Patrone et al., 2022).For any measurement, we can compute what the local accuracy would be if we chose to assign it to each class in turn.Using the SARS-CoV-2 tri-class example as an illustration, conducting this procedure on a measurement r may result in, say, similarly high local accuracies for the previously infected and vaccinated classes, but a low local accuracy for the naïve class.In this situation and without an optimal classification scheme, we may not feel confident labeling the sample as previously infected or as vaccinated, since their probabilities are similar, but we can say the sample is almost certainly not naïve.This leads naturally to the observation that any subset of classes can be combined to make a new class.In particular, we can reduce the problem to a binary classifier.For our serological example, perhaps it is desirable to consider previously infected and vaccinated samples together, or equally possible that it is difficult to tell them apart for a particular assay, and so our goal becomes to classify them separately from naïves.This reduction of the problem size by combining classes is in a sense a projection onto a lower class space.Specifically, consider where P1 (r) and P2 (r) are newly-created PDFs with associated prevalences q1 and q2 = 1 − q1 .The task becomes to find q1 , for which there may be an optimal strategy.The diagnostic community's current analog to local accuracy is the concept of a likelihood ratio (LR), calculated as s e /(1 − s p ) for a previously infected test result, where s e and s p represent sensitivity and specificity.The previously infected LR can be interpreted as the ratio of probabilities of correctly to incorrectly predicting a previously infected result (Riffenburgh, 2011).These values use average population information through s e and s p values, which may not always be available or representative.In contrast, local accuracy uses local information, since the latter is conditioned on knowing individual measurement values.

Extensions
Multiclass methods are readily equipped to handle further stratification of antibody data, such as by age group, biological sex, or coronavirus disease of 2019 (COVID-19) booster status.An additional class could be added for individuals who are both vaccinated and previously infected.Studies have demonstrated a greater antibody response post-vaccination for previously-infected versus COVID-19 naïve recipients (Dalle Carbonare et al., 2021;Narowski et al., 2022); this could allow for these populations to be distinguished by our classification scheme.Further, we minimize the prevalence-weighted combination of misclassifications, but the optimization problem can be rewritten for any desired objective function.Reformulations include "rule-in" or "rule-out" tests that meet desired sensitivity or specificity targets (Florkowski, 2008).Our methods may even be generalizable to multi-label classification, in which a sample can be assigned to more than one class; we anticipate challenges designing the corresponding optimization problem.Finally, the methods presented here can be applied to any setting where class size estimation and population labeling are required; an example is cell sorting in flow cytometry.

Limitations
Model selection is inherently subjective; Schwartz (1967) showed that the error goes to zero as more data points are added.As the number of antibody measurements increases, corresponding to viewing the data in higher dimensions, additional modeling choices become available.Patrone and Kearsley (2021) suggest the possibility of minimizing misclassifications over a family of models; see also Smith (2013) for a discussion of model form errors. Classification accuracy and prevalence estimation of the 1D data sets from Ainsworth et al. (2020) and Wei et al. (2021) suffer from overlap between their spike IgG values.If more measurements were available per sample, modeling the data in a higher dimension could improve class separation and thereby lower error rates (see Luke et al., 2022).Further, our models do not account for time-dependence.This concept is important when classifying antibody tests, which are known to have a half life on the order of several months post infection or vaccination (Xia et al., 2021;Kwok et al., 2022).See Bedekar et al. (2022) for a time-dependent approach to the binary setting.

Implications for assay developers
We have solved the multiclass diagnostic classification problem, which was previously unresolved.Antibody measurements from vaccinated individuals can now be distinguished from previously infected and naïve samples.
Our work is the first to obtain unbiased predictions of the relative fractions of vaccinated, previously infected, and naïve individuals in a population.These estimates are improved as more samples are added.Best practices for conducting these predictions include dividing the range of all possible measurement values into nonempty regions that create separation between samples of neighboring regions.This can be easily achieved using pre-defined clustering algorithms.Our procedure hinges on selecting probability distributions to model training populations, which can be conducted automatically for measurements of a single antibody target in several open-source programming languages.Our classification scheme is also easily implementable, and can be modified to prioritize specificity if desired.Regardless of the reformulation, the error is minimized by construction.

Acknowledgements
This work is a contribution of the National Institute of Standards and Technology and is not subject to copyright in the United States.R.L. was funded through the NIST PREP grant 70NANB18H162.The aforementioned funder had no role in study design, data analysis, decision to publish, or preparation of the manuscript.Use of data provided in this paper has been approved by the NIST Research Protections Office (IRB no. ITL 20202057).The authors wish to thank Drs. Daniel Anderson and Eric Shirley for useful discussions during preparation of this manuscript.

Data Availability
Analysis scripts and data developed as a part of this work are available upon reasonable request.Original data are provided in Ainsworth et al. (2020) and Wei et al. (2021).

Declarations of Competing Interests
The authors have no competing interests to declare.et al. (2020) provide previously infected and naïve samples (which they refer to as positive and negative).Ainsworth et al. (2020) used a true negative rate of 99 % to set a positivity threshold of 8 million MFI units.Previously infected serological samples were taken at least 20 days post symptom onset from individuals whose infections were confirmed via reverse transcription polymerase chain reaction (RT-PCR) by a nose or throat swab.The naïve serological samples were collected pre-pandemic.

Ainsworth
Wei et al. ( 2021) provide vaccinated serological samples.Measurements were recorded from individuals that had been innoculated with either the Oxford-AstraZeneca ChAdOx1 nCoV-19 (Vaccine A) or the Pfizer-BioNTech BNT162b2 (Vaccine B).The vaccinated samples were collected at various time points relative to the first inoculation, ranging from 14 days prior to 66 days after.To control for variation resulting from the length of time after inoculation, we use only data from 28 days post first dose.2021) truncated vaccinated measurements below 2 ng/mL (0.4 % of total data) and above 500 ng/mL (7 % of total data).The truncation was not applied to the previously infected and naïve samples because they were first reported in Ainsworth et al. (2020).In total, there are 976 naïve, 536 previously infected, and 686 vaccinated samples (362 Vaccine A and 324 Vaccine B samples taken at the 28 day mark).

C. Data censoring
Given a general distribution f (x; ϕ) of parameters ϕ that fits the data between the truncation limits x min < x < x max , the probability that a measurement is above the upper truncation limit is ) and the probability that a measurement is below the lower truncation limit is These definitions of p max and p min are used in the updated likelihood function The PDF can be written as

D. 2D synthetic probability models
Our 2D exploration builds off work by Patrone and Kearsley (2021), which used data from an assay developed in Liu et al. (2020).The two dimensions correspond to measurements taken at the receptor-binding domain (RBD), a substructure of the spike protein of the SARS-CoV-2 molecule, and the nucleocapsid (N) that stabilizes the RNA; values are recorded as MFIs.Values are recorded as MFIs but follow the same logarithmic scale given by ( 15).The log-transformed RBD values are x and N values are y.All model parameters are determined via MLE.The naïve samples should have relatively low values of both RBD and N since they have no specific immune response to SARS-CoV-2.However, we still expect small naïve signals due to cross-reactivity with other coronaviruses such as NL63 and HKU1, which are versions of the common cold.In contrast, previously infected samples should have produced a relatively high immune response to both RBD and N.
It is natural to expect some correlation between the signals, which motivates a change of variables z = (x + y)/ √ 2, w = (x − y)/ √ 2. This will cause the data to be distributed along the diagonal.Our naïve distribution N is given by ) This is a hybrid gamma-normal distribution where the variance of the variable corresponding to the direction perpendicular to the diagonal, w, depends on the variance of the variable corresponding to the diagonal, z: σ w = α exp(z/β).This allows for the data to fan out slightly away from the origin along the diagonal.
The previously infected distribution P is given by This is a hybrid beta-normal distribution.Here, we use a different change of variables for the diagonal direction to rescale the beta distribution to its domain [0, 1]: ζ = (x + y)/(9 √ 2).Again, the variance of the second variable w depends on the first, ζ: σ w = θ √ ζ.This is a modeling choice that allows for a slightly wider distribution for large N and RBD values.The beta distribution is selected because we recognize that the previously infected samples should span the entire diagonal line y = x.
When creating a synthetic vaccinated category, we consider characteristics that set vaccinated and previously infected individuals' antibody measurements apart.Since the vaccines target the binding site on the spike protein, we expect both vaccinated and previously infected individuals to have high RBD values.However, only naturally infected individuals should exhibit high N values.Thus, we create a vaccinated category that is clustered in the bottom right of an N vs. RBD plot.
For the vaccinated (V ) population, we use a hybrid Weibull-normal distribution: The variance of the second variable w depends on the first, z, and is the same form as that for the naïve distribution.vaccinated, and previously infected domains are labeled D ⋆ N , D ⋆ V , and D ⋆ P .The optimal decision boundary separating the vaccinated and previously infected class for both the Vaccine B and combined vaccinated samples lies at the upper truncation limit of the vaccinated data.Thus, there are no vaccinated samples misclassified as previously infected for these two cases.This may be caused by the accumulation of vaccinated samples in the largest bin; in contrast, the Vaccine A vaccinated class (Figure 5) does not exhibit such a large spike.Our data-censored vaccinated PDFs account for this, and thus for large spikes, all samples at that truncation limit are sorted into the vaccinated class.

Figure 1 :
Figure 1: Illustration of classification domains D 1 , D 2 , and D 3 in which the sets of equal probability of a measurement belonging to two or more classes are shown as lines separating the optimal regions.

Figure 3 :
Figure 3: Conditional PDFs for the three naïve, previously infected, and vaccinated classes trained on the training data for a Vaccine A visualization of the vaccinated class.See Supplemental Figure S1 for the PDFs of the Vaccine B and combined visualizations.

Figure 4 :
Figure 4: Test data k-means partitioning with a Vaccine A vaccinated class for generalized prevalence estimates.We use k = 3 classes; the clustered domains are labeled as D 1 , D 2 , and D 3 .See Supplemental Figure S2 for the partitions of the Vaccine B and combined visualizations of the vaccinated class.

Figure 5 :
Figure 5: Training (a) and test (b) data with a Vaccine A vaccinated class with optimal decision thresholds using a known prevalence.Vertical dashed lines indicate the optimal decision boundaries.The optimal naïve, vaccinated, and previously infected domains are labeled D ⋆ N , D ⋆ V , and D ⋆ P .See Supplemental Figures S3 and S4 for the optimal domains for Vaccine B and combined visualizations of the vaccinated class.
Figure 6: Prevalence estimation convergence for synthetic data using a Vaccine A vaccinated class (1000 simulations).In (b), the boxes display the median and upper and lower quartiles as the line inside the box and its top and bottom edges.The whiskers show non-outlier maximum and minimum values; outliers vary from the median by more that 1.5 times the difference between the upper and lower quartiles, and are shown as circles.In (b) and (c), the subscripts N , P , and A denote naïve, previously infected, and Vaccine A vaccinated.The number of samples is S.
Figure 7: (a) Level sets of conditional PDFs with example synthetic data, (b) k-means clustering, (c) optimal classification domains with estimated generalized prevalences.In (c), the subscripts N , P , and V denote naïve, previously infected, and vaccinated.
Figure 8: (a-c): Histograms of generalized prevalence estimates from 1000 synthetic data sets.Ten bins are used for each histogram and the true prevalence is shown as a vertical red line.
Ainsworth et al. (2020) recorded antibody measurements in MFI.For comparison, Wei et al.  (2021)  convert from MFI to ng/mL following:log 10 (m) = A + Bf + CI f >D (f − D),(B17) where the constants A, B, C, and D are given by A = 0.221738, B = 1.751889 × 10 −7 , C = 5.416675 × 10 −7 , and D = 9.19031 × 10 6 .Here, m is the measurement in ng/mL, f is the measurement in MFI, and I denotes the indicator function on all MFI values, returning one if the measurement is above D and zero otherwise.Wei et al. ( Figure S9: Conditional PDFs for the three classes trained on the training data for Vaccine B and combined visualizations of the vaccinated class.

Figure
Figure S9 shows the conditional PDFs for the naïve, previously infected, and vaccinated classes for two visualizations of the vaccinated class.The effects of truncating the data at the upper limit are visible in the right-most bins of the vaccinated class histograms; these are accounted for by the data-censored vaccinated PDFs. Figure S10 shows the k-means partition for the Vaccine B and combined visualizations of the vaccinated class.The optimal domains for the training and test data with a Vaccine B vaccinated class are shown in Figure S11 and labeled as D ⋆ N , D ⋆ V , and D ⋆ P ; the same for a combined vaccinated class are shown in Figure S12.The training data is classified with known generalized prevalences and the test data using estimated generalized prevalences.The optimal naïve, Figure S10: Test data k-means partitioning for generalized prevalence estimates.We use k = 3 classes; the clustered domains are labeled as D 1 , D 2 , and D 3 .

Figure S11 :
Figure S11: Training (a) and test (b) data with optimal decision thresholds using known (a) and estimated (b) prevalences for a Vaccine B vaccinated class.Vertical dashed lines indicate optimal decision boundaries.

Figure S12 :
Figure S12: Training (a) and test (b) data for a combined vaccinated class with optimal decision thresholds.The training data is classified with known generalized prevalences and the test data using estimated generalized prevalences.Vertical dashed lines indicate the optimal decision boundaries.

Table 1 :
Estimated and true generalized prevalences for the test data naïve (N), previously infected (P), and vaccinated (V) classes.Vaccine A (A) and Vaccine B (B) are considered separately and together (All).

Table 2 :
Classification errors for training data using a known prevalence and test data with an estimated prevalence.Vaccine A and Vaccine B are considered separately and together (Combined).

Table 3 :
Generalized prevalence estimate means and standard deviations taken over 1000 simulations of synthetic data generated from the probability models for increasing number of samples for a Vaccine A vaccinated class.