Bayesian inference of infected patients in group testing with prevalence estimation

Group testing is a method of identifying infected patients by performing tests on a pool of specimens collected from patients. For the case in which the test returns a false result with finite probability, we propose Bayesian inference and a corresponding belief propagation (BP) algorithm to identify the infected patients from the results of tests performed on the pool. We show that the true-positive rate is improved by taking into account the credible interval of a point estimate of each patient. Further, the prevalence and the error probability in the test are estimated by combining an expectation-maximization method with the BP algorithm. As another approach, we introduce a hierarchical Bayes model to identify the infected patients and estimate the prevalence. By comparing these methods, we formulate a guide for practical usage.


I. INTRODUCTION
In clinical testing methods such as blood tests and polymerase chain reaction (PCR) tests, discovering infected patients from a large population requires significant operating costs. Because of limitations in the number of available devices, reagents, and technologists, a high demand exists for more efficient methods of testing. Group testing is one of the approaches for the reduction of the operating costs by performing tests on pools of specimens obtained from patients [1,2]. It is known that with the assumption that the rate of the infected patients in the population is sufficiently small, in principle, one can identify the infected patients from the tests on pools whose number is smaller than that of the population. Originally, group testing was developed for blood testing during World War II, and is now applied to various fields such as quality control in product testing [3], estimation of the content of genetically mutated organisms in maize grains [4], and multiple access communication [5].
The identification of the infected patients from the results of group test is mathematically formulated as a channel coding problem to reconstruct the original signal from a codeword transferred through noisy channel [6], where the original signal, codeword, and noisy channel correspond to the state of patients, states of the pools, and errors in the tests, respectively. Further, group testing can be regarded as a variant of compressed sensing [7,8] with discrete variables and logical sums. Hence, the progress in the last decade in sparse estimation including compressed sensing has revived interest in group testing, and information-theory approaches have achieved bound evaluation of group testing at a limiting case [9,10].
Recently, in response to the epidemic infection of COVID-19 that requires testing on large populations, the idea of group testing has attracted increasing attention [11][12][13] from the viewpoint of practical application rather than mathematics. In practice, clinical testing sometimes results in errors even when the operation is precise. For example, in PCR tests, false negative probabilities of up to 3% and false positive (FP) probabilities of up to 5% have been observed [14]. Moreover, the bacterial or viral load in the specimen depends on the specimen collection method and timing [15]. Therefore, a specimen sometimes does not contain a sufficient amount of the pathogen to exceed the detection limit, even when the patient is infected. Further, post-collection contamination of pathogens into a specimen can cause a positive result even when the patient is not infected. Statistical inference can contribute to the correction of errors by estimating the true state of patients from noisy test data, and quantifying the credibility of the estimation.
In this paper, we introduce Bayesian inference to identify the infected patients in the group testing problem considering the finite false probabilities in the test. The infection probability of each patient is approximately calculated by a belief propagation (BP) algorithm because of its low computational cost [16], although the BP algorithm does not achieve the information theoretic bound [10]. Our contributions with regard to the BP algorithm are as follows: has been applied to its estimation [17][18][19]. We construct the estimator of prevalence in addition to the identification of the infected patients by combining an expectation-maximization (EM) method with a BP algorithm. Following the same procedure, the TP and FP probabilities of the test are also estimated.
: (iii) As another approach, we introduce the hierarchical Bayes model to identify the infected patients and estimate the prevalence.
We apply the BP algorithm to the hierarchical Bayes model and evaluate the performance in comparison with the approach described in (ii), and find that the computational cost of the hierarchical Bayes approach is lesser than that required in (ii).
The remainder of the paper is organized as follows. In Section II, we explain the problem setting of group testing and introduce Bayesian inference. In Section III, we introduce the BP algorithm and show its reconstruction performance under finite false probabilities. In Section IV, we propose an estimator based on the confidence interval of the estimated infection probability. In Section V, we discuss the estimation of the unknown parameters in group testing by using the likelihood calculated by the BP algorithm. In section VI, we introduce the hierarchical Bayes model for group testing and discuss the estimation of the prevalence by applying the BP algorithm to the hierarchical model. Section VII presents a summary and discussion, and supplementary comments are provided in Section VIII.

II. PROBLEM SETTING
We consider a population of N patients and M groups on which the test is performed. Let us denote the state of N-patients by X (0) ∈ {0, 1} N , where X i = 1 and X i = 0 means that i-th patient is infected and not infected, respectively. The grouping of the patients is determined by the pooling matrix F ∈ {0, 1} M×N , where F µi = 1 and F µi = 0 means that the i-th patient is in the µ-th group and not, respectively. The true state of the µ (= 1, · · · , M)-th group, denoted by Y (0) µ , is given by where denotes the logical sum of N components. Namely, when µ-th pool contains at least one infected patient, the state of the µ-th pool is 1 (positive), and 0 (negative) otherwise.
The test returns a false result with finite probability. We assume that the errors in the tests are independent from each other and model the observation (result of the test) as where C(·) is the probabilistic function whose behavior is given by [10] P(C(a) = 1|a = 1) = p TP , P(C(a) = 0|a Here, p TP and p FP correspond to the TP and FP probabilities in the test, respectively, and these values are common for all tests. Fig.1 shows matrix representation of the group testing, where we note that the summations in the usual matrix factorization are replaced with a logical sum. We focus on the case α ≡ M/N < 1, where the number of tests is smaller than that of the patients.
Hereafter, we consider that the size of group is fixed to N G . Further, the overlap, which is the number of groups that each patient belongs to, is fixed at N O ; hence, the relationship N O = α × N G holds.

A. Bayesian inference
From the property of C(·), the generative model of Y is given by where and The purpose is to infer the true states of patients X (0) from the observation Y . In general, the true generative process of Y is unknown, but it is reasonable to assume that the process is expressed by the conditional Bernoulli distribution eq.(5), as the variables Y and X (0) are binary, although the value of true parameters p TP and p FP are not known in advance. Bayesian inference is a preferred method in the presence of the reasonable model. As prior distribution of the patient states, we use following distribution: where ρ ∈ [0, 1] is the prevalence, which is not known in advance † . Following the Bayes rule, the posterior distribution is given by wherep TP ,p TP andρ are the assumed TP probability, FP probability, and prevalence, respectively. The i-th patient's state is identified on the basis of the marginal distribution given by where X\X i denotes components of X other than X i . As the variable X i is binary, we can represent the marginal distribution using a Bernoulli probability θ i as and θ i corresponds to the infection probability, namely, the probability that X i = 1. The simplest estimate of X (0) i is the maximum a posteriori (MAP) estimator given by where I(a) is the indicator function whose value is 1 when a is true, and 0 otherwise.

III. BELIEF PROPAGATION
The computation of the marginal distribution requires an exponential order of the sums, and thus is intractable. We approximately calculate the marginal distribution using the BP algorithm on the factor graph representation of the group testing [10,20]. Fig.2 shows the factor graph representation of the group testing for N = 6, M = 4, N G = 3 and N O = 2. Here, we denote M(µ) and G(i) as the indices of the patients in the µ-th pool, and that of the pools in which the i-th patient is included, respectively. The conditional probability P(Y µ |X) depends on X i (i ∈ M(µ)); hence, the posterior distribution can be expressed as a bipartite graph, as shown in Fig.2. For the edge that connects the µ-th factor (test) and the i-th variable (patient), two kinds of messages are defined asm which correspond to posterior information and output information, respectively. Intuitively, the messages m i→µ (X i ) andm µ→i (X i ) represent marginal distributions of X i before and after the µ-th test is performed, respectively. As X i ∈ {0, 1}, these messages are represented by one parameter asm whereθ µ→i and θ i→µ are given byθ and Using these messages, we can approximate the marginal distribution as and thus the infection probability is approximated aŝ and the MAP estimator is given byX

15: end for
First, we consider the case where we know the correct parameters;p TP = p TP ,p FP = p FP , andρ = ρ. The pseudocode of the BP algorithm for group testing with known parameters is shown in Algorithm 1 ‡ . We check the performance of BP algorithm for randomly constructed pooling matrix under the constraint as i F µi = N G ∀µ and µ F µi = N O ∀i. The true state of patients X (0) is also randomly generated under the constraint that i X (0) i = N ρ. The accuracy of the MAP estimator is measured by the TP rate and FP rate given by respectively. A TP value larger than p TP and an FP value smaller than p FP indicates that the performance of the BP-based identification is better than the parallel test of N-patients. α increases, that is, as the number of tests increases, TP increases and the ρ region where TP is larger than p TP extends. FP also has a smaller value than p FP for small values of ρ, and this success region extends as α increases. For large ρ,X (MAP) converges to 0 becauseθ i < 0.5 ∀i, and both TP and FP decrease to zero. A similar tendency is shown for different values of p TP and p FP .
As an example, we show TP and FP for N = 1000, N G = 10, p TP = 0.95, and p FP = 0.1 in Fig.4. The FP probability p FP has large influence on TP, which is obvious from the comparison of Figs. 3 and 4. Intuitively, an increase in p FP (or decrease in p TP ) causes uncertainty of the identification and decreasesθ i ; hence, the MAP estimator tends to be zero. The dependence of TP on p TP and p FP is shown in Fig.5(a) at N = 1000, N G = 10 (N O = 5), and ρ = 0.01. The solid line indicates TP = p TP , and a TP value over the solid line means the reconstruction by the BP algorithm achieves a higher TP than the parallel test of N-patients, and this situation is achieved at a sufficiently small FP probability p FP < 0.05 in this parameter region. The reconstruction performance also depends on N G . Fig.5 (b) Shows the N G -dependence of TP at N = 1000, M = 500, p TP = 0.95, and p FP = 0.05. TP increases as N G increases without increasing the number of tests.
In practical testing, one of the objective is to identify the infected patients in order to prevent spreading of the disease. Therefore, increasing TP is a priority issue, and we mainly focus on the improvement of TP using the BP algorithm.

A. BP algorithm needs "decision threshold"
Before proceeding to the improvement of TP, we discuss the trivial fixed points of the BP algorithm and introduce the idea of "decision threshold" for the identification of infected patients. From eq.(23), whenθ µ→i = 1 for µ ∈ G(i), we obtainθ i = 1 irrespective of the value ofρ. This situation arises when θ j→µ = 0 for j ∈ M(µ)\i at p FP = 0. Therefore,θ i = 1 is achieved when the patients j ∈ M(µ)\i are estimated as negative before µ-th test is performed, where µ ∈ G(i). This is the case in which the i-th patient is trivially identified as positive. In other words, the BP algorithm does not returnθ i = 1 for general cases; hence, to determine the infected patients X (0) i ∈ {0, 1} from an estimate at the BP fixed pointθ i ∈ [0, 1], we need a "decision threshold" such as a MAP estimator, whereθ i = 0.5 is the threshold for determining the infected patients. TP and FP depend on this threshold, and our strategy for the improvement of TP is the appropriate choice of the threshold as discussed in next section.
The threshold atθ i = 0 is expected to lead conservative result, but it is not appropriate for general value of p TP . Following similar logic,θ i = 0 is obtained when at least one of the components ofθ µ→i among µ ∈ G(i) takes the value 0, which is achieved at p TP = 1 and Y µ = 0 or p TP = 0 and Y µ = 1. The former case means that all patients belong to the µ-th test are negative when Y µ = 0 and p TP = 1. In the latter case, Y µ = 1 means Y (0) µ = 0 because p TP = 0; Hence, all patients belonging to the positive test are negative. In other words,θ i is always larger than zero when p TP is less than 1 or larger than 1. Therefore, all of the patients are judged as positive under the threshold atθ i = 0, which corresponds to FP = 1.

IV. IMPROVEMENT OF TRUE-POSITIVE RATE CONSIDERING FLUCTUATION OF THE ESTIMATES
The estimated Bernoulli probabilityθ is a function of Y , and fluctuates depending on the probabilistic observation. The quantification of the credibility ofθ helps in determining the infected patients under conditions of noisy observation data. The confidence interval is one of the guides in inference considering the input fluctuation [21]. For convenience, we introduce the following statistic:τ which gives the MAP estimator asX Here, we assume that the generative model has the corresponding "true value" τ i . Following the normal theory, the 95% confidence interval of the true value of τ i is constructed as where 1.96 is the 97.5% quantile of the standard normal distribution, andσ i is the estimate of the standard error . We resort to the nonparametric bootstrap method to estimate the standard error [22]. We generate b = 1, · · · , N B bootstrap samples We perform the BP algorithm for every bootstrap sample * * , and denote the estimate under the b-th bootstrap sample asτ (b) .
, the estimate of the standard error ofθ i is given bŷ where is the average over the bootstrap samples.
We define the bootstrap estimate of the i-th patient's state aŝ which means that patients whose confidence interval runs over the region τ > 0 are regarded as infected. In comparison with the MAP estimator eq.(28), the decision threshold over which the patients are estimated as infected is lower by 1. The MAP estimator cannot achieve a higher TP than p TP for any ρ, but the bootstrap estimator improves TP to be greater than p TP . Meanwhile, FP of the bootstrap estimator is higher than that of the MAP estimator, which is caused by the reduced decision threshold compared with that of the MAP estimator. However, FP is smaller than p FP for sufficiently small ρ; hence, we consider the bootstrap estimator as practicable. The situation is the same for other parameter region. As an example, we show TP and FP of bootstrap estimator at N = 1000, M = 400, N G = 20, p TP = 0.95, p FP = 0.1 in Fig.7.
For the intuitive understanding of the bootstrap estimator, we show examples of the bootstrap distributions of τ in Fig.8 at N = 1000, M = 500, N G = 10, p TP = 0.95, and p FP = 0.1, where the solid line representsτ and the two dashed lines indicate the confidence interval. This histogram was obtained from 1000 bootstrap samples; note that the confidence interval eq.(29) is not that for the bootstrap distribution. Fig.8(a) shows the bootstrap distribution of an infected patient who is judged as non-infected by the MAP estimator and as infected by the bootstrap estimator. Fig.8(b) shows the same for a non-infected patient. The patients shown in Fig.8 (a) and (b) contribute to the increase of TP and FP of bootstrap estimate, respectively.

V. ESTIMATION OF UNKNOWN PARAMETERS BY EXPECTATION MAXIMIZATION
In this section, we consider the estimation of the unknown parameters: prevalence ρ, TP probability p TP , and FP probability p FP . We construct their estimator by the maximum likelihood method, where the likelihood is given by and the estimators are given byρ An approximation of the log-likelihood is given by the BP algorithm as Bethe free entropy [20], defined as where We derive the maximum-likelihood estimator by the stationary condition of the Bethe free entropy [23]. After the calculation shown in the appendix, we obtainρ where · µ denotes the expectation of X (µ) ≡ {X i |i ∈ M(µ)} according to the posterior distribution with respect to the µ-th test defined by and Eqs. (44)-(45) always have trivial fixed points at 0 and 1, and to avoid these solutions, we solve following expressions: We calculate the infected probability of patients and the estimators of the unknown parameters by the expectation-maximization (EM) method; in the E-step, the fixed point of BP, {θ µ→i } and {θ i→µ }, is achieved by recursive updating under a fixedp TP ,p FP andρ; in the M-step, these parameters are updated according to the extremization conditions of eqs. (43)-(45). We term this method the BP+EM algorithm, which is summarized in Algorithm 2 † † . We note that the behavior of the BP+EM algorithm heavily depends on the initial condition ofp TP andp FP . When the initial conditions ofp TP andp FP are close to their true values, the BP+EM algorithm are stable; hence, the proposed method should be treated as a correction of the experimentally estimated values. Meanwhile, the estimation ofρ is insensitive to the initial condition, which is smaller than α.

VI. HIERARCHICAL BAYES APPROACH
As another approach to estimating prevalence, we introduce the hierarchical Bayes model, where the prevalence is regarded as a hyperparameter distributed according to the hyperprior distribution which is the beta distribution with hyperhyperparameters a and b, and B(a, b) is the beta function. The beta distribution is the conjugate of the Bernoulli distribution. A graphical representation of group testing for the hierarchical Bayes model is shown in Fig.10. The prior distribution of X i under a given ρ is regarded as an "interaction" that is represented by a factor node I i . We introduce additional messages π i→i ,π i→i ,r i→ρ , and r ρ→i for all i, that are propagated from X i to I i , I i to X i , I i to ρ, and ρ to I i , respectively, as shown in Fig.10.

Hyperparameter
Hyperprior The messages propagated between the bipartite graph of Y and X are given bỹ whereπ i→i carries the prior information to X i . Here, we expressπ i→i (x i ) by one parameterρ i , which is derived later, as Using eq.(56), the parametersθ µ→i and θ i→µ that express the messages as Eqs. (14)- (15), are given bỹ and U µ and W µ are given by eqs. (18)- (19). The messages between variables and priors are given by and we obtainρ Further, by setting we obtain which corresponds to the infection probability when the prior is ignored. The messages between prior I i and the hyperparameter ρ are given byr Using these messages, we can approximate the marginal distribution as and the infection probability of X i is estimated aŝ We call the BP algorithm for the hierarchical Bayes model the hierarchical BP (HBP) algorithm; its psuedocode is summarized in Algorithm 3.

A. Comparison of BP+EM and HBP for finite system size
In this section, we discuss the difference between the BP+EM algorithm and HBP algorithm as estimation methods for prevalence. First, we consider the N → ∞ limit, where the saddle point method can be applied to the integral of ρ in eq.(63). After the calculation shown in Appendix B, we obtainρ where ρ * i satisfies We note that eq.(71) does not depend on the hyperprior, and the prevalence is estimated asρ = 1 N N i=1ρ i . Comparing the estimated prevalence in the HBP algorithm with that of the BP+EM algorithm eq.(43) shows that the difference between the two estimators is negligible at N → ∞. Therefore, we compared BP+EM and HBP focusing on the following points.
: (I) Accuracy as an estimator of the prevalence for finite N As mentioned previously, the difference between the two estimators is negligible at N → ∞. However, the two estimators do not coincide with each other at finite N. We quantify the accuracy of the estimator at finite N using bias defined by whereρ(Y, F) denotes the estimates under given Y and F. An accurate estimator results in a low bias value.

: (II) Computational time
Although the mathematical forms of the estimator of prevalence are similar, the update rules in BP+EM algorithm and HBP algorithm differ from each other. The BP+EM algorithm consists of a double loop, namely the E-step for BP and the M-step for updatingρ. In the HBP algorithm, the messages and the estimator are updated at the same time. The difference between these update rules influences the computational time.  Fig.12 (a) and Fig.13 (a) show N-dependence of the bias for the BP+EM and HBP algorithm at α = 0.5, ρ = 0.05, p TP = 0.99, p FP = 0.01 (Fig.12), and α = 0.5, ρ = 0.05, p TP = 0.95, p FP = 0.05 (Fig.13). In these algorithms, the same 100-realization of X 0 , F, and Y were used for comparing these methods. The mean of the beta distribution is given by a/(a + b); hence, the mean of the hyperprior at a = 0.5, b = 0.95 matches the true value of ρ. BP+EM and HBP at a = 0.5, b = 0.95 show almost the same dependency on N in bias. When a and b are not chosen to match the mean of the hyperprior, bias becomes large in finite N, but the difference in biase vanishes as N → ∞. Fig.12 (b) and Fig.13 (b) show N-dependence of the computation time, where we fixed our experimental environment to use a single 3.5 GHz Intel Core i7 CPU. The computational time of the HBP algorithm is less than that of the BP+EM algorithm, and this priority stands out for the high-noise case, which is obvious from the comparison between Fig.12 (b) and Fig.13 (b).
From these results, we consider that the choice of using BP+EM or HBP depends on the purpose. When precise estimation of prevalence is required and one has no conception of the appropriate hyperprior in small system size, BP+EM algorithm should be used. For quick identification of the infected patients, in particular for large system size, the HBP algorithm is well suited to the demand.

VII. SUMMARY AND DISCUSSION
In this study, we investigated the group testing problem where the test possess finite false probabilities. We introduced the BP algorithm to infer the infected patients under the Bayesian inference settings. The performance of the BP algorithm, in particular for the TP rate, was improved by considering the credible interval of the point estimate assigned to each patient. Our approach used bootstrap distribution to estimate the interval. The unknown parameters in the model, in particular prevalence, can be estimated using the EM method and hierarchical Bayes modeling. We compared these method and formulated a guide for practical usage.
We concentrated on the pooling matrix randomly constructed under the column-wise and row-wise constraint specified by N G and N O . The adaptive procedure of group testing was also examined, where the pooling for the next stage was sequentially designed by taking into account the output of the test in the previous stage [26][27][28]. Extension of our BP and HBP algorithm to the adaptive setting is a promising way to explore more efficient pooling and test scheduling.
The MATLAB code used in this study is distributed on GitHub https://github.com/AyakaSakata/GroupTesting.

VIII. COMMENTS AND NOTES
: † The prevalence of the population does not necessarily match an individual's infection probability; hence, the prior distribution is an assumption. We consider that this form of prior information is appropriate for the estimation of the prevalence in the sense that prevalence of the population generated by Bernoulli(ρ) converges to ρ for sufficiently large N.
When the infection probability of each patient can be guessed from symptoms before performing the test and has different values for each patient, one can apply the information into the prior as ρ i . The BP algorithm for this case is obtained from HBP method by fixing the value ofρ i at ρ i .
: ‡ We introduce a damping factor d ∈ (0, 1] for the stabilization of the algorithm as For HBP, we introduce additional damping as In all the numerical simulations performed in this study, we set d = 0.1 conservatively, but this choice extends the time for convergence. To achieve faster and stable convergence, an adaptive setting of the damping factor for group testing is worth studying [29].
: § Here, 1 M denotes an M-dimensional vector whose components take the value 1.
: ¶ For the construction of the confidence interval for Bayesian point estimate, the parametric bootstrap method is another approach [30,31], where the bootstrap samples are generated according to the posterior distribution with the point estimate.
: Another construction method of the credible interval uses the bootstrap percentile as where G −1 B (α) is the α-percentile of the bootstrap distribution. We tried this interval for determining the infected patients. The obtained TP is comparable with the normal theory, but FP tends to be large compared with the interval determined by the normal theory.
: * * The bootstrap sample contains the same row vector of F with high probability. We omit the overlapped rows to stabilize the BP algorithm.
: † † We solve eq.(51) and eq.(52) by the Newton method in the M-step; however, the optimization at each M-step induces algorithmic instability. Hence, we updatep TP andp FP for only one step following the Newton method.