Determination of Varying Group Sizes for Pooling Procedure

Pooling is an attractive strategy in screening infected specimens, especially for rare diseases. An essential step of performing the pooled test is to determine the group size. Sometimes, equal group size is not appropriate due to population heterogeneity. In this case, varying group sizes are preferred and could be determined while individual information is available. In this study, we propose a sequential procedure to determine varying group sizes through fully utilizing available information. This procedure is data driven. Simulations show that it has good performance in estimating parameters.


Introduction
Routine monitoring or large scale of screening usually occurs in biomedical research to identify infected specimens [1][2][3][4]. However, some test kits, e.g., nucleic acid amplification test (NAAT), are expensive [2,5]. erefore, the expense during a large-scale monitoring process is usually a financial burden if resource is limited [6][7][8]. e strategy of pooling biospecimens is attractive to address this issue [9][10][11], which was first used during World War II to screen for syphilis [12]. is strategy is firstly to pool specimens into groups and then screen these groups. If a group tests negative, all specimens in this group will be declared negative; otherwise, continue to perform individual test. When the prevalence is low, the total number of tests using pooling will be far less than that using the individual test. Due to its efficiency and cost saving, pooling is now applied in many fields, such as agriculture [13], genetics [14,15], HIV/AIDS [16,17] and blood screening [18], and environmental epidemiology [19,20]. e gain of pooling mainly depends on the pooling algorithm. Assuming homogeneity of the population, dozens of papers have investigated the problem how to design an efficient algorithm [21][22][23][24][25]. However, this assumption might be violated in practical application [26][27][28]. While individual information is available, it is of interest to estimate individual-level prevalence through incorporating such information. Note that only group-level status is observed, e.g., positive or negative.
is problem has been studied in parametric context through the framework of binary regression models [29][30][31], and also in semiparametric [32,33] or nonparametric context [34,35]. However, aforementioned work mostly uses a single group size that is determined in advance.
A set of pool sizes might be more appropriate while considering population heterogeneity. For example, varying pool sizes were used to estimate the infection prevalence of Myxobolus cerebralis, which causes whirling disease, among free-ranging salmonid fish collected from the Truckee River in Nevada and California [36]. In a study of estimating the prevalence of several viruses in carnations grown in nursery glasshouses in Victoria, sequential pooled testing involving several pool sizes was adopted [37]. Using a single group size might be optimal for some estimates but far from others, especially when we have little information ahead of the experiment [37,38]. More work is better on this issue since the benefit of pooling algorithm mainly depend on the choice of pool size [38][39][40]. In this study, we propose a pooling strategy with varying pool sizes through taking advantage of individual information. Our procedure is a data-driven pooling algorithm, where groups are formed sequentially. Its performance is extensively investigated by simulations and a real data set.

Notations and Background.
Suppose N specimens are assigned into m groups each with size k i for i � 1, 2, . . . , m. z i denotes the observed status of the i th group, and X ij denotes the covariates of the j th specimen in the i th group for j � 1, . . . , k i and i � 1, . . . , m. e observations are z i , X ij , j � 1, . . . , k i , i � 1, . . . , m}, where X ij � 1, x 1,ij , . . . , x d−1,ij T .
Here, the notation A T represents the transpose of matrix A. e sensitivity and specificity of the screening tool are denoted by S e and S p , respectively. e full likelihood function is where r � S e + S p − 1 and e parameter β is defined by β � β 0 , β 1 , . . . , β d−1 T , and the function g −1 (·) is a known, monotone, and differentiable link function.
Sometimes there might be a maximum admissible group size k max , e.g., a large group size might bring the dilution effect. erefore, we should carefully choose an appropriate group size that is smaller than k max . Define a set K � 1, 2, . . . , k max { }, and denote it by k � k 1 , . . . , k m , k i ∈ K, i � 1, . . . , m. Once the group size k is determined, we could obtain the estimator of β through maximum likelihood function L(β, z, X). e Fisher information matrix of the parameter β could be rewritten as follows: where e calculation of Fisher information I(β, k) is presented in Supplemental Material (Available here). To obtain a better estimator β, we try to find k that maximizes Fisher information I(β, k). However, individual-level measurements make it difficult to achieve this goal.
e Fisher information I(β, k) defined in (2) is closeness let the Fisher information reduce to the following format: . en, we propose to determine the group sizes through minimizing all C i (β, k i ) with respect to k i for i � 1, . . . , m.
Note that the aforementioned approximate approach requires the pools are homogeneous.
ere are two methods to obtain homogeneous pool: reorder the specimens according to similarity of covariants or based on individual risk probability. e latter is adopted in this study. Following the method in McMahan et al. [42], the procedure of forming homogeneous pool is as follows. Firstly, use training data or prior knowledge to obtain an initial estimator β (0) [42]. Secondly, sort the specimens by their risk probability. Let G denotes the set which contains total covariants of enrolled specimens, G � x 1 , . . . , x N , where N is the number of specimens and x i is the covariant of the i th specimen. Sort G by risk probability p i � g(x T i β (0) ) in the descending order, and obtain a sorted set G s � x s 1 , · · · , x s N . e remaining procedure is directly performed on this sorted set.

Sequential Adaptive Pooling Algorithm.
Our strategy is an adaptive design, which is often adopted in the biological experiment and also in the pooled test [22]. Before stating the algorithm, we need the following result. Suppose the specimens are assigned for the first l − 1 groups with the corresponding group sizes k 1 , . . . , k l−1 . Let n l � l j�1 k j for l ≥ 1 and n 0 � 0. Denote W l (β) � −log(1 − g((x s n l−1 +1 ) T β)). en the group size for the next group, k l , equals k max if k max ≤ ϕ 0 /W l (β (0) ). Here, ϕ 0 is the root of an equation 2S e (1 − S e )(ϕ − 1)e 2ϕ + r(2S e − 1)(ϕ − 2)e ϕ + 2r 2 � 0 and is approximately 1.8414. e proof of this result is presented in Supplemental Material (Available here). Our pooling strategy is described as follows: Step 1. Label the specimens according to the ordering of G s . For example, label the specimen with covariants x s 1 by number 1. Assign specimens with labels up to k max into l th group.

Computational and Mathematical Methods in Medicine
Step 3. Let G s � G s /G l , l � l + 1. Repeat Step 2 to form the next group in the same way until all specimens are assigned.
Step 4. Screen the groups and obtain maximum likelihood estimator of β. Note that this is a data-driven pooling strategy. Additionally, the above procedure does not strictly require that all specimens are enrolled before screening since the set G s is dynamic and could be renewed by new enrolled specimens.

Numerical Results.
In this section, we proceed to evaluate the performance of our proposed procedure. Name it by PSV, which is pooling strategy with varied group sizes. For comparison, we also present the results of pooling strategy with a single group size k, named by PSS(k). e group size k for PSS(k) is given in advance, e.g., k � 5, 10, or could be determined by the average prevalence of those enrolled samples. For the latter, we determine the optimal single group size k * by minimizing the variance of p.
To investigate the performance of these methods, define the link function g(·) as the logistic function g(u) � 1/(1 + exp(−u)). en, individual prevalence is obtained through the following model: We first consider a single covariant (d � 2), following the normal distribution N(2, 1.5) or the gamma distribution Γ(2.5, 0.8). e corresponding parameters are set by β 0 � −3 and β 1 � 0.4. e samples are generated under these settings, and the procedures are repeated by M � 5000 times. We report the estimators β 0 and β 1 , along with their mean square error (MSE) in Table 1 under different settings of sensitivity, specificity, and the number of groups. In Figure 1, we further report the relative bias of the parameters. Table 1 shows that all procedures have similar performance except PSF [5]. While using the procedure PSF, we have to choose a group size in advance. It is crucial for a group testing algorithm since the precision of estimators severely depend on the group size. In our setting, the average of individual prevalence is about 0.0997, and the corresponding optimal single group size is mostly k * � 13, 12, 11 for (S e , S p ) � (0.99, 0.99), (0.95, 0.95), and (0.9, 0.9) respectively. Consequently, the procedure PSF [10] has better performance than PSF [5] since the latter procedure uses a too smaller group size. Figure 1 further shows the relative bias of the parameters, β 0 and β 1 . Our procedure with varying group sizes, PSV, has very good performance under different scenarios. e procedure PSF [5] still has the poorest performance on the measurement of relative bias. As data-driven pooling strategies, PSV and PSF (k * ) both show good performance, but PSV has smaller bias, which is a desired characteristic.
e overall relative bias of these estimators reported in Figure 3 also confirms such property. It also reveals that pooling procedures using a single group size are not desired for a heterogeneous population, even the group size is carefully chosen, e.g., k * .

An Illustrative Application.
Verstraeten et al. conducted a surveillance study in Kenya to monitor a trend in HIV risk over time [43]. e samples were collected from pregnant women, along with potential risk covariants such as age, parity, and education level. ey used a common group size of 10 to estimate the seroprevalence of HIV. However, the individual prevalence of HIV is related with those risk covariants, e.g., the risk of HIV might tend to increase with age. For this data set, Vansteelandt et al. reported a set of group sizes varying between 5 and 12 under cost-precision trade-off [40].

Discussion
In biological and epidemiological studies, there is growing interest in developing methods for a more accurate result but less cost. Group testing is such a cost saving strategy. In this study, we developed a pooling strategy that uses varying group sizes while individual information is available. is strategy is attractive since it only depends on the information of enrolled specimens and does not require a group size chosen in advance. Due to the characteristic of data-driven and theoretical justification, the procedure, "PSV," proposed in this study has a robust performance under different settings. It is convenient for practical application since we do not have to worry about how to choose an appropriate group size.
Varying group sizes are reasonable to be used when the target population is diverse. For example, a sequential testing procedure using several group sizes is adopted to estimate virus infection levels of carnation populations grown in glasshouses since different carnation populations were expected to have a wide range of infection levels [45]. We could pool more specimens into one group if the probability of testing positive is small. It sounds reasonable to balance the probability of testing positive for each group, a way to mimic the situation when all enrolled specimens are homogeneous.
In this study, we also propose a procedure using a single group size k * determined by minimizing the variance of estimator of the prevalence. We could choose this procedure if we prefer a simple procedure or the diversity among the specimens to be screened is ignorable. Besides, we did not consider the cost of collecting specimens. If a test is much more expensive than that of collecting specimens, then the cost of tests is the main consideration in a project involving large-scale screening. Otherwise, it is necessary to take into account the overall cost of collecting and test while using the pooling strategy.

Data Availability
e Kenya data supporting this study are from previously reported studies and datasets, which have been cited. e data are available at https://cran.r-project.org/package=binGroup.

Conflicts of Interest
e authors declare no conflicts of interest.