Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers

Abstract Motivation Amplicon sequencing is widely applied to explore heterogeneity and rare variants in genetic populations. Resolving true biological variants and quantifying their abundance is crucial for downstream analyses, but measured abundances are distorted by stochasticity and bias in amplification, plus errors during polymerase chain reaction (PCR) and sequencing. One solution attaches unique molecular identifiers (UMIs) to sample sequences before amplification. Counting UMIs instead of sequences provides unbiased estimates of abundance. While modern methods improve over naïve counting by UMI identity, most do not account for UMI reuse or collision, and they do not adequately model PCR and sequencing errors in the UMIs and sample sequences. Results We introduce Deduplication and Abundance estimation with UMIs (DAUMI), a probabilistic framework to detect true biological amplicon sequences and accurately estimate their deduplicated abundance. DAUMI recognizes UMI collision, even on highly similar sequences, and detects and corrects most PCR and sequencing errors in the UMI and sampled sequences. DAUMI performs better on simulated and real data compared to other UMI-aware clustering methods. Availability and implementation Source code is available at https://github.com/DormanLab/AmpliCI. Supplementary information Supplementary data are available at Bioinformatics online.

4. Iterate the E step and the M step until convergence.
reads R and barcodes B, and unobserved hidden variables W is (S1) Define the conditional expectation of the hidden data Z i = (Z i1 , Z i2 ) given the observed data (b i , r i ) and computed with the current parameter estimates η (t) and Γ (t) , The EM algorithm iterates between an E step, where the expectations e (t) isk are computed, and an M step, where the model parameters are estimated to maximize equation (S1), with e (t) isk substituted for 1 {Zi1=us,Zi2=h k } . Iterations will continue to a local maximum of Eq. (1) when the relative change in observed data log likelihood drops below a threshold (default, 1 × 10 −6 ). To simplify the notation in the following, we drop the superscripts (t) indicating the tth EM iteration, but each of e isk , ξ sk , λ s , φ sk , some to be defined below, as well as the parameters η, Γ and their updatesη,Γ, will depend on the iteration t.

E step:
For the ith read, e isk is, by Bayes rule, where all probabilities are given in the main text and parameters are replaced with their current estimates.

M step:
Parameter η can be updated withη while the sth row γ s of transition matrix Γ can be updated by maximization of where λ s is the Lagrange multiplier to impose the sum constraint for transitions out of u s . In the absence of a penalty (ρ = 0), the MLE isγ where ξ sk = n i=1 e isk is the expected number of molecules where UMI u s is attached to molecule h k . For positive ρ, the maximum penalized-likelihood estimators (MPLEs) of γ s areγ sk = 0 when ξ sk = 0 and solutions of the score functions, for each k with ξ sk > 0. Yin et al. (Yin, 2016) prove the solution to these score functions exists and maximizes the objective function when (a) at least one ξ sk > ρ and (b) |ξ sk − ρ| > ω for all k. Under these conditions,γ sk is the valid root of φ sk ± φ 2 sk + 4ωλ s ξ sk 2λ s , where φ sk = ξ sk − ωλ s − ρ and λ s is the root of implicit equation In practice, we set ω to a very small value (default: 10 −20 ) for LASSO-like regularization and apply the Newton-Raphson algorithm to find the root λ s initialized as max 0≤k≤K ξ sk − ρ.
Since ω is very small, it is rare that condition (b) is not met, especially when ρ is chosen not to be an integer. Condition (a) is not met (all ξ sk < ρ) when (c) u s is an error UMI incorrectly included in U or (d) u s is a valid UMI that was only weakly amplified, probably with early-cycle PCR error(s) to haplotype(s) h k identical or similar to haplotypes included in H. The first problem (c) may be resolved by applying a penalty to η s , though we did not attempt such a solution in DAUMI since we took other precautions to limit the inclusion of false UMIs in U. The second problem (d) is difficult to resolve without properly modeling the PCR error and amplification process. DAUMI and all compared methods do not model PCR and cannot distinguish early-cycle PCR errors from legitimate variation in the sample. Instead, these methods rely on the fact that early-cycle PCR errors are exceptionally rare compared to late-cycle PCR errors, because there is far more opportunity for error after the template has been geometrically amplified. Thus, these quantification methods detect and discard most PCR and sequencing errors, but inevitably leave a few early cycle PCR errors. In the few cases where (a) is not met, we setγ sk = 1 {k=k} fork = arg max k ξ sk . As long as the overall M step improves equation (S1) at each iteration, the EM algorithm is still guaranteed to converge (McLachlan and Krishnan, 2008).
We check for this condition and never encountered a problem in any of the real or simulation datasets analyzed.

S2 Automated Selection of Penalty Parameter ρ
To automate selection of penalty parameter ρ, we propose a simple model for the observed abundance distribution p UMI (x) of UMIs in the dataset. This distribution is a mixture of true UMIs and error UMIs generated by PCR and sequencing errors. Our goal is to identify a threshold that distinguishes true and error UMIs.
We assume molecular abundance follows a stochastic Galton-Watson branching process (Stolovitzky and Cecchi, 1996), which is often used to model the PCR amplification process (Pflug and von Haeseler, 2018). The model assumes each molecule is independently duplicated with amplification efficiency χ ∈ (0, 1) per PCR cycle. Let denote the PCR error rate, a small probability that a molecule is copied with at least one error. The number of error-free copies X t of a true UMI without collision after the tth PCR cycle is given by the stochastic recursion where [1 − ]χ is the error-free amplification efficiency. After σ cycles of PCR, the molecules are sampled and sequenced to produce observed count where ι is the sampling rate and δ is the sequencing error probability.
The probability p BP (x; σ, χ, , ι, δ) of x faithful copies of a molecule after σ cycles of PCR amplification followed by sequencing is not analytically available, but can be computed by numerical methods (Lange, 2010). Assuming all PCR amplification errors are unique, the abundance of an error UMI depends only on the random PCR cycle T in which it originated and not also on the amplification of similar molecules in a neighborhood around it. Then, using the law of total probability, the probability of observing x copies of an error UMI is where Pr(T = t) is the probability of a PCR error in the tth cycle or a sequencing error when T = σ + 1.
Overall, the abundance of a UMI follows a mixture distribution where we have included error-free amplification and sequencing of original molecules present at T = 0.
In practice, we take set ι = 1 when calculating p UMI (x). In our experience, ι is substantially confounded with χ and σ, when the latter is unknown. Setting ι = 1 constrains the parameters sufficiently to produce good estimates, as assessed on simulation data. Assuming UMIs are unique prior to amplification (no collision), we have is a function of the parameters {χ, σ, , δ}, which we can optimize to match the observed UMI abundance distribution.
There is no analytical solution to estimate the parameters {χ, σ, , δ}, and the presence of unmodeled artefacts, like collision, in the right tail of the abundance distribution makes numerical maximum likelihood estimation difficult. Instead, we right-truncate the data at count τ and select parameters that minimize the Kolmogorov-Smirnov (KS) statistic (Stephens, 1974) for conditional distribution From the left, the first mode of the observed UMI abundance distribution represents errors, while the second and possibly additional modes represent PCR amplified products (Stolovitzky and Cecchi, 1996;Pflug and von Haeseler, 2018). To estimate the parameters of the PCR process, it is important to retain information about PCR amplification by truncating the distribution to the right of the second mode. We use R package multimode (Ameijeiras-Alonso et al., 2021) to find the mode of each distribution after removing sequencing errors with AmpliCI. These distributions are shown for simulation data in Figure S4 and HIV data in Figure S6. Given the conditional cumulative distribution func-  (Schirmer et al., 2016;Stolovitzky and Cecchi, 1996;Potapov and Ong, 2017;Quince et al., 2011), but the user may need to adjust these parameters, particularly τ or σ, to match their data. Note, σ is not necessarily the actual number of experimental PCR cycles, but an effective number of PCR amplification cycles under our "ideal" model of PCR amplification. Rather than an exhaustive search over the entire grid, we implement a block relaxation algorithm (de Leeuw, 1994), alternating between optimization of (χ, σ) while holding ( , δ) fixed and vice versa until the KS statistic converges with the relative change between iterations < 10 −4 for each possible value of τ .
The remaining challenge is to choose a threshold ρ that will remove most error UMIs without discarding true UMIs. In practice, we choose ρ to be the 5th percentile of the fitted observed abundance distribution p BP (x;σ,χ,ˆ , 1,δ) for true UMIs, which implies an estimated five percent of true variants will have observed abundance below the threshold ρ. Table S4 shows the fitted model and selected ρ for each τ ∈ {100, 150, 200, 300} in all test datasets. For most datasets, our strategy works well and helps to select appropriate ρ ( Figure S3 and S5). On the V1V2 and V3 datasets, the strategy selects an optimal or near optimal ρ (Table S9), but that choice is always ρ = 1 for V3, no matter the truncation point τ . There is no obvious second mode in raw UMI count data ( Figure S5), either because there was very inefficient amplification, we estimateχ = 0.05, or severe downsampling at sequencing. Both scenarios limit UMI replication and inhibit all UMI-based quantification methods. At the same time, these data display some very highly replicated UMIs, perhaps because of high rates of UMI collision or highly stochastic amplification rates (Potapov and Ong, 2017). For all these reasons, it is difficult to pick a threshold to separate errors from real variants. In general, we suggest visual assessment of the observed UMI abundances before and after error removal to assess the adequacy of the automatically selected ρ. In the case of V3, we manually chose ρ = 20, which yielded slightly worse performance in our assessments than the automatically selected ρ = 1 (Table S9), thus demonstrating that we did not cherry pick results. The good news for the user is that DAUMI performance is minimally impacted by selection of ρ. It appears to have most impact on low amplification datasets like V3, but even with our suboptimal manual override choice of ρ, DAUMI performance was superior to other methods (Table 1).
For assessing performance on the real data, we had to choose ρ for random binary splits of the data ( §3.2). For V1V2, ρ was selected independently for each subset by the automatic procedure. For V3, ρ = 10 was selected for both subsets, half the value chosen by visual assessment on the whole dataset.
For the single-cell molecular spike dataset, we pooled the UMIs from all cells and estimated ρ = 9 by the automatic procedure. Since we expect minimal UMI collision across cells, we applied this threshold directly to each cell.

S3 Tuning Run Parameters of Other Methods
As just demonstrated for selection of ρ, it is important and sometimes difficult to set appropriate values for run parameters; the default values are not always optimal. To verify the optimality of the default parameters used in the other quantification methods studied in this work, we ran Calib, UMI-tools and DAUMI under different parameter settings. There are no run parameters for the Naïve method, and tuning Starcode-umi was deemed too difficult as there are at least five parameters and no guidance to set them. Table S6 shows the results on simulation data while varying run parameters. Specifically, we tried edit distance d = 1 or d = 2 for UMI-tools, varied ρ for DAUMI, and tried six parameter settings for Calib as suggested by the authors (Orabi et al., 2018). UMI-tools performed better with the default edit distance d = 1, though the number of false positives slightly increases. While Calib's default parameter values did not always produce the best results, the effect of different parameter settings was minimal.
Finally, we show that DAUMI at default settings always outperformed other algorithms at their optimal parameter settings, even though default ρ does not always achieve the best performance.

S4 Abundance Estimation for Known Variants
In some cases, the set of possible haplotypes is known a priori (Zhang et al., 2021). To mimic this situation, we set H to the true 25 haplotypes for simulations 1-4. This experiment also allows us to explore DAUMI abundance estimation with less chance for false positives, since only a few of the 25 true haplotypes are not sampled in the data and could be incorrectly included as false positives. Table S7 shows that performance does increase as compared to the performance in Table S6, where the set H is not known. DAUMI can accurately estimate deduplicated abundance on datasets without UMI collision (Simulations 1-2) as expected, and there is little difference as ρ varies. However, for datasets with UMI collision (Simulations 3-4), DAUMI underestimates haplotype deduplicated abundance (regression slope b < 1), especially as ρ increases, because then even legitimate UMI to haplotype linkages with transition probability γ sk > 0 are eliminated by the penalty. For Simulation 4 with low PCR efficiency, the underestimation is particularly sensitive to ρ, presumably because there are more true molecules failing to amplify well. When two haplotypes are assigned to the same UMI and the number of assigned reads to one haplotype is below the threshold ρ, DAUMI will eliminate the less observed haplotype as a likely error, leading to an underestimation of haplotype abundance. However, it is not a good idea to simply set ρ very low, since then DAUMI will retain false linkages between UMIs and haplotypes. There may be weak evidence for UMI/haplotype combinations (u s , h k ) that do not actually exist in the sample if there is at least one observed read (b i , r i ) plausibly generated from (u s , h k ). For example, DAUMI identifies two false positives for Simulation 3 and one false positive for Simulation 4 with ρ = 0.01 (a choice well below the automatically selected value [see Table S2]). These are haplotypes included in H that were not actually sampled. When the haplotype set H is not constrained by a known truth, there will be even more weak linkages that survive when ρ is set too low.

S5 Extracting UMIs from V1V2 Dataset
The reverse reads of the V1V2 dataset contain a 52-55 nucleotide (nt) technical sequence at the 5' end, starting with 0-3 random nucleotides, an 18nt adapter, a 9nt UMI, and a 25nt primer with two ambiguous nucleotides, one Y and another R, used to amplify the target region in the HIV env gene. Our goal is to extract the 9nt UMI, discard the rest of the technical sequence, and recover the approximately 250nt sampled sequence. This section describes an independent model used only in data preprocessing; reuse of DAUMI model notation does not communicate simile.
We assume there are no indel errors in the technical sequence of each read, so the adapter sequence starts at read position 1, 2, 3, 4, or is not present at all. Let W i1 ∈ {0, 1, 2, 3, 4} be the unknown number of random nucleotides at the start of read X i or W i1 = 4 when the technical sequence is not present.
Further, let W i2 ∈ {0, 1, 2, 3}, defined only when W i1 < 4, be the unknown state of the unambiguous primer sans UMI with the Y and R nucleotides resolved. We assume W i1 and W i2 are independent and For the observed data, dropping read index i, let be the probability of read nucleotide x at position j given W = w. Read position j can index the 0-3 nucleotides in the 5' random leader, nucleotides in the UMI, either of two ambiguous positions in the primer, other unambiguous nucleotides in the technical sequence or the sample sequence. Assume the random 5' leader, the UMI and the sampled sequence are adequately modeled as independently and identically distributed nucleotides within class, but allow the nucleotide composition to vary: q r = (q rA , q rC , q rG , q rT ) for the random nucleotides in the 5' leader and UMI and q s = (q sA , q sC , q sG , q sT ) for the sample sequence. At all other sites, assume errors are independent, but not equally likely across sites. Let δ j be the probability of an error-free nucleotide at read position j. Given an error, let γ N1N2 be the probability that true nucleotide N 1 is misread as read nucleotide N 2 . Define the index sets Let δ = (δ 1 , δ 2 , ...) T , γ = (γ AC , γ AG , γ AT , γ CA , γ CG , γ CT , γ GA , γ GC , γ GT , γ TA , γ TC , γ TG ), η = (η 0 , η 1 , . . . , η 4 ), and ζ = (ζ 0 , ζ 1 , ζ 2 , ζ e ). Then our unknown parameter vector is θ = (δ T , γ T , q T u , q T s , η T , ζ T ) T . Finally, if there are n reads, the length of read i is J i , all reads are independent and all nucleotides within reads are conditionally independent, then the complete data likelihood is In the E step, we need to compute E[ln L C (θ | X, W ) | X], but since the complete data log likelihood is linear and the reads are independent, it amounts to computing for each i ∈ {1, 2, . . . , n}, k ∈ {0, 1, 2, 3}, and l ∈ {0, 1, 2, 3}. When k = 4, l is undefined, and we have with expected counts We fit all 61,881 reads to this model, converting ambiguous N nucleotides (affecting just 170 reads) to A. We iterated the EM algorithm until the relative change in log likelihood was below 0.001. We dropped 18,861 reads shorter than the 52nt primer and 700 reads unlikely to contain the technical sequence, i.e.
with Pr(W i1 = 4 | X i ) ≥ 0.5. We dropped three reads with posterior probability Pr(W i1 = 4 | X i ) < 0.5, but arg max k∈{0,1,2,3},l∈{0,1,2,3} Pr(W i1 = k, W i2 = l | X i ) < Pr(W i1 = 4 | X i ); these could have been retained but were lost to a small bug in our code. Since we noticed several reads had a poor match to the 25nt primer, we dropped another 8,763 reads where the log likelihood of the primer region only was smaller than −100 as likely technical artefacts. Finally, we dropped 25 surviving reads with ambiguous N nucleotides. For the 33,529 remaining reads, we assumed the 9nts starting at positionŵ i1 + 18 of the ith read constituted the UMI and the sampled molecule extended fromŵ i1 + 52 to the end of the read.
Here,ŵ i1 = arg max k∈{0,1,2,3} Pr(W i1 = k | X i ) is the most likely number of 5' leader nucleotides. The posterior probabilities used for screening were obtained from the e ikl , k < 4 and e i4 values computed in the last E step.

S6 Supplementary Figures and Tables
Here we include all supplementary figures and tables that have been cited in the main text or in supplementary sections S1-S4. Figure S1 compares DAUMI and other methods with respect to their ability to detect UMI collisions. Figure S2 shows the pipeline for simulating datasets. Figures S3-S6 illustrate the observed UMI abundance distributions before and after error correction; the selected parameter ρ is shown as a red vertical line on the UMI abundance distributions without error correction ( Figures S3   and S5). Figure S7 shows the performance of UMI-unaware AmpliCI on the simulated datasets. Figure S8 compares true and estimated values of UMI mixing proportions η in simulation. Figure S9 shows the number of estimated haplotypes, their deduplicated abundances, and a Venn diagram of overlaps for the HIV V1V2 data. Figure S10 demonstrates the low UMI diversity found in the HIV V3 dataset. Figure S11 uses Venn diagrams to compare haplotypes found by different methods on the HIV V3 dataset.  Table S2 summarize the four simulation conditions that generated the data analyzed in Figure 3. Table S3 provides running time information for DAUMI. Table S4 provides detailed information about the achieved fits of the branching process model ( §S2) during selection of ρ. Table S5 provides numeric summaries for the linear model fits shown in Figure 3. Table S6 shows the performance of different methods while varying run parameters on the simulation data. Table S7 shows the performance of DAUMI with H fixed to a set of known candidates and varying ρ on the simulation data. Table S8 summarizes the HIV datasets. Table S9 shows the effect of ρ on DAUMI performance on the HIV datasets. Table S10 compares methods when all singletons are a priori disqualified to be haplotypes, which is the default behavior for DAUMI and Naïve, but not the other methods. Table S11 shows the improvements of DAUMI with a different initial set U on one single cell of the single-cell molecular spikes dataset.

Resolvable Resolvable
Resolvable Figure S1: DAUMI can correctly resolve the origin of haplotypes even if there are UMI collisions. Figure S2: Data simulation pipeline. Total N molecules are tagged with UMIs, after being sampled from K haplotypes under a multinomial distribution Mult(N, π). Then tagged molecular are amplified with a given efficiency for a given number of PCR cycles to mimic PCR amplification process. Lastly, ART is applied to simulate single-end reads to generate final FASTQ files. Simulation 3 Simulation 4 Figure S3: UMI observed abundance distribution of simulation datasets (errors uncorrected). Simulations 1 and 4 were simulated with PCR efficiency 0.5 in 10 cycles. Simulations 2 and 3 were simulated with PCR efficiency 0.9 in 7 cycles. There are UMI collisions in Simulations 3 and 4. The red vertical line is the ρ selected by our proposed method.

Simulation 1
Simulation 2 Simulation 3 Simulation 4 Figure S4: UMI abundance distribution of simulation datasets after removing sequencing errors with UMI-unaware AmpliCI. See legend of Figure S3 for more details about the simulations.  Figure S5: UMI observed abundance distribution of V1V2 and V3 datasets (errors uncorrected). V3 is highly right-skewed, so (c) focuses on the left tail for UMI with observed abundance at or below 500. The red vertical line is the ρ selected either by our proposed method (V1V2) or by eye (V3).  Figure S7: Abundance estimation of UMI-unaware AmpliCI on simulated data. AmpliCI was performed to denoise whole (UMI + haplotype) sequences. Deduplicated abundances were count as number of denoised haplotypes (with UMI unattached). See legend of Figure 3 for more details.  Figure S11: Venn diagrams comparing recovered haplotypes (with deduplicated abundance no less than two) on two subsets of V3 (Caskey et al., 2017) by DAUMI (ρ = 10), Calib, UMI-tools, Starcode-umi and Naïve methods, made by VennDiagram R package (v1.6.20) (Chen and Boutros, 2011). Results of the first out of five random partitions is shown.       : centered scaled distance between ρ and expected amplified abundance, (ρ − µ)/s. The theoretical mean (µ) and variance (s 2 ) and autoselected ρ are reported in Table S2. Comparable results for H unknown in Figure 3 and Tables S5-S6. The best performer for each dataset is bolded unless there is a tie of more than two values of ρ.     Table S11: DAUMI default settings are conservative for spike data. We report performance of DAUMI, UMI-tools and Naïve method on one single cell (cell barcode: TCGTAGACCAGATTCG) from the single-cell molecular spike dataset. DAUMI (AIC) shows results of DAUMI with a larger initial UMI set U, prepared by AmpliCI with option --useAIC, which more liberally accepts haplotypes than the default BIC. Columns are as described for Figure 5 in the main text.