Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

Detecting mutations from single DNA molecules is crucial in many fields but challenging. Next-generation sequencing (NGS) affords tremendous throughput but cannot directly sequence double-stranded DNA molecules (‘single duplexes’) to discern the true mutations on both strands. Here we present Concatenating Original Duplex for Error Correction (CODEC), which confers single duplex resolution to NGS. CODEC affords 1,000-fold higher accuracy than NGS, using up to 100-fold fewer reads than duplex sequencing. CODEC revealed mutation frequencies of 2.72 × 10−8 in sperm of a 39-year-old individual, and somatic mutations acquired with age in blood cells. CODEC detected genome-wide, clonal hematopoiesis mutations from single DNA molecules, single mutated duplexes from tumor genomes and liquid biopsies, microsatellite instability with 10-fold greater sensitivity and mutational signatures, and specific tumor mutations with up to 100-fold fewer reads. CODEC enables more precise genetic testing and reveals biologically significant mutations, which are commonly obscured by NGS errors.

account for common germline indels and noisy sites. We assume each read is observed independently and use the empirical error rate observed in deep CODEC WGS NA12878 to calculate the likelihood of each observed homopolymer length given the reference length. The sum of per-site MSI posterior probability is used as the sample MSI score and was plotted in Figure 5c.
CODEC-MSI outputs a sample-level MSI score by summing over of the posterior probability (a.k.a. genotype quality (GQ) score) of all MSI-positive loci. Thereby, CODEC-MSI model is essentially an indel (more specifically Short Tandem Repeat (STR)) calling model. By jointly modeling the tumor and normal samples and limiting the genotype to be only AA = Homozygous reference or AB = Heterozygous, it computes the posterior probability of a MS locus being: Without subscript, G = {Gs, Gg} is the combined genotype of somatic and germline mutations; Gs or Gg Î{AA, AB}. D stands for the data, L is the vector of repeat lengths (e.g., len(ATATAT) = 3) for a AT repeat) in tumor, and y is the number of reference fragments in normal. And we ignore a site if the normal contains any non-reference reads. The likelihood function P(D|G) = P(L|Gs, Gg) P(y|Gg) assumes independence of tumor and normal sample. Here, the P(y|Gg) is a binomial distribution with p = 0.5 for AB and p = 0.999 (assuming error rate 10 -3 ) for AA. Since when depth is low, we may not resolve the genotype. However, the larger the y, the more likely that Gg = AA. Similar to Saunders et al. [2], we modeled tumor sample as a mixture of the normal sample with somatic mutations and normal sample as a mixture of germline mutations with noise. In tumor sample, each fragment is independently observed thus have Here, P(Ti = g) is the probability of read i originating from a germline cell and P(Ti = s), a somatic cell. Note that P(Ti = g) + P(Ti = s) = 1 so that we can reduce to a scalar parameter η. Strictly speaking, η should vary locus by locus according to the clonalities. Practically, this is very hard to estimate given the low depth (<4x). Hence, we let η represent tumor fraction which we can estimate using all informative loci. Next, for each read pair, we assume it is generated by one chromosome (or a haplotype) which is again randomly selected with p = 0.5 from a diplotype. We estimate the alternative haplotype from the tumor sample by taking the most probable non-reference haplotype, i.e., the one has the greatest number of read support. We calculate the diplotype to read likelihood by summing over the two possible haplotypes: !(1 $ |G) = : !(5 $ |;) !(;|G) where P(li|H) is the empirical distribution, in this case the observed repeat length distribution conditional on the true repeat length. We used CODEC on NA12878 to estimate the repeat length empirical distribution, represented by the heatmap in Figure 5b. To account for the low depth and potential contamination, we utilize public variant database such as gnomAD and use the population variant allele frequency in the prior probability: P(G) = P(N) P(S), where P(N) and P(S) are Bernoulli distribution with pn = max(popAF, 10 -4 ) and ps = 0.5, when genotype is AB.

Duplex recovery and downsampling to certain family sizes
Custom python scripts were used to downsample in Figure 2b,c and 4i. For duplex recovery and mutant duplex recovery, we subsampled the pre-consensus family-assigned reads (after Fgbio GroupReadsByUmi) per target ( Figure  2c or whole-genome-wide ( Figure 4i) at log spaced fractions starting from 10-4 (np.logspace(-4, 0, 30)) and calculated the number of duplexes or Mutect2-supported mutant duplexes formed at each downsampled fraction, respectively. In this study, this allowed us to understand situations when only limited sequencing was given (e.g., <100 read pairs).
To understand the impact of family size on residual SNV rate, we wrote another python script for downsampling ( Figure 2b). In our sample, the number of duplex consensuses having the exact family sizes (number of pre-collapsed raw read pairs) was limited and thus gave less confident results. Thus, we took advantage of families with strictly larger family sizes and downsampled them to have certain target family size. We also sought to maintain an equal or close ratio between the number of reads from each strand.

WGS coverage calculation and cost comparison
On WGS data from CODEC and standard NGS, we first used 'Picard MarkDuplicates' for deduplication. Then, 'Picard CollectWgsMetrics' with default parameters was used to calculate the coverage. The coverage of CODEC WGS considered all CODEC reads including byproducts. When calculating CODEC correct product coverage, we first removed byproducts (discordant reads) using SAMtools and then ran 'Picard CollectWgsMetrics'. Duplex depth is the number of final duplex bases (passing all CODEC filters) divided by the size of interrogated genomic regions. The numbers of final duplex bases were calculated by single-fragment mutation caller and used as the denominator for calculating residual mutation rates. The numbers of final duplex bases were also used for calculating the cost of CODEC vs. Duplex Sequencing in Figure 2d. The cost of CODEC and Duplex Sequencing were proportional to the number of raw read pairs (CODEC 214M, Duplex Sequencing 305M). The actual cost of Illumina sequencing is assumed at $7/Gb (assuming NovaSeq 6000 S4 and accounting for the Q30 rate).

Fragment-level and base-level filters
1) The number of Q30+Q30 bases (bases from R1 and R2 are both equal to or above Q30) is at least 70% of the overlap length of R1 and R2.
2) The number of discordant bases between R1 and R2 and non-ACGT bases is less than 5% of the total read length. 3) Mapping quality is at least 60. 4) No 5¢ end clipping is allowed. 5) Fragment length is above 30 bp. 6) The secondary alignment score is no more than 50% of the primary alignment score. 7) Each read is allowed to have less than 5 non-N substitutions (base from the read is A/C/G/T and is different from the reference genome). 8) No cluster of mutations near the fragment ends, i.e., three mutations are spaced out less than 15 bp from each other and from either end of the fragment. 9) A read should contain no indel when it is used for calling SNVs. Apart from the fragment-level filters, we have a few base-level filters: 1) both bases from R1 and R2 should have quality scores equal to or above 30 (≥Q30) and should agree with each other, 2) the bases immediately preceding or succeeding of indel should be ≥Q30 and are not N, 3) mutations within 12 bp from the fragment ends are not called, and 4) 95% of the read-pairs in the same family should agree on the mutation that is being called.

Comparing to standard NGS on single-fragment mutation calling
When comparing CODEC WGS vs. standard WGS on detecting mutations from single DNA molecules, we first matched on their coverage. GATK DownsampleSam was used in all the scenarios. For the eight breast cancer tumors used in Figure 4a,g,h, we downsampled the standard WGS tumor samples to match CODEC coverage on per sample basis (range 1.37x to 2.38x). The 15 normal buffy-coat standard NGS samples used in Figure 3c were downsampled to 6x since CODEC coverage has a tighter distribution (5.46x to 6.93x). Note that two of the CODEC samples originally had high coverage 18x and we downsampled them to 5.6x. Lastly, to generate Figure 4e, we downsampled the standard WGS MSI-H tumors from 12x to 1x, with an incremental of 1, and CODEC from 5x to 3x, 1x, 0.5x, 0.25x, 0.125x, 0.05x, and finally to 0.025x. CODEC single-fragment mutation caller was used by the standard NGS data in these scenarios with a couple of slight modifications: 1) We allowed mutations called from the overhang regions where R1 and R2 do not overlap since the fragment sizes of the standard NGS are usually larger. 2) the Q30 + Q30 filter was changed to calculating the ratio of Q30 bases throughout the read pair, again because of the small or 0 overlap between R1 and R2 in standard NGS. Similar to duplex depth of CODEC data, we thus can also calculate the number of final evaluated bases from standard NGS (Supplementary Table 1).

Examining clonal tumor mutation detection with CODEC WGS
We deeply sequenced a tumor biopsy (Figure 4b) (tumor purity = 0.86) using CODEC with 509M raw read pairs. The purity was estimated by ABSOLUTE [3] on 60x standard WGS. We downsampled the data to 10%, 20%,…, 90% of total reads and called mutations using the single-fragment mutation caller. To strike a balance between sensitivity and specificity, we chose to use a Q30+Q30 cutoff at 50%. The rest of the filters were kept the same as described in the earlier section. From all somatic SNVs called from 60x WGS by Mutect2, we selected a set of SNVs (n = 3,408) with cancer cell fraction (CCF, estimated by ABSOLUTE) ≥ 0.9 and within our high-complexity regions (2.3B) as clonal SNVs. When estimating the PPV, we used all SNVs (n = 11,714) as the ground truth without restricting to CCF. For CODEC data, we called SSNVs with at least 1 or 2 mutant duplex support. The theoretical sensitivities were estimated by a binomial distribution with p = 0.86 ´ 0.5, assuming all heterozygous mutations. The projection was through binomial distributions where p was estimated from the data, which we found to be lower than the ideal case.

Sperm DNA purification
The sperm sample was obtained as part of an IRB-approved human subjects protocol at New York University Grossman School of Medicine. Sperm was purified from semen prior to initial freezing by density gradient centrifugation, and after thawing for processing by a second density gradient centrifugation with ORIGIO gradient 40/80 buffer (Cooper Surgical, 84022060) per the manufacturer's instructions. An aliquot of sperm was then centrifuged at 300 RCF for 5 minutes at room temperature. The supernatant was removed, leaving approximately 50 μL of sperm/buffer at the bottom of the microtube. The tube was tapped gently 5 times to break up the sperm pellet before adding lysis buffer. Sperm lysis buffer was prepared by combining (for each sample) 497.5 µL of QIAGEN Buffer RLT (QIAGEN, 79216) without β-mercaptoethanol, and 2.5 μL of 0.5 M Bond-Breaker TCEP Solution (Thermo Scientific, 77720) for a lysis buffer with 2.5 mM TCEP final concentration. Five hundred microliters of sperm lysis buffer were added to the sperm sample, without pipette mixing. One hundred milligrams of 0.2 mm stainless steel beads (Next Advance, SSB02) were then added to each sample and homogenization was performed with the TissueLyser II (QIAGEN) at 20 Hz. DNA was then extracted using the QIAamp DNA Mini Kit (QIAGEN, 51306) with a modified protocol as follows. Five hundred microliters of buffer AL were added to each lysate and vortexed well. Then, 500 μL of 100% ethanol were added and vortexed well. Then, the mixture was applied to a QIAamp DNA Mini spin column and the remaining standard QIAamp protocol was followed. DNA was eluted with 100 µL of 10 mM Tris pH 8. RNase treatment was then performed by adding 12 µL of 10x PBS pH 7.4 (Gibco, 70011044), 2 µL of Monarch RNase A (NEB, T3018L), and 6 µL nuclease-free water. The reaction was incubated at room temperature for 5 minutes and immediately purified using 0.8x SPRI beads with elution using 35 µL of low TE buffer. A methylation-based somatic cell contamination assay was performed [4] to confirm sperm purity.