Evaluating and Improving Small Subunit rRNA PCR Primer Coverage for Bacteria, Archaea, and Eukaryotes Using Metagenomes from Global Ocean Surveys

ABSTRACT Small subunit rRNA (SSU rRNA) amplicon sequencing can quantitatively and comprehensively profile natural microbiomes, representing a critically important tool for studying diverse global ecosystems. However, results will only be accurate if PCR primers perfectly match the rRNA of all organisms present. To evaluate how well marine microorganisms across all 3 domains are detected by this method, we compared commonly used primers with >300 million rRNA gene sequences retrieved from globally distributed marine metagenomes. The best-performing primers compared to 16S rRNA of bacteria and archaea were 515Y/926R and 515Y/806RB, which perfectly matched over 96% of all sequences. Considering cyanobacterial and chloroplast 16S rRNA, 515Y/926R had the highest coverage (99%), making this set ideal for quantifying marine primary producers. For eukaryotic 18S rRNA sequences, 515Y/926R also performed best (88%), followed by V4R/V4RB (18S rRNA specific; 82%)—demonstrating that the 515Y/926R combination performs best overall for all 3 domains. Using Atlantic and Pacific Ocean samples, we demonstrate high correspondence between 515Y/926R amplicon abundances (generated for this study) and metagenomic 16S rRNA (median R2 = 0.98, n = 272), indicating amplicons can produce equally accurate community composition data compared with shotgun metagenomics. Our analysis also revealed that expected performance of all primer sets could be improved with minor modifications, pointing toward a nearly completely universal primer set that could accurately quantify biogeochemically important taxa in ecosystems ranging from the deep sea to the surface. In addition, our reproducible bioinformatic workflow can guide microbiome researchers studying different ecosystems or human health to similarly improve existing primers and generate more accurate quantitative amplicon data. IMPORTANCE PCR amplification and sequencing of marker genes is a low-cost technique for monitoring prokaryotic and eukaryotic microbial communities across space and time but will work optimally only if environmental organisms match PCR primer sequences exactly. In this study, we evaluated how well primers match globally distributed short-read oceanic metagenomes. Our results demonstrate that primer sets vary widely in performance, and that at least for marine systems, rRNA amplicon data from some primers lack significant biases compared to metagenomes. We also show that it is theoretically possible to create a nearly universal primer set for diverse saline environments by defining a specific mixture of a few dozen oligonucleotides, and present a software pipeline that can guide rational design of primers for any environment with available meta’omic data.

A mplicon sequencing is a powerful tool for understanding microbial community composition and dynamics in the oceans and other ecosystems (1), but the PCR amplification step is potentially biased due to both technical issues during amplification and mismatches to organisms found in natural environments (2)(3)(4)(5)(6). Despite these concerns, PCR amplicon sequencing retains several advantages that make it desirable to investigate and correct biases. First, it is a high-throughput and low-cost technique, making it suitable for large numbers of samples, e.g., for global surveys of sediment, water, animal-associated, and other microbial communities (1).
Second, the targeted nature of the PCR assay means relatively small numbers of sequences are sufficient for detecting rare organisms even when we have no genomes from any of their relatives, due to the conserved nature of the molecule and the existence of comprehensive small subunit (SSU) rRNA sequence databases. Being able to quantify the abundance and dynamics of rare organisms is important for understanding ecosystem function since many rare bacteria have impacts far greater than their low abundances might imply (7,8). While PCR assays are targeted, we note that this does not imply they need to be taxonomically restricted. For example, there are some primer binding sites in the SSU rRNA molecule that are nearly universally conserved between Archaea, Bacteria, and Eukarya (discussed further below).
Third, using untargeted metagenomic sequencing for taxonomic profiling still has a number of significant disadvantages versus amplicon sequencing. Direct taxonomic assignment of randomly sheared metagenomic reads by recruitment to reference genomes has the potential to assign taxonomy at a very high resolution. However, the best currently available genome databases still recruit only ;40% of metagenomic reads for marine prokaryotes (9), meaning a significant fraction of untargeted metagenomic data sets are taxonomically uncharted. This problem is even more acute for environmental eukaryotes with large amounts of noncoding genomic DNA and fewer genomic references. It is also possible to extract SSU rRNA (or other marker genes) from metagenomes to generate a taxonomic profile. However, SSU rRNA fragments typically represent only a small fraction of a given data set (;0.1%), so this is a far more costly way to obtain a comprehensive community profile. In addition, because metagenomically retrieved marker genes are randomly sheared fragments covering both conserved and hypervariable regions and are in most cases too short to provide a unique match to reference sequences, they must be assembled or otherwise clustered using nondeterministic algorithms. Because of this, the taxonomic resolution of shortread metagenomic marker gene data is limited (e.g., near family or genus level for the bestknown conserved markers like rRNA). In contrast, with rRNA amplicons, modern "denoising" algorithms produce exact amplicon sequence variants (ASVs) that are stable biogeographic markers that can be intercompared without reanalysis (10). The resulting data sets are also relatively simple to analyze since they consist of a single gene region and can be comprehensively classified with databases such as SILVA or RDP (11,12).
Recent studies have shown that PCR amplicon sequencing of mock microbial communities can recover known relative and absolute abundances extremely well and thus provides accurate quantification of natural gene copy abundances (3,13,14). In turn, this accuracy allows amplicon data to serve as a ground-truth for modeling/ecological studies by quantifying community members and their dynamics. While welldesigned primers have the potential to recover truly quantitative data, these studies also underscored the critical fact that a single mismatch between the primer and template sequences can have a dramatic effect on the measured community composition in complex natural mixtures. For example, a single internal mismatch in the popular Earth Microbiome Project primer set caused an ;10-fold bias against the most common bacteria in seawater (SAR11 cluster), and terminal 39 mismatches can completely prevent amplification (2,3,6,13). Since the extent of this bias is not easily predicted, PCR primers should incorporate degenerate bases or specific oligonucleotide variants so that all targeted organisms are perfectly matched without overly diluting the common perfect matches in primer mixtures. This is especially critical for abundant taxa such as SAR11 as distortions in their relative abundances will skew the remainder of the community, but it is also important to consider for rare taxa which might have essential biogeochemical or ecological roles.
Previous studies have shown high coverage of natural taxa is possible by designing moderately degenerate primers without sacrificing specificity or PCR efficiency (3,13). This primer design was accomplished by comparing oligonucleotide sequences to a SSU rRNA database such as SILVA or RDP (11,12) and then checking for mismatches to organisms known to be abundant in the environment of interest. This approach led to marked improvements in primer design, for example by Apprill et al. (2) and Parada et al. (3) who reported that small modifications to existing primers could better quantify the dominant marine taxa SAR11 and Thaumarchaea. However, in these primer evaluations, the reliance on full-length references and giving each sequence in a database equal weight can lead to a distorted perspective of the actual extent of matches and mismatches expected in real samples since they do not take into account the highly unequal abundances in nature. In addition, some environments may have abundant taxa poorly represented or unrepresented in these curated reference databases.
Wear et al. (6) studied the effect of these potential biases empirically by testing 4 primer sets currently in broad use by marine microbiologists on a 16S rRNA mock microbial community derived from natural seawater communities near Santa Barbara, CA. These authors tested the effect of their analysis pipeline on recovery of mock community sequences for different primers and found their primer-pipeline combination had several sequence-specific biases, in some cases due to a primer mismatch. While this was an important step toward cross-comparing primers in an oceanographic context, Wear et al. (6) noted that results are specific to their mock microbial community, analyzed with their particular pipeline, and thus it remains unknown how representative their results are for other ecosystems and other pipelines. More specifically, it is currently unknown how many primer-mismatched rRNA sequences occur across diverse oceanographic environments, a key piece of information for designing field measurements and interpreting results.
To more fully evaluate the extent to which primers currently in broad use by the marine microbiology community perfectly match naturally occurring sequences, we developed a workflow to conduct in silico primer evaluations with metagenomic SSU rRNA fragments, under the general assumption that metagenomes have fewer methodological biases than PCR. While it is already known that some primers have better coverage for abundant marine organisms than do others (3,13), a precise quantitative comparison of these primers with others that are in broad use has not yet been conducted.
We developed a new software pipeline for analyzing globally distributed marine metagenomic data sets (see Table S1 at https://osf.io/gr4nc/) to make a quantitative and objective evaluation of PCR primers for pelagic marine ecosystems and to suggest specific improvements based on this evidence. We further experimentally compared the quantitative performance of SSU rRNA amplicon sequencing, performed for this study, against the published shotgun metagenomic data from the same sample DNA from the BioGEOTRACES study (20). Our pipeline also allowed us to make intercomparisons between primer sets often not considered together. For example, we compared the performance of two "universal" (3-domain) PCR primers that amplify both 16S and 18S rRNA with primers designed to specifically target only 16S or 18S rRNA and exclude the other molecule.
Because our software pipeline is broadly applicable for any environment where metagenomic data exist, it will allow investigators to design environment-specific PCR probes that recover as many (or as few) targeted organisms as desired. It is also flexible, allowing design or improvement of primers from different variable regions either to retain stable intercomparisons with existing data or to target a region of the SSU rRNA molecule that has better taxonomic resolution for taxa of interest (15). Our approach has the added advantage of providing a quantitative, evidence-based framework to identify which taxa may have been affected by biases in existing data sets generated with suboptimal primers-information that is often unknown or only recognized anecdotally by specific investigators.

RESULTS AND DISCUSSION
Overall primer coverage for pelagic ocean data sets. To evaluate the real-world extent of primer biases, we compared PCR primers currently in broad use by the marine microbiology community (see Table S2 at https://osf.io/gr4nc) to SSU rRNA sequences extracted from a geographically diverse set of pelagic ocean metagenomes ( Fig. 1; see also Table S1 at https://osf.io/gr4nc). These analyses were based on the percentage of perfect matches between primers and metagenome sequences as a metric to determine potential PCR performance, since previous studies have shown that even a single mismatch can significantly bias results (3,13).
We observed relatively even recruitment of metagenomic reads across the SSU rRNA molecule, which indicates that each different primer region has broadly similar potential to generate accurate taxonomic profiles of naturally occurring organisms (see supplemental results at https://osf.io/gr4nc). However, when we compared the precise sequences derived from the primer-binding regions to the oligonucleotide primer sequences, we observed that predicted median coverage of perfect matches varied widely among primer pairs (see Table S2 at https://osf.io/gr4nc), with 515Y/ 806RB and 515Y/926R showing the most consistently high incidence of perfect matches (Fig. 2). The ability of all primers to capture the underlying diversity in the metagenomic samples can be improved to some extent by increasing degeneracies, though this varied significantly between primer sets. Adding a single additional degeneracy fixes the majority of mismatches for some primers (e.g., 785R; discussed further below), whereas others have greater than 2 mismatches to metagenomic sequences and thus would require more extensive modifications (see Fig. S1 to S28 at https://osf .io/gr4nc). Below, we discuss the performance of these primers across 4 broad taxonomic groups (Archaea, Bacteria, Cyanobacteria 1 plastidal 16S rRNA, and Eukarya). We separated Cyanobacteria 1 phytoplankton plastidal 16S rRNA from bacterial 16S rRNA since these organisms are responsible for the vast majority of marine primary productivity and thus are important to quantify accurately. For Archaea, the primer with the best median coverage was the EMP 515Y/806RB combination (coverage = 0.996), followed by 515Y/926R (0.982), and 926wF/1392R (0.952). We also note that our results suggest the 341F/785R primer combination could amplify most Archaea with the addition of several degeneracies in 341F (see Fig. S1 to S7 at https://osf.io/gr4nc).
For Cyanobacteria and chloroplast 16S rRNA (derived from eukaryotic phytoplankton), the primer pair with the highest median coverage was 515Y/926R (0.992), followed by 515Y/806RB (0.881), 27F/338R (0.826), 926wF/1392R (0.714), and 341F/785R (0.702). We note that the vast majority of these mismatched sequences come from chloroplast 16S rRNA sequences-the common cyanobacterial groups Prochlorococcus and Synechococcus have high coverage with all primers listed above, so this is only a concern for those wishing to use chloroplast 16S rRNA data to quantify eukaryotic phytoplankton.
For Eukarya, the primer pair with the highest median coverage was 515Y/926R (0.883), followed by V4F/V4RB (0.822), 926wF/1392R (0.792), and 1389F/1510R (0.732). The distributions of coverage were considerably broader for Eukarya versus all other 16S rRNA categories, which may be due to a higher sequence heterogeneity among eukaryotic genomes. As above, all primer sets could be improved with the addition of some degeneracy (see Fig. S22 to S28 at https://osf.io/gr4nc), but approaching perfect coverage seems less practically achievable for eukaryotic SSU rRNA sequences. In addition, our analysis indicates that improvements in eukaryotic coverage are most likely to come from targeting universally conserved rRNA regions, such as those containing on the x axis, a situation approached by the UNIV V4-V5 primer for Cyanobacteria 1 plastid 16S rRNA, for example. Note that the "All SSU rRNA" panel represents all of the SSU rRNA targeted by a particular primer set (e.g., for 1389F/1510R this includes only Eukarya while for 926wF/1392R it includes all 4 taxonomic groups). Data are corrected for predicted taxonomic overlap between forward and reverse primers. Coordinates refer to locations of primer alignments to the Escherichia coli 16S rRNA gene (strain K-12, substrain MG1655).
We note, however, that recognizable dinoflagellate chloroplast rRNA sequences were almost completely missing from the metagenomic sequences (for all primer regions, not just 515Y/926R) and thus may not appear in the resulting amplicons regardless of primer choice. This is possibly due to the unusual genomic organization/subcellular localization of chloroplast genes in dinoflagellates (16), or alternatively because we lack sufficiently accurate or comprehensive databases for identifying dinoflagellate chloroplast 16S rRNA (discussed further below). It is possible, however, to recover and confidently identify dinoflagellate 18S rRNA sequences from the 515Y/926R primer pair (17), making it feasible to reconstruct a holistic picture of phytoplankton community composition/abundance in combination with the chloroplast 16S rRNA.
In summary, the primer pairs with the highest environmental coverage were 515Y/ 926R (universal), 515Y/806RB (prokaryote-only), and V4F/V4RB (eukaryote-only). It is worth emphasizing that both universal primer sets have the potential to be equally or more comprehensive for Eukarya than the Eukarya-specific primer sets tested here. In addition, for those wishing to quantify the abundance of eukaryotic phytoplankton using chloroplast data, the 515Y/926R primer set nearly perfectly matches all environmental sequences. These results are a quantitative confirmation that the careful primer design previously reported for the 515Y/806RB/926R primers has resulted in very high overall coverage in oceanic ecosystems. On the other hand, certain primer sets (e.g., 341F/785R) have very low coverage due to mismatches to dominant organisms, despite being designed to maximize environmental coverage (18). Resulting data may thus be distorted, though we note that removing taxa with mismatches to primers from processed data will recover accurate relative abundances for the remaining taxa (13), due to the fact that biases are thought to be taxon specific (19). This relative abundance correction approach depends on our pipeline's specific identification of mismatched taxa and provides a way to recover useful quantitative information from legacy data sets that may have been biased during PCR. In combination with ecosystem-specific full-length 16S rRNA databases, this correction approach could represent an important tool for integrating historical amplicon data sets and modern long-read sequencing data into a single intercomparable framework for observing longer-term changes in ecosystem structure (15). In addition, by eliminating or at least controlling for primer bias, our approach could help better evaluate how well particular primer regions perform with respect to taxonomic profiling and classification since it would tease apart the effect of variable region from amplification artifacts.
A quantitative intercomparison between SSU rRNA amplicon sequencing and metagenomics. In order for amplicon sequencing to become useful as a quantitative tool for measuring natural gene abundances, it is desirable to benchmark amplicon abundances against an external reference that is not affected by potential PCR biases to gain a more objective determination of how well a primer set recovers true gene abundances. While shotgun metagenomics is not necessarily free of all biases, these biases are distinct from those in PCR-based amplicon studies; thus, if the two data types yield similar community composition patterns, we can be more confident about our overall conclusions. We tested whether amplicon-based strategies recover similar patterns as metagenomes by generating PCR amplicon sequences for a subset of 272 BioGEOTRACES (20) samples (cruises GA03 and GP13). We used the same DNA, eliminating extraction biases as a potential confounding factor in this intercomparison. We were thus able to make a direct "apples-to-apples" quantitative comparison between the relative abundances of organisms determined with amplicons and those determined with metagenomic reads.
To accomplish this, we compared exact amplicon sequence variants (ASVs) from the 515Y/926R primer pair with metagenomic reads from the same region of the SSU rRNA molecule. To make a more robust quantitative comparison, ASVs were clustered as necessary to account for the fact that short (150-bp) metagenomic reads could in some cases be attributed only to a broader phylogenetic group (e.g., Prochlorococcus and SAR11). The summed abundances of all taxonomic groups, as determined by ASVs and metagenomic reads, were then compared using a linear least-squares regression. For 16S rRNA sequences, we observed very similar community composition for both techniques, as shown by very high R 2 values and coverage in Fig. 3A (median R 2 = 0.98; range = 0.55 to 1.0). This indicated that the ASV sequences were recovering the same diversity of organisms present in metagenomic reads from the primer region and is consistent with the above-mentioned observation of even coverage of metagenomic reads across the SSU rRNA molecule.
These patterns were generally consistent across the surface-mesopelagic transition where there is a major change in community composition (Fig. 3). We do, however, note that at depths of .150 m about 38% of samples had R 2 values below 0.95 (Fig. 3A), versus 9.2% for samples at depths of ,150 m. These lower correspondences are unlikely to be due to mismatches to the primer region, since predicted primer coverage is uniformly high across depth for the 515Y/926R primer pair across depth (Fig. 3B). Other factors such as DNA quality or quantity may have been responsible for this, but we cannot rule out the possibility that it involves characteristics of the sequences between the priming regions such as secondary structure or expanded loops known from some eukaryotes (21). Regardless, these results show that amplicon sequencing and metagenomic profiling typically have a very high quantitative correspondence with the primers tested here (515Y/926R). Additionally, we provide a software toolbox for making further intercomparisons between arbitrary primer sets and metagenomes that could be useful for optimizing both PCR amplicon sequencing and metagenomics for diverse environments.
Improving existing primers. In addition to quantifying mismatches, our pipeline also identifies the specific primer region variants found in particular naturally occurring taxa and their coverage across data sets. This information can guide improvements for any of the primers tested here, and we identify two specific use cases. First, some of the primers identified as having major mismatches are dominated by a few organisms, and thus small modifications are all that is necessary to make them more appropriate for oceanic ecosystems. Second, we were interested to know whether for the betterperforming primers it would be possible to identify a combination of oligonucleotides that would perform well across all marine environments tested here. Such a primer set could have the advantage of being able to monitor rare but biogeochemically important taxa that may become more abundant due to climate change, e.g., as a result of expanding oxygen minimum zones.
To illustrate how a primer with poor predicted performance could be improved, we use the case of 785R, which has a mismatch to the dominant organism SAR11. Because the SAR11 mismatches represent up to 40% of total 16S rRNA reads (see Table S3 at https://osf.io/gr4nc), this one mismatch will have a major effect on the resulting data but could be corrected by adding a single additional degeneracy at position 7 in the primer (59-GACTACNVGGGTATCTAATCC). However, 785R also has mismatches to a number of abundant chloroplast sequences (Fig. 2), which could be corrected with two additional degeneracies (59-RAYTACNVGGGTATCTAATCC). Compared to the original primer (59-GACTACHVGGGTATCTAATCC), these primers would require the synthesis of a larger number of oligonucleotide combinations (12 and 48 versus 9 originally) but are well within the number of combinations currently in use for other environmental amplification strategies (22). Interestingly, despite this being the worst-performing primer for marine ecosystems due to mismatches to the dominant taxon SAR11, we predict that by adding the above degeneracies this primer would achieve nearly perfect coverage for Bacteria (see Fig. S15 to S21 at https://osf.io/gr4nc). This suggests the primer-binding region of 785R is highly conserved among Bacteria, consistent with the initial publication describing this primer as the best overall for prokaryotic taxa across diverse environments (18). This observation also underscores the value of comparing primer sequences with environmental data since it uncovered a significant, unexpected mismatch to an abundant taxon that can now be corrected.
We also investigated whether it is possible to improve primer sets that already perform well (e.g., 515Y/806RB/926R) to the point where they offer near-universal coverage of all taxa identified in this study. To do so, we identified data sets that had significant potential for coverage improvement (see Fig. S1 to S28 at https://osf.io/gr4nc) and certain rare order-level taxa that had many mismatches or are known to be biogeochemically important.
Our data showed that the 806RB primer performed extremely well overall for 16S rRNA sequences but did miss a number of 16S rRNA chloroplast sequences mostly from the Mamiellales clade, which includes small picoeukaryotes such as Ostreococcus and Bathycoccus. As shown in Fig. S8 to S14 at https://osf.io/gr4nc, these taxa were still missed when allowing for up to 2 mismatches, meaning that they are unlikely to be accurately quantified with the current primer design. Alterations would be possible, but we note that this may be an intentional design choice, since chloroplast amplification may be undesirable in certain environments (23) but is often very useful in euphotic zone marine water samples where it can be used to quantify primary producer abundance. Related to this, we note that a context-dependent advantage of 806RB is that it lacks an 18S rRNA binding site. This means that in combination with 515Y, nuclear 18S rRNA will not be amplified with 806RB. In other words, even though the variable region covered by 515Y/806RB contains eukaryotic sequences, this primer set will not recover them as amplicons-demonstrating that primer design places a first-order constraint on how well a given amplicon product reflects the natural community. In contrast, the 515Y/926R combination will amplify these eukaryotic sequences. In a practical sense, amplifying 18S rRNA sequences with 16S rRNA can present problems, for example, in host-associated microbiome studies where 18S rRNA could overwhelm targeted 16S rRNA. This overwhelming of libraries by 18S rRNA sequences is, however, unlikely to be a major issue for the 515Y/926R primer pair for pelagic ocean samples. This is due to the fact that 18S rRNA sequences will generally be selected against in the sequencing process (thus reducing their overall abundances [13]) and the fact that marine environments are generally dominated by prokaryotic microorganisms (13) (see Fig. S29 at https://osf.io/gr4nc).
The 515Y primer was relatively easy to improve, given that its performance was already very robust across diverse data sets (see Fig. S1 to S28 at https://osf.io/gr4nc). We identified four positions where further degeneracies could be added (59-GTGBCAGCMSYCGCGGTMA) and were sufficient to resolve the vast majority of mismatches including to rare taxa such as certain representatives of the Patescibacteria (also known as candidate phyla radiation [CPR] bacteria) which were previously identified as taxonomic "blind spots" in PCR amplicon analysis (5). However, incorporating these new degeneracies produced a relatively large number of oligonucleotide combinations compared to the original primer (48 versus 4 originally), and most of these new variants were not observed in the metagenomic data. We thus decided to redesign the primer based on a different approach, starting from a nondegenerate primer and adding in only variants that could be detected in the natural environment or in the SILVA database. This resulted in a specific mixture of 13 specific oligonucleotides that we term 515Yp-min ("p" for pelagic, "min" for minimal). This mixture of oligonucleotides is tailored specifically to the environmental sequences present in natural marine environments. It maximizes organismal coverage, while keeping the overall degeneracy at a relatively low level compared to other degenerate primers used in environmental microbiology (22). Another key advantage of this tailored approach is that it permits iterative improvements to a primer set over time. For example, we added a variant matching dinoflagellate chloroplast sequences from the PhytoRef database (24) that may potentially improve quantification of these organisms' plastidal 16S rRNA in combination with other modifications to 926R (discussed below). In the future, if we were able to identify other variants explaining why dinoflagellate chloroplast 16S sequences are conspicuously absent from 515Y/926R amplicon data, it would then be feasible to add these new oligonucleotides to the mixture without creating excessive sequence redundancy.
The 926R primer had mismatches to more taxa and thus required more extensive modifications. For example, Table S3 at https://osf.io/gr4nc shows that 926R has mismatches to both Ectothiorhodospirales (an uncultivated group related to purple sulfur bacteria [25]) and the Rickettsiales (an order that includes mitochondrial sequences in addition to free-living organisms [26]). While less abundant overall, we also noted mismatches in the Tara Oceans data set to Brocadiales, the order containing anammox bacteria (27). In the Guaymas Basin sediment samples, some of the dominant mismatches were to the Campylobacteria, dominant chemoautotrophs in many sulfidic systems (28), as well as unusual Archaea that are now recognized to be much more diverse than previously thought and missed by some PCR primers (29). Accurate quantification of these rare taxa would allow the monitoring of unusual but potentially significant changes in biogeochemistry. For example, the Campylobacteria are known to be abundant in certain low-oxygen microenvironments such as in sediment trap particles (30), and the Brocadiales are abundant in oxygen minimum zones (27). Being able to better quantify these "sentinel taxa" could allow us to monitor the dynamics of these low-oxygen environments that are likely to expand under global climate change.
Following the same approach discussed above for 515Yp-min, we developed a mixture of 38 oligonucleotides ("926Rp-min") that contains perfect matches for abovementioned taxa, primer variants that have .2% overall relative abundance across all environments, and 2 sequences from dinoflagellate chloroplasts that were not previously covered. This set of oligonucleotides is comprehensive for all the environments tested in our study and would likely result in major improvements in quantification of all of the above-mentioned taxa. To achieve the same organismal coverage by adding more degenerate sites would increase the total number of primer variants to at least 144 (compared to 16 in the original primer). Not only would this dilute perfectly matching oligonucleotides and be unlikely to result in efficient PCR amplification, it would also include many redundant oligonucleotides unnecessary with an explicit oligonucleotide mixture. The improvements in coverage for this new mixture may be particularly apparent for Rickettsiales templates, many of which have a single mismatch at the 39 end of the primer that is known to completely prevent PCR amplification (6). Since the order Rickettsiales contains sequences from mitochondrial SSU rRNA, this modification will potentially allow for more effective quantification of protist abundances, similar to how chloroplasts quantify marine protistan phytoplankton.
Since it is beyond the scope of this paper to provide exhaustive recommendations for each primer set, we direct interested readers to an Open Science Foundation repository which contains all necessary output files, scripts, and a suggested workflow for improving primers using metagenomic data (https://osf.io/gr4nc/). Improvements should be possible for most taxon/primer combinations since a few mismatches typically dominated, with a long tail of rare variants that may derive from sequencing errors or pseudogenized SSU rRNA. This information would allow an interested reader to optimize one of the primers investigated here for particular taxa or environments of interest.
Moving forward, we recommend oceanographers consider applying these two new primer mixtures for pelagic water column surveys to maximize organismal coverage (515Yp-min, 926Rp-min; see Table S4 at https://osf.io/gr4nc/). By correcting for 39 mismatches that lead to taxonomic "blind spots" and including exact matches to taxa found in low-oxygen regions, we should be able to better quantify the abundance and dynamics of organisms critical to marine carbon, nitrogen, and sulfur cycling. In addition, by representing primers as a specific mixture of oligonucleotides rather than a single sequence with degeneracies, it leaves open the possibility of small modifications in the future to improve coverage; the same is not true for fully degenerate primers, where each new degeneracy will multiply the total number of combinations by at least 2-fold and further dilute the other perfectly matching oligonucleotides. If proven to be as quantitative as the original 515Y/926R primers (3,13), these new mixtures would provide an affordable, comprehensive (all organisms from Archaea to zooplankton), and scalable method to measure whole-community compositions across time and space in the world's oceans. While further validation is necessary, we believe our results and those of previous studies (3,(13)(14)(15) demonstrate the potential for PCR amplicon methods to accurately quantify natural marker gene abundances in a robust and accurate manner, either by recovering true relative abundances or in combination with internal standards to obtain absolute quantification (14). This would allow confident measurement of microbial community composition alongside well-developed methods such as those for quantifying phytoplankton photosynthetic pigments. In turn, this would help expand our understanding of how oceanic microbes interact with biogeochemical cycles and respond to global climate change.
Finally, our bioinformatic approach could be applied elsewhere to generate ecosystem-specific primer sets. For example, it remains unknown whether the patterns we observed for seawater would be true in other systems, such as terrestrial sediment, soil, freshwater, or animal/plant-associated microbiomes. Our reproducible workflow, combined with expert curation and broad environmental sampling, could ensure that taxa of interest are accurately quantified by oligonucleotide primers for any environment where deeply sampled metagenomes could be used as a database. In addition, our approach could allow for the rational evaluation and improvement of new primers for next-generation long-read amplicon sequencing, or to improve the performance of primers used to identify animal species of economic or conservation importance. By evaluating primer coverage objectively and providing simple ways to iteratively improve existing primer sets, it would ensure that the potential of these techniques for monitoring microbiomes relevant to ecosystem functioning and human health is fully realized.

MATERIALS AND METHODS
Sample scope/biogeographic distribution and evaluated primers. We analyzed globally distributed metagenomic data sets that reflect a range of depths and nutrient regimes ( Fig. 1; see also Table  S1 at https://osf.io/gr4nc) and tested primers that are commonly used in environmental microbiology (see Table S2 at https://osf.io/gr4nc) (2-4, 18, 31-35). Metagenomic samples included worldwide sampling expeditions such as Tara Oceans (36), the Malaspina expedition (37), and BioGEOTRACES (20). Among these three campaigns, the Malaspina metagenomes focus largely on the deep sea, whereas Tara and BioGEOTRACES sample primarily the sunlit ocean. In addition, we used time-series data from the Hawaii Ocean Time-Series and the Bermuda Atlantic Time-Series (38, 39) (HOT/BATS; both analyzed with BioGEOTRACES as BiHOBA [BioGEOTRACES-HOT-BATS]), a bloom time-series in Monterey Bay (40), and the San Pedro Ocean Time-Series (41) (SPOT). We also included in the analysis two sites representing suboxic/anoxic systems-the Saanich Inlet time-series (42) and samples collected from hydrothermally altered sediments in Guaymas Basin (43). We also generated new 16S/18S rRNA universal amplicon sequences for two BioGEOTRACES cruise transects (GA03/GP13; noted with open circles in Fig. 1) which were used for making intercomparisons between metagenomes and amplicons.
Evaluation of primer coverage/biases using metagenomic reads. In order to evaluate the performance of our primer set compared to others, we used metagenomic reads to estimate the fraction of SSU rRNA fragments that could be successfully targeted by (i.e., perfectly match) each primer set. In brief, our pipeline retrieves SSU rRNA from fastq-formatted shotgun metagenomic samples, removes sequences containing uninformative repeats, sorts into four organismal categories (Archaea, Bacteria, Cyanobacteria [including plastidal 16S rRNA], and Eukarya), and aligns these sequences to a group-specific reference SSU rRNA sequence (e.g., Saccharomyces cerevisiae for Eukarya). To obtain sequences overlapping each oligonucleotide primer and to exclude nonprimer reads, coordinates provided in the config file were used to extract reads from alignments that overlapped the primer-binding region plus 5 leading or trailing bases. These reads, which represent SSU rRNA fragments with a primer-binding region, were then quality filtered and compared with primer sequences (see Table S2 at https://osf.io/gr4nc/) at 0-, 1-, and 2-mismatch thresholds (see Fig. S1 to S28 at https://osf.io/gr4nc) to identify matched and mismatched sequences. As an additional quality-control step, we excluded any aligned read falling in the primer region that did not have a match to the primer at a 6-mismatch threshold (e.g., misalignments or distantly related sequences). Both mismatched and matched reads were then classified using the SILVA 132 database (Archaea, Bacteria, Eukarya) and PhytoRef (11,24) (Cyanobacteria 1 plastidal 16S rRNA). The specific software implementation is described fully in the supplemental material at https://osf.io/gr4nc/. For those wishing to inspect results further, complete plaintext summaries of output and summary plots are available online at https://osf.io/gr4nc/.
These computational steps were implemented with the Snakemake workflow engine (44) to create a documented and reproducible pipeline which is freely available (https://github.com/jcmcnch/MGPrimerEval). It automatically produces tabular output/summary plots and can be applied to new samples and primers by modifying template configuration files. The specific functioning of the three workflows is described in more detail in the supplemental material at https://osf.io/gr4nc/ and on the GitHub page.
PCR amplification and ASV generation. Two hundred seventy-two DNA samples from GEOTRACES cruises GA03 and GP13 were used to generate PCR amplicons and represent either surface (GP13) or surface plus upper mesopelagic water samples (GA03). GP13 is a longitudinal transect in the southern Pacific, whereas GA03 is a longitudinal transect in the North Atlantic. DNA used for PCR amplicons was the exact same DNA material used to produce the metagenomes described by Biller et al. (20) except that it was diluted to 0.5 ng/ ml in low-EDTA Tris-EDTA (TE) buffer prior to amplification. DNA was amplified with the 515Y/926R primers (515Y, 59-GTGYCAGCMGCCGCGGTAA/926R, 59-CCGYCAATTYMTTTRAGTTT) (3) using 0.5 ng of DNA template in a 25-ml reaction mixture with a final primer concentration of 0.3 mM. Primers were part of a larger construct with Illumina adapter constructs/barcodes already included, allowing for single-step library preparation. Thermocycling was as follows: an initial denaturation at 95°C for 120 s followed by 30 amplification cycles of 95°C for 45 s, 50°C for 45 s, and 68°C for 90 s and a final elongation step at 68°C for 300 s. Laboratory methods are described in more detail at https://dx.doi.org/10.17504/protocols.io.vb7e2rn. Both 16S and 18S rRNA mock communities (staggered and even versions of both [3,13]) were included in the sequencing run as quality controls. Amplicons were sequenced with HiSeq 2 Â 250 RapidRun technology at the University of Southern California (USC) sequencing center in a run pooled with shotgun metagenomic samples. Raw basecall data were demultiplexed with bcl2fastq (-r 20 -p 20 -w 32 -barcode-mismatches 0) (https://support.illumina.com/ sequencing/sequencing_software/bcl2fastq-conversion-software.html) using the 6-bp reverse indices and subsequently with the forward 5-bp barcode with cutadapt (48) according to the templates provided at https:// github.com/jcmcnch/demux-notes. Raw data from this step of the pipeline are available on NCBI (see below).
Sequence data were analyzed with a custom in silico pipeline available at https://github.com/ jcmcnch/eASV-pipeline-for-515Y-926R and based on qiime2 2019.4 (45). Briefly, 16S and 18S rRNA sequences were first partitioned using a custom 16S/18S rRNA-specific database derived from SILVA 132 Using Metagenomes To Improve PCR Amplicon Primers and PR2 (11,46) which is available at https://osf.io/e65rs/. After partitioning, primers were removed from amplicons allowing up to 20% mismatches to the primer, and raw sequences were imported into qiime2 and denoised with DADA2 (10). Taxonomy was assigned using a naive Bayesian classification algorithm with the SILVA 132 database subsetted to the amplicon region as a reference (47). Sequences identified as chloroplasts were reannotated with PhytoRef (24), and 18S rRNA sequences were additionally annotated with PR2 (46). A step-by-step protocol for these informatic steps is available at https://dx .doi.org/10.17504/protocols.io.vi9e4h6. After denoising and annotation, biom tables were exported with taxonomy as tsv files and converted to relative abundances which were used as input for the metagenome-amplicon comparison.