Random sampling causes the low reproducibility of rare eukaryotic OTUs in Illumina COI metabarcoding

DNA metabarcoding, the PCR-based profiling of natural communities, is becoming the method of choice for biodiversity monitoring because it circumvents some of the limitations inherent to traditional ecological surveys. However, potential sources of bias that can affect the reproducibility of this method remain to be quantified. The interpretation of differences in patterns of sequence abundance and the ecological relevance of rare sequences remain particularly uncertain. Here we used one artificial mock community to explore the significance of abundance patterns and disentangle the effects of two potential biases on data reproducibility: indexed PCR primers and random sampling during Illumina MiSeq sequencing. We amplified a short fragment of the mitochondrial Cytochrome c Oxidase Subunit I (COI) for a single mock sample containing equimolar amounts of total genomic DNA from 34 marine invertebrates belonging to six phyla. We used seven indexed broad-range primers and sequenced the resulting library on two consecutive Illumina MiSeq runs. The total number of Operational Taxonomic Units (OTUs) was ∼4 times higher than expected based on the composition of the mock sample. Moreover, the total number of reads for the 34 components of the mock sample differed by up to three orders of magnitude. However, 79 out of 86 of the unexpected OTUs were represented by <10 sequences that did not appear consistently across replicates. Our data suggest that random sampling of rare OTUs (e.g., small associated fauna such as parasites) accounted for most of variation in OTU presence–absence, whereas biases associated with indexed PCRs accounted for a larger amount of variation in relative abundance patterns. These results suggest that random sampling during sequencing leads to the low reproducibility of rare OTUs. We suggest that the strategy for handling rare OTUs should depend on the objectives of the study. Systematic removal of rare OTUs may avoid inflating diversity based on common β descriptors but will exclude positive records of taxa that are functionally important. Our results further reinforce the need for technical replicates (parallel PCR and sequencing from the same sample) in metabarcoding experimental designs. Data reproducibility should be determined empirically as it will depend upon the sequencing depth, the type of sample, the sequence analysis pipeline, and the number of replicates. Moreover, estimating relative biomasses or abundances based on read counts remains elusive at the OTU level.

139 in both directions were assembled, checked for stop codons and frameshifts, and aligned in 140 Geneious (Biomatters) to build a reference library containing COI barcodes for each of the 34 141 specimens included in the mock sample. The complete list of specimens, including taxonomy 142 and GenBank accession numbers, are provided in Table S1. A PCR reaction performed with the 143 negative control extraction confirmed the absence of contaminants (no band on 1.5% agarose 144 gel).  Table 1) to minimize the likelihood of false read-to-sample 152 assignments due to tag jumping (Schnell et al. 2015). Three PCR reactions (i.e. triplicates) were 153 performed using each of the seven indexed primer pairs for a total of 21 PCRs using the PCR 154 mixtures and touchdown temperature profile described previously (Leray & Knowlton 2015). 155 PCR reactions were prepared in a room free of PCR amplicons. Triplicate PCRs were pooled (for 156 a total of seven pools) to limit stochastic amplification biases, and purified using Agencourt 200 differences in rates of sequence evolution among taxonomic groups. Here, we defined -l 3 and -u 201 4 because it was shown to create OTUs that closely reflect morphological species grouping 202 among marine invertebrates by providing the lowest frequency of false positives (   We assessed bias by calculating the pairwise Jaccard and Bray-Curtis at two levels of 244 replication: primer index replicate and sequencing replicate. Within-primers pairwise 245 dissimilarities were calculated between communities obtained using identical primer indices but 246 sequenced in different MiSeq runs (i.e. primer1/run1 vs. primer1/run2; primer2/run1 vs. 247 primer2/run2). Between-primers pairwise dissimilarities were calculated between communities 248 obtained using different primer indices regardless of the sequencing run (i.e. primer1/run1 vs. 249 primer2/run1; primer2/run1 vs. primer1/run2).

250
Within-run pairwise dissimilarities were calculated between communities obtained within 251 the same MiSeq sequencing run regardless of the indexed primer set used (i.e. primer1/run1 vs. 252 primer2/run1; primer2/run1 vs. primer3/run1). Between-run pairwise dissimilarities were 253 calculated between communities obtained in different MiSeq sequencing runs regardless of the 254 indexed primer set used (i.e. primer1/run1 vs. primer1/run2; primer1/run1 vs. primer2/run2). 255 Pairwise community dissimilarities calculated within any indexed primer result from discordance 256 that occurred after library preparation. Dissimilarities calculated between and within MiSeq runs 257 as well as between indexed primers within a run result from a combination of technical and 258 random sampling artifacts occurring during PCR amplification, library preparation and 259 sequencing (Table 2). To further examine similarities in OTU composition, we calculated 260 hierarchical cluster trees using an Unweighted Pair Group Method with Arithmetic mean 262 dataset 100 times using 75% of the sequences in the smallest sample (34,206). We also 263 visualized Bray-Curtis differences between samples using a principal coordinate analyses 264 (PCoA). The score of each OTU was plotted in 2-dimensional PCoA space to illustrate their 265 influence on the dissimilarities between samples. Finally, we tested differences in OTU 266 composition between primer indices and sequencing runs using permutational multivariate    (Homo sapiens and a rodent) representing 296 a total of 244 sequences were removed from the dataset, leaving a total of 120 eukaryotic OTUs.
297 BLASTn searches against the reference barcode library revealed that 34 of these OTUs 298 corresponded to the 34 species included in the mock sample. We will refer to these OTUs as 299 "target OTUs". The 86 OTUs that did not match any species included in the mock sample are 300 hereafter referred to as "non-target OTUs". Among them, 31 (25.8%) had >97% similarity to 301 GenBank and BOLD sequences while 41 (34.2%) could be confidently assigned to higher 302 taxonomic levels using the Bayesian phylogenetic approach implemented in SAP, leaving only 303 14 OTUs (11.7%) unidentified. As noted above, the 34 target OTUs belonged to six animal phyla 304 (Table S1) Figure S1. Branch tip symbols indicate the method of identification of each OTU. Target: OTU that matched the COI sequence of a species included in the mock sample (referred to as "target OTUs" throughout the manuscript; note that OTUs that did not match any target OTU are referred to as "nontarget OTUs" throughout the manuscript); NCBI/BOLD: OTU that did not match a target OTU but had >97% similarity with a reference COI barcode in NCBI or BOLD; SAP: OTU that did not match a target OTU nor a reference COI barcode in NCBI or BOLD but could be confidently assigned to higher a taxonomic level using a Bayesian phylogenetic approach implemented in the Statistical Assignment Package (SAP); Unidentified: OTU that could not be confidently identified to any taxonomic group using the three approaches detailed above.

Rank abundance plot of OTUs detected in the mock sample
Colours indicate the method of identification of each OTU (see Fig. 2 legend). Target: OTU that matched the COI sequence of a species included in the mock sample (referred to as "target OTUs" throughout the manuscript); NCBI/BOLD: OTU that did not match a target OTU but had >97% similarity with a reference COI barcode in NCBI or BOLD; SAP: OTU that did not match a target OTU nor a reference COI barcode in NCBI or BOLD but could be confidently assigned to higher a taxonomic level using a Bayesian phylogenetic approach implemented in the Statistical Assignment Package (SAP); Unidentified: OTU that could not be confidently identified to any taxonomic group using the three approaches detailed above* indicates the presence of an abundant yet unidentified OTU. Reproducibility of non-target OTUs in indexed PCRs as a function of the total number of reads they represent in each sequencing run "Non-target OTUs" correspond to OTUs that did not match the COI sequence of a species included in the mock sample. Each point represents data for one sequencing run for one nontarget OTU, so that for example, four non-target OTUs found in all seven indexed PCR trials in sequencing run 1 and two non-target OTUs found in all PCR trials in sequencing run 2 had a total number of reads >10 for both sequencing runs.

Jaccard pairwise dissimilarity among indexed PCRs and Illumina Miseq runs
The effect of rare OTUs on Jaccard was evaluated by sequentially removing low abundance OTUs from the rarefied OTU