Impact of DNA Sequencing and Analysis Methods on 16S rRNA Gene Bacterial Community Analysis of Dairy Products

Validated methods are urgently needed to improve DNA sequence-based assessments of complex bacterial communities. In this study, we used 16S rRNA PCR amplicon and gDNA mock community standards, consisting of nine, dairy-associated bacterial species, to evaluate the most commonly applied 16S rRNA marker gene DNA sequencing and analysis platforms used in evaluating dairy and other bacterial habitats. Our results show that bacterial metataxonomic assessments are largely dependent on the DNA sequencing platform and read curation method used. DADA2 improved sequence annotation compared with QIIME 1, and when combined with the Ion Torrent PGM DNA sequencing platform and the Greengenes database for taxonomic assignment, the most accurate representation of the dairy mock community standards was reached. This approach will be useful for validating sample collection and DNA extraction methods and ultimately investigating bacterial population dynamics in milk- and dairy-associated environments.

This approach has enabled the simultaneous identification of the majority of bacteria in complex microbial communities. Although analysis of 16S rRNA gene diversity has provided significant new perspectives on bacterial habitats, there remain challenges to sample preparation, DNA sequencing, and data analysis approaches for ensuring accurate measurements of bacterial populations.
DNA sequencing platforms, including 454 pyrosequencing, Illumina, Ion Torrent, and Pacific Biosciences have also been shown to cause variation in bacterial community assessments (4,(18)(19)(20)(21). Moreover, data analysis methods, especially read clustering approaches (i.e., generating representative sequences), are known to have a significant impact on the interpretation of bacterial composition (21)(22)(23)(24)(25)(26)(27). De novo sequence clustering can result in unstable operational taxonomic units (OTUs) between projects that are composed of different sequences with each clustering iteration (28,29). Reference-based sequence clustering tends to result in fewer sequence variants than de novo methods (21,23), but can still lead to overestimation of bacterial community diversity caused by insufficient read quality control and error filtering (27). As a result, there is now an effort to move away from OTU-based methods toward DNA sequences that represent single nucleotide variation (30,31). One of the amplicon sequence variant (ASV) clustering methods is DADA2 (Divisive Amplicon Denoising Algorithm 2), which builds a quality-based model for filtering error and identifying variation in 16S rRNA gene sequences (26).
Herein, we sought to compare different DNA sequencing, read assembly, and data analysis strategies for the capacity to accurately detect the composition of a mock bacterial community consisting of nine species commonly found in milk. To eliminate biases introduced by sample type and DNA extraction method and focus on close examination of the biases introduced by PCR, sequencing, and bioinformatics analyses, we employed two different mock communities consisting of either organism-specific PCR amplicons or purified genomic DNA (gDNA) (Fig. 1). This approach allowed us to compare the performance of two popular benchtop DNA sequencers (Illumina MiSeq and Ion Torrent PGM), paired-end read assembly of Illumina MiSeq, OTU (QIIME 1 open reference)/ASV (DADA2) analysis methods, and reference taxonomy databases (Greengenes and RDP). Our results showed that the combination of DADA2 and the Greengenes analysis pipeline, paired with Ion Torrent PGM sequencing, results in the most accurate representation of the mock communities.

RESULTS
Comparison of representative sequence analysis of 16S rRNA V4 region reads generated with the Illumina MiSeq and Ion Torrent PGM. A mock community was prepared by combining 16S rRNA V4 region PCR amplicons from nine bacterial strains (Table 1) in equimolar quantities prior to DNA sequencing on the Ion Torrent PGM and Illumina MiSeq instruments. Sequences were either assembled (Illumina MiSeq) or maintained as single-end reads (Illumina MiSeq and Ion Torrent PGM). A low percentage of reads were identified as chimeras (0 to 1.4%) (see Table S1 in the supplemental material), and the remaining reads were analyzed using QIIME 1 (UCLUST) following the open-reference pipeline at a 97% threshold or DADA2 pipelines for OTU or ASV identification using the Greengenes (version 13.8) and RDP (version GOLD for QIIME 1 and version 11.5 for DADA2) reference databases. Total reads after quality filtering for each sequencing and read assembly method are shown in Table S1.
Illumina MiSeq paired-end assemblies. QIIME 1 analysis of Illumina MiSeq pairedend assembled reads with recommended parameters (32) resulted in total OTU numbers that were at least 4.2-fold greater than the expected nine OTUs encompassing strains included in the mock community (Table 2). When the Greengenes database was used for OTU alignment, 85 OTUs were identified. The majority of those OTUs (i.e., 65) were assigned to taxa included in the mock community. Although the numbers of OTUs varied for each taxon, a single OTU representative encompassed the majority (Ͼ 60%) of reads for each of the mock community members (Table 2). For example, out of nine Staphylococcus OTUs identified with Greengenes, 99% of the reads were represented by one OTU. The remaining 20 OTUs identified with Greengenes were either designated as taxa that were not included in the mock community or were designated only to the order level. When the RDP database was used as the reference database for QIIME 1 analysis, the total OTUs decreased to 70, despite the increase in spurious "other"  assignments (Table 2). Nevertheless, the overall OTU number remained considerably higher than the expected nine OTUs based on the mock community composition. The numbers of ASVs identified with DADA2 from Illumina paired-end assemblies were lower than the numbers of OTUs assigned with QIIME 1 ( Table 2). Because DADA2 assigns ASVs independently from taxonomic reference databases, total ASV numbers were the same using both Greengenes and RDP. The only distinction was that one Clostridiaceae ASV and both Escherichia ASVs identified using Greengenes were designated as Clostridium and Enterobacteriaceae in RDP (Table 2).
Illumina MiSeq unassembled single-end reads. Without read assembly, the QIIME 1 pipeline resulted in 68 and 36 total OTUs with the Greengenes and RDP databases, respectively (Table 3). These OTU numbers were lower than the paired-end assemblies ( Table 2). DADA2, on the other hand, resulted in a slightly higher number of ASVs (40 ASVs) than the paired-end assemblies (38 ASVs) (Tables 2 and 3). More reads were regarded as "other" taxa, and this result was most likely due to the shorter lengths of  the unassembled, single-end MiSeq reads. Between the two reference databases, Greengenes resulted in more accurate taxonomic assignments with DADA2. Two Bacillales ASVs and one Clostridiales ASV that were included in the "other" ASV category by RDP were assigned as Bacillus and Clostridiaceae with Greengenes. Moreover, the Escherichia/Shigella ASV in RDP was unambiguously allotted to the Escherichia genus by Greengenes (Table 3).
Ion Torrent PGM reads. The application of QIIME 1 with the Greengenes database to the Ion Torrent reads resulted in the highest number of OTUs out of any of the methods applied (Table 4). The use of RDP in QIIME 1 also yielded high OTU numbers, comparable to those found for the paired-end Illumina MiSeq assemblies (Table 4). Conversely, DADA2 resulted in only 13 ASVs (Table 4). The 4 additional ASVs compared to the expected 9 ASVs were created due to errors in the homopolymer regions (see Fig. S1 in the supplemental material), and the 13 ASVs were distributed across the 9 bacterial taxa included in the mock community, with the exception of one ASV with ambiguous taxonomy (Bacillales) identified using RDP, which was identified as Bacillus using Greengenes. Greengenes also improved the assignment of Escherichia coli to the genus level, as opposed to the family level in RDP (Table 4).
For each of the three DNA sequencing/read curation methods tested, DADA2 assigned fewer ASVs per taxon and resulted in fewer spurious ASVs than QIIME 1 (UCLUST) assigned OTUs, except in Illumina single-end results analyzed with the RDP database. DADA2 taxonomic identification was more specific with the Greengenes than the RDP database. Therefore, the combined DADA2/Greengenes approach was used for the subsequent analyses described below.
Assessments of the gDNA mock community were altered depending on DNA sequencing platform. A gDNA mock community was prepared by mixing equal quantities of gDNA from the nine milk-associated bacterial species prior to barcoded 16S rRNA V4 region PCR amplification (Fig. 1). The PCR products were then used for sequencing on either the Illumina MiSeq or Ion Torrent PGM, followed by analysis with the DADA2/Greengenes method. More chimeras were found for the gDNA mock community (ranging from 0.3 to 4.6%) than the PCR amplicon mock community (Table S1), indicating amplification errors arose from multitemplate PCR. However, except for the known variation in platform-dependent read lengths, nucleotide sequences of the most abundant ASVs assigned to each of the nine mock community species were identical between the Illumina MiSeq (single and paired ends) and Ion Torrent PGM platforms (see Illumina MiSeq paired-end assembly in Fig. S2, Illumina MiSeq single-end reads in Fig. S3, and Ion Torrent PGM reads in Fig. S4 in the supplemental material). Nucleotide sequence alignments of those ASVs to the corresponding ASVs identified from the PCR amplicon mock community also showed 100% nucleotide sequence conservation ( Fig. S2 to S4). For both Illumina MiSeq assembled and unassembled (single-end) reads, the gDNA mock community resulted in high numbers of ASVs (Table 5). These numbers were higher than those found for the PCR amplicon mock community (Tables 2 and 3) and were primarily due to the higher quantities of spurious ASVs (e.g., Clostridiales, Lactobacillus, and Oscillospira) present at low proportions (0.02 to 3.85% of total reads for each ASV) (see Table S2 in the supplemental material). As a result, the Shannon index of the gDNA mock community was elevated compared to the PCR amplicon mock community for both paired-end and single-end Illumina MiSeq results (Fig. 2), and these values were significantly increased compared to the expected ␣ diversity based on mock community composition. Interestingly, the same number of 13 ASVs was found for the gDNA and PCR amplicon mock communities when the Ion Torrent PGM was used (Table 5), and the Shannon index of the gDNA mock community resembled the PCR amplicon mock community expected value (Fig. 2).
Ion Torrent PGM sequencing with the DADA2/Greengenes method resulted in more accurate representations of the gDNA and PCR amplicon mock communities. DNA sequencing approaches were next compared for their capacity to yield the expected ␤ diversity and proportions of bacterial taxa included in the two mock communities. According to UPGMA (unweighted pair group method using average linkages) hierarchical clustering of Bray-Curtis dissimilarity metrics, results from the three DNA sequencing approaches (Illumina MiSeq paired-end assembly, single-end, and Ion Torrent PGM) were all in different clusters compared to the expected bacterial composition, independent of the gDNA or 16S rRNA PCR amplicon community type (Fig. 3A). Conversely, UPGMA of the weighted Unifrac distance metrics clustered the sequences according to mock community type (Fig. 3B). These comparisons showed that the gDNA mock communities sequenced with the Ion Torrent PGM were the most similar to theoretical (expected) proportions. No single method was found best suited for representing the PCR amplicon mock community (Fig. 3B). To assess whether the use of DADA2/Greengenes influenced this outcome, the other data analysis methods were compared, and it was found that the DNA sequencing platform used was consistently influential on mock community ␤ diversity (e.g., QIIME 1 with the Greengenes database is shown in Fig. S5 in the supplemental material). Examination of the relative abundances of individual taxa across the three DNA sequencing approaches showed that for the 16S rRNA PCR amplicon mock community, the proportions of most bacterial species were mostly not significantly altered compared to expected theoretical values. Exceptions to this finding were the reduced proportions of Enterococcaceae and Enterococcus found for Illumina paired-end assemblies and Streptococcus for both Illumina MiSeq methods as well as the Ion Torrent PGM platform (Fig. 4). For the gDNA mock community, the proportions of Escherichia and Streptococcus were significantly different from expected for all three DNA sequencing platforms. The proportions of Pseudomonas were also significantly lower than expected for the single-and paired-end assemblies from the Illumina MiSeq, and the proportions of Bacillus and Lactococcus were also altered for the paired-end assemblies. Lastly, there were higher proportions of "other" taxa for both Illumina MiSeq methods (Fig. 4), especially in the gDNA mock community. Overall, even though DNA sequencing with the Ion Torrent PGM combined with DADA2/Greengenes analysis did not completely provide the expected bacterial composition, this approach resulted in the most accurate representations of the bacteria and their proportions in both gDNA and PCR amplicon mock communities.

DISCUSSION
By comparing DNA sequencing methods, analysis algorithms, and reference databases using dairy relevant bacterial DNA (PCR amplicon and gDNA) mock communities, we found that the DADA2/Greengenes data analysis methods with the Ion Torrent PGM yielded the most accurate interpretations of the 16S rRNA V4 variable region relative to the other methods (Illumina MiSeq, QIIME 1, RDP) tested. This conclusion is notable considering that DADA2 was developed for analysis of Illumina DNA sequence reads (26). Although successfully applied for that purpose (25,33), our findings show that the DADA2 algorithm is compatible with the Ion Torrent reads and error profile. Moreover, our study also offers new and detailed 16S rRNA data comparisons on single-versus multitemplate PCR and single-versus pair-end assembled Illumina reads, which can be broadly informative to benchmark bioinformatics workflows and to the study of bacterial diversity and composition in other microbial habitats besides dairy products.
Application of DADA2 and QIIME 1 (UCLUST) analysis pipelines to the same 16S rRNA gene data showed that DADA2 assigned fewer total and spurious OTUs/ASVs than QIIME 1 even with stringent filtering (32). Because read length was kept consistent FIG 4 Relative abundance of taxa in the 16S rRNA PCR amplicon and gDNA mock communities. Relative abundances of expected taxa are labeled with the corresponding taxonomic level from sequencing results. "Amplicon" represents the 16S rRNA PCR amplicon mock community, and "gDNA" represents the gDNA mock community. The results shown were analyzed following the DADA2 pipeline with the Greengenes database. Each bar represents the mean Ϯ SD from three replicates. Proportions for each community were compared to expected proportions using ANOVA with Bonferroni's multiple-comparison test. P values of Ͻ0.05 were considered to be significantly different from the expected values and are indicated by an asterisk above each bar plot. within each DNA sequencing and data assembly platform, platform-specific differences in OTU/ASV numbers were mainly derived from the core algorithms used for filtering and clustering representative sequences. The DADA2 core algorithm includes errorrate-based denoising, isBimeraDenovo chimera identification, and ASV inference (26). In QIIME 1, the core analysis includes the USEARCH chimera identification and OTU picking strategy (34). Comparison of OTUs and ASVs using the QIIME 1 and DADA2 pipelines, respectively, also showed that the DADA2 analysis pipeline was able to assign ASVs to more specific taxonomic levels (genus) than QIIME 1. This could be the result of the different taxonomy classifiers employed by DADA2 (RDP's naive Bayesian classifier) and QIIME 1 (UCLUST classifier) (35).
OTU and ASV taxonomy assignments were also compared with consideration to 16S rRNA gene reference databases. Results from the different combinations of analysis methods and reference databases showed that the majority of OTUs/ASVs detected were representatives of bacterial taxa included in the 16S rRNA PCR amplicon and gDNA mock communities. Each bacterial species was represented by at least a single OTU/ASV. Additional OTUs/ASVs were largely due to low-abundance sequence variants. DNA sequences of the predominant ASVs/OTUs were 100% identical between gDNA and PCR amplicon mock communities, further supporting the precision of the technique. When QIIME 1 was applied, the RDP_GOLD database (36) yielded lower numbers of total OTUs than found with Greengenes 13.8, independent of whether the Illumina MiSeq or Ion Torrent PGM was used to generate the DNA sequence reads. However, the RDP_GOLD database has not been updated since 2011 (36) and could potentially be missing many bacterial sequences, leading to less differentiation between OTUs. With the DADA2 pipeline, ASVs were inferred prior to taxonomy assignment (26), resulting in the same total ASV numbers for both the RDP 11.5 and Greengenes 13.8 databases. However, assignments of DADA2 ASVs were still influenced by reference databasespecific taxonomic nomenclature and DNA sequences (37), such that Greengenes provided deeper, more accurate taxonomic assignments than those found with RDP.
The Illumina MiSeq and Ion Torrent PGM methods also clearly impacted the outcomes of our mock community analyses. The Illumina MiSeq is well established and known for its low error rate, high-volume read outputs, and low sequencing cost per Gb (38,39). Although Illumina MiSeq reads had higher Phred quality scores, for both single-end and paired-end assembled Illumina MiSeq reads, greater numbers of unexpected taxa and OTUs/ASVs were observed compared to the Ion Torrent PGM. This finding could be the result of differences in library preparation methods, external contamination, index switching (40), and/or substitution errors (41,42) specific to the Illumina MiSeq. To reduce misassigned reads, previous studies have suggested using a dual-index strategy (43) and stringent filtering at the index region (40), as well as sequencing of negative controls for in silico removal of contaminant reads (44). In contrast, the use of the Ion Torrent PGM with our read trimming parameters resulted in the lowest numbers of DADA2 assigned ASVs. At 13 ASVs for both mock communities, this number was only slightly greater than the nine predicted. All 13 ASVs were repeatedly assigned to members of the mock communities, except for one lowabundance ASV when RDP was applied. The four additional ASVs were the result of read errors in the homopolymer regions, a common Ion Torrent error model (20,38,39) that still passed the DADA2 filtering with recommended parameters (https://benjjneb .github.io/dada2/faq.html#can-i-use-dada2-with-my-454-or-ion-torrent-data). This error model could be further reduced by increasing the homopolymer error penalty value. Interestingly, the Ion Torrent PGM reads resulted in the highest numbers of OTUs when QIIME 1 was used to analyze the data. This might have been due to the higher number of erroneous reads that were passed by QIIME 1 filtering, but were identified as sequence chimeras and artifacts by DADA2.
For the PCR amplicon mock community, bacterial diversity analyses based on the DADA2/Greengenes pipeline showed that the results from the Illumina MiSeq were similar to Ion Torrent PGM and the in silico expected values. However, the gDNA mock community relative abundances of certain bacteria in the gDNA mock community were significantly altered compared to expected proportions according to the paired-and single-end Illumina MiSeq methods. This was particularly the case for Streptococcus and Lactococcus. Because Streptococcus and Lactococcus have similar 16S rRNA gene sequences, variation in their relative abundances could be caused by the accumulation of substitution errors, a common error that occurs with the Illumina MiSeq instrument (41,42). No spurious taxa were found in either the PCR amplicon or gDNA mock community in Ion Torrent results analyzed with the DADA2/Greengenes pipeline. In contrast, the gDNA mock community contained 9-fold and 5-fold more "other" spurious taxa compared to the PCR amplicon mock community in Illumina paired-and single-end results, respectively. To this regard, the majority of these taxa (and proportion of reads, Ͼ74%) were assigned to bacterial orders, families, and genera that are highly related to the species included in the mock community (e.g., Clostridiales, Lactobacillus, Oscillospira, and Turicibacter). This, together with the lower numbers of ASVs found for the corresponding PCR amplicon community and single-end results, indicates that errors resulting from paired-end Illumina MiSeq assembly are augmented by combining multitemplate PCR with joining forward and reverse reads. This issue can be mitigated by using single-end reads (as shown by the data here), fewer PCR cycles (33), and increasing the denaturing time (45).
By the use of bacterial DNA standards from nine dairy-relevant bacterial species, we found that DNA sequencing and analysis pipelines contributed significant variations to OTU/ASV distributions and observed bacterial diversities. Moreover, PCR biases and errors from multitemplate DNA amplifications are not entirely filtered with the Illumina MiSeq method. Overall, the Ion Torrent PGM DNA sequencer combined with the DADA2/Greengenes pipeline led to more accurate OTU/ASV assignments and bacterial diversity measurements of the PCR amplicon and gDNA mock communities under our study conditions. The Ion Torrent PGM method is recognized for shorter run times, lower instrument cost, and flexibility in sequencing scale per run by the use of different sequencing chips (38,39). Therefore, this platform could be of particular use to study dairy and other food products with short shelf life times. Moreover, with DADA2 being wrapped in the QIIME 2 platform, we agree with the QIIME 2 developers that new sequencing results should be analyzed using QIIME 2 with a standardized analysis pipeline (e.g., DADA2) instead of QIIME 1 (UCLUST) (46). Further improvements might be reached by refinements to taxonomy classifiers (35), updating reference databases to emphasize bacteria found in different environments, such as dairy foods, and/or testing other reference databases, such as SILVA (37,47,48). Lastly, we recognize that upstream sample processing and DNA extraction protocols can introduce significant biases into assessments of bacterial community composition (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15). Therefore, the data analysis methods applied here should be tested using whole-cell mock communities containing different proportions of bacteria as well as on complex environmental samples. Moreover, to increase reproducibility, consistent methodology and inclusion of negative and positive controls in each run/project are recommended (49). The findings here and the continued development of microbial diversity analysis methods should result in even more reliable comparisons within and between bacterial habitats.

MATERIALS AND METHODS
Bacterial strains and culture conditions. Bacterial strains representing species commonly found in bovine milk were used to construct a mock bacterial community (Table 1). Each bacterial strain was grown in standard laboratory culture medium with negative controls for that species and harvested at early stationary phase by centrifugation at 13,000 ϫ g for 2 min. The laboratory culture media were as follows: Bacillus subtilis, Pseudomonas fluorescens, and Escherichia coli, LB (Lennox broth; Thermo Fisher Scientific); Enterococcus faecalis and Streptococcus agalactiae, brain heart infusion broth (Thermo Fisher Scientific); Staphylococcus aureus, tryptic soy broth (Becton Dickinson); Corynebacterium bovis, tryptic soy broth (Becton Dickinson) with 0.1% Tween 80; Lactococcus lactis, M17 broth (Becton Dickinson) with 0.5% glucose; and Clostridium tyrobutyricum, reinforced clostridial broth (Becton Dickinson). All strains were incubated at 37°C, with the exception of B. subtilis, L. lactis, and P. fluorescens, which were incubated at 30°C. B. subtilis, C. bovis, E. faecalis, E. coli, and P. fluorescens were grown under aeration (250 rpm).
Genomic DNA extraction and PCR amplification. Genomic DNA was extracted using the MagMAX Total nucleic acid isolation kit (Thermo Fisher Scientific, Vilnius, Lithuania) according to the manufacturer's protocol with the repeat bead beating method on a FastPrep-24 instrument (MP Biomedicals LLC).
The DNA concentration was measured with the Qubit 3.0 fluorometer using the Qubit double-stranded DNA (dsDNA) HS assay kit (Life Technologies, Eugene, OR). PCR amplification was performed using Ex Taq DNA polymerase (TaKaRa, Otsu, Japan) and primers F515 and R806 (50) with a random 8-bp barcode on the 5= end of F515 for sample multiplexing (51,52). PCR was initiated at 94°C for 3 min, followed by 35 cycles of 94°C for 45 s, 54°C for 60 s, and 72°C for 30 s, with a final extension step at 72°C for 10 min. Negative controls were run for each barcoded primer. No PCR product for the negative controls was observed on a 1.5% agarose gel. PCR products were pooled and then gel purified with the Wizard SV gel and PCR clean-up system (Promega, Madison, WI).
Preparation of the mock communities. A schematic experimental design for preparing the mock communities is shown in Fig. 1. For the gDNA mock community, 100 ng gDNA isolated from each of the strains was pooled in three separate replicates. The proportion of each bacterial strain in the gDNA mock community was determined by taking into account the genome size and 16S rRNA gene copy number (Table 1). To construct the amplicon mock community, gDNA of the nine bacterial strains was amplified in triplicate by using three different barcoded PCR primers. Amplicon concentrations were measured with the Quant-iT PicoGreen dsDNA assay kit (Life Technologies, Eugene, OR) prior to pooling at equal molar concentrations.
DNA sequencing. For Illumina sequencing, the KAPA HTP library preparation kit (KK8234, Kapa Biosystems, Pittsburgh, PA) was used for the ligation of NEXTflex adapters (Bioo Scientific, Austin, TX) to the 16S rRNA amplicons prior to 250-bp paired-end sequencing (with 7% PhiX control) on an Illumina MiSeq instrument at the University of California, Davis, Genome Center (http://genomecenter.ucdavis .edu/). For Ion Torrent sequencing, non-barcoded Ion A and Ion P1 adapters were ligated to the pooled amplicons, followed by templating, enrichment, and sequencing on the One-Touch 2 and One-Touch ES systems and Ion PGM using the 400 sequencing kit and a 318 v2 chip (Life Technologies, Carlsbad, CA).
16S rRNA gene sequence analysis. An in silico mock community, termed "expected," was created using the 16S V4 amplicon sequences from published genomes and reference genomes for the specific bacterial species (Table 1). In addition, the expected 16S V4 region copy numbers were normalized based on the genome size and 16S rRNA gene copy numbers.
Illumina MiSeq sequencing outputs were trimmed with the fastx_tools (53) to keep the first 245 and 170 bases for the forward and reverse reads, respectively (for quality profiles, see Fig. S6 and S7 in the supplemental material). The Ion Torrent sequence output BAM file was converted to FASTQ format using BEDTools (54), and reads shorter than 200 bp were also removed. The first 280 bases of the Ion Torrent reads were kept for analysis (for quality profiles, see Fig. S6 and S7 in the supplemental material).
The FASTQ files were then analyzed with QIIME version 1.9.1 and DADA2 1.6.0 (26,55). In QIIME 1, Illumina reads from the two orientations (forward and reverse) were analyzed either with or without assembly where the join_paired_ends.py (fastq-join method) (56) script was used with minimum 100-bp overlap and 1% maximum difference between overlapping sequences. Ion Torrent single-end and paired-end assembled Illumina FASTQ files then had the barcode (8 bases) and primer regions (forward primer, 21 bases; reverse primer, 20 bases) removed and were demultiplexed using the split_libraries-_fastq.py script with no barcode error and quality filtered at Q30. Chimeric sequences were identified using USEARCH (34,36) with both the de novo and reference-based methods against the Greengene database version 13.8 (57,58) via the identify_chimeric_seqs.py command with default parameter values. Sequences from both Illumina and Ion Torrent as well as the in silico mock community with expected proportions were merged as one fasta file for operational taxonomic unit (OTU) clustering using the pick_open_reference_otus.py script with recommended parameters (32) and the UCLUST method at 97% similarity thresholds. The Greengenes version 13.8 (57,58) and RDP_GOLD (36) databases were used as references for OTU assignments. Archaea, chloroplasts, and low-abundance (0.005%) OTUs were removed from the OTU tables (32).
In DADA2, for single-end analysis, the truncated Illumina and Ion Torrent FASTQ files after barcode (8 bases) and primer sequence (forward primer, 21 bases; reverse primer, 20 bases) trimming were demultiplexed using split_libraries_fastq.py script with no barcode error and no quality filter (-r 999, -n 999, -q 0, -p 0.0001). Since the single-end reads were already quality trimmed, no additional truncation was performed in DADA2 to be consistent in read length with QIIME 1 analysis. For paired-end analysis, in order to get matched sequence files, raw Illumina reads were demultiplexed in pairs using the idemp tool (59) with no barcode error. Barcode and forward and reverse primer regions were then trimmed with fastx_tools (53). The resulting reads were truncated in DADA2 to keep the first 196 bases of the forward reads and 121 bases of the reverse reads, which were later merged after ASV inference with no error allowed and a 51-bp minimum overlap to be consistent with the QIIME 1 method in resulting read length. For reads from both Ion Torrent and Illumina MiSeq, the error model learning [learnErrors()], dereplication [derepFastq()], and ASV inference [dada()] were performed in R with the DADA2 default parameter, except for added parameters for Ion Torrent [dada(HOMOPOLYMER_GAP_PENALTYϭ-1, BAND-_SIZE ϭ 32)]. Chimeras were identified and removed after sequence clustering via the removeBimeraDenovo() function with the "consensus" method and the isBimeraDenovoTable() function default settings.
Taxonomy was assigned to the resulting amplicon sequence variants (ASVs) using RDP database version 11.5 (60) and Greengenes database version 13.8 with the minimum bootstrap confidence at 80 (57,58). Ion Torrent and Illumina single-end and paired-end assembled reads were merged with the in silico mock community using the phyloseq package in R (61), and singletons and low-abundance (0.005%) ASVs were removed to be consistent with QIIME 1 analysis. Sequences of spurious ASVs were further aligned with sequences in the NCBI nr/nt database using BLASTn (62) with default settings.
Statistics. OTU/ASV counts were rarefied at 5,483 sequences per sample to retain all samples for downstream analyses. Significant differences in the observed mock community composition (␣ diversity and taxonomic distribution) were determined by analysis of variance (ANOVA) with the Bonferroni's multiple-comparison test. A P value of Ͻ0.05 indicates significance. The significance of sample clustering was indicated by permutational multivariate ANOVA using the adonis function from the vegan package in R (63) with a P value of Ͻ0.05 through 9,999 permutations.
Accession number(s). Joined-and single-end DNA sequences after quality filtering and trimming have been deposited in the Qiita database (https://qiita.ucsd.edu) under study ID no. 11351 and in the European Nucleotide Archive (ENA) under accession no. ERP104377.