A 16S rRNA gene sequencing and analysis protocol for the Illumina MiniSeq platform

Abstract High‐throughput sequencing of the 16S rRNA gene on the Illumina platform is commonly used to assess microbial diversity in environmental samples. The MiniSeq, Illumina's latest benchtop sequencer, enables more cost‐efficient DNA sequencing relative to larger Illumina sequencing platforms (e.g., MiSeq). Here we used a modified custom primer sequencing approach to test the fidelity of the MiniSeq for high‐throughput sequencing of the V4 hypervariable region of 16S rRNA genes from complex communities in environmental samples. To this end, we designed additional sequencing primers that enabled application of a dual‐index barcoding method on the MiniSeq. A mock community was sequenced alongside the environmental samples in four different sequencing runs as a quality control benchmark. We were able to recapture a realistic richness of the mock community in all sequencing runs, and identify meaningful differences in alpha and beta diversity in the environmental samples. Furthermore, rarefaction analysis indicated diversity in many environmental samples was close to saturation. These results show that the MiniSeq can produce similar quantities of high‐quality V4 reads compared to the MiSeq, yet is a cost‐effective option for any laboratory interested in performing high‐throughput 16S rRNA gene sequencing.

16S rRNA gene sequencing, because the MiniSeq is a benchtop sequencer available for ca. 30% of the cost compared to the MiSeq. Furthermore, the reagent kits for the MiniSeq are also ca. 30% the cost of the MiSeq, yet are capable of generating up to 8 million pairs of reads, and the High Output version of this kit produces a volume of sequence data up to 25 million reads (Illumina 2016a).
The dual-indexed custom primer 16S rRNA gene sequencing protocol for the V4 hypervariable region is widely applied in microbial diversity studies (Kozich, Westcott, Baxter, Highlander, & Schloss, 2013), but was originally developed for sequencing on the MiSeq platform. Thus, our aim was to optimize this dual-indexed custom primer 16S sequencing protocol for the MiniSeq platform, in order to test the fidelity of the MiniSeq for 16S rRNA gene sequencing.
Our modifications to the existing high-throughput 16S rRNA sequencing protocol (Kozich et al., 2013) use new sequencing primers to adapt this method for the MiniSeq. We performed multiple highthroughput sequencing runs targeting the V4 hypervariable region of the 16S rRNA gene derived from complex environmental samples, alongside a mock community with a known number of different species. Platform fidelity was assessed by alpha diversity analyses of a mock community of known species composition, which shows that with the proper quality controls the MiniSeq is capable of producing quality 16S rRNA gene sequence data that can be used to rapidly and reliably assess microbial diversity in complex environmental samples.

| Cultivation and DNA extraction of the 16S mock community
To create a mock community (>3% dissimilarity threshold, see Table   S1), pure cultures were isolated from soil, human skin, cell phone swabs, freshwater and saltwater, and grown on agar plates for 3-7 days at room temperature. For genomic DNA extraction, a small amount of each bacterial strain was transferred into a 2 ml sterile lysing Matrix E tube and 800 μl of preheated (60°C) sterile filtered C1 extraction buffer (38 ml saturated NaPO4 [1 mol/L] buffer, 7.5 ml 100% ethanol, 4 ml MoBio's lysis buffer solution C1 [MoBio, Carlsbad, CA], 0.5 ml 10% SDS) was added. The samples were homogenized for 40 s at a speed of 6 m/s using a QuickPrep-24 5G homogenizer (MP Biomedicals, Santa Ana, CA) and heated for 2 min at 99°C in an Eppendorf ThermoMixer C (Thermo Fisher Scientific, Waltham, MA), followed by two freeze-thaw (−80°C/room temperature) cycles to lyse bacterial cells. After repetition of the homogenizing step, the samples were centrifuged for 10 min at 10,000 g in a Heraeus Pico 21 centrifuge (Thermo Fisher Scientific, Waltham, MA).
Microbial DNA was purified using the MoBio PowerClean Pro DNA Clean-Up Kit (Qiagen, Hilden, Germany) following the manufacturer's instructions using 100 μl of the supernatant. DNA was quantified fluorometrically on the Qubit version 3.0 (Life Technologies, Grand Island, NY) using the Qubit dsDNA high sensitivity assay kit (Life Technologies).
To confirm the number of species in the mock community, the full length 16S rRNA gene of each isolate was amplified and sequenced by Sanger sequencing. Two conserved primers (27f, 1492r) were used to amplify the entire gene during PCR with the following conditions: initial denaturation at 95°C for 3 min; 30 cycles of denaturation at 95°C for 30 s; annealing at 56°C for 30 s; elongation at 72°C for 1 min and a final 5 min extension at 72°C. Individual reactions consisted of 1 μl template DNA, 5 μl 5× Green GoTaq Flexi Buffer (Promega), 3 μl MgCl 2 (25 mmol/L), 1 μl fw primer (10 μmol/L), 1 μl rv primer (10 μmol/L), 12.9 μl nuclease-free water, dNTP Mix (10 mmol/L), and 0.1 μl GoTaq Green DNA Polymerase (Promega). The amplicons were subjected to Sanger sequencing using the facilities of the Biocenter of the Ludwig-Maximilian University (LMU), Martinsried. To confirm dissimilarity thresholds of >3% for all 18 species, we aligned the sequences using BLAST version 2.2.26+ (Altschul, Gish, Miller, Myers, & Lipman, 1990). We pooled the isolates at equimolar concentration and created technical replicates of the mock community to assess the reproducibility of the method.
Environmental samples included salt marsh sediments, freshwater pond sediments, marine sponges, salt water aquaria, and carbonate biofilms ( Figure 3). Samples to assess levels of contamination (which were also sequenced, and OTUs removed from the environmental samples) were collected from dust in three different labs in the building where the sequencing and PCR amplifications were performed. Genomic DNA of environmental samples (Run A, n = 45; Run B, n = 88; Run C, n = 84; Run D, n = 90) was extracted according to the protocol of Orsi et al. (2017). In brief, samples were transferred to either 50-ml or 2-ml Lysing Matrix E tubes containing 1.4 mm ceramic spheres, 0.1 mm silica spheres, and one 4 mm glass sphere (MP Biomedicals, OH) following each incubation. 15 ml (for 50 ml tubes) or 1 ml (for 2 ml tubes) of the extraction buffer (C1 lysing buffer (MoBio, Carlsbad California), 10% SDS, 100% ethanol, and 1 mol/L Na 2 HPO 4 ) was added and homogenized for 40 s in a Fast-Prep 5G homogenizer at a speed of 6 m/s. Then, the supernatant containing the DNA was purified with the DNeasy PowerClean Pro Cleanup Kit (Qiagen, Germany).
Extracted DNA was quantified by using the Qubit double-stranded DNA (dsDNA) high-sensitivity assay kit and a Qubit 3.0 fluorometer (Invitrogen, Eugene, OR).
The barcoded DNA amplicons were analyzed on a 1.5% (w/v) agarose gel, and excised and purified for sequencing using the Zymoclean Gel DNA Recovery Kit (Zymo Research, Irvine, CA), adding 15 μl of buffer EB to elute DNA. After gel extraction, DNA concentrations were measured using Qubit and diluted first to 10 nmol/L and then to a final 1 nmol/L in a serial dilution before the samples were pooled (adding 5 μl of every sample).

| 16S sequencing strategy and primer design
We had to design additional Index 2 sequencing primers (see Table 1) to enable the dual-index barcoding method on the MiniSeq. Without these additional index sequencing primers on the MiniSeq, it is impossible to demultiplex the samples after the run because the Index 2 sequences will not be sequenced. An additional Index 2 sequencing primer is needed because, as opposed to the MiSeq, the MiniSeq only reads Index 2 after the clusters have been turned around to sequence the pair reads (see Figure 1). Sequencing proceeds in the direction of the flow cell and starts by generating Read 1 (150 bp) using Read 1 sequencing primer, followed by obtaining Index 1 (8 bp) using Index 1 sequencing primer. Clusters are turned around using the oligonucleotides provided on the flow cell. After bridging, Index 2 sequencing primer generates Index 2 (8 bp) and Read 2 sequencing primer finally obtains Read 2 (150 bp).
We used the additional Index 2 sequencing primers to perform F I G U R E 1 Schematic description of the dual-index sequencing strategy on the MiniSeq. Reading the figure from top to bottom shows the sequential order of paired-end sequencing steps (four total). "Turn around" indicates the step of paired-end turn around on the flow cell surface. The sequencing proceeds in the direction of the flow cell surface, which in this figure is located on the right side (arrows point in direction of sequencing reaction). Sequencing starts by using Read 1 primer to sequence Read 1, followed by Index 1 primer to generate Index 1. The MiniSeq only uses the oligonucleotides on the flow cell for bridging and both the second index and the paired read are sequenced after the clusters are turned around. Hence an Index 2 primer is needed to sequence Index 2. Read 2 is then sequenced by using the Read 2 primer (after Kozich et al., 2013). Sequencing primers for only the forward primer 515F-Y (Parada et al., 2016) are shown, for sequencing primers needed for the 515F primer (Caporaso et al., 2012) please see Table 1 for all sequencing primers control library (1.8 pM). The method used to align sequences to PhiX to determine the error profiles is made possible by the software provided by Illumina that is preinstalled on the sequencer. This is calculated automatically by the software after each run, as long as the PhiX has been added to the library loaded into the reagent cartridge.  Table 1). The results of the MiniSeq sequencing runs and the 16S rRNA gene sequences from the mock community are publicly available in the ENA Project PRJEB24504.

| 16S bioinformatics analysis and OTU assignment
Demultiplexing and base calling were both performed using bcl2fastq Conversion Software v2.18 (Illumina, Inc.). All bioinformatics analysis were conducted in USEARCH version 9.2.64 (Edgar, 2010) and QIIME version 1.9.1 (Caporaso et al., 2010). The initial step was to assemble paired-end reads using the fastq_merge pairs command with default parameters allowing for a maximum of five mismatches in the overlapping region. Stringent quality filtering was carried out using the fastq_filter command. We discarded low-quality reads by setting the maximum expected error threshold (E_max), which is the sum of the error probability provided by the Q score for each base, to 1. Reads were de-replicated and singletons discarded. Reads were clustered into OTUs sharing 97% sequence identity using the heuristic clustering algorithm UPARSE (Edgar, 2013), which is implemented in the clus-ter_otus command. The algorithm performs de novo chimera filtering and OTU clustering simultaneously (Edgar, 2013). The usearch_global command assigned the reads to OTUs and created an OTU table for further downstream analysis. Taxonomy was assigned in QIIME (Caporaso et al., 2010) through BLASTn searches 2.2.26+ (Altschul et al., 1990) with an identity threshold of 90% (hits below 90% identity were not considered) against the SILVA ribosomal RNA gene database (Quast et al., 2013) release for QIIME SILVA123. As a quality control step, we removed all OTUs containing <10 sequences and which had no BLASTn hit. Spurious OTUs were identified in the mock community as those OTUs with a closest BLASTn hit to organisms that were not in the original mock community. The OTU tables were rarefied to the sample containing the lowest number of sequences, with a threshold of >10,000 sequences (all samples having less than 10,000 sequences were removed from analyses prior to the rarefaction step).

| 16S data analysis
In order to investigate beta diversity structures of our samples, we performed downstream analysis in R version 3.3.0 (R Development Core Team 2011). Nonmetric multivariate (NMDS) analyses of the microbial communities were calculated using a Bray-Curtis distance in the Vegan package (Oksanen et al., 2017). Analysis of Similarity (ANOSIM) was performed using 999 permutations with a Bray-

Curtis distance. Rarefaction analyses on environmental samples
were performed in QIIME version 1.9.1 using both observed species and chao1 metrics.

| RE SULTS AND D ISCUSS I ON
The main modification of our MiniSeq protocol from the dual-index sequencing method of Kozich et al. (2013) is the use of an additional index sequencing primer. This additional index sequencing primer is necessary because the MiniSeq does not sequence the second index using adapters present on the flow cell surface as the MiSeq does. Rather, the MiniSeq reads Index 2 only after the clusters have been turned around to sequence the paired-end reads (Figure 1). Thus, in addition to the three sequencing primers described by Kozich et al. (2013), we designed and used new Index 2 sequencing primers, Index2.515F-Y (5 The primers were diluted and loaded into the specified positions on the Illumina reagent cartridge. We used primers shown in (a) for sequencing Runs A and B, where primers shown in (b) were used for sequencing Runs C and D.
′-TTACCGCGGCKGCTGRCACACAATTACCATA-3′) and Index2.515F (5′-TTACCGCGGCKGCTGGCACACAATTACCATA-3′) to enable the dual-index barcoding method on the MiniSeq (see Table 1 for all sequencing primers). We tested this modified approach on four different 16S rRNA sequencing runs including diverse environmental samples as well as a mock community composed of 18 different bacterial species. The mock community was created from pure cultures, whose 16S rRNA genes were determined through Sanger sequencing to be >3% different (Table S1). Environmental samples were collected from salt marsh sediments, freshwater pond sediments, marine sponge, beach sediments, salt water aquaria, and microbial biofilms recovered on carbonate sediments.

| Run performances
Run A yielded a total of 1.23 Gbp with cluster density of 76 ± 9 K/ mm 2 and >73% of the clusters passing filter (PF) ( Spiking in the genomic DNA resulted in clusters PF > 80% for Runs B-D, which is expected for optimized cluster density on the platform. For Runs C and D, clustering efficiency of PF > 95% was achieved. For example, sequencing Run B generated 3.31 Gbp with a cluster density of 170 ± 3 K/mm 2 and >84% of clusters PF (Table 2).
Run B had 88% of all bases from both reads assigned a quality score of Q ≥ 30 with an estimated error rate of 0.8%. Run C yielded 2.67 Gbp with a cluster density of 124 ± 1 K/mm 2 and >95% of the clusters PF (Table 2). For Run C, a Q ≥ 30 was achieved by 94% of all bases, with an estimated error rate of 0.43%. Sequencing Run D generated 2.56 Gbp with a cluster density of 120 ± 51 K/mm 2 and >94% of the clusters PF (Table 2). In Run D, 93% of bases had a quality score of Q ≥ 30 with an estimated error rate of 0.47%. We note that it is difficult to distinguish sequencing errors from PCR errors, and thus refer to the error rates predicted for the amplicons from the PhiX data as estimated error rates as these do not account for PCR errors.

| Terminal G homopolymers
The In our first 16S rRNA sequencing run (Run A) that had relatively poor quality (cluster density 76 ± 9 K/mm 2 , PF < 80%), 7% of forward reads and 8% of reverse reads had long (>10) terminal poly-G strings (see Figure S1). that a cluster density >120 K/mm 2 and percent of clusters passing filter >90% provide for a high-quality run. We urge caution when analyzing rare taxa (Sogin et al., 2006) with 16S data generated on the MiniSeq, as the low-sequencing depth may not be sufficient.

Moreover, sequences with terminal poly-G homopolymers need
to be carefully accounted for as they could lead to spurious OTUs.
Other modern methods of analysis such as DADA2 (Callahan et al., 2016) could assist with poly-G containing reads and other erroneous reads in MiniSeq 16S rRNA gene amplicon data.

| Mock community analysis
Because the V4 hypervariable region is ca. 250 bp in length the 150 bp pair of reads produced by the MiniSeq overlap 50 bp on average, this may impact diversity estimates (because mismatches in the overlapping contigs are used to assess errors and platform fidelity).
We used OTU clustering to see whether the true richness could be recovered in the mock community. For Run A, we had 6 replicates, whereas for Run C and D, 3 replicates were sequenced. After data processing (see Methods), the UPARSE algorithm (Edgar, 2013)
been misclassifications due to sequencing errors. The remaining four are presumed to be either contaminants, or derived from sample cross talk. Thus, the UPARSE method could accurately recover the microbial richness from our MiniSeq 16S rRNA gene data. Other studies using UPARSE also showed that the number of OTUs generated with this method is in close concordance with the number of species in mock communities (e.g., Edgar, 2013;Flynn, Brown, Chain, MacIsaac, & Cristescu, 2015). While the exact number of OTUs in the mock community was not obtained (Figure 2), mock communities are rarely recovered at the exact richness after 16S high-throughput sequencing with variability reaching >30% of the richness in the original mock community even under stringent criteria (Edgar, 2013). This is typically attributed to additional undetected contaminants, potential multiple rRNA gene copies harbored by some of the genomes, and single sequencing errors that can occur in low abundance in the sample index barcodes (Edgar, 2016). Furthermore, analysis of a mock community sequenced in parallel with environmental samples is challenging due to sample "cross talk" that can occur partly due to errors in the barcodes themselves (Edgar, 2016). This can occur either during PCR or sequencing, but is difficult to assess if the environmental samples contain similar strains as the mock community.
In our case, several of the strains (Pseudomonas fluorescens, Vibrio natriegens, Pseudoalteromonas flavipulchra) in our mock community are closely related to organisms in the environmental samples from marine sediments and corals that were sequenced (Table S1). Thus, it is difficult to speculate on the exact degree of sample cross talk in our sequencing runs.

| Analysis of environmental samples
Our quality control procedures for the MiniSeq 16S rRNA gene data appears to be reasonably prudent, because the richness of our recovered mock community OTUs relative to the starting richness falls within the variability of stringently controlled mock community sequence analyses (Edgar, 2013). To control for contamination, we also sequenced lab dust samples and extraction blanks and removed OTUs shared with the environmental samples. After removal of contaminant OTUs, a significantly different (ANOSIM: p = .001, R:.9) microbiome for each sample was observed (Figure 3). Given that the richness of the mock community is close to the true value, these beta diversity analyses show that the MiniSeq is a viable platform for high-throughput 16S rRNA gene sequencing studies of microbiomes.
Furthermore, samples included in the rarefaction analysis reached saturation, indicating the MiniSeq can sample diversity adequately from environmental samples ( Figure S2). For each run, there was variability (0.54%-13.91%) in the number of reads (clusters PF) obtained per sample, with Runs C and D having the least variability in sequencing depth between samples (Table 2). Runs C and D were also the highest quality in terms of % clusters passing filter. These data show that even for a sequencing run with 90 samples, after pooling the samples at equimolar concentrations a high-quality run (>90% of clusters PF) on the MiniSeq can provide upwards of ca.
50,000 reads per sample (Table S2). Comparing sequencing fidelity across platforms is a feasible way of validating high-throughput sequencing approaches (Caporaso et al., 2012). However, mock communities can also be used as a way to test the fidelity of high-throughput sequencing platforms (Benítez-Páez, Portune, & Sanz, 2016;Caporaso et al., 2011). Thus, while we do not compare our results to those obtained from larger sequencing platforms, for example, a MiSeq (as described by Caporaso et al., 2012), the analyses of the mock community show that the MiniSeq is able to capture a realistic picture of its microbial diversity. For con- increases feasibility for small laboratories to perform their own highthroughput sequencing of the 16S rRNA marker gene.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.