Non-biological synthetic spike-in controls and the AMPtk software pipeline improve fungal high throughput amplicon sequencing data

High throughput amplicon sequencing (HTAS) of conserved DNA regions is a powerful technique to characterize biological communities from environmental samples. Recently, spike-in mock communities have been used to measure accuracy of sequencing platforms and data analysis pipelines. The fungal internal transcribed spacer (ITS) region is difficult to sequence due to its variability (length and sequence divergence) across the fungal kingdom. To assess the ability of sequencing platforms and data processing pipelines using fungal ITS amplicons, we created two ITS spike-in control mock communities composed of single copy plasmid DNA: a biological mock community (BioMock), consisting of cloned ITS sequences, and a synthetic mock community (SynMock), consisting of non-biological ITS-like sequences. Using these spike-in controls we show that pre-clustering steps for variable length amplicons are critically important and a major source of bias is attributed to initial PCR reactions. These data suggest HTAS read abundances are not representative of starting values. We developed AMPtk (amplicon toolkit), a versatile software solution equipped to deal with variable length amplicons featuring a method to quality filter HTAS data based on spike-in controls. While we describe herein a non-biological (synthetic) mock community for ITS sequences, the concept can be widely applied to any HTAS dataset.

results of variable length amplicons from HTAS. We demonstrate that read abundances are an 127 unreliable proxy for measuring relative abundances in fungal communities on both Illumina and A single step PCR reaction was used to create Ion Torrent compatible sequencing libraries, and primers were designed according to manufacturer's recommendations. Briefly, the 195 forward primer was composed of the Ion A adapter sequence, followed by the Ion key signal 196 sequence, a unique Ion Xpress Barcode sequence (10-12 bp), a single base-pair linker (A), 197 followed by the fITS7 primer (Ihrmark et al., 2012). The reverse primer was composed of the Ion trP1 adapter sequence followed by the conserved ITS4 primer (White et al., 1990). Sequencing  processing HTAS data are broken down into i) pre-processing reads, ii) clustering into OTUs, iii) 220 filtering OTU table, and iv) assigning taxonomy.

222
Pre-processing reads -Data structures from Roche 454 and Ion Torrent are similar where 223 reads are in a single file and have a unique barcode at the 5' end of the read followed by the 224 gene-specific priming site; therefore, AMPtk processes reads from these two platforms very 225 similarly. As a preliminary quality control step, only reads that have a valid barcode and forward 226 primer are retained. Next, reverse primer sequences are removed and reads are trimmed to a user-defined maximum length. Data from Illumina is processed differently because reads are 228 most often paired-end reads and most sequencing centers provide users with de-multiplexed by 229 sample paired-end data (i.e. output of 'bcl2fastq'). AMPtk first merges the paired end reads 230 using USEARCH or VSEARCH, phiX spike-in control is removed with USEARCH, forward and 231 reverse primers are removed if found, and all data are combined into a single file. Pre-232 processing reads in AMPtk from any of the sequencing platforms results in a single output file 233 that is compatible with all downstream steps.

235
Clustering reads into OTUs  249 that read counts in each OTU that are below the index-bleed threshold are set to zero as they 250 fall within the range of data that could be attributed to index-bleed. alignment percent identity is < 97%, then the best confidence score from UTAX or SINTAX is used, iii) if there is disagreement between taxonomy levels assigned by each method then a 263 least common ancestor (LCA) approach is utilized to generate a conservative estimate of 264 taxonomy. AMPtk also can take a QIIME-like mapping file that can contain all the metadata 265 associated with the HTAS study; the output is then a multi-fasta file containing taxonomy in the 266 headers, a classic OTU   (Table 1). Given the small percentage of ITS1 and ITS2 regions 293 that are greater than 450 bp (the current upper limit of the Ion Torrent PGM platform), the 294 number of taxa in the reference database that are unlikely to sequence on the Ion Torrent due to amplicon length is relatively small (Table 1). Illumina MiSeq is now capable of paired end 300 296 bp read lengths (2 x 300); however, reads need to overlap for proper processing in NGS 297 software platforms and thus a ~ 500 bp size limit would also be able to sequence most taxa in 298 the reference database.     Given the results from analysis of the UNITE datasets, we set out to create a representative ITS mock community to be used as a spike-in sequencing control to determine 314 the quantitative nature of ITS HTAS and to measure the performance between the commonly 315 used Illumina MiSeq platform versus the Ion Torrent PGM. To circumvent the problematic 316 issues associated with the ITS region, we reasoned that cloned ITS sequences in plasmid form 317 would allow for accurate quantification and pooling, thus providing a means to test the accuracy 318 of the sequencer platforms and data processing workflows. Therefore, we cloned known ITS to compare the different software algorithms here, but will briefly mention that when we did run 336 our data through QIIME, the number of OTUs was highly over-estimated and the error rates 337 were very high (Table S2). We were unable to run our data through Mothur due to the inability to 338 do a multiple sequence alignment and subsequent distance matrix of the ITS region. The best 339 performing clustering pipeline was UPARSE; however there were several issues with how the 340 reads were pre-processed and quality filtered that lead to suboptimal results (Table S3 and   341   Table S4). It is important to note that all of these software solutions have been built with 16S 342 amplicons in mind and several have been optimized for Illumina data.

343
The major difference in 16S amplicons versus those of ITS1/ITS2 is that the lengths of distinguishing feature makes ITS sequences from diverse taxa impossible to align (Schoch et 346 al., 2012) and thus represents a major limitation in data processing. To illustrate the importance 347 of properly pre-processing ITS data, we clustered using UPARSE the ITS1 and ITS2 regions 348 using the UNITE reference database (Figure 2). Using the full length ITS1/ITS2 sequences as a 349 benchmark, we then explored the outcome of trimming/truncating the sequences to different    abundance-based metrics to analyze HTAS, without giving any consideration to 405 presence/absence-based metrics. We reasoned we could investigate this issue using the ITS 406 BioMock artificial community, which would not suffer from bias associated with DNA extraction, 407 ITS copy numbers, and intraspecific variation. We compared the relative read abundances of

437
In HTAS experiments, considerable effort is made to try to sequence to an equal depth    reads per sample in any of our sequencing runs was nearly 60,000 (Supplemental Table S5).
Unequal sequencing depth has been used as rationale for explaining the lack of correlation 443 between read abundance and actual abundance. Therefore, random subsampling of reads in 444 each sample prior to clustering (also called rarefying) has been widely used in the literature,

454
consensus on a mechanism of index-bleed during the sequencing run has yet to be reached.

455
Index-bleed is a significant challenge to overcome as sample crossover has the potential to

484
The SynMock was tested as a spike-in control on both the Ion Torrent and Illumina

485
MiSeq platforms. The raw data were processed using AMPtk and clustered using UPARSE.

486
These data illustrate that the synthetic sequences are able to be processed simultaneously with 487 real ITS sequences and provide a way to track the level of index-bleed between multiplexed 488 samples ( Figure 6). The increased benefit of being able to track the SynMock sequences as 489 they "bleed" out of the sample allows for a more accurate measurement of index-bleed. Using 490 default Illumina de-multiplexing (allowing 1 mismatch in the index sequence), index-bleed using 491 the SynMock community was 0.072% ( Figure 6C). To determine if allowing mismatches in the 492 index reads was increasing index-bleed, we reprocessed the data with 0 mismatches and found 493 that index-bleed was reduced to 0.046%. While index-bleed was reduced by nearly half, the reduce index-bleed in the data. We noted that in our Illumina dual-indexing library prep that there was increased index-bleed on samples that had a shared reverse index (i7), suggesting in the barcode resulted in 0.167% index-bleed while allowing 0 mismatches in the barcode 501 resulted in 0.156% index-bleed ( Figure 6C). While these data would suggest that index-bleed is 502 perhaps higher in Ion Torrent PGM datasets, we have subsequently used the SynMock on more 503 than 10 different HTAS Ion Torrent PGM experiments and have since seen much lower levels of 504 index-bleed, occasionally approaching 0% index-bleed.

505
Many environmental samples can contain hundreds of taxa and thus a legitimate 506 concern is that the 12 member SynMock community does not represent a realistic community in 507 terms of diversity in a sample. To test if the SynMock was able to be recovered in a more 508 complex community, we mixed SynMock together with two environmental samples that had 509 more than 200 OTUs in previous sequencing runs. These mixed samples show that SynMock 510 could be recovered from a complex community and the sequences behave like real ITS 511 sequences ( Figure 6A). While many studies have set a read count threshold to filter "noisy" data 512 from OTU tables, this threshold has been typically selected arbitrarily, i.e. OTUs with read 513 counts less than 100 or less than 10% of the total, etc. Use of the SynMock spike-in control 514 allowed for data driven thresholds to be measured and moreover for the ability to filter the OTU 515 table based on the calculated index-bleed. The AMPtk filter command calculates index-bleed by 516 mapping the OTUs to the mock community and then provides a way to filter the OTU table 517 based on this calculated value. AMPtk filters across each OTU in the table such that difficult to 518 sequence or "low abundance" OTUs are not indiscriminately dropped. Taken together, these 519 data illustrate the utility of a non-biological mock community in parameterizing data processing 520 steps and importantly providing a method in AMPtk to reduce index-bleed from HTAS datasets.

550
The ITS region is widely used as a molecular barcode in fungal biodiversity studies as it 551 is easy to amplify and public reference databases are robust. However, HTAS with the ITS 552 region presents some unique challenges due to variability in sequence characteristics such as 553 length and copy number. Most HTAS software development and optimization has been focused 554 on the 16S molecular barcode, a region that is near uniform in length across prokaryotic taxa.

555
Thus, there is a need for a software solution that can more accurately account for variable 556 length amplicons. We developed a single-copy mock community based on cloned ITS 557 sequences as a tool to validate and compare different NGS platforms and data processing 558 pipelines. Using an artificial single-copy mock community of cloned ITS sequences in plasmids 559 (BioMock), we determined that the core clustering/denoising algorithms work for variable length 560 amplicons; however, pre-processing techniques widely used for uniform length amplicons 561 introduce significant error into the pipelines. Simplifying the pre-processing of sequencing reads 562 (i.e., identifying unique sequence barcodes, forward/reverse primers, and trimming reads to a 563 uniform length without data loss) resulted in large improvement in downstream OTU clustering.

564
The pre-processing of reads prior to quality filtering is critical for variable length amplicons and 565 is implemented in AMPtk.
HTAS are not a reliable proxy for inferring biological relative abundance. These data do support 569 use of presence/absence (binary) metrics as we were able to recover all members of our mock 570 community, even when they were spiked into a diverse environmental sample. We identified the 571 initial PCR reaction (library construction) as the major source of read number bias, a conclusion