Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications

Culture-independent metagenomic detection of microbial species has the potential to provide rapid and precise real-time diagnostic results. However, it is potentially limited by sequencing and taxonomic classification errors. We use simulated and real-world data to benchmark rates of species misclassification using 100 reference genomes for each of the ten common bloodstream pathogens and six frequent blood-culture contaminants (n=1568, only 68 genomes were available for Micrococcus luteus ). Simulating both with and without sequencing error for both the Illumina and Oxford Nanopore platforms, we evaluated commonly used classification tools including Kraken2, Bracken and Centrifuge, utilizing mini (8 GB) and standard (30–50 GB) databases. Bracken with the standard database performed best, the median percentage of reads across both sequencing platforms identified correctly to the species level was 97.8% (IQR 92.7:99.0) [range 5:100]. For Kraken2 with a mini database, a commonly used combination, median species-level identification was 86.4% (IQR 50.5:93.7) [range 4.3:100]. Classification performance varied by species, with Escherichia coli being more challenging to classify correctly (probability of reads being assigned to the correct species: 56.1–96.0%, varying by tool used). Human read misclassification was negligible. By filtering out shorter Nanopore reads we found performance similar or superior to Illumina sequencing, despite higher sequencing error rates. Misclassification was more common when the misclassified species had a higher average nucleotide identity to the true species. Our findings highlight taxonomic misclassification of sequencing data occurs and varies by sequencing and analysis workflow. To account for ‘bioinformatic contamination’ we present a contamination catalogue that can be used in metagenomic pipelines to ensure accurate results that can support clinical decision making.

is expected and can also arise from imperfect sampling, the laboratory environment, equipment and reagents, and also bioinformatically. Bioinformatic contamination can occur from errors in demultiplexing where multiple samples are run at once using different indexes, and from misclassification of sequence data as the wrong genus or species.
One of the primary challenges in metagenomics is reads may be generated from multiple species rather than just one following sequencing of an isolate from culture. The creates a computational challenge, i.e. how to accurately identify and differentiate all species contained within a sample, while ruling out any species present only as artefacts? This must also be done against an ever-increasing number of previous sequences to compare with. Metagenomic classifiers can be grouped based on their reference database type, which include DNA-to-DNA methods for example, Kraken [2], Kraken2 [3], Bracken [4] and PathSeq [5]; DNA-to-protein methods for example, Kaiju [6] and Diamond [7]; and DNA-to-marker methods, which include only specific gene families in their reference databases for example, MetaPhlAn [8] and mOTUs [9]. In a recent benchmarking study [10], DNA-to-DNA methods were the best-scoring methods, while DNA-to-marker methods performed more poorly.
Some studies have benchmarked classifiers and observed that the performance of different classifiers varies significantly even on the same benchmark datasets [10][11][12][13][14][15]. However, performance depends both on the taxonomic classifier and the reference database of sequences of known species used and few studies have compared performance using a uniform database [10]. For example, while Ye et al. previously performed classification benchmarking on short-read sequencing, previous studies have not extensively benchmarked long-read sequencing data or specifically investigated the effects of database selection or read length on classification performance. Furthermore, no catalogue of how common taxonomic misclassification is by species exists in the literature to allow studies to account for bioinformatic contamination arising from the classification process [10]. Therefore, there is still a need for thorough benchmarking to shed light on the optimal selection of bioinformatic tools for metagenomics and to account for 'bioinformatic contamination' , which is an important and potentially underrecognized issue in the field [16].
Here we use simulated and real-world data to define the intrinsic rates of species misclassification that occur using Illumina and Oxford Nanopore sequencing platforms with commonly used classification tools and databases. We simulate reads from reference genomes, rather than simulating metagenomic samples, to create a controlled series of experiments that allow us to estimate the extent of taxonomic misclassification that occurs from sequencing a range of common pathogens and contaminants. Our findings can be applied to metagenomic samples where there is evidence of the presence at least one bacterial species. We derive species-specific thresholds for identifying if the presence of a second, third, or further additional species arises from misclassification or true presence. Therefore, our approach allows the extent of bioinformatic contamination arising from taxonomic misclassification to be accounted for in clinical diagnostic workflows and polymicrobial infections to be identified more reliably.

Simulated data
We generated simulated sequence data for ten common blood-stream pathogens ( For each species, 100 randomly selected reference genomes were downloaded where possible (on the 10 January 2021) from the NCBI RefSeq database [17], i.e. 1568 genomes in total (only 68 genomes were available for M. luteus). The scope of this evaluation was restricted to the identification of bacterial infections only. Although we opted to not remove genomes from this random selection that overlapped with the classification database (detailed below), we performed a sensitivity analysis excluding these genomes to ensure that contamination was not being underestimated. However, genomes very similar to those in the reference

Impact Statement
Metagenomics may transform clinical microbiology by enabling more rapid species detection in a potentially unbiased manner and reducing reliance on culture-based approaches. However, it is still limited by ongoing challenges such as sequencing and classification software errors. In this study, we use simulated and real-world data to define the intrinsic rates of species misclassification that occur using Illumina and Oxford Nanopore sequencing platforms with commonly used taxonomic classification tools and databases. We quantify the extent of 'bioinformatic contamination' arising from the classification process. This enables us to identify the best performing tools that maximize classification accuracy, and to suggest how taxonomic misclassification can be formally accounted for in clinical diagnostic workflows. Specifically, we specify thresholds for identifying or excluding polymicrobial infections in metagenomic samples, based on rates of misclassification of similar species, which might have clinical implications when treating infection.
database are likely to make up some of any genomes sequenced in any given setting. Therefore, our approach, taking a random selection (including overlapping genomes) may offer more representative results.
Simulated Illumina and Oxford Nanopore reads were generated from each genome, with separate simulations with and without sequencing errors. The relevance of simulations without sequencing errors is to allow estimates of what performance might be achieved as sequencing technologies improve and potentially per base error rates are reduced in newer releases [18]. Illumina reads were simulated using ART_Illumina v2.3.7 [19] (art_illumina -i reference. fna -p -l 250 f 20 m 999 s 1 -o -nf 5 -na -ss MS -na) generating reads at a 20× coverage for each reference genome (Fig. S5) (available in the online version of this article), to simulate a realistic MiSeq throughput, which is around the minimum depth previously shown to be adequate for variant identification and antimicrobial resistance prediction [20]. Oxford Nanopore reads were simulated using NanoSim-H v1.1.0.4 [21,22] (nanosim-h reference. fna -p ecoli_R9_1D -o -n 11000), generating reads at approximately a 20× coverage per genome with a median read length of 5921 bp (IQR 2745-11709) [range 125-120218] (Fig. S6). The default ecoli_R9_1D error profile, which mimics the R9 flow cell using 1D sequencing, and the MiSeq error profile were used for Nanopore and Illumina read simulation, respectively. No pre-processing of fastq reads was performed as this was a controlled simulated experiment.
To assess potential misclassification of human reads as bacterial taxa, we performed 100 Illumina simulations from the human genome (GRCh38 downloaded from NCBI on the 19 April 2022) using ART_Illumina and set coverage at 0.032× to approximate high human contamination in metagenomic samples (art_illumina -i GRCh38. fna -p -l 250 f 0.032 m 999 s 1 -o %s -nf 5 -na -ss MS -na).

Taxonomic classifiers and sequence databases
Previous studies have demonstrated that Kraken-based classification tools deliver faster classification speeds and similar accuracy compared to comparator tools [10,23]. Therefore, we chose to evaluate Kraken2 (v2.0.7) [3] and Bracken (v2.5) [4], as well as Centrifuge [24], which was written to address the high memory requirements of Kraken. Mini (8 GB) and standard (30-50 GB) uniform databases were built from archaeal, bacterial, viral and human reference genomes downloaded from NCBI RefSeq on the 10 January 2021. Databases were built according to each tool's default setting. The mini database resembled an up-to-date version of 'MiniKraken' limited to 8 GB, which down-samples the standard Kraken2 database using a hash function. As Centrifuge does not allow for the restriction of size when building a database, only a standard database was built. Fastq files were classified using the default settings of Kraken2 (kraken2 --db $DBNAME sequence_ data. fq --report /), Bracken (bracken -d $DBNAME -i sequence_ data. fq -o /) and Centrifuge (centrifuge -k 1 x $DBNAME -f / --report-file / -S /) however, for Centrifuge we set the distinct primary assignment value to 1 (-k 1) in order to provide a comparable standard where each read is assigned to only one taxonomic level. Kraken and Bracken produced the same output, while the Centrifuge output was converted to a Kraken-style report using the command (centrifuge-kreport -x $DBNAME centrifuge. output. txt > krakenstyle. output. txt).

Real-world data
We used a convenience sample of real-world Illumina sequencing data for benchmarking, which consisted of reads generated from sequencing of 100 pure isolates of S. aureus, E. coli and K. pneumoniae from previous published works [NCBI project accession numbers PRJNA690682, PRJNA604975] [25,26]. Sequenced isolates originated from blood-stream infections between 15 September 2008 and 01 December 2018 in Oxfordshire, UK. Available 150 bp paired-end short-read sequences were generated using the Illumina HiSeq 4000.

Statistical analysis
We measured classifier performance as the proportion of reads classified as the correct species or correct genus, given the known species of the reference genome or clinical isolate from which reads were simulated or obtained. The reported read abundance for each species may be referred to as either 'raw' calculated from the relative abundance of reads from each taxa or 'corrected' by accounting for genome size, such that the relative number of organisms of each taxa is estimated. We used raw abundance estimates unless the correction is automatically performed by the software as in the case of Bracken.
We used generalized linear regression (with a logit link function, binomial family and robust standard errors) to investigate associations between the proportion of reads classified as the correct species and sequencing technology, taxonomic classifier, database and species.
Similarly, to investigate the extent to which species misclassification was because the true and misclassified genomes were similar, we used a linear model (R code: lm(ANI ~pct_at_this_level_or_lower, data)) to investigate the relationship between the average nucleotide identity (ANI) between the true species and the misclassified species and the proportion of reads misclassified as the incorrect species. ANI values for each correct-incorrect species pair were generated by taking the median value for all pairwise comparisons between 30 genomes of each species misclassified (downloaded from NCBI RefSeq on 24 June 2021) and 50 genomes of the correct species using FastANI (fastANI --ql [27]. Analyses were performed using R (v.4.1) and R Studio (v1.4.1717) [28].
To investigate the relationship between read lengths and species classification performance, we generated Oxford Nanopore and Illumina reads of varying lengths using error and error-free profiles from 100 reference genomes of S. aureus and E. coli each, and plotted read lengths vs. the proportion of reads correctly classified by Kraken2.

Determinants of taxonomic classification accuracy
We simulated sequence data from 1568 genomes representing ten common pathogens and six common blood-culture contaminants. We chose S. aureus, classified with Kraken2, using a mini (8 gb) database, with simulated Illumina reads including   [29][30][31].

Best performing combinations
Overall, Bracken with the standard database was the best performing option (Figs 1 and S1 show abundance classification details of five important common blood-stream pathogens and five contaminants, respectively       Table S3, while Table 3 provides a summary with recommended 95th, 99th and 100th percentile cut-off value to determine if additional organisms are present, i.e. there is polymicrobial infection. Of the 1568 randomly selected genomes used in simulations, 7.6% overlapped with the references genomes used to build the classification database, however, when overlapping genomes were removed from this analysis this did not affect the findings (data are not shown separately as they are very similar). A higher sequencing depth of 40× compared to 20× did not result in any improvements in classification performance or reduction in misclassified or unclassified reads (Table S4).

Relationship between species misclassification and nucleotide identity
Misclassification at the species level was more common when the misclassified species had a higher average nucleotide identity to the true species ( Fig. 4 and Table S5). However, there were several outliers, where misclassification was greater than expected for the given level of genome-wide nucleotide identity, for example B. thuringiensis was misclassified at 25% when B. cereus was the true species, and S. pneumoniae at 10% when S. mitis was the true species. This reflects the difficulty distinguishing species within known groups or complexes of bacteria, and potentially reflects that some parts of the genome may have higher nucleotide identity.

Comparison with real-world data
Findings from Illumina simulations (using MiSeq error profiles) were very similar to real-world Illumina data (generated with the HiSeq platform, Fig. 5

DISCUSSION
The potential for metagenomic technology to improve diagnostics and patient outcomes has been widely recognized [32], and there is extensive work ongoing to overcome current challenges associated with use of these approaches in clinical settings [33].
In this article, we simulate and use real-world sequencing data to define the intrinsic rates of bioinformatic species misclassification that occur using common sequencing platforms, classification tools and databases. This enables us to quantify the extent of bioinformatic contamination in the classification process to account for this formally in diagnostic workflows, and further guide the selection of optimal bioinformatic tools to maximize classification accuracy. Overall, across both short and long reads we found that using the Bracken classifier with a 'standard' database, i.e. larger resulted in the most accurate performance across all species with almost no reads unclassified (Figs 1 and S1). In comparison, Kraken2 with a mini database, a commonly used combination performed worse. Previous studies with Illumina short reads have reported estimates from Kraken2 to be more accurate than Kraken1, KrakenUniq, CLARK and Centrifuge at both the genus and species levels [3]. Furthermore, studies have shown with Illumina data that Bracken which re-assigns reads in intermediate taxonomic nodes to the species level or above from Kraken1/2 outputs produce far better abundance estimates [4,10]. Our results confirm that Bracken produces far more accurate abundance estimates for both long-and short-reads, especially when using a standard database. Previous metagenomic studies have rarely used Bracken or standard, i.e. larger, databases [1], the latter in part because of higher memory requirements, and we encourage future studies to utilize the most accurate tools available where computational resources permit.
To formally account for bioinformatic contamination in clinical diagnostic workflows, we produced the first known catalogue of misclassified species from simulating 1568 genomes representing ten common pathogens and six common blood-culture contaminants using different classification tools, databases and sequencers (summary catalogue Table 3, and detailed catalogue Table S3). It should be noted that our proposed minimum cut-off thresholds for identifying the presence of additional species are based on the extent of bioinformatic contamination arising from taxonomic misclassification only. Other forms of contamination are possible during sampling and laboratory workflows, which also need to be considered. Our thresholds can be applied to metagenomic samples by multiplying the estimated proportion of a sample that is from a specific species by the 95, 99 or 100% misclassification cut-off values in Table 3. For instance, if a sample by read proportion is found to have 50% Streptococcus mitis and 40% S. epidermidis, then the 95% threshold for the proportion of reads misclassified as S. pneumoniae would be 50% * 26.6%+40% * 0%=13.3%. Therefore, even if the remaining 10% of the sample were classified as S. pneumoniae this would be considered a bioinformatic contaminant, while S. mitis and S. epidermidis would not as they both are above their respective modified cut-off values. Refinements to this heuristic approach are possible, for example probabilistically estimating the proportion of each sample made up of a given species while simultaneously accounting for misclassification.
We used existing real-world Illumina data to show correlation with the Illumina simulator (Fig. 5), while the Nanopore simulator used has been benchmarked extensively elsewhere [21]. Detailed 95th, 99th and 100th percentile cut-off values for other species are available in Table S3, which may be used to determine if there is strong evidence that additional organisms are present, i.e. there is polymicrobial infection or, conversely, if only bioinformatic contamination is present. Using simulations, we also show that misclassification of human reads is negligible when human DNA contamination might exist in clinical samples.
Additionally, our results show that misclassification at the species level was more common when the misclassified species had a higher average nucleotide identity to the true species. However, there were several outliers where misclassification was greater than expected for the given level of genome-wide nucleotide identity reflecting the difficulty distinguishing species within known groups or complexes of bacteria. Tools such as DeepMicrobes, which use a deep learning-based computational framework for taxonomic classification may offer an advantage in these situations where far fewer false positives are produced compared to other tools regardless of different degrees of nucleotide similarity [34].
The advent of long-read technologies such as the Oxford Nanopore has significant advantages for metagenomic analysis from the detection of structural variants to improved de novo assembly [35]. Furthermore, long-read sequencing of native molecules eliminate amplification bias while preserving base modifications [36]. However, these technologies are limited by a higher error-rate affecting 6-12% of all sequenced bases, with a significant fraction of insertion and deletion errors [37]. This may affect the success of current classification methods, as there are few algorithms developed to exploit long-read data. However, even after simulating sequencing errors, the longer reads of Nanopore sequencing meant that it still could outperform Illumina sequencing. This finding is supported by previous work [38]. Following on from this observation we directly investigated the impact of read length on the proportion of reads correctly classified. Allowing for sequencing error, for E. coli and S. aureus, Nanopore reads of ~2000 bp performed similarly to Illumina reads and performed better at higher read lengths >5000 bp. Our findings have several implications: as technology and accuracy for long-read sequencing advances, we expect classification accuracy to significantly improve. Furthermore, filtering shorter reads (at least <1000 bp) from error-prone Nanopore sequencing can provide performance either equivalent or superior to standard Illumina sequencing.
Although we have thoroughly analysed and discussed technical factors that correlate with taxonomic misclassification, biological factors that might be responsible for misclassification might also exist. One such biological aspect that we have addressed here is the presence of the complexes or species, which are closely related yet named differently, which can result in the erroneous classification of sequence reads. Other biological factors might include the contribution of high similar genomic regions that are not specific to a particular species but shared across two or more species. These regions are well known to be prevalent in bacterial pathogens (e.g. in mobile genetic elements) and will affect read misclassification rates even if the sequencing reads are error-free. Future studies are required that ascertain how much bioinformatic contamination may be removed by building classifier databases from modified genome-sequence databases, which contain exclusively the relatively 'stable' genomic backbone regions after excluding regions shared among species.
In conclusion, our findings highlight the pertinent issue of the presence of bioinformatic contamination and how this varies by the combination of sequencer, classifier and database used. While we have recommended Bracken using a standard database to be utilized in metagenomic workflows, we have produced a misclassification catalogue whereby misclassification can be accounted for using any combination of sequencers, classifiers or databases benchmarked in this study. Benchmarking metrics should inform the choice and implementation of metagenomic sequencing tools for both research and clinical applications.

Funding information
The study was supported by the National Institute for Health Research (NIHR) Biomedical Research Centre, Oxford. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. DWE is a Robertson Foundation Fellow. KG is a Rhodes Scholar.