NanoRTax, a real-time pipeline for taxonomic and diversity analysis of nanopore 16S rRNA amplicon sequencing data

Graphical abstract


Introduction
Since the adoption of next-generation sequencing (NGS) technologies, the continuous development of sequencing techniques and cost reductions have revolutionized the study of microbial communities [1]. The ever-growing availability of sequencing equipment in research laboratories and facilities has dramatically increased the number of metagenomic studies, databases, and bioinformatic tools [2,3]. Consequently, a wide range of applications has emerged in life and health sciences, such as the integration of sequencing approaches in clinical settings [4,5], where these methods can bolster the speed and sensitivity of traditional microbial culturing and antibiotic susceptibility testing [6].
The introduction of third-generation sequencing technologies, such as Oxford Nanopore Technologies (ONT), has enabled the sequencing of long reads (>1 kbp) while providing a portable platform, which confers the ability to sequence samples even in a nonspecialized environment [7,8]. In particular, ONT long reads can span complete transcripts or genes, and target sequences such as the full-length 16S rRNA gene for taxonomic classification of bacteria. Specifically, the increase in read length has led to a boost in taxonomic resolution and classification accuracy, making it possible to assign reads beyond the genus level when performing pathogen identification or diversity analyses [9]. Besides this, ONT sequencing platforms also feature the unique possibility to access read data of an ongoing experiment in real-time when paired with modern GPU basecalling modes. This characteristic along with the availability of rapid library preparation protocols has served to operate with turnaround times of less than 6 h, a dra- matic decrease from the 48-72 h required for microbial culture approaches -emphasizing the potential of bringing a streamlined sequencing and real-time analysis to critical time response settings [10,11].
This challenge requires pairing rapid laboratory protocols with bioinformatic tools adapted for real-time workflows. Besides, taxonomic classifiers for long reads need to comprehensively evaluate the effect of tool and database selection in a real-time analysis scenario [12]. Here we present NanoRTax, a nextflow-based pipeline for bacterial taxonomy classification and sample diversity analysis of nanopore full-length 16S rRNA amplicon reads. The pipeline features the integration of state-of-the-art read classification methods, downstream analysis, and real-time capability to enable benchmarking of 16S rRNA gene classification methods while the sequencing experiment is in progress. The pipeline is paired with an independent Dash web application which provides immediate access to taxonomic information, diversity statistics, and visualizations.

Materials and methods
NanoRTax is implemented in the Nextflow [13] workflow management system to enable efficient parallel execution and built-in integration of software dependencies using Docker containers and conda environments.
NanoRTax input consists of basecalled and demultiplexed FASTQ files following the structure of MinKNOW sequencing software output directories. The output path of an ongoing experiment can be specified for real-time analysis of newly generated FASTQ files. First, input sequences undergo a quality control step using Fastp [14]. By default, reads of length below 1400 base pairs (bp) or above 1700 bp are discarded to keep only near full-length 16S rRNA sequences. However, these can be specified manually by the user for alternative length intervals. Then, taxonomic assignment is performed by one or more classifiers of choice between Kraken2 [15], Centrifuge [16], and BLAST [17]. Database and parameter selection for each tool can be specified via command line or in pre-loaded configuration files, and users can easily change the database for another of their choice. The classification output is then processed to extract the full taxonomy for every classified read using Taxonkit [18]. This information is used in the next step to generate the NanoRTax final output that includes the sequence relative abundances, diversity index calculations at different taxonomic levels and an abundance table with taxons for each sample analyzed on the execution. A report aggregation step is performed while new FASTQ sequence files are fed to the pipeline and further classified. This enables the synchronization of NanoRTax execution with the sequencing experiment and allows the inspection of partial results of the ongoing experiment.
For user-friendly visualization of the partial or complete outputs, the pipeline can be paired with an independent Python Dash web application, which serves as a dashboard to explore outputs in real time. The interface integrates interactive summary tables and plots regarding quality control parameters, relative abundances with modifiable frequency cutoffs, and sample diversity index calculations over time. The general workflow of NanoRTax and software versions are detailed in Fig. 1 and Table 1.

Results
To assess the usefulness of NanoRTax real-time analysis capability, we analyzed the full-length 16S rRNA gene nanopore sequencing data from 31 tracheal aspirates from adult Intensive Care Unit (ICU) patients with non-pulmonary sepsis (n = 31, 25 survivors and six deceased patients) collected from a single medical-surgical ICU at sepsis diagnosis (within 8 h). We previously showed that this small cohort had sufficient statistical power and described that a reduction in genus-level bacterial lung diversity within 8 h of sepsis diagnosis is associated with ICU mortality, providing a potentially novel and early prognostic biomarker of non-pulmonary sepsis with better prognostic ability than other commonly used clinical scores [19]. We performed the reanalysis of this data using the NanoRTax complete workflow and generated the patient-level reports [20].
For each ICU sample, species-level diversity index metrics were calculated periodically from 5,000 to 100,000 reads to simulate different time periods of an ongoing sequencing experiment. Shannon diversity index calculated at species level for each time period was then compared between deceased and survivor patients based on Kraken2 and the NCBI RefSeq database containing only bacterial genomes. BLAST classifications based on NCBI 16S RefSeq database were used only for reference as this combination of classifier and database provided the best performance in our previous assessments [21]. The predictive value of the lung bacterial diversity index was assessed by fitting a linear model and calculating the Area Under the Curve (AUC) from Receiver Operating Characteristic (ROC) curves (Fig. 2). We observed a reduction in the Shannon diversity index in deceased ICU patients compared to survivors as early as at less than 2 h of the simulated sequencing experiment, which roughly corresponds to 5,000 reads (Wilcoxon test, p = 0.002 and AUC = 88.67 % (Kraken2); 86.00 % (BLAST)).

Discussion
The strong association of reduced lung bacterial diversity with a worse sepsis prognosis highlights the importance of host-microbial interactions and provides an early prognostic biomarker for sepsis. An early sepsis response has been proven to be of paramount importance for patient outcomes improvement and will remain relevant until novel drugs or interventions are demonstrated to be effective [22]. Applications in the context of diagnosis and mortality prediction have been explored recently, aiming to integrate not only sequencing information but also clinical data to enable better diagnosis, prognosis prediction, or entailment of treatment [23][24][25]. In this study, we simulated a realistic scenario of a realtime framework to predict ICU mortality in sepsis patients based on 16S rRNA gene sequencing experiments on lung samples paired with rapid analysis protocols, allowing us to draw the same conclusions as those from a complete 48 h sequencing dataset. Moreover, these results validated the previously observed lung dysbiosis association with mortality [19] to the species level as a result of the higher taxonomic resolution achieved by sequencing the fulllength 16S rRNA genes. NanoRTax enables the immediate analysis of data while sequencing by implementing Kraken2 and Centrifuge rapid classifiers, which have been recently evaluated for long-read metagenomic profiling [26,27]. Additionally, the taxonomic assignment can be performed with BLAST to provide a framework to benchmark the tool in a real-time context or to evaluate Kraken2 and Centrifuge tools against a gold-standard BLAST classification. Our results also serve as a proof-of-concept of how real-time bioinformatic workflows could be useful to shorten the turnaround times in critical care settings and suggest their possible use for future research on early-response strategies for sepsis.
While NanoRTax was designed for full-length 16S rRNA gene taxonomic analysis of microbial samples, a focus on different amplification targets and the use of pipeline parametrization could take the application beyond bacterial profiling. Similar NanoRTaxbased classification workflows can be proposed for the detection of fungal and viral infections [28,29], while non-taxonomic targeted amplicons can profile either specific antimicrobial resistance genes or entire gene panels such as the resistome [30]. Furthermore, continuous releases of the ONT sequencing chemistry and improvements in the basecalling algorithms are expected to positively impact taxonomic assignments using NanoRTax. ONT hardware releases like the ONT Flongle sequencing adapter or the ONT Voltrax library preparation device can simplify rapid portable sequencing workflows by reducing the resources needed for the experimental protocols [31]. Enhanced portability and analytical speed directly benefit the in-situ assessment of microbial samples and confer relevance to the real-time bioinformatics tools described in this study. However, there are substantial practical challenges for routine taxonomic classification and metagenomics applications outside research practices [32]. Both analytical factors, such as sensitivity limitations due to genome size, pathogen load, or ease of microorganism lysis [33], and sample factors, such as background contamination issues [34,35] can affect classification results in metagenomics studies. Bioinformatic analysis also turned out to be non-trivial since the completeness and accuracy of the ever-growing sequence databases and different approaches of taxonomic methods have been demonstrated to have an important effect on results [2,36,37]. Thus, careful interpretation and constant benchmarking of analysis methods and databases will be key for taxonomic classification and metagenomic application success [38]. In order to successfully take data analysis to a realtime scenario, GPU basecalling of raw data generated from an ONT sequencing experiment is necessary to enable streaming bioinformatic analysis. In fact, both fast and high accuracy basecalling models on CPU mode are too slow when approaching real-time applications (e.g., basecalling of 4,000 16S rRNA gene reads using fast mode took 7-13 min using 48 CPU threads vs 10-15 s using GPU in our settings).

Conclusions
We have developed NanoRTax, a bioinformatics pipeline to enable real-time taxonomic analysis of full-length 16S rRNA nanopore reads featuring multiple classification tools and immediate output visualization. We applied the NanoRTax workflow to the evaluation of 16S rRNA gene sequencing data of lung samples aimed to predict mortality in sepsis patients admitted to the ICU. Despite further experimental demonstrations are needed, our results obtained from the analysis of simulated very early sequencing data (within 2 h) support the benefits of implementing NGSbased assessments in this scenario. Despite this field is experiencing a fast development pace, we expect that routine clinical metagenomics will remain outside critical time-response scenarios until limitations are addressed. We anticipate that real-time bioinformatic analysis tools and implementations will be advancing concurrently with NGS development and applications.

Consent for publication
Not applicable.

Availability of data and materials
NanoRTax Nextflow pipeline and Python Dash web application are freely available under MIT license on Github [39] (https:// github.com/genomicsITER/NanoRTax). The repository includes instructions and a testing dataset for a minimal pipeline execution.