Recommendations for the introduction of metagenomic next-generation sequencing in clinical virology, part II: bioinformatic analysis and reporting

: Metagenomic next-generation sequencing (mNGS) is an untargeted technique for determination of microbial DNA/RNA sequences in a variety of sample types from patients with infectious syndromes. mNGS is still in its early stages of broader translation into clinical applications. To further support the development, implementation, optimization and standardization of mNGS procedures for virus diagnostics, the European Society for Clinical Virology (ESCV) Network on Next-Generation Sequencing (ENNGS) has been established. The aim of ENNGS is to bring together professionals involved in mNGS for viral diagnostics to share methodologies and experiences, and to develop application guidelines. Following the ENNGS publication Recommendations for the introduction of mNGS in clinical virology, part I: wet lab procedure in this journal, the current manuscript aims to provide practical recommendations for the bioinformatic analysis of mNGS data and reporting of results to clinicians. implementation, optimization and standardization of mNGS procedures for virus diagnostics, the European Society for Clinical Virology (ESCV) Network on Next-Generation Sequencing (ENNGS) has been established. The aim of ENNGS is to bring together professionals involved in mNGS for viral diagnostics to share methodologies and experiences, and to develop application guidelines. Following the ENNGS publication Recommendations for the introduction of mNGS in clinical virology, part I: wet lab procedure in this journal, the current manuscript aims to provide practical recommendations for the bioinformatic analysis of mNGS data and reporting of results to clinicians.


Introduction
Metagenomic next-generation sequencing (mNGS) is an untargeted technique for the determination of DNA/RNA sequences in a variety of clinical sample types from patients with infectious syndromes [1][2][3]. mNGS is suited for identification of any pathogen, including variants that have diverged at typical PCR amplification targets, pathogens not known to be associated with a specific clinical syndrome, and novel pathogens which may remain undetected by target-based methods [4,5]. Despite these clear advantages, mNGS is still in its early stages of translation into clinical application. One of the challenges in the clinical use of mNGS is the current lack of standardization of methods and workflows, including the bioinformatic analysis to ensure a fit-for-purpose, sensitive and specific pathogen detection. The performance of metagenomic methods is heavily dependent on accurate bioinformatic analysis, and both classification algorithms and databases are crucial factors determining the overall performance of available pipelines [6,7]. A wide range of metagenomic pipelines and taxonomic classifiers have been developed, often for the purpose of biodiversity studies analysing the composition of the microbiome in different samples and cohorts. In contrast, when applying mNGS for patient diagnostics, potential false-negative and false-positive bioinformatic classification results can have significant consequences for patient care. Most reports on bioinformatic tools for metagenomic analysis for virus diagnostics typically describe algorithms and validations of single in-house pipelines developed by the authors themselves [8][9][10][11][12], stressing the need for high quality validation studies. The development of guidelines and recommendations on mNGS bioinformatic analysis methods and reporting will assist the implementation of mNGS in diagnostic laboratories, ensuring the validity of results and thus optimizing patient management.
To support the development and implementation of mNGS procedures for virus diagnostics, a network has been established under the auspices of the European Society for Clinical Virology (ESCV): the ESCV Network on Next-Generation Sequencing (ENNGS). The aim of this network is to bring together professionals involved in mNGS for viral diagnostics, to share materials, methodologies and experiences, and to develop recommendations for the implementation and use of mNGS in clinical diagnostics and Public Health laboratories.

Aim and scope
This review aims to provide recommendations for the implementation and validation of bioinformatic analysis methods for viral mNGS, excluding the wet lab part of the process, which has been discussed previously (Part I) [13] and is outside the scope of the current review. We aim to provide practical recommendations for analysis and reporting steps to aid in the successful implementation of fit-for-purpose mNGS procedures in viral diagnostic laboratories.

Bioinformatic software, expertise and information technology (IT) equipment
Processing of mNGS data is either provided by specialized bioinformaticians or non-bioinformaticians through user-friendly interfaces to tools and pipelines. Most metagenomic software pipelines are in the public domain and require expertise in bioinformatics. For the hardware part, options are i) the use of local computers, ii) the use of remote, more potent computers, including the use of cloud computing. Although some bioinformatic pipelines can be run in relatively modest desktop servers even directly in the laboratory, the recommended situation for routine clinical metagenomic analysis, which requires considerable computational capacity, is to have access to a cluster server which is usually situated within a dedicated physically separated "core" IT facility with infrastructure for central data processing, either accessible directly or via external providers of the analysis pipelines (Table 1, Recommendation 1). User-friendly software considerations are cloudbased platforms with web front-end interfaces which can facilitate direct uploading of the raw files from sequencing instruments and direct downloading of the final output analyses from the server. Examples of these interfaces and platforms are the Galaxy [14] and INSaFLU platforms (https://insaflu.insa.pt/) [15], server hosting (i.e. Amazon web services, Microsoft Azure), or cloud-based software solutions which can be scalable on-demand and frequently at lower operational costs (see Table 2). Finally, "third-generation" small sequencers based on nanopores that have relatively low capacity for metagenomic runs and are currently used for research applications, may simplify and streamline both the laboratory and bioinformatics processes, allowing for real-time analysis on a laptop computer, and futuristically, potentially near the bedside [16][17][18].

Data security
Data should be protected from unauthorized access and actions, loss, and destruction. Patient privacy should be guaranteed and justified use and governance of personal data, should be considered when implementing metagenomic procedures. The complexity and data management issues associated with NGS have led to an increasing number of diagnostic laboratories to turn to cloud services [19]. Cloud computing facilitates on-demand self-service, broad network access, resource pooling, and metering capabilities, but also means that the end user generally has no control or limited knowledge over the exact location of the provided computational services [19]. Therefore, it is recommended to have written agreements with cloud service providers on the management of protection of information for unauthorized access, use, disclosure, disruption, modification, or destruction, confidentiality, and timely/reliable access to and use of information (Recommendation 2). Furthermore, since accreditation of laboratory activities requires that every component of the assay must be verified prior to reporting patient test results, the agreement should include the management of new releases of software versions to enable validation prior to using a new version for patient care.

Storage of raw data
NGS FASTQ data and metadata files should be stored with file names and folders having unique and identifying names helpful in classifying and sorting (https://www.ukdataservice.ac.uk/manage-data/format/ organising.aspx) [20]. (Recommendation 3). Recommended is to include e.g. the date of data delivery, the project team or (sub)department, project name, sequencing library number, unique sample identifiers such as sample number and date, with consistency over time and different people. A standardized submission protocol providing metadata and data handling is supported by a Laboratory Information and Management System (LIMS). Original data files saved in the folder as well as the folder itself should have read-only access and files in the folder should keep their original names supporting standards required for method accreditation (name of the FASTQ files containing Illumina reads typically includes flow cell number, sample name, sample number, machine lane number, type of the reads (R1/R2), for instance, "HK2LLDSXX_7074−09-002−001_CTGATCGT-ATATGCGC_L004_R1. fastq" and the name of the FAST5 files containing nanopore (ONT) raw electrical signals typically includes flowcelll number, run id and a consecutive number of the files generated per barcode, for instance, "FAK96194_5138107d5a8425587f0828dd31f396e3ebd774c4_1.fast5" and need to be converted into FASTQ format using for instance GUPPY). Most of the tools for NGS data processing accept files in the compressed formats 'tar', 'zip', or 'gzip'.

Data preprocessing
Sequence data quality can be visualized with e.g. FASTQC [21] (htt p://www.bioinformatics.babraham.ac.uk/projects/fastqc/ and Mul-tiQC [22] and is followed by data pre-processing, which includes the removal of low-quality, low-complexity reads, bases (using PRINSEQ [23]) and sequence adapters using tools like Trimmomatic [24], and Cutadapt [25], with fairly comparable algorithms with minor differences in read counts after trimming. Some tools, e.g. Trimmomatic do  1. For pathogen detection, the cut-off for defining a positive result has to be established during the validation phase by comparison with golden standard molecular techniques. Since the threshold is dependent on factors throughout the entire wet lab and analysis workflow, this will have to be determined for every protocol. The distribution of the reads across the genome has to be taken into account.
Result review and reporting (5) 1. Before reporting, the mNGS data need to be technically evaluated and reviewed, for quality, possible laboratory contaminations and plausibility. 1. Hits of known reagent contaminants, misassignments, bacteriophages, and common (retro)viral endogenous sequences should not be reported to the clinician.
J.J.C. de Vries et al.

Journal of Clinical Virology 138 (2021) 104812
4 not auto-detect adapters and need an adapter file.

Removal of human sequences
Certain types of data analysis may require removal of ribosomal RNA reads or human reads prior to classification due to the ethical reasons/ data protection rights and for speeding up downstream data analysis. Validation of efficacy of removal of human reads [26][27][28][29] is recommended in the light of the general data protection regulation.

Data analysis: version control
Downstream mNGS data analysis may be restricted to taxonomic classification of sequence reads or may alternatively include de novo assembly of reads into contigs or scaffolds, followed by aligning to a set of genomes, which requires selection of the tools for particular tasks and targets [30][31][32][33]. Currently there are no optimal or golden standard tools and different approaches can produce different results for the same FASTQ file. In a recent ENNGS comparison of viral metagenomic pipelines, performance was impacted by the overall components of specific pipelines including algorithm, settings, and database [34].
It is recommended to use version controlled pipeline tools and external databases used for NGS data analysis of clinical samples. For each tool used in the pipeline, at least the following parameters/options have to be described: date of analysis, name and version of the tools and external databases, as well as user-defined and default values of parameters used for each tool, e.g. using the version management tool (Bio)Conda [35]). Additionally, it is recommended to version the overall ensemble of tools, e.g. using a workflow tools/docker containers (e.g. Snakemake [36], Nextflow [37]) (Recommendation 4). Subsequently, storage of the workflow and its default settings can be hosted by GitHub/GitLab [38], a platform with built-in version control.

Taxonomic classification algorithms
Taxonomic profiling gives an insight into taxonomic composition of the samples analyzed and results obtained in defining relative abundances of organisms belonging to taxa at different taxonomic levels, for viruses primarily species, genus and family. Dependent on the specific clinical questions addressed, sequences may be further classified below the level of species, such as genotypes (of hepatitis B and C viruses), subtypes (HIV-1), or isolates, although this is beyond the remit of the taxonomy provided by the ICTV and beyond the ability to accurately sub-type varies between pipelines [34].
Reads can be classified using different algorithmic approaches that can handle large number of sequencing reads in a reasonable amount of time [39]. In order to do so, most algorithms use stretches of perfect sequence matches with reference sequences named k-mers. These tools can be divided into three groups: i) DNA-to-DNA classification (BLASTn-like; i.e. megaBLAST [40], Kraken [41]; Centrifuge [42], CLARK [43]), ii) DNA-to-protein (BLASTx-like; i.e DIAMOND [44], Kaiju [45], GenomeDetective [46], SURPI [47], RIEMS [48] and iii) marker-based classification (i.e. MetaPhlAn2 [49]). DNA-to-protein tools can be more sensitive to novel and highly variable sequences due to the lower mutation rates of amino acids compared with nucleotide sequences [45,50,51]. An aspect that should be taken into account when selecting a taxonomic classification algorithm is the precision versus recall trade-off. High recall usually comes at the cost of a decline in precision, meaning that false positive taxa are classified at low abundance levels [34,39,52]. Each read is usually assigned a particular score or confidence level by the taxonomic algorithm and this can be taken into account by any downstream application as a reliability estimator of the classification [53].

Reference database
Selection of the reference database can significantly influence the results of taxonomic classification [7]. The reference database should consist of genomes that cover the entire genetic diversity of relevant organisms and should be curated in order not to contain any artificial, low-quality or incorrectly named genome sequences (Recommendation 5). Poorly curated databases containing misannotated reference sequences will lead to false positive results due to incorrect assignment of ambiguously mapped/aligned reads or k-mers. Incomplete databases missing newly discovered or uncommon viral strains can lead to false negative results [54]. Database compression by removal of duplicate sequences [46] is an effective way to save storage space, but compression can lead to a decreased performance in pathogen detection [39]. In general, larger databases enable more accurate sub-typing/classification to isolate level.
Several viral databases are available to the scientific community (examples are shown in Table 3). Use of complete NCBI's GenBank nucleotide database [55] containing sequences assigned to viruses (NCBI: txid10239) contains redundant sequences, requires a lot of computer resources and leads to a number of false-positive virus assignments [7] as the GenBank database entries are not curated. In contrast, the non-redundant NCBI's RefSeq database [56] is relatively small by providing one sequence per species accurately assigned based on ICTV taxonomy, and importantly, well-curated, significantly reducing the number of false-positive assignments to provisional sequences that can be inaccurate. Viruses recently discovered and virus variants highly divergent from NCBI's RefSeq reference sequence may be unidentified, the latter also depending on the stringency of the mapping criteria of the classification algorithm used as described above [4]. In clinical diagnostic practice, NCBI's RefSeq database is commonly used for identification and classification of viruses and resulted in good overall performance in an international benchmark study [34]. Curated vertebrate virus genome databases have been proposed, conveniently for clinical diagnostics lacking non-vertebrate viruses, for example Virosaurus (https://viralzone.expasy.org/8676 [57]) with sequences that are clustered to remove redundancy.
With the exponential growth of the number genome sequences in public databases, it is important to periodically update the reference databases used for taxonomic profiling, and to validate this update (Recommendation 6). The frequency of the update is dependent on the need to classify at subtype or isolate level, and on the appearance of Table 2 Examples of external providers of web-based user-friendly viral metagenomic analysis tools and interfaces.  [34,77] novel viruses in the updated public databases. Finally, some virus reference sequences contain stretches of human origin which can be initially noticed by consistent appearance of these hits. This type of misannotation can be detected by aligning the assigned sequencing reads with BLAST, whereby the top hits turn out to be of human origin in these cases. Tagging/blacklisting such entries may structurally prevent misannotation of sequences and false positive results.

Removal of contaminating sequences
Contamination can be introduced in several steps of the workflow, including nucleic acid extraction kits, reagents and diluents, postsampling environment (i.e. airborne particles, index switching, crossovers from past sequencing runs) and misclassification related to the classification algorithms used and/or the reference databases available [58,59]. As mentioned in Part I of these guidelines, positive and negative controls should be included in the sequencing run so post-sequencing contamination removal can be performed either manually or using computational algorithms (Recommendation 7). Two examples of such tools include Recentrifuge [59] and the R package Decontam [60]. These algorithms are based on different assumptions: while Recentrifuge classifies candidate contaminating taxa based on the relative frequency in the samples compared to controls and checks for crossover contamination, Decontam assumes that sequences from contaminating taxa are likely to have frequencies that inversely correlate with sample DNA concentration and are also likely to have a higher prevalence in control samples than in true samples (the contaminating species do not have to compete with true species in the negative control). Furthermore, Recentrifuge takes into account the score level of the classifications in every single step provided by the taxonomic classifier, therefore, removing potential false positive taxa introduced by the taxonomic algorithm. It must be taken into account that (low level) sequences detected in the negative run control not uncommonly originate from highly abundance species present in patient samples (e.g. due to index hopping). Automated removal of contaminating sequences should be validated (Recommendation 7). Alignment of sequence reads against a contaminant database (using bwa) can also be useful.

Normalization of read counts
For quantitative or semi-quantitative results, normalization of number of reads assigned to certain taxa by the total number of reads generated for each sample is useful since the number of generated sequencing reads might be considerably different between samples [61,62]. Additionally, differences in average genome sizes between taxa can also lead to misinterpretation of the results and, therefore, additional normalization by average genome length for each taxonomic group belonging to a certain taxonomic level is required, for example by reporting read counts per Kb of genome length per million reads [47,58] (Recommendation 8).

Datasets for validation
The bioinformatics pipeline should be evaluated using data from real samples, well-characterized by molecular diagnostic methods, which can be supplemented with analysis using in silico datasets [63,64] (Recommendation 9). Artificial mNGS reads can be generated using the tools such as ART; CAMISIM [65] or other simulators, reviewed in [66]. By using simulated datasets, the impact of variable amounts of background sequences (e.g. reads of human or bacterial origin), different mutation rates, detection rate of less-related viral genomes, as well as multiple combinations of settings (single vs paired-end reads and different read lengths) can be tested [6].

Pipeline performance
Pipeline performance: recall (sensitivity), precision (positive predictive value) and/or F1-score should be determined with real data sets from samples with a known status based on golden standard molecular diagnostic methods (Recommendation 10). The F1 score is defined as the harmonic mean of sensitivity (recall/true positive rate) and precision [6]. Specificity analysis for mNGS methods is hampered by the immense high number of negative mNGS findings without available PCR result. By calculating the precision, the proportion of unknown true negative findings is conveniently avoided.
The limit of detection of the entire workflow should be determined in line with the intended use of the assay. Assessment of pipeline performance should include base calling, alignment, and target identification.

Threshold for defining a positive result
For pathogen detection, the threshold for defining a positive result has to be established during the validation phase by comparison with gold standard molecular techniques. Since virus read counts/distribution and thus the threshold is dependent on factors throughout the entire wet lab and analysis workflow, this will have to be determined for every protocol.
Recent validation work suggests that for robust identification of a positive result, non-overlapping reads mapping to three or more different genomic regions of the organism identified should be present [1,9,67]. A threshold based on read distribution seems more accurate than a threshold based (only) on the number of reads: high read numbers from amplicon contaminants will be mistakenly reported, and a few reads distributed over several genome locations of a pathogen may be missed when setting a strict threshold based on read counts [1,68,69]. Therefore, confirmation of positive results should include mapping reads to a relevant reference sequence of the identified organism/s resulting in genome coverage information, either as an automated part of the pipeline or as a secondary analysis (Recommendation 11). It must be noted that identification of bacteria would require different criteria [70].

Ring trials
Benchmarking of a variety of pipelines [63,64] has recently been performed by the ENNGS using datasets from clinical samples using RT-PCR as a gold standard [34]. A wide variety of viral metagenomic pipelines was used in the participating clinical diagnostic laboratories. In the benchmark, detection of low abundant viral pathogens and mixed infections remained a challenge. Benchmarks are required for accreditation purposes, can reveal less effective components of a workflow, and moreover, can point out best practices with regard to the common aim of the participants, the use of mNGS for clinical diagnostics. A ring trial organized by the Swiss Institute of Bioinformatics encountered the performance of both the wet and the dry lab procedures [71]. The QCMD has initiated a EQA scheme of metagenomic workflows in 2020 (Q4) using spiked samples. Clinical labs providing mNGS service should participate in ring trials or a formal EQA scheme where available; schemes that test both wet lab and bioinformatics are preferable.

Results Review and reporting
Before reporting, the mNGS data need to be technically evaluated and reviewed, for quality, possible laboratory contaminations and plausibility (Recommendation 12), which may be done in an interdisciplinary team consisting of molecular microbiology, bioinformatics, and clinical virology expertise [72]. This technical team should consider the quality of the run and the expected number of spike-in control reads. (Kit) contaminants, or sequences also detected in the no-template controls should be corrected for. For the evaluation and confirmation of a viral infection, the depth of coverage and number of different genome regions covered have to be taken into account (Fig. 1). Potential false positive hits based on classification misassignments can be manually detected using BLAST. Confirmatory PCRs targeting mNGS sequence hits are useful (in the early phase of implementation).
After technical review, the result of mNGS should be reported to the clinician in a compact format and facilitate decision making with regard to the treatment strategy and further diagnostic steps. Thus, the reports should be comprehensible, but yet easy to read and contain only clinically relevant or potentially relevant information. The essence of diagnostics is to identify potentially clinically relevant findings and interpret their significance. Therefore, hits of known reagent contaminants, misassignments, bacteriophages, and common (retro)viral endogenous sequences should not be reported to the requesting clinician [9] (Recommendation 13).
Pathogenic viruses detected as bystander, though not associated with the clinical syndrome at presentation, such as hepatitis C and HIV, can be detected by mNGS and should be reported. At the moment of clinical request of (viral) mNGS, the clinician should be informed about the potential to detect bystander pathogens [73]. This information can be available for example at the (digital) request form or in the diagnostic information booklet, and it should be made clear to the clinician that by performing the request for mNGS, virus identification in the broadest sense is agreed upon [74].
Viruses of unknown pathogenicity or uncommonly detected viruses may not have been associated with a specific disease before but at a later point in time may turn out to be associated with a specific syndrome, as seen with astrovirus encephalitis and thus reporting of these viruses is recommended. The interpretation of an unknown or potential association of the metagenomic finding in the particular patient can be discussed subsequently with the clinician or commented on at the report, for example in the case of low level detection of herpes viruses.
In case of the discovery of an exotic or novel agent, a literature review, personal discussion with the clinician and further virological testing may be required.

Conclusions
For some clinical syndromes, there is a need to extend the diagnostic Fig. 1. Examples of coverage plots [46] with rue positive mNGS findings (a-c) confirmed by PCR in real clinical samples: a) human coronavirus HKU-1, 3951 reads, 89 % genome coverage, b) human mastadenovirus A, 19 reads, 8% genome coverage, >3 genome locations, and c) spiked-in equine arteritis virus, 14 reads, 5% genome coverage >3 genome locations, and d) an example of a false positive mNGS finding plotting a mapped hepatitis C virus amplicon contaminant, 133,213 reads, 4% coverage but only 1 genome location. Top bar represents nucleotide alignment, bottom bar(s) represents amino acid alignment, green zone: matching sequences. Distribution of reads over the genome is an important parameter for defining a positive result. portfolio with mNGS. The recommendations provided here are intended to guide clinical diagnostics and Public Health laboratories on the implementation of viral mNGS bioinformatic pipelines and workflows. Bioinformatic software tools and platforms will develop very fast, and it is it is anticipated that these future developments will support the progressive and broad introduction of viral metagenomic sequencing into clinical diagnostics and Public Health laboratories.

Declaration of Competing Interest
The authors declare no conflict of interest.