rMAP: the Rapid Microbial Analysis Pipeline for ESKAPE bacterial group whole-genome sequence data

The recent re-emergence of multidrug-resistant pathogens has exacerbated their threat to worldwide public health. The evolution of the genomics era has led to the generation of huge volumes of sequencing data at an unprecedented rate due to the ever-reducing costs of whole-genome sequencing (WGS). We have developed the Rapid Microbial Analysis Pipeline (rMAP), a user-friendly pipeline capable of profiling the resistomes of ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa and Enterobacter species) using WGS data generated from Illumina’s sequencing platforms. rMAP is designed for individuals with little bioinformatics expertise, and automates the steps required for WGS analysis directly from the raw genomic sequence data, including adapter and low-quality sequence read trimming, de novo genome assembly, genome annotation, single-nucleotide polymorphism (SNP) variant calling, phylogenetic inference by maximum likelihood, antimicrobial resistance (AMR) profiling, plasmid profiling, virulence factor determination, multi-locus sequence typing (MLST), pangenome analysis and insertion sequence characterization (IS). Once the analysis is finished, rMAP generates an interactive web-like html report. rMAP installation is very simple, it can be run using very simple commands. It represents a rapid and easy way to perform comprehensive bacterial WGS analysis using a personal laptop in low-income settings where high-performance computing infrastructure is limited.


INTRODUCTION
The recent re-emergence of multidrug-resistant pathogens through persistent misuse of antibiotics has exacerbated their threat to worldwide human public health and wellbeing. Such organisms, consisting of Staphylococcus aureus, Pseudomonas aeruginosa and Klebsiella species belonging to the ESKAPE pathogen group, have been flagged among the most notorious micro-organisms expressing tremendously high levels of antimicrobial resistance by the World Health Organization (WHO), and have been reported by many studies to contribute to the high frequency of nosocomial infections which have led to high morbidity and mortality rates all over the world [1][2][3].
In the same spirit, rapid advances in diagnostic science and personalized medicine have seen the emergence of highthroughput next-generation sequencing technologies to replace conventional microbiology laboratories, and this has greatly reduced diagnostic costs and turnaround times for results for infectious pathogens as a way of keeping pace with emerging multidrug-resistant varieties. Nextgeneration processes generally involve parallel sequencing, OPEN ACCESS producing vast quantities of genomic data, and extensive modern computation infrastructure is required to make sense of the sequencing data in downstream analysis. Furthermore, another bottleneck in the deployment of highthroughput sequencing (HTS) technologies is the ability to analyse the increasing amount of data produced in a fit-forpurpose manner [4]. The field of microbial bioinformatics is thriving and quickly adapting to technological changes, which creates difficulties for clinical microbiologists with little or no bioinformatics background in following the complexity and increasingly obscure jargon of this field [4].
The routine application of whole-genome sequencing (WGS) requires cheap, user-friendly techniques that can be used on-site by personnel who have not specialized in big data management [5,6]. The ability of bioinformaticists to analyse, compare, interpret and visualize the vast increase in bacterial genomes is valiantly trying to keep up with these developments [7]. Many biologists are drowning in too much data, and in desperate need of a tool capable of deciphering this complex information, and it is predicted that these trends will continue in the foreseeable future as the generation of genome data becomes cheaper and abundant [7]. Therefore, we introduce the Rapid Microbial Analysis Pipeline (rMAP), a one-stop toolbox that uses WGS illumina data to characterize the resistomes of bacteria of ESKAPE origin. This is an open-source, user-friendly, command-line, automated and scalable pipeline for conducting analysis of HTS data produced by Illumina platforms. rMAP takes raw sequencing data as input and performs bacterial bioinformatic analysis steps, including: adapter and low-quality sequence trimming, de novo genome assembly, genome annotation, SNP variant calling, phylogenetic inference by maximum likelihood, antimicrobial resistance profiling, plasmid profiling, virulence factor determination, multilocus sequence typing (MLST), pangenome analysis and insertion sequence (IS) characterization.

Pipeline architecture
rMAP is a tool implemented in four programming languages, namely Shell script, Python, Perl and R. It was precompiled and supports the Linux 64-bit architecture and macOS version 10.14.6 (Mojave) and above. It was originally built using WSL Ubuntu 20.04.1 LTS (Focal Fossa) and Ubuntu 18.04.4 LTS (Bionic Beaver) and the binaries are compatible with noarch-Unix-style operating systems. rMAP was built using a collection of published reputable tools such as FASTQC [8], MultiQC [9], Trimmomatic [10], Shovill, Megahit [11], Prokka [12], Freebayes, SnpEff [13], IQtree [14], BWA [15], Samtools [16], Roary [17] and ISMapper [18], just to mention a few. All of the tools and third-party dependences required by rMAP are resolved and containerized within a conda environment as a single package so as not to interfere with already existing programs. The programs in the conda environment are built on top of Python version 3.7.8 [19] and are compatible with R statistical package version 4.0.2 [20]. A full list of the packages used by rMAP is provided in Table 1.

Overview of rMAP workflow
rMAP can be used with an unlimited number of samples of different species and origins. However, it was built to target pathogens of public health concern exhibiting high levels of antimicrobial resistance (AMR) and nosocomial infections. It can be applied to isolates of human and animal origin to give insights into the transmission dynamics of AMR genes at the human-animal interface.

Benchmarking datasets
The pipeline was tested on numerous bacterial pathogens from the ESKAPE group isolated from different origins (clinical, faecal, animal and sewage), sequenced on Illumina platforms and obtained from the publicly available repositories the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) under the following accessions: Enterococcus species (SRR8948878, SRR8948879, SRR8948880,

Impact Statement
The evolution of the genomics era has led to the generation of massive chunks of sequencing data and different bioinformatics tools have been developed to analyse these data. The ever-reducing costs of whole-genome sequencing (WGS) have led to diagnostic and research laboratories obtaining genome sequencing technologies. The considerable bioinformatics skills needed to analyse the large volume of genomic data from these platforms and the complex format in which results are presented offer two important impediments in the implementation of WGS. To the best of our knowledge, there is currently no published all-in-one bioinformatics tool that successfully provides: genome-assembly statistics; single-nucleotide polymorphism (SNP) variant calling; phylogenetic analysis; antimicrobial resistance, plasmid and virulence factor profiling; multi-locus sequence typing; pangenome analysis; and insertion sequence characterization (IS) for ESKAPE pathogens. Therefore, we introduce rMAP (https://github.com/GunzIvan28/rMAP), a rapid microbial analysis pipeline for comprehensive analysis of bacterial WGS data. This is an open-source, user-friendly, command-line and scalable pipeline for conducting WGS analysis of Illumina sequencing reads. It represents a rapid and easy way to perform comprehensive bacterial WGS analysis using personal laptops, especially in lowincome settings where high-performance computing infrastructure is limited. rMAP generates a web-like html interactive report (https://gunzivan28.github.io/rMAP/) that can be shared and interpreted by microbiologists.

Core pipeline features
rMAP requires three mandatory parameters; the input directory that contains sequence reads in either fastq or fastq. gz formats, an output user-defined directory and a reference genome in either GenBank or fasta format. A full GenBank reference genome file is recommended for the --reference option to obtain an annotated VCF files. The raw fastq files are directly submitted to rMAP, with no prior bioinformatics treatment, as follows: The pipeline's features can be summarized in the order of: SRA sequence download, quality control, adapter trimming, de novo assembly, resistome profiling, variant calling, phylogenetic inference, pangenome analysis, insertion sequence mapping and report generation, as shown in Fig. 1.

Sequence read archive download
rMAP is able to retrieve sequences from the NCBI's SRA using fastq-dump [21]. A user simply creates a list containing the sample accession numbers to be downloaded saved at the home directory. The downloaded sequences are saved in a default directory called SRA-READS created by rMAP.

Quality assessment and filtering
The pipeline autodetects any non-zipped fastq reads and parses them to the fastq. gz format for optimization purposes during downstream analysis. Fastqc [8] generates sequence quality reports and statistics from each individual sample, which are then aggregated into a single graphically interactive html report using MultiQC [9].

Adapter and low sequence read trimming
Trimmomatic [10] is used to trim off adapters using a set of pre-defined Illumina library preparation adapters saved in fasta format and low sequence regions from the raw input sequence reads. The pipeline's default parameters for quality and minimum sequence length are set at a phred quality score of 27 and 80 base pairs, respectively, to accommodate sequencing data that may not be of the very high recommended quality (i.e. 33).

De novo assembly and annotation
Two assemblers are selected for this purpose for a user to choose from -Shovill [22] and Megahit [11] -each demonstrating an advantage over the other. Both algorithms take the trimmed reads as their input and perform k-mer-based assembly to produce contigs. Megahit exhibited very fast computational speeds, almost half those of its counterpart, but with slightly lower quality assembly metrics. Assembly with Shovill involves guided mapping of the contigs to a reference and numerous rounds of genome polishing using pilon to remove gaps,and takes more time but produces good quality assembly metrics (N50, L50, genome length). Prodigal [23] is used to predict open reading frames from the assembled contigs, which are then functionally annotated using Prokka [12].

Variant calling
The trimmed reads are aligned against a an indexed reference in the fasta format using the Burrows-Wheeler aligner [15] to produce SAM files. Soft and hard clipped alignments are removed from the sequence alignment map (SAM) files using Samclip (https:// github. com/ tseemann/ samclip). Samtools [16] then sorts, marks duplicates and indexes the resultant binary alignment map (BAM) files. Freebayes [24] calls variants using Bayesian models to produce variant call format (VCF) files containing single-nucleotide polymorphism (SNP) information, which is filtered using bcftools (https:// github. com/ samtools/ bcftools) and normalized of biallelic regions using Vt [25]. The filtered VCF files are annotated using snpEff [13]. Raw, tab-separated, annotated and filtered VCF files are available for the users to manipulate.

Phylogenetic inference
Because of the computationally demanding requirements of algorithms in terms of RAM and core threads during phylogenetic analysis, rMAP incorporates the use of SNP-based analysis, which has been proven to be faster than using sequencing data to infer phylogeny. A single VCF file containing all the samples and their SNPs is generated towards the end the into a multi-alignment fasta file. Multi-sequence alignment is performed using Mafft [33], with the removal of ambiguously aligned reads and the selection of informative regions to infer phylogeny using BMGE [34]. IQtree [14] tests various substitution models and constructs trees from the alignments using the maximum-likelihood method with 1000 bootstraps. The resulting trees are visualized in rectangular (phylogram), circular (phylogram) and circular (cladogram) forms.

Pangenome analysis
Roary [17] is employed by rMAP to perform core and accessory pangenome analysis across the input samples using general feature format (.gff) files generated from the annotation step. Fasttree is used to convert the core genome alignment to the newick format. The scalable vector graphic (SVG) file obtained from the pangenome analysis is converted to a portable network graphic (PNG) file format by cairosvg (https:// cairosvg. org/). The resulting trees are visualized in rectangular (phylogram), circular (phylogram) and circular (cladogram) forms.

Insertion sequence (IS) analysis
rMAP interrogates for the presence of mobile genetic elements, in particular insertion sequences, using ISMapper [18], which basically spans the lengths of the entire genome of a sequence searching for homology against a set of well-known insertion sequence families commonly found in ESKAPE isolates [35] and the ISfinder database (https:// www-is. biotoul. fr/ index. php), as shown in Table 2.

Reporting and visualization of the reports
rMAP stores and formats reports from each stage of the pipeline under one directory called 'reports' and uses R-base [20] with a set of R packages, including ggtree [36], Rcolor-Brewer, ggplot2 [37], knitr [38], rmarkdown [39], plotly [40], reshape2 [41] and treeio [42], to generate a web-like html interactive report with explanations at every stage of analysis that can easily be shared and interpreted by inexperienced bioinformatics individuals. An example of such a report can be accessed via https:// gunzivan28. github. io/ rMAP/. The reporting format for rMAP was mainly adapted from the Tormes [6] pipeline. The results from a successful run can be found under the user-defined output directory and consist of files from assembly, annotation, insertion sequences, mlsts, pangenomes, phylogeny, plasmids, quality reports, quast assembly stats, reports, resistance genes, trimmed reads, variant calling and virulence factors for further analysis. rMAP retains all of the intermediate files generated after a successful run to be interrogated further by experienced bioinformatics users. The contents extracted from the intermediate files and summarized in the html report with a short description are summarized in Table 3. Examples of visuals generated by the pipeline are illustrated in Fig. 2.

Computational infrastructure and benchmarking
The original philosophy of creating rMAP was to create a tool that can be easily installed and run on a descent personal computer. The pipeline was successfully compiled on two personal computers with the following specifications: Dell Inspiron 5570 8th Gen Intel Core i7-8550U CPU @1.80 GHz   Table 4.

DISCUSSION
Although other pipelines developed under the same philosophy and functionality as rMAP, such as Tormes [6], ASA3P [43] and the recently published Bactopia [44], exist, we noticed that each of these had a shortcoming that we aimed to address. In terms of usability, Tormes [6] was the most friendly pipeline, with one major drawback, where it could never be launched without a tab-separated metadata file complying with a set criteria. It was also more oriented to bacterial species-specific analyses, namely Escherichia coli and Salmonella species. ASA3P [43] and Bactopia [44] required a bioinformatics-competent user for operation, since they are written in complex languages, namely, Groovy and Nextflow, respectively. Other similar pipelines, such as Nullarbor (https:// github. com/ tseemann/ nullarbor), were extremely difficult to compile and use compared to their counterparts, requiring a metadata file conforming with set criteria. In cases where metadata files are required, the different software flagged errors or halted task executions as the correct conforming metadata files were essential for the downstream analyses.
rMAP, on the other hand, comes with features aimed at overcoming the limitations of its counterparts. It requires no prior preprocessing of the sequences or metadata files. The user only provides three essential requirements, namely, an input directory, an output directory and a reference genome to run the pipeline. The pipeline is written in basic programming languages that do not require advanced expertise or troubleshooting to be launched. rMAP is highly portable and capable of operating on decent personal computers running  As a significant limitation, rMAP is coded exclusively in Bash and is not implemented within a modern workflow language manager, such as Snakemake or Nextflow. The ultimate consequence of this is that a user will have to either restart the whole run or manually check which steps have completed successfully and resume the run by only selecting options that were not performed while excluding the computed steps from the main command script. Implementation of the pipeline within a modern workflow language will feature in the next release of the software.

CONCLUSION
rMAP is a robust, scalable, user-friendly, automated bioinformatics analysis workflow for Illumina WGS reads that has demonstrated efficiency in the analysis of public healthsignificant pathogens. Therefore, we recommend it as a tool for continuous monitoring and surveillance that is suitable for assessing antimicrobial resistance gene trends, especially in low-income countries with limited computational bioinformatics infrastructure.

Availability and future directions
The source code is available on GitHub under a GPL3 licence at https:// github. com/ GunzIvan28/ rMAP. Questions and issues can be sent to ivangunz23@ gmail. com and bug reports can be filed as GitHub issues. Although rMAP itself is published and distributed under a GPL3 licence, some of its dependences bundled within the rMAP volume are published under different licence models.

Funding information
This work was supported through the Grand Challenges Africa programme (GCA/AMR/rnd2/058). Grand Challenges Africa is a programme of the African Academy of Sciences (AAS) implemented through the Alliance for Accelerating Excellence in Science in Africa (AESA) platform, an initiative of the AAS and the African Union Development Agency (AUDA-NEPAD). GC Africa is supported by the Bill and Melinda Gates Foundation (BMGF) and The African Academy of Sciences and partners. The views expressed herein are those of the authors and not necessarily those of the AAS and its partners. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.