MGS-Fast: Metagenomic shotgun data fast annotation using microbial gene catalogs

Abstract Background Current methods used for annotating metagenomics shotgun sequencing (MGS) data rely on a computationally intensive and low-stringency approach of mapping each read to a generic database of proteins or reference microbial genomes. Results We developed MGS-Fast, an analysis approach for shotgun whole-genome metagenomic data utilizing Bowtie2 DNA-DNA alignment of reads that is an alternative to using the integrated catalog of reference genes database of well-annotated genes compiled from human microbiome data. This method is rapid and provides high-stringency matches (>90% DNA sequence identity) of the metagenomics reads to genes with annotated functions. We demonstrate the use of this method with data from a study of liver disease and synthetic reads, and Human Microbiome Project shotgun data, to detect differentially abundant Kyoto Encyclopedia of Genes and Genomes gene functions in these experiments. This rapid annotation method is freely available as a Galaxy workflow within a Docker image. Conclusions MGS-Fast can confidently transfer functional annotations from gene databases to metagenomic reads, with speed and accuracy.

Abstract: Current methods used for annotating metagenomics shotgun sequencing (MGS) data, rely on a computationally intensive and low stringency approach of mapping each read to a generic database of proteins or reference microbial genomes. We developed MGS-Fast, an alternative analysis approach for shotgun whole genome metagenomic data utilizing Bowtie2 DNA-DNA alignment of reads, to the IGC database of well annotated genes compiled from human microbiome data. This method is rapid and provides high stringency matches (>90% DNA sequence identity) of the metagenomics reads to genes with annotated functions. We demonstrate the use of this method with data from a study of liver disease and synthetic reads, and Human Microbiome Project shotgun data, to detect differentially abundant KEGG gene functions in these experiments. This rapid annotation method is freely available as a Galaxy workflow within a Docker image.

BACKGROUND
The initial focus of metagenomics studies, such as the Human Microbiome Project [1] was to survey the microbial communities present in various sites on and in the human body, but the focus of research has now shifted to understanding the functional role these microbes play in metabolic and disease processes. Assessment of the taxonomic diversity and composition of metagenome samples using amplicon sequencing of the 16S rRNA marker gene is inexpensive and has been applied to map a wide variety of microbial communities , but it is also subject to bias and lacks sensitivity below the species level. It is known that individual bacterial isolates with identical 16S genes may differ by as much as 15-30% in their genomes [2], which may include toxin production, antimicrobial, or metabolic genes. Alternatively, metagenomics shotgun sequencing (MGS) of all DNA present in a biological sample can be used for computational prediction of gene functions of sequenced DNA fragments to infer differences in the biological function of microbial communities [3]. Existing bioinformatics tools to characterize MGS data face bottlenecks due to the large computational task of comparing millions of short DNA sequences (50 to 200 nucleotides in length) to various databases of known proteins, conserved protein motifs, or  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 annotated genomes. The BLAST [4] method used to compare DNA sequences (i.e., reads) to a database, requiring hundreds of CPU hours to analyze a typical MGS sample containing hundreds of millions of reads. Approaches to overcome this computational bottleneck include the reduction of read data file complexity, for example, through de-duplication or by de novo assembly. However, these data reduction methods themselves require substantial computational effort and can introduce significant bias. Furthermore, misassemblies can introduce significant biases, since the WGS reads correspond to hundreds of bacterial genomes and chimeric contigs can be created [5]. This is especially true for gut microbiomes where closely related species with similar genomes are present, and this could be exacerbated in the case where significant gene transfer occur across species (transposase, phage and lateral gene transfer). The result of a misassembly is to distort abundance information, as genomic sequences could be assembled together, resulting in losing the signal for species present in the sample. In addition, one of the advantages of the mapping approach, is that it is possible to recover the presence of a gene even when the coverage for that gene is not sufficient for assembling it. Therefore, gene presence in the sample can be better identified used the raw reads and comparing to annotation databases, rather assemblies that is difficult to ensure that species are not artificially masked during the assembly process.
Other methods involve the use of faster, but less sensitive sequence matching algorithms such as BLAT [6] or RAPSearch (MG-RAST webserver [7]), or reduced databases for functional protein identification, thus providing a less precise assay for microbial protein function. For example, the MG-RAST webserver, the wait queue for data processing can be up to several weeks. Carr and Borenstein [8] compared MGS annotation using BLAST vs. BWA (a DNA sequence similarity tool very similar to Bowtie) and they conclude that at short evolutionary distances, BWA has a higher precision and recall than BLAST for identifying KEGG orthologs, but recall and precision for BWA drops dramatically at greater evolutionary distances.

MGS-Fast Algorithm and Software Implementation.
We created a computationally efficient pipeline for MGS data analysis called MGS-Fast, which combines several data pre-processing steps ( read data trimming and filtering of low-quality sequences, removal of human host contaminant sequences), with taxonomic and functional profiling of metagenomic WGS sequence data. The pipeline leverages software broadly used in the bioinformatics community for quality control, taxonomy, DNA sequence alignment, and taxonomic profiling (details in Methods section). The novelty of MGS-Fast is based on the use of stringent DNA-DNA matching to annotated and high-quality bacterial DNA sequences from the integrated catalog of reference genes (IGC) in the human gut microbiome [9]. The IGC database contains 9,879,896 gut microbe genes with annotations based on the Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/kegg1.html). The bioinformatics workflow of MGS-Fast utilizes Bowtie2 to rapidly map a MGS dataset and assign known functions to reads originating from microbial genes, producing counts of KEGG gene orthologs as output.
The KEGG counts are then applied to identify differentially abundant microbial gene functions in metagenomics datasets, and separate samples from different patient groups (Fig. 1). The IGC database is precompiled as a Bowtie 2 index which is deployed automatically during installation of MGS-Fast, and is also available as separate download (Availability section). Furthermore, the MGS-Fast pipeline is packaged as a pre-configured, ready to execute software within a Docker container, that is easy to deploy by non-bioinformatics experts through a single command (Supplementary Information -Software-Manual). Researchers working with non-human MGS gut data, can create their own custom database of microbial genes (Methods section) for functional profiling. Once a custumized database is prepared, MGS-P Fast allows for parallel processing of multiple WGS metagenomic samples with, increasing accuracy and reducing computational time for the functional assignment of reads.

Data Analysis of Liver Cirrhosis Metagenomic Samples.
In our study, the MGS-Fast pipeline was used for the analysis of gut microbiome samples from 10 patients with liver cirrhosis, and 10 control samples from an earlier study by Qin et. al [10], obtained from the European Nucleotide Archive accession ERP005860. Upon completion, the pipeline generated gene function abundance counts, with a total of 3785 KEGG IDs that was similar to the number (4,801) in the original study by Qin et. al. Next, following the recommendations in [11], we analyzed 502 out of 3785 KEGG IDs that had significantly different abundance scores (FDR corrected P-value threshold 0.05, Suppl. KEGG-FDR.CSV) as a mixture model with a Negative Binomial distribution, using the R package edgeR, version 3.7 [12]. In order to visualize the differences between groups of KEGG ID abundances, we also used edgeR to create a multi-dimensional scaling plot (MDS, Fig. 1), were clear separation was observed between the healthy vs. cirrhosis samples.

1.
Abundance plot generated using the R package edgeR and a mixture model with a Negative Binomial distribution, for the KEGG annotations generated by MGS-Fast using as input data from gut microbiomes of healthy patients (HV) and patients with liver cirrhosis (LD).
Next, we mapped 502 FDR-corrected (0.05) KEGG IDs returned by MGS-Fast to pathways using the KEGG website tools (http://www.genome.jp/kegg/tool/map_pathway2.html). We were able to identify functional groups and pathways modules which corresponded closely to the ones found in the original study by Qin et al. Specifically, the majority of the pathway modules detected were for membrane transport, including oligopeptide transport systems, in addition to zinc, glutamine and energy coupling factor transport (complete list in Suppl. file "KEGG modules -502 KEGG IDs.doc"). Furthermore, and as reported in the original study we found prevalent pathway modules for carbohydrate, amino-acid and 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 energy metabolism, including the citrate and Krebs and Calvin cycles, gluconeogenesis, glyoxylate and glycolysis cycle. We also observed a set of liver-cirrhosis-associated markers similarly with Qin et al., including assimilatory nitrate reduction, denitrification, GABA biosynthesis and GABA shunt, in addition to heme biosynthesis. The GABA neurotransmitter system is correlated to brain disease [13] as a result of liver dysfunction, because of increased GABA levels in the blood have the potential to go through the blood-brain barrier and cause hepatic encephalopathy. Finally, we also detected a set of pathway modules for ammonia production, which could lead to increases ammonia levels in the blood as described by the original study (Qin et al. 2014). In this respect, we also found the assimilatory nitrate reduction pathway module to be present, in addition to dissimilatory nitrate reduction and the complete nitrification pathway (Suppl. file "KEGG modules -502 KEGG IDs.doc").

Comparative Pipeline Performance and Processing Times.
The MGS-Fast Docker container was deployed on an 8-CPU Intel Xeon Server supporting hyper-threading for a total of 8 parallel processes ("threads"), in addition to 128 GB RAM memory. This is a highperformance computing server, commonly found in laboratories performing genome sequencing bioinformatics. In order to compare computational performance of MGS-Fast to other published pipelines for metagenomic annotation, we measured the processing time for each tool in the different pipelines using the patient datasets from our study. For ensuring compatibility of the results we applied the option "--threads 8" or similar for all pipelines (Kraken, GOTTCHA [14][15], and HumanAn2 [https://huttenhower.sph.harvard.edu/humann2]) included in our comparison. The pipelines have been setup according to the documentation for each, using the standard full database for Kraken ("krakenbuild --standard --db $DBNAME") and the latest bacterial databases for GOTTCHA (ftp://ftp.lanl.gov/public/genome/gottcha/. Processing times for the MGS-Fast workflow (sample id ERR526291, number of reads 15,181,542 x 2), in comparison to the other pipelines are shown in Table 1.
Interestingly, most other pipelines compared with MGS-Fast do not perform preprocessing of data such as quality control or removal of host WGS reads, except for GOTTCHA which offers users the option to trim input DNA reads. Table 1 lists the times required by MGS-Fast for the data pre-processing steps, including read trimming and removing of host sequences. Furthermore, Fig. 2 reports processing times for all MGS-Fast pipeline steps when used for analysis of different patient metagenomic data sets, which ranged from 1.5 GB to 11GB in size. Similarly, Fig. 3 provides total time of execution comparison for the same datasets with Kraken that uses a large in-memory k-mer database versus the compressed Burrows-Wheeler Transform (BWT) -Bowtie2 aligner used for MGS-Fast, and DIAMOND using protein based alignment.

WGS metagenomic datasets used as controls for MGS-Fast.
We evaluated the performance of the MGS-Fast pipeline further by using a range of metagenomic samples from mouse and human collected across different body sites (e.g., gut, mouth, skin), in addition to environmental samples (e.g., copper mine waste), and negative controls of simulated read data from real or synthetic genomes ( Table 2). The Human Oral Microbiome Database (HOMD, [16]) was included to the analysis workflow as additional annotation data. In more detail, Bowtie2 alignments of human gut (fecal) samples performed by MGS-Fast resulted in 95.62% of all reads in the sample being successfully mapped to the database. For human oral and skin microbiome samples 82.32% and 33.02% % of the reads, respectively ( Table 2) were mapped to the database.
Simulated FASTQ reads from the human reference genome GRCh38 aligned at only 7.35%, which was expected since our pipeline filters out human sequences. As a positive control, we mapped MetaSim [17] simulated reads from the E. coli K12 reference genome (GenBank: accession U00096.3) and 98.5% of the sample was aligned. Furthermore, as negative controls we included the HMP mock microbial community (SRR172902; 28.82% of reads aligned to database), a synthetic metagenome (SRR3732372) made from a mixture of DNA from lab strains of bacteria (10.23% of reads aligned to database), and a copper mine waste sample (MG-RAST accession 4664533.3; 8.69% of reads aligned to database; Table 2). Finally, false positive matches were evaluated by aligning a set of randomly generated reads by the XS simulator [18]. As expected, only 0.5% of the sample reads aligned to our database. Randomly generated reads XS simulator 0.53

MGS-Fast Pipeline Structure and Data Processing.
The MGS-Fast workflow begins with quality control using FastQC (FastQC , RRID:SCR_014583)(red rectangles, Fig. 4A, http://www.bioinformatics.babraham.ac.uk/projects/fastqc), which creates as output a report on many aspects of input data quality. Next, Trimmomatic (Trimmomatic , RRID:SCR_011848)( [20], blue rectangle, Fig. 4A) is used to remove sequencing adapters, primers and low quality sequence data. Human host DNA is removed by alignment of reads to the human reference genome using Bowtie2 (Bowtie , RRID:SCR_005476)( [21], left green rectangle, Fig. 4A), using the human GRCh38 reference genome (https://www.ncbi.nlm.nih.gov/grc/human). This filtering step retains only the "unmatched" reads corresponding to the metagenome as specified by the "--un" (unaligned) option from Bowtie2. The retained reads are then aligned to the IGC microbiome gene catalog database with Bowtie2 (right green rectangle, Fig. 4A), using the "--end-to-end -sensitive" option, in order to assign KEGG protein function IDs to each read. The IGC database (http://meta.genomics.cn/meta/dataTools) contains approximately 10 million KEGG-annotated microbial genes, collected from 1267 public human gut microbiome samples plus an additional 922 complete annotated prokaryotic genomes. The software versions included in this workflow are the following: FASTQC 0.11.6, Trimmomatic 0.32.1, Bowtie 2.2.6 and MetaPhlAn 2.5.0 (MetaPhlAn, RRID:SCR_004915) [22], with a default pre-set of parameters (details in Supplementary Manual) that can be easily adjusted and changed by the users through the Galaxy interface.
(A).  Next, a custom Python script integrated in the workflow (yellow rectangle, Fig. 4A) is used to count the number of IGC genes and KEGG IDs generated as output of Bowtie2 with IGC. The script counts the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 number of reads aligned to each gene ID in the BAM file generated by Bowtie2, and stores the counts in a two column "gene ID -abundances" file. The annotations were also parsed from the original IGC/HOMD FASTA files, in order to produce a second file of "gene ID -KEGG ID" that was saved as a Python 'dictionary' data structure. The "gene ID -abundances" are then read line by line using the Python script, which also loads the dictionary data structure, and matches the "gene ID -abundance" list entries with these of the "gene ID -KEGG ID" based on the gene ID. The KEGG IDs of the matching lines from the two lists, are then used by the script as a key for a new dictionary containing key-value pairs. The value corresponding to the KEGG ID keys of the dictionary is set to the corresponding abundance count, and also the count is incremented when the KEGG ID has is already present the dictionary as the lists are parsed. Following all data processing steps, MGS-Fast prints the KEGG IDs and read counts for each gene in a text file, which is used as input file for the EdgeR script that creates abundance plots ("abundanceplot.R", available at https://github.com/BCIL/MGS-Fast). The R script filters out genes with low counts, keeping those rows where the count per million (cpm) is ≥1 in at least 6 samples. The cpm for mapped reads is essentially counts scaled by the number of fragments sequenced in one million. Furthermore, we used the "calcNormFactors" function from EdgeR in our script, which normalizes for RNA composition by finding a set of scaling factors for the library sizes across samples. This essentially re-scales the library size resulting in an "effective" library size, which is then used for abundance calculations. This step helps to remove any further artefacts of read distributions per gene that might be introduced for example at the initial stage of read trimming.

MGS-Fast Pipeline Software Distribution and Data Options.
The MGS-Fast pipeline was developed using the workflow canvas of the Galaxy bioinformatics web server (Galaxy , RRID:SCR_006281) (Fig. 4A, [23]), which was first pre-installed and configured to run within a Docker virtual machine container (http://www.docker.com). The Galaxy web server was chosen since it provides an intuitive, web-browser interface for non-technical users. Users can easily access and run the MGS-Fast pipeline via the Galaxy interface and developers can use the Galaxy workflow canvas to build and modify the pipeline. Our goal was to develop a complete software bundle within a docker container, which includes the MGS-Fast workflow and all required bioinformatics software, in addition to all other software dependencies. The entire pipeline is implemented in a series of steps (Fig. 4A), which are automated via Galaxy. The users only need to select the input datasets (Fig. 4B), and the Galaxy workflow engine will automatically execute all remaining analysis steps of the pipeline. Furthermore, the input data directory is attached and automatically available through the Galaxy interface when users set up MGS-Fast and specify the data directories (Suppl. Software Manual). Upon completion of an MGS-Fast pipeline run, the users can download all result files, or simply view the output within the Galaxy interface (Fig. 4C). Furthermore, users can run MGS-Fast by reconfiguring the analysis steps, or rerunning a single tool instead of the whole pipeline.
MGS-Fast also provides two options for users to create custom Bowtie2 indexes, for filtering both host genome reads and for annotating the metagenomic reads. Through the first option users can specify the location of a file containing the sequence of a host genome or metagenome, through a text-based menu during the initial run of the MGS-Fast container (Suppl. Manual). The scripts inside the container will automatically build an index for the provided genome, and make it available for use on the Galaxy interface without any futher effort by the user. As a second option, we made an additional pipeline (Suppl. Material) available which is called "Galaxy-Workflow-Custom_MGS-Fast.ga" and can be imported to an already installed and running MGS-Fast workflow. The custom workflow is identical with the regular workflow used by MGS-Fast, but provides users the option to use a FASTA file containing the sequence(s) of the custom genome as input. The Bowtie2 index for the provided genome is automatically built during the first run of the workflow, and is then made available for all subsequent runs. Similarly, users can at 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 P any point add additional custom genome indexes, both for filtering host DNA reads and for classifying metagenomes. Users can download host genomes (e.g., mouse reference https://www.ncbi.nlm.nih.gov/genome/?term=Mouse), and also a range of WGS metagenomes from the Joint Genome Institute [https://img.jgi.doe.gov/cgi-bin/m/main.cgi] .

DISCUSSION
MGS-Fast can confidently transfer functional annotations from annotated gene databases to sequence reads in metagenomic datasets. For microbial read annotation and assignment of KEGG IDs by alignment to the International Genome Consortium (IGC), the MGS-Fast uses the Bowtie2 algorithm requiring by default 90% DNA sequence identity in finding matches. While the DNA to DNA alignment performed by Bowtie2 is less sensitive than translated BLAST utilizing information from conservative amino acid substitutions, at the 90% level of identity, we only have exact matches from DNA fragments of the same species, or orthologs between closely related species [24]. We have also considered increasing the sensitivity of our method, by changing the Bowtie2 parameters and including the "--very-sensitive-local" parameter or increasing the number of allowed mismatches, but decided against it since it would allow less stringent DNA-DNA alignments and more false positives. Nonetheless, this option is still available for users as the Bowtie2 parameters can be easily adjusted through the Galaxy interface, when initiating a MGS-Fast pipeline run. However, caution is required in interpreting the results with increased sensitivity parameters, since a high percentage of false positive alignments will make assignment of metabolic function to metagenomic reads less reliable. In addition to Bowtie2 our pipeline also provides annotations through MetaPhlAn2, using a database of approximately one million unique clade-specific marker genes from 17,000 reference genomes. This enables MGS-Fast to identify taxa within narrow clades, even in the absence of reference genomes for species in the gut community. MetaPhlAn2 enabled us to identify a microbial organism at higher taxonomic levels of genera or family, and we observed identification of 8931 organisms, out of which 6681 have been annotated by MetaPhlan at the order taxonomic level ("ceae").
Using Docker container technology, we bundled all required software components and the MGS-Fast pipeline as a pre-configured, ready to use bioinformatics package for performing standardized, automated metagenomics analysis on any desktop or laptop computer running Windows, MacOS or the Linux operating system. Both the container and source code are publicly available for download (Availability section), and can be easily installed by non-bioinformatics experts with a single command (Suppl. Software-Manual). Users can then simply access the MGS-Fast pipeline via the Galaxy interface by entering the network address of the container (made available to the user when the installation is complete) on their web browser. Furthermore, the Docker container can be deployed on the cloud or institutional clusters, where users can run multiple instances of MGS-Fast in parallel in order to process multiple NGS samples, or within a single instance of MGS-Fast using Galaxy's Data Collections input data options .
Regarding computational performance for large-scale studies, we tested MGS-Fast with a set of Illumina HiSeq 2000 oral microbiome datasets ranging from 1.5GB to 11GB (Fig. 2) in file size (https://www.ebi.ac.uk/ena/data/view/SRX978361). Using a compute server with average computational capacity (128GB, 8 CPU core) the processing time for MGS-Fast ranged from 20 minutes for smaller read sets to 2.5 hours for the larger ones. The cumulative processing time to complete running MGS-Fast for all datasets included in this study was approximately 15 hours (900 minutes). Processing times for large WGS metagenomic studies (100 samples), the complete study can be processed in the course of a few days. While running a single dataset at a time on our computer server, we noticed that the hardware capacity was underutilized and decided to implement MGS-Fast analysis in parallel, reducing 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 the total time required to process the datasets included here. In a production setting, where also more computational capacity might be available, researchers could use tens of instances at the same time and efficiently process large-scale data sets.
We then compared the performance of the Burrows-Wheeler Transform (BWT) MGS-Fast used for MGS-Fast. (Fig. 3) with Kraken [14] and DIAMOND [25] that use respectively large in-memory k-mer database and using protein based alignment, versus the compressed Burrows-Wheeler Transform (BWT) -Bowtie2 aligner used for MGS-Fast. The Kraken approach is based on a database of pre-classified k-mers, and although it classify millions of reads in just a few minutes, their memory requirements are usually high requiring a high-performance, expensive compute server that some labs do not have access to in order to be able to complete the analysis. In contrast, MGS-Fast similar to multiple other published tools in the literature using the more efficient Bowtie2 index structure, which allows for memory efficiency storage and increased speed when querying the database [26], as also evidenced by our results in the present study.
Furthermore, our results are substantiated by the a subsequent release by the authors of Kraken of a newer tool called Centrifuge [27], that unlike Kraken and similar to other nucleotide-based classification tools in the literature is using is also the Burrows-Wheeler transform (BWT) for the genome database. This strategy uses one-tenth the space of a Kraken index for the same database, providing faster classification speed and lower memory requirements, making it possible to perform large-scale metagenomics annotation on a desktop computer. As reported in this study [27], Centrifuge took only 47 min on a standard desktop computer to analyze a total of 26 GB of input sequence data, and where Kraken and MegaBLAST required 100 GB and 25 GB of memory respectively for their indexes, Centrifuge requires only 4.2 GB. Similar results have been reported for multi-fold increased speed in comparison to Kraken with BWT use in PALADIN and Kaiju [28][29], and as reported in these studies an important constraint for Kraken is its memory usage, where the database grows in linear proportion to the number of distinct k-mers in the genomic library (at 12 bytes per k-mer).
Regarding using nucleotide read alignment versus protein translated search for metagenomic classification, the fact is that amino acid sequences are conserved better at evolutionary distances, leading to more sensitive read classification in the case of distant species or tax. Both DIAMOND and Kaiju [25,29] align six-frame translations of reads against a protein database. Similarly to what was mentioned above, Kaiju indexes the reference protein database using BWT as does our MGS-Fast tool, allowing metagenomic sequences to be searched fast and with low-memory footprint against a large protein database. Given a metagenomic sample and the pre-built index, Kaiju first translates every read in all six reading frames, splitting the read at stop codons. As reported in this study, by using the BWT as an index for the reference protein database, Kaiju classifies up to millions of reads per minute and is typically faster than k-mer-based methods such Kraken and Clark [30].

Competing Interests
The authors declare that they have no competing interests.

Authors' contributions
The construction and testing of the MGS-Fast method was implemented by S. Brown with assistance from Y. Hao and H. Chen. The Docker image for MGS-Fast was built by B. Laungani with assistance from T. Ali, C. Dong, C. Lijeron, and B. Kim. The data analysis was performed and the manuscript text was written by K. P Krampis, S. Brown and C. Wultsch. The supplemental software manual was written jointly by B. Laungani with additions by K. Krampis, T. Ali, C. Dong, C. Lijeron, B. Kim and C. Wultsch.