Introduction

The last two decades of research have brought vast changes to the field of genomics. The sequencing of a genome and the subsequent annotation of gene functions that were originally performed by teams of researchers expending thousands of man-hours of labor has become a standard laboratory technique that can be performed by a single person in one day. As sequencing technology has advanced and the cost has dropped, the number of genomes being deposited into the public databases has outpaced Moore's Law1,2. This has shifted the bottlenecks in genomic analysis from the sequencing per se to the tools that are used for annotation and genomic analysis.

In 2008, the RAST server (Rapid Annotation using Subsystem Technology) was developed to annotate microbial genomes3,4. It works by projecting manually curated gene annotations from the SEED database onto newly submitted genomes5,6,7. The key to the consistency and accuracy of the RAST algorithm has been the carefully structured annotation data in the SEED, which are organized into subsystems (sets of logically related functional roles)5. As a result, RAST has become one of the most popular sources for consistent and accurate annotations for microbial genomes. The RAST community currently consists of ~10,000 active users who have contributed an average of 1,170 microbial genomes per week in the last year. It is also being used as the foundation for maintaining consistency for automated metabolic modeling in the ModelSEED8 and KBase (kbase.us) and for comparative genomics in the bacterial pathogen database, PATRIC9,10.

RAST and other annotation engines encapsulate software for identifying and annotating specific genomic features into a standard annotation pipeline11,12,13,14,15,16. This approach has several advantages including offering speed, convenience and consistency to the user. In order to annotate with RAST, users submit their contigs to the server where the computation is performed. This frees users from having to download and install multiple programs, or to perform intensive computations. However, despite these advantages, this approach also has limitations. For instance, the default pipeline may not always be the best choice for a given genome. It is also difficult for researchers to customize annotation pipelines by choosing different tools and adding their own features and annotations. Until recently, it has also been difficult to submit batches of genomes and demand has been increasing for a version of RAST that can accommodate custom batch submissions.

In this paper, we describe a new modular implementation of RAST, which we call the RAST tool kit (RASTtk). RASTtk allows users to build their own annotation pipelines with a choice of gene calling algorithms, annotation scripts and output formats. It also provides a framework for users to add features and annotations to a processed RAST job. RASTtk can handle batch submissions of genomes, as well as the batch submission with custom annotation pipelines. RASTtk can be used on both the RAST website (http://rast.nmpdr.org) and the command line.

Results and Discussion

Accessing RASTtk

In order to make RAST more flexible and to keep pace with advancements that are being made in bioinformatics, we have separated the individual steps of the RAST annotation pipeline to provide a version that is modular, extensible and customizable.

RASTtk exists as a set of advanced options on the RAST web server (http://rast.nmpdr.org) (Figure 1). It has also been made available as a set of stand-alone scripts in the Interactive Remote Invocation Service (IRIS) environment (http://iris.theseed.org/). IRIS is a web application that functions like a command line window. Through IRIS, users can use the latest RASTtk scripts to build and compare custom annotation pipelines. In addition, RASTtk scripts can be installed and run locally using the RASTtk.app DMG (Mac only) (https://github.com/TheSEED/RASTtk-Distribution/releases/). Individual scripts are also available through the KBase GitHub page (https://github.com/kbase/genome_annotation). Tutorials for using the RASTtk scripts can be found on the SEED website (http://tutorial.theseed.org).

Figure 1
figure 1

RASTtk options that are available on the RAST website (http://rast.nmpdr.org).

A table of options is displayed when the user selects the RASTtk annotation scheme and clicks the checkbox for “Customize”. Individual steps can be turned off and on using the check boxes. Parameters and conditions can be changed or added as needed. Dragging and dropping table rows will change the order of the steps.

The RASTtk Default Pipeline

During an annotation job, the data from individual scripts must be collected and integrated into a coherent picture of the genome as a whole. The abstract layout of a RASTtk pipeline is described below:

  1. 1

    An initial tool converts a set of contigs (with a minimal amount of metadata) into a special file format called a Genome Typed Object (GTO). A GTO is a Java Script Object Notation (JSON) formatted file that is human-readable and allows for easy exchange of data objects [www.json.org].

  2. 2

    Each step in the pipeline transforms an input GTO into an enhanced GTO. For example, calling genes with Prodigal uses the command “rast-call-features-CDS-prodigal” to add gene calls to an input GTO, which produces an expanded GTO (see Table 1 for a list of the commands implementing the current set of supported transformation tools).

    Table 1 Characteristics of the RASTtk scripts
  3. 3

    A user can export data from a GTO. Generally, the final product of a sequence of transformation commands would be used to export a tab-delimited spreadsheet or GenBank entry17; numerous alternative export formats are also supported.

We offer a default pipeline for RASTtk, which represents our recommendation for a rapid and accurate annotation workflow. This workflow differs slightly from the “classic” RAST pipeline (Figure 2). For instance, we have added Prodigal18 as an additional gene caller because of its improved accuracy with short genes and start positions and because it is more robust to differences in G+C content12. We have also rewritten several of the core RAST algorithms, including the tools that find rRNA genes and resolve overlapping features. We have included a new version of the k-mer-based annotation algorithm19 and scripts that find repeat regions, CRISPRs, insertion sequences and Streptococcus repeats.

Figure 2
figure 2

The RAST workflow.

Each individual step is bounded by a box and steps are connected by arrows. New RASTtk steps are indicated by red boxes and arrows. Improvements in the original steps are indicated in red text. Steps that are no longer part of the RASTtk pathway are indicated by gray arrows.

Calling ribosomal RNAs

Many tools exist for finding and curating collections of rRNAs13,20, but we needed a program that is simple, lightweight and fast. The current RAST rRNA finder is a custom script that uses hand-curated and phylogenetically diverse set of representative sequences of the 23S (currently 81 representatives), 16S (currently 120 representatives) and 5S (currently 292 representatives) rRNAs. These sets represent the diversity of genomes in the SEED and have been curated for the correct endpoints. The rRNAs of a new genome are simply found using a BLASTN21 search against this curated set. This script also reports partial length matches because genome assemblies with incomplete rRNA operons have become common.

Calling tRNAs

In order to find the tRNAs, we currently use tRNAscan-SE, a tool that was written by Lowe and Eddy22. It uses a secondary structure based searching method to find the tRNA genes.

Calling large repeat regions

The annotation of genomic regions that have been acquired by horizontal gene transfer remains a major challenge to maintaining consistency and accuracy in RAST. Repeat regions are often indicative of horizontal gene transfer and are hallmarks of insertion sequences and other mobile elements. Because of this, we have added a new script to the default RASTtk pipeline that performs a BLASTN21 search of the genome against itself and reports any region occurring more than once with ≥ 95% nucleotide identity. These precomputed repeat regions can then be used for comparative analyses and as supporting data for more detailed annotation of mobile elements.

Calling seleno- and pyrrolysylproteins

Selenoproteins are widespread among the sequenced bacterial and archaeal genomes occurring in ~25% of the genomes in the CoreSEED (a collection of ~1000 highly curated diverse bacterial and archaeal genomes utilized by RAST). They contain the rare amino acid selenocysteine, which is incorporated at a UGA stop codon in frame23. In order to find these proteins, a hand-curated set of known selenoproteins is kept in a BLAST database21. When a match to a potential selenoprotein is found within a genome, it is then searched for the in-frame stop codon. If the stop is found, then the protein is annotated as a selenoprotein. Pyrrolysylproteins are less common among the currently sequenced genomes, occurring in ~1% of the sequenced bacterial and archaeal genomes in the CoreSEED. Similar to selenocysteine, pyrrolysine is incorporated at a UAG stop codon23. We search for pyrrolysylproteins using the same strategy.

Calling Streptococcus repeat elements

Streptococcus species contain small interspersed repeats that may modulate gene expression and can be used for epidemiological typing24,25. We have added tools created by Croucher et al. for finding these elements24. When the user specifies the metadata for the genome, these scripts will be run if the genus is Streptococcus. In the future, we anticipate adding other species-specific annotation tools and their conditional usage in RASTtk will mirror that of the Streptococcus repeat finder.

Calling CRISPR elements

CRISPRs, clustered regularly interspaced short palindromic repeats, are a special type of repeat region found in many bacterial and archaeal genomes. The CRISPR array contains spacer regions matching foreign DNA that are regularly spaced and bounded by repeat regions. The DNA of the spacer region is transcribed and used to interfere with incoming foreign DNA. Because of their importance in horizontal gene transfer and in biotechnology26,27, we have added a script to the RASTtk pipeline that finds CRISPR elements. This script works by using a Perl regular expression search [Wall, L., Christiansen, T. & Orwant, J. Programming perl. (O'Reilly Media, Inc., 2004)] to find recurring direct repeats of at least 24 nucleotides in length and spaced at regular intervals. The output of the script is three new feature types: the entire CRISPR array, the CRISPR spacer and the CRISPR repeat.

Calling genes

RASTtk offers the option to call the open reading frames with Glimmer328, GeneMarkS29 and Prodigal18. The original web-based version of RAST used Glimmer3 and our current default RASTtk pipeline uses both Prodigal and Glimmer3. The output of both programs is added to the GTO file and then an optimal set of calls is chosen in the overlap resolution step (described below).

Annotating proteins with k-mers

Historically, RAST has made heavy use of the FIGfam collection maintained within the SEED project6. FIGfams are protein families in which it is believed that all members of the same family share an identical function and were derived from a common ancestor (i.e., they are all isofunctional homologs). The original implementation of the k-mer-based assignment of function was based on the use of signature k-mers3,19. A signature k-mer is defined as an 8-mer amino acid sequence in which the vast majority (over 80%) of occurrences are found within FIGfams sharing a common function and that do not occur in any FIGfam with a different function. For example, a k-mer for which 93% of the occurrences within the FIGfam collection were in families implementing the function SSU ribosomal protein S13p (S18e) would be considered a “signature of function”. In this case, the signatures depend critically on the FIGfams.

Once we had modestly consistent collections of sequences in the SEED, we introduced a version of k-mer analysis that was not based on FIGfams. The notion of signature k-mer was modified to an 8-mer amino acid sequence in which the vast majority (over 80%) of occurrences were within sequences assigned an identical function. This version depends on the collection of protein sequences with consistent annotations. Currently, we base this version on a collection of about 1000 representative genomes present in a subset of the SEED called the CoreSEED (core.theseed.org). The CoreSEED represents our best attempt at annotation consistency and is currently the main focus of our manual annotation efforts.

The CoreSEED database attempts to provide the most consistent manual annotations for the core metabolic and house keeping functions in a relatively small and diverse set of the bacteria and archaea, whereas the PubSEED database (from which the FIGfam collection is generated) attempts to absorb new annotations from the academic community for many genomes. The default RASTtk workflow first searches against the limited number of more stable annotation-based k-mers from the CoreSEED and then if an annotation cannot be found it searches against the larger collection of FIGfam based k-mers from the PubSEED.

In addition to these two k-mer based gene-function assignment tools, there are also two analogous tools that use these k-mer sets to search for function-containing regions of DNA that do not require gene calls. They are useful for searching for genes in regions where calls may have been missed and for assessing functions in un-assembled sequence data.

Annotating proteins missed by k-mers

If no function can be found for a protein-encoding gene during the k-mer analysis, a final search that uses a combination of BLAT30 and BLASTP21 is performed against a set of non-redundant genus-specific protein databases for the organism's genus and, when available, closest relatives. If a matching protein is found with an e-value < = 1e-5 and a percent identity > = 50%, then the function from the protein in the database is assigned to the gene.

Resolving Overlapping Calls

After using different tools to call open reading frames and annotate features, we try to resolve the results into a coherent picture. To resolve overlapping features, we use a dynamic programming algorithm that resembles the scoring algorithm in Prodigal18. It works by scanning the genome and generating a score for each alternative combination of feature calls for tRNA, rRNA and protein-encoding genes. In general, for a given location on the contig, tRNA and rRNA receive a higher score than protein encoding genes and large overlaps receive a negative score based on the length of the overlap. Large gaps between genes also result in negative scores. After considering all of the combinations calls, the genomic arrangement with the highest score is chosen. User-defined features are exempt from consideration by this algorithm.

Additional Analysis Scripts

Customizing and updating an annotation job

Users can run custom analysis jobs locally on the command line or in IRIS and then add their own features to an annotation job. In order to input specialty features, the user must create a tab-delimited text file, which contains a unique identifier, the location, feature type and function. These are then added to the GTO file and can be exported in a variety of formats. Users can also update functions directly by providing a tab-delimited file with the identifier and the new function. The update is then logged in the GTO file.

Batch genome submissions

RASTtk supports the ability to upload a directory of GTO files either using IRIS or the RASTtk.app DMG. When a batch upload is performed, the entire directory is uploaded to the RAST server and placed into the queue. A job identifier is returned to the user and this is used to check the status of the job and to download the job when it is completed. A custom RASTtk pipeline can be invoked by adding a special JSON formatted workflow file along with the directory of genomes. For more information on using RASTtk in batch mode, please refer to the RASTtk tutorials (http://tutorial.theseed.org).

Calling prophage elements

In order to find potential prophage elements we have added PhiSpy31. PhiSpy uses a combination of several independent heuristic methods to identify regions in the genome, which may be derived from phages or mobile elements.

Finding insertion sequences

We have added a new tool that uses a reference set of end sequences and transposase proteins from the SEED and ISfinder databases3,32 to search the genome for IS elements. Matches are found by using a combination of BLASTN for the end regions and BLASTX for the proteins21.

Identifying special gene sets

The PATRIC project is an integration of data and tools for studying bacterial pathogens. In order for RASTtk to support PATRIC, it was necessary to improve the identification of genes relating to virulence and drug development33. RASTtk now offers an analysis script for this purpose. It searches against custom BLAST databases that have been built from ARDB34 and CARD35 for finding potential antibiotic resistance genes; DrugBank36 and TTD37 for finding potential drug targets; VFDB38, Victors39 and the PATRIC virulence factors10,33 for finding potential virulence factors; and the human reference genome sequence40 for finding potential human homologs, which is an important step in drug screening analyses.

Future Directions

RASTtk enables users to optimize and customize the annotation steps for a given genome and to apply these pipelines to sets of genomes as customized batch submissions. The modularity of RASTtk also makes it much easier to develop and incorporate software for improving genome annotations and we anticipate adding tools to RASTtk as they become needed. We also expect the utilization of RASTtk to result in the community development of annotation pipelines aimed at solving more specialized annotation problems, such as more accurate gene calling in prophages and eukaryotes. We also anticipate providing specialty scripts that improve the accuracy for gene families that are currently difficult to annotate, such as pathogenicity-related genes, transporters and mobile elements.