A Computational System for Identifying Operons Based on RNA-Seq Data

An operon is a set of neighboring genes in a genome that is transcribed as a single polycistronic message. Genes that are part of the same operon often have related functional roles or participate in the same metabolic pathways. The majority of all bacterial genes are co-transcribed with one or more other genes as part of a multi-gene operon. Thus, accurate identification of operons is important in understanding co-regulation of genes and their functional relationships. Here, we present a computational system that uses RNA-seq data to determine operons throughout a genome. The system takes the name of a genome and one or more files of RNA-seq data as input. Our method combines primary genomic sequence information with expression data from the RNA-seq files in a unified probabilistic model in order to identify operons. We assess our method’s ability to accurately identify operons in a range of species through comparison to external databases of operons, both experimentally confirmed and computationally predicted, and through focused experiments that confirm new operons identified by our method. Our system is freely available at https://cs.wellesley.edu/~btjaden/Rockhopper/.


Introduction
An operon is a set of consecutive genes on the same strand in a genome that are cotranscribed into a single polycistronic message. Operons were first described by Jacob and Monod [1]. Operons pervade the genomes of bacteria and archaea, and less commonly can be found in eukaryotes such as nematodes [2]. They are a means by which an organism can implement co-expression of related genes. As a result, identification of operons is an important component in elucidating gene regulatory networks.
A variety of experimental approaches have been employed to detect operons, such as RNA polymerase footprinting, primer extension or S1 mapping, northern blotting, RNase protection assays, and RT-PCR. While these experimental methods may be precise and provide strong evidence in support of an operon, they can be relatively costly or time consuming. Thus, a number of computational methods have been developed for systematic identification of operons throughout a genome. These computational methods have the advantage of being fast and efficient, but the operon predictions they generate can have varying degrees of accuracy with respect to their correspondence with experimentally verified operons [3].
Computational methods for predicting operons generally start with directons, which are consecutive genes on the same strand of a genome [4], and use various features indicative of operons to predict whether the genes in the directon are co-transcribed as part of a single operon or not. The features used in computational methods for predicting operons (reviewed in [5,6]) can be described generally by three categories: (1) primary sequence features, (2) external data source features, and (3) transcript expression features.
In the first category, primary sequence features, the most commonly used feature is intergenic distance [6]. The distribution of intergenic distances for genes known to be part of the same operon is different, and on average shorter in length, than the distribution of intergenic distances for consecutive genes in a genome that are not part of the same operon. Intergenic distance between consecutive genes is straightforward to compute and offers strong predictive power relative to many other features used for operon prediction [6,7]. Another example of a primary sequence feature that can be used for operon prediction is the presence of transcription signals, such as promoters or terminators -most commonly Rhoindependent terminators -at the beginning and end of the candidate operon, respectively, but not between genes within the candidate operon [8]. Codon usage is commonly used for the problem of identifying protein-coding gene sequences in a genome, and an early studied observed different codon usage patterns for genes in the same operon as compared to genes in different operons based on a small sample of genes [9]. Several studies have employed codon usage as a feature in operon prediction under the assumption that codon usage for genes in the same operon is more similar than codon usage for genes not in the same operon [10][11][12]. However, the results are mixed when evaluating the contribution of codon usage as a feature in operon prediction, with some examples of codon usage showing significant predictive power [10] and other examples showing little or no statistical difference when incorporating this feature into operon prediction [11,12].
In the second category, external data source features, features for operon prediction have been derived from various data sources beyond an organism's genomic sequence. One feature in this category is the functional relationship of the protein products from genes in a candidate operon [13]. Functionally related genes as determined from clusters of orthologous groups (COGs) of proteins and participation in similar metabolic pathways can be indicative of genes that are co-transcribed as part of the same operon [14]. Comparative genomics analyses are commonly used to identify these features, as phylogenetic profiles evince neighborhoods of conserved genes [15,16]. Among operon prediction methods that utilize features from the first two categories but not the third category, transcript expression data, the method based on the STRING database [13] and its corresponding webservers [7,17] demonstrates some of the highest accuracy of operon predictions [18], and it uses conserved genes profiles as a core feature.
In the third category, transcript expression data, features are determined from whole-genome transcript expression assays. Early studies used microarray data to aid in operon prediction. Co-expression across different conditions of consecutive genes suggests an increased likelihood of the genes belonging to the same operon [19]. Similarly, observed expression from the intergenic region between genes indicates an enhanced likelihood of the genes being co-expressed [20]. In recent years, there has been an explosion in the use of RNA-seq data throughout biological and related sciences. The Sequence Read Archive (SRA) now contains more than 20 peta-bases of sequencing data [21,22]. Along with this growth in RNA-seq data is its increasing use in characterizing operons. In some cases, methods based on RNA-seq data have been developed and applied to identify operons in specific organisms [23][24][25][26]. In other cases, general purpose tools have been developed to predict operons broadly based on RNA-seq data [12,16,27].
In order to improve access to the growing number of operon identifications, a number of databases and web servers are available. In some cases, computational predictions of operons are made offline for a large number of genomes and a database allows users to query these pre-computed operon predictions [15,17,28]. In other cases, web servers enable real-time prediction of operons based on user supplied data [7,16]. A number of databases also integrate curated information about experimentally verified operons [29][30][31]. While the various computational methods and databases for predicting operons are not always in agreement [3,24], tools for computational identification of operons can be effective, often achieving predictive accuracies of 90% or more [13]. Table 1 provides a summary of fifteen different methods for predicting operons. Four of the methods listed in Table 1 provide a webserver interface for accessing the operon predictions. The final seven columns in the table indicate the features used by each method when making its predictions. The least commonly used feature in this set is codon usage, with only one of the fifteen methods employing this feature, whereas the most commonly used feature is intergenic distance between operon genes, which is used by eleven of the fifteen methods.
One challenge for these computational methods is that, in many cases, they are trained on data from organisms, such as Escherichia coli and Bacillus subtilis, where a large number of operons have been confirmed experimentally, but their applicability to prokaryotes more broadly is difficult to assess without large sets of independently verified operons outside of a few model organisms. Operon prediction is further challenged by the dynamic nature of transcription, as operon structures can change when environmental conditions vary [12]. Sets of consecutive genes may be co-transcribed as part of an operon in one condition but be individually transcribed in another condition, e.g., when alternative promoters occur both before and within operons [32,33]. Most methods make static predictions of genes forming an operon, rather than dynamic condition-specific predictions.
sequencing reads to a reference genome, assembling transcripts, identifying transcript boundaries and novel transcripts such as regulatory RNAs, de novo transcript assembly when reference genomes are unavailable, normalizing data from different experiments to enable meaningful inter-experimental integration and comparison, quantifying transcript abundance levels, statistical testing for differential gene expression in different experimental conditions, and visualizing results in a genome browser. Rockhopper also uses RNA-seq data to identify operons throughout genomes. In the remainder of this article, we detail the methodology used by Rockhopper for operon prediction, assess the predictions with more focused wet-lab experiments, compare the results with databases of operon identifications, and evaluate the results provided by other groups who have employed Rockhopper for operon annotation. Though the methods that we describe have applicability to other domains, the remainder of this article focuses on bacterial operon identification, since this is the domain where the vast majority of operons are found and have been studied.

Input to Rockhopper software system
As input, Rockhopper requires the name of a genome and one or more files of RNA-seq reads. Sequencing read files may be in fastq, qseq, fasta, sam or bam format, and the files optionally may be gzipped. The sequencing reads may correspond to single-end reads or paired-end reads, and they may be strand-specific or strand-ambiguous. Based on the supplied name of a genome, Rockhopper automatically downloads the genome sequence and gene annotations from RefSeq [35]. If a user wishes to analyze data from an organism not found in RefSeq, the user may supply their own genome sequence file in fasta format and their own gene annotation file as input to Rockhopper.

Mapping sequencing reads
For all RNA-seq data in this study, prior to using Rockhopper to map sequencing reads to a genome, the Trimmomatic [36] tool was used with default parameter settings for adapter trimming and quality filtering. Rockhopper was then employed to align sequencing reads to a genome. Rockhopper maps reads to a genome in a manner similar to that of Bowtie 2 [37], by building an FM-index [38] for the genome based on the Burrows-Wheeler transform [39]. One of the challenges in analyzing RNA-seq data is that reads often map ambiguously to different transcripts, e.g., when a genome contains multiple paralogous genes [40]. The approach used by RNA-seq analysis pipelines for resolving these ambiguities and allocating reads to paralogous genes can have significant effects on downstream analyses such as testing for differential gene expression [41]. Rockhopper uses the straightforward approach of allocating ambiguous reads equally among the genes to which the reads map. In future work, there may be opportunities for evolving Rockhopper's model, as recent studies have proposed new methods for allocating ambiguous reads that have the potential to result in increased accuracy of downstream analyses [42].

Computing directon information
As with many other operon prediction methods [4], Rockhopper begins its identification of operons with directons, which are sets of adjacent genes in a genome on the same strand.
Using the input gene annotation files, Rockhopper calculates all directons in a genome before determining which subset of directons are co-transcribed, i.e., are operons. As context, Figure 1a shows the length of directons in the E. coli genome [43]. Here, we use the number of genes in the directon to describe its length rather than, say, the number of nucleotides. As Figure 1a illustrates, there are approximately 500 instances of directons containing a single gene among the 4,386 protein-coding genes in the E. coli genome. Single gene directons occur when a gene is flanked on either side in the genome by a gene on the opposite strand. Similarly, there are approximately 250 instances of length two-gene directons and approximately 150 instances of length three-gene directons. In comparison, Figure 1b shows the length of operons in E. coli as reported by RegulonDB [31]. E. coli is used here because it is arguably the organism where operons have been most extensively studied and because very few other bacterial genomes have large sets of experimentally validated operons. As Figure 1b

Intergenic distance as a feature of operon prediction
Consistent with most other methods for predicting operons throughout a genome, Rockhopper uses the intergenic distance between genes as a predictive feature in determining operons, as consecutive genes separated by a small intergenic distance are more likely to be co-transcribed than consecutive genes separated by a large intergenic distance [6]. Figures 2a, 2b, and 2c show distributions of the intergenic distances between consecutive genes on the same strand (solid red line) and on opposite strands (solid blue line) for Bacillus subtilis, Vibrio cholerae, and Escherichia coli, respectively. Figure 2c additionally shows the distribution of intergenic distances between consecutive genes that are part of documented operons in RegulonDB (solid black line) in Escherichia coli [31]. As Figure 2c demonstrates, the distribution of intergenic distances for documented operons in RegulonDB (in black) is more similar to the distribution for consecutive genes on the same strand (red) than consecutive genes on the opposite strand (blue) though it has less deviation than the distribution for consecutive genes on the same strand (red). Dashed lines in the figures illustrate versions of the distributions smoothed via an Epanechnikov kernel function [44]. Figures 2a, 2b, and 2c illustrate how different genomes have different propensities for shorter or longer intergenic regions between consecutive genes on the same strand and between consecutive genes on opposite strands. As broader context, Figure 2d shows the same distributions but, rather than for individual genomes, for 131,238 bacterial genomes available from RefSeq [35]. For a particular genome designated by the user, Rockhopper calculates distributions analogous to those shown in Figures 2a, 2b, and 2c. Then, given the intergenic distance in nucleotides between two consecutive genes on the same strand, Rockhopper estimates the probability that the genes correspond to the same operon and the probability that the genes do not correspond to the same operon. The probability that the genes correspond to the same operon is based on the smoothed distribution of intergenic distances between consecutive genes on the same strand (dashed red line in Figure 2) for the relevant genome. The probability that the genes do not correspond to the same operon is based on the smoothed distribution of intergenic distances between consecutive genes on opposite strands (dashed blue line in Figure 2) for the relevant genome.

Expression patterns as a feature of operon prediction
RNA-seq data evince expression patterns of genes, and these expression patterns contain information about the likelihood that consecutive genes in a genome are part of the same operon [24]. Genes that are co-transcribed as part of the same polycistronic message generally have related expression levels to each other [19,25]. Like other features for operon identification, expression patterns contain some predictive information but also have limits in their ability to distinguish genes that are co-transcribed from genes that are not cotranscribed. For example, one study observed differential expression of polycistronic genes in 43% of operons in E. coli [26]. Thus, as one feature to aid in its prediction of genes belonging to the same operon, Rockhopper uses the relationship of expression across conditions assayed by RNA-seq experiments. Rockhopper computes two distributions based on the correlation coefficient of two consecutive genes' expression levels across the assayed conditions [45]. The first distribution corresponds to correlations of consecutive genes on the same strand throughout the genome. The second distribution corresponds to correlations of consecutive genes on opposite strands throughout the genome. These two expression correlation distributions, like the two distributions of intergenic distances described in the previous section, enable Rockhopper to estimate the probability that two consecutive genes are part of the same operon or not, based on the relevant feature.

Probabilistic model for combining features
In order to combine information from different features suggestive of genes belonging to the same operon into a unified probabilistic model for predicting operons, Rockhopper employs a naïve Bayes classifier. Specifically, let g be consecutive genes on the same strand of a genome. Rockhopper models the probability that g corresponds to an operon, i.e., the genes in g are co-transcribed as part of the same polycistronic message, as probability operon g = prior_probability operon * probability g operon probability g (1) and the probability that g does not correspond to an operon as probability not operon g = prior_probability not operon * probability g not operon probability g . ( The denominator in expressions (1) and (2) (4) respectively. Note, the products in expressions (3) and (4) above are over the d=2 features corresponding to intergenic distance and expression pattern, as described in the two previous sections above. The prior probability that consecutive genes on the same strand are cotranscribed as part of the same operon is estimated as 1.0 -(# of directons/number of pairs of genes on the same strand) [4]. As points of reference, the prior probability of two consecutive genes on the same strand being part of the same operon is computed by Rockhopper as 69% for genes in V. cholerae, 73% for genes in E. coli, 84% for genes in S. enterica, and 85% for genes in S. pyogenes. Once the probability of g being in the same operon and the probability of g not being in the same operon are computed, g is classified as an operon or not based on whichever probability is greater using the maximum a posteriori (MAP) decision rule argmax k ∈ operon, not operon prior_probability k * ∏ i = 1 d probability g i k . (5)

Evaluation metrics
When evaluating operon predictions, different metrics may be employed to assess the accuracy of the predictions. One possible metric considers whether all genes in an operon are correctly predicted or not. The challenge with such a metric is that some operons contain many genes, as indicated in Figure 1b, and the metric will not distinguish between approaches that correctly predict most but not all genes in an operon versus approaches that do not correctly predict any genes in an operon. Thus, rather than each entire operon as the basic unit of prediction and evaluation, gene pairs are used as the basic unit, where a gene pair is defined here as two consecutive genes on the same strand in a genome. For genomewide operon predictions, all gene pairs in the genome are identified and a prediction is made for each as to whether the two genes in the pair are co-transcribed as part of the same operon or not. Assuming there exists some external source of information about operons in the genome, the predictions can then be compared to the external source to quantify how many gene pair predictions are correct. A true positive prediction corresponds to a gene pair predicted to be part of the same operon that is part of the same operon in the external source. A false positive prediction corresponds to a gene pair predicted to be part of the same operon that is not part of the same operon in the external source. A true negative prediction corresponds to a gene pair predicted not to be part of the same operon that is not part of the same operon in the external source. And a false negative prediction corresponds to a gene pair predicted not to be part of the same operon that is part of the same operon in the external source. The sensitivity (aka, recall or true positive rate) of the predictions is the percentage of gene pairs belonging to the same operon in the external source that are correctly predicted to be part of the same operon, i.e., the number of true positive predictions divided by the sum of true positive and false negative predictions. The specificity (aka, selectivity or true negative rate) of the predictions is the percentage of gene pairs not belonging to the same operon in the external source that are correctly predicted not to be part of the same operon, i.e., the number of true negative predictions divided by the sum of true negative and false positive predictions.

RNA-seq data
All RNA-seq data used in this study is publicly available from the Sequence Read Archive [21]

Rockhopper usage and availability
Rockhopper is available from the website https://cs.wellesley.edu/~btjaden/Rockhopper. In addition to the source code, installation and executables are available for PCs and Macs, and a JAR file is available for any platform. Rockhopper is implemented using the Java programming language and Java version 1.6 or later is required for its execution. Rockhopper can be run either from the command line or via a graphical user interface (GUI). In the past year, Rockhopper was downloaded approximately seven thousand times from four thousand unique IP addresses.
As output for operon predictions, Rockhopper generates a tab-delimited text file. Each line of the file corresponds to an operon prediction. Each operon prediction has five components: the nucleotide coordinate of the start of the coding sequence of the first gene in the operon, the nucleotide coordinate of the end of the coding sequence of the last gene in the operon, the strand of the operon, the number of genes in the operon, and a list containing the names of the genes in the operon. In addition to the text file containing information about all of the operon predictions, Rockhopper enables visualization of the results using the Integrative Genomics Viewer (IGV) genome browser [46]. Figure 3 illustrates the visual output of Rockhopper's operon predictions along with the location of annotated genes and sequencing reads on both strands of the genome. Figure 3 focuses on a genomic region containing a nine gene operon that has been well studied and is experimentally confirmed [47][48][49][50]. In the figure, the first four tracks show that there is the largest expression for this region on the minus strand in the second experimental condition, and there is evidence of expression spanning all nine genes. As the fifth track indicates, Rockhopper predicts that all nine genes in this region are co-transcribed as part of the same operon. While the figure focuses on a specific region, the genome browser included as part of Rockhopper allows interactive exploration of Rockhopper's operon predictions throughout a genome.

Evaluation of predictions against operon databases
We used Rockhopper to predict operons throughout the genomes of ten bacteria: Neisseria gonorrhoeae FA1090, Salmonella enterica subspecies enterica serovar Typhimurium strain LT2, Streptococcus pyogenes M1 GAS, Escherichia coli strain K-12 substrain MG1655, Caulobacter vibrioides CB15, Helicobacter pylori 26695, Pseudomonas aeruginosa PA01, Bacillus subtilis subsp. subtilis str. 168, Shewanella oneidensis MR-1, and Vibrio cholerae 01 biovar El Tor str. N16961. Table 2 shows the number of operons predicted by Rockhopper in each genome and the number of RNA-seq experiments used to make the predictions. The number of genes in a genome that Rockhopper predicted to be part of a multi-gene operon ranged from 38% for Caulobacter vibrioides to 80% for Vibrio cholerae.
Rockhopper's operon predictions were compared to the operon predictions from the Database of prOkaryotic OpeRons (DOOR 2 ), a leading database of computationally predicted operons in bacteria [16]. The final column in Table 2 indicates the similarity, as measured by the Rand coefficient [51], between Rockhopper's operon predictions and DOOR's operon predictions. The overlap between the two sets of predictions ranged between 88% for V. cholerae to 95% for E. coli.
Operon predictions from both Rockhopper and DOOR were then evaluated against three external operon databases that are based on operons with experimental evidence: RegulonDB contains highly curated information about experimentally verified operons in E. coli [31], Sharma et al. produced an extensive annotation of operons in Helicobacter pylori [23], and DBTBS consists of experimentally supported operons in Bacillus subtilis [29].
Results from comparing Rockhopper's and DOOR's predictions to databases of experimentally supported operons are shown in Table 3. The sensitivities in Table 3 suggest that both Rockhopper and DOOR successfully identify most operons in a genome. The specificities in Table 3 suggest a low false positive rate for both approaches, i.e., when consecutive genes are not part of the same operon then the approaches successfully identify most of the genes as not being co-transcribed. Across the three databases, the sensitivity of Rockhopper's predictions range from 88% to 95%, slightly higher than the sensitivity of DOOR's predictions, which range from 84% to 93%. The specificity of predictions from Rockhopper and DOOR are comparable, ranging between 81% to 96% and between 80% to 95%, respectively. The slightly higher sensitivity of Rockhopper as compared to DOOR at a comparable specificity suggests that Rockhopper is identifying more operons as it has a higher number of true positive predictions without the cost of increased false positive predictions.
To evaluate how the number of RNA-seq experiments might influence Rockhopper's performance, we varied the number of E. coli RNA-seq data sets that we used from 2 to 16 when making operon predictions with Rockhopper. We used E. coli here because of the large number of RNA-seq data sets available for this organism and because there is an external database of operons, RegulonDB [31], that is well curated. The sensitivity of Rockhopper's operon predictions when compared to operons in RegulonDB remained the same at 90% as the number of RNA-seq data sets used by Rockhopper increased from 2 to 16. The specificity of Rockhopper's operon predictions increased slightly, from 79% to 81%, as the number of RNA-seq data sets used by Rockhopper increased from 2 to 16.
In order to better understand how different features impact the accuracy of Rockhopper's operon predictions, we calculated the mean values of the features for different sets of predictions -those corresponding to true positive, false positive, true negative, and false negative predictions. Here, we used Rockhopper's predictions from E. coli as compared to RegulonDB [31], as this is a comprehensive and regularly curated database of experimentally verified operons. For the feature of intergenic distance, the mean distance between pairs of E. coli genes that Rockhopper predicts as belonging to the same operon are

Experimental validation of operon predictions
We hypothesize that some of Rockhopper's predictions that do not correspond to verified operons in external databases may indeed be genuine operons that have not been identified or validated previously. If this is the case, some of Rockhopper's predictions that are counted as "false positives" and that contribute to a lower specificity score may indeed be "true positives" and their incorrect classification would be a consequence of comparing the predictions to incomplete operon databases. To test this hypothesis, we performed experiments to evaluate whether some of Rockhopper's novel operon predictions, i.e., predictions of operons not found in the database, might correspond to genuine operons as opposed to being false predictions. We began by choosing 10 random sets of consecutive genes in E. coli according to the following criteria: each set of genes was predicted by Rockhopper to be co-transcribed as part of a multi-gene operon, some of the genes in the set are identified in RegulonDB as being part of the same operon whereas other genes in the set are not identified in RegulonDB as being part of the same operon, and the genes showed significant expression in the RNA-seq data. The 10 sets of genes ranged in size from 2 genes to 11 genes. For 13 pairs of consecutive genes within these sets that were identified in RegulonDB as being part of the same operon and for 14 pairs of consecutive genes within these sets that were not identified by RegulonDB as being part of the same operon, we performed RT-PCR to test for co-transcription. We found evidence from the RT-PCR experiments supporting co-transcription for 11 of the 13 pairs that RegulonDB reported to be part of the same operon, and we found evidence from the RT-PCR experiments supporting co-transcription for 12 of the 14 pairs that RegulonDB did not report to be part of the same operon [27]. To further understand whether some of Rockhopper's predictions might correspond to operons that have not previously been identified, we similarly tested some of our novel operon predictions in N. gonorrhoeae using RT-PCR. For six pairs of genes in N. gonorrhoeae that Rockhopper predicted to be co-transcribed as part of the same operon, we tested the pairs for co-transcription using RT-PCR and found evidence of cotranscription for all six [27]. For two pairs of genes in N. gonorrhoeae that Rockhopper predicted not to be co-transcribed, we tested the pairs for co-transcription using RT-PCR and found evidence of co-transcription for one of the two pairs [27]. Thus, in two different genomes, one in which operons have been studied extensively (E. coli) and one in which there has been less examination of operon structures (N. gonorrhoeae), we found that a high percentage of Rockhopper's novel operon predictions were supported by independent experimental evidence. However, there were also a small number of examples of previously verified operons for which we did not observe evidence of co-transcription in our RT-PCR experiments and one example where Rockhopper did not predict genes to be part of the same operon yet we observed evidence of co-transcription in our RT-PCR experiments. We interpret these findings to support the notion that Rockhopper's operon predictions are largely accurate and are applicable to different bacterial genomes, but there is still room for improvement both in increasing the sensitivity of the predictions and reducing the false positive rate.

Discussion
We evaluated Rockhopper's operon predictions by comparing them to databases of experimentally determined operons and to a leading computational approach. Rockhopper demonstrates sensitivities between 88% and 95% and specificities between 81% and 96% when its predictions are compared to databases of operons for three bacterial species. Rockhopper shows slightly improved sensitivity and comparable specificity when compared to DOOR 2.0, a leading method for operon prediction [16]. These results indicate the importance of combining multiple features for operon prediction, as individual features on their own cannot reliably distinguish co-transcribed genes from non-co-transcribed pairs of genes. Further, these data suggest that there may be opportunities to further improve the predictive accuracy of the system by differentially combining the features. Rockhopper's Bayesian classification approach makes the simplifying assumption that features are independent of each other. However, further work may benefit from exploring more complex relationships among features.
One of the challenges with evaluating operon prediction methods is that there are few genomes for which large sets of operons have been experimentally confirmed. For the few such genomes where large sets of validated operons exist, it is unclear to what extent the set of confirmed operons is complete. Thus, assessment of predictions as true/false positives/ negatives is imperfect. The situation is further complicated as many operons are dynamic in nature, with genes being co-transcribed in one condition but not in another. One study in E. coli found that 36% of operons had internal promoters and terminators that generated multiple transcription units [26]. Rockhopper makes static rather than dynamic operon predictions, though it can be used for dynamic predictions as well. One group built a database called OperomeDB of condition specific operons for more than a hundred bacterial transcriptomes based on Rockhopper's operon predictions [52]. Another group developed a method for condition specific operon prediction and evaluated their results against Rockhopper's operon predictions for three genomes, Porphyromonas gingivalis, Escherichia coli, and Salmonella enterica, observing that Rockhopper consistently led to more accurate operon predictions than the condition specific approach [12].
As further evaluation, we assessed Rockhopper's predictions for two genomes using RT-PCR. For many of Rockhopper's predictions that did not correspond to known operons and which we had labeled as false positive predictions when comparing them to various databases, we observed evidence that the genes were co-transcribed as part of the same operon and that a number of Rockhopper's novel operon predictions correspond to previously unknown operons.
Rockhopper's method for predicting operons throughout a genome has been evaluated by other groups and applied to a broad range of bacteria. In an interesting application, Rockhopper was used to identify operons in E. coli that responded to colorectal cancer [53].
An alternative to Rockhopper's approach for operon prediction is to use features that are genome-specific or that have variable success in different genomes, such as computationally predicting transcription signals including promoter regions, transcription factor binding sites, and rho-independent terminators. In contrast, Rockhopper was designed so that it can be used for operon prediction in any bacterial genome for which RNA-seq data is available.
Our results indicate that the accuracies of its predictions are relatively stable across different genomes.

Conclusion
Operons are one mechanism by which organisms can coordinate the expression of proximal genes. Understanding which genes are co-transcribed as part of a single polycistronic message provides information about the functional relationship and regulation of the genes. Computational methods have proven useful as efficient tools for predicting operons throughout a genome. Here, we present one such computational method, Rockhopper. Rockhopper uses RNA-seq data, which is increasingly being generated, to aid in its operon predictions. Rockhopper combines primary sequence information and RNA-seq data in a unified probabilistic model in order to make operon predictions in a genome. Rockhopper demonstrates high accuracy across a range of bacterial genomes. We hope that Rockhopper serves as a useful tool to the community for accurate, efficient, and user-friendly operon identification.    The visualization created by Rockhopper using the IGV genome browser illustrates a 7,842 basepair region in the E. coli genome, including sequencing reads corresponding to this region from two experimental conditions. The first condition corresponds to minimal medium supplemented with 0.2% D-glucose (SRA SRR794972) and the second condition corresponds to LB medium exposed to 0.5% α-methylglucoside (SRA SRR794863). The region contains nine ATP synthase genes on the minus strand: atpI, atpB, atpE, atpF, atpH, atpA, atpG, atpD, atpC. There are six tracks (rows of information) displayed in the browser.

Highlights
The first track (in blue) indicates sequencing reads observed on the plus strand from the first experimental condition. The second track (in blue) indicates sequencing reads observed on the plus strand from the second experimental condition. The third track (in red) indicates sequencing reads observed on the minus strand from the first experimental condition. The fourth track (in red) indicates sequencing reads observed on the minus strand from the second experimental condition. The fifth track (in magenta) represents an operon prediction from Rockhopper. The sixth track (in magenta) at the very bottom indicates the locations of nine annotated genes on the minus strand in the E. coli genome. Different methods for predicting operons are shown. The first column provides a citation for the method. The second column indicates the year of publication. The third column indicates the number of genomes for which operon predictions were made. The fourth column indicates if a webserver interface is provided for accessing the predictions. The final seven columns indicate features used for predictions: intergenic distance between genes in an operon, conservation of genes as determined through comparative genomics, functional annotation of genes, transcription signals in the form of promoter regions or terminators, codon usage of genes, microarray gene expression data, and RNA-Seq data.