The Co-regulation Data Harvester for Tetrahymena thermophila: automated high-throughput gene annotation and functional inference in a microbial eukaryote

Identifying co-regulated genes can provide a useful approach for defining pathway-specific machinery in an organism. To be efficient, this approach relies on thorough genome annotation, which is not available for most organisms with sequenced genomes. Studies in Tetrahymena thermophila, the most experimentally accessible ciliate, have generated a rich transcriptomic database covering many well-defined physiological states. Genes that are involved in the same pathway show significant co-regulation, and screens based on gene co-regulation have identified novel factors in specific pathways, for example in membrane trafficking. However, a limitation has been the relatively sparse annotation of the Tetrahymena genome, making it impractical to approach genome-wide analyses. We have therefore developed an efficient approach to analyze both co-regulation and gene annotation, called the Co-regulation Data Harvester (CDH). The CDH automates identification of co-regulated genes by accessing the Tetrahymena transcriptome database, determines their orthologs in other organisms via reciprocal BLAST searches, and collates the annotations of those orthologs' functions. Inferences drawn from the CDH reproduce and expand upon experimental findings in Tetrahymena. The CDH, which is freely available, represents a powerful new tool for analyzing cell biological pathways in Tetrahymena. Moreover, to the extent that genes and pathways are conserved between organisms, the inferences obtained via the CDH should be relevant, and can be explored, in many other systems.

expression in T. thermophila, as judged based on mRNA levels, can reveal 23 functionally significant co-regulation. Gene regulation in this species appears 24 to predominantly occur at the level of transcription [14], and so steady-state 25 mRNA levels may explain the majority of steady-state protein levels, as 26 reported in other systems [15]. In this report, we will refer to genes that are 27 listed as co-expressed in the TetraFGD as co-regulated.

28
A high-throughput analysis of T. thermophila gene expression profiles 29 revealed that accurate gene networks can be inferred from co-regulation 30 data [13], providing evidence that co-regulated genes tend to be function-31 ally associated. This approach has been used in bacterial, mammalian, and 32 apicomplexan systems [12,16,17]. There is also experimental evidence in 33 T. thermophila to support the conclusion that co-regulation corresponds to 34 functional association: Co-regulation data were used to successfully predict 35 novel sorting factors and proteases involved in the biosynthesis of a class of 36 secretory vesicles, called mucocysts [18,19]. These results suggest that the and in which taxa to run the forward BLAST searches. The results of the "Documents" folder: /Documents/CoregulationDataHarvester/csvFiles.  [23,24]. This process is well-studied and known to be driven by a spe-84 cial adaptation of RNA interference, utilizing Dicer-and Piwi-like proteins, 85 among other factors [25,26]. TWI1 encodes a Piwi-like protein that plays 86 a central role in programmed genome rearrangement [27,26]. When TWI1 87 is entered as the query for the CDH, the CDH retrieves a large number of 88 DNA and RNA-processing factors, as well as chromodomain proteins  plementary File 1). Importantly, these include the key factors known to be 90 involved in programmed genome rearrangement ( were subsequently shown to represent key enzymes for GRL cleavage [19,29].

110
Using CTH3 as a query for the CDH results in a list that includes a 111 large number of genes known to be involved in mucocyst biogenesis (Table   112 2), and is enriched in membrane-trafficking factors and proteins with asyet unknown functions in this organism (Supplementary File 3). Among the latter are a subunit of the AP3 complex and a syntaxin in the STX7 115 subfamily. Subsequent functional analysis of these genes showed that they 116 are both essential for mucocyst formation, providing the best evidence to 117 date that mucocysts are lysosome-related organelles (Kaur et al., submitted).

118
The CDH report for this CTH3 query is attached as Supplementary File 4.

119
This report is also edited to indicate the cases when the CDH matched or 120 expanded upon existing gene annotations.  (Table 3; Supplementary Files 2 and 4). Effectively, the CDH increased the 125 annotation coverage of the genes co-regulated with TWI1 from 46% to 60%, 126 and the annotation coverage of the genes co-regulated with CTH3 from 41% 127 to 57%. Specifying the BLAST parameters allows the user to discover the 128 most informative functional predictions for their genes and pathways of in-129 terest. Limiting the CDH search to lineages outside of the ciliates is more 130 likely to retrieve previously annotated orthologs, but runs the increased risk 131 that weak homologs will generate spurious results. For some processes, such 132 as programmed genome rearrangement in which TWI1 is involved, the most 133 informative BLAST searches may be those restricted to the ciliates. In our 134 trials, the effectiveness of the CDH is maintained regardless of which taxa 135 the BLAST searches are run against.

136
In addition to providing a means of quickly gathering available data about 137 a set of co-regulated genes and inferring their functions, the CDH data can be 138 extended to to investigate the potential overlap between components of dif-139 ferent cellular pathways. For example, NUP50 encodes a gene that functions 140 both in nuclear import at the nuclear pore complex and as part of a com-141 plex involved in transcription [30,31]. Accordingly, the genes co-regulated 142 with NUP50 show extensive overlap with genes co-regulated with an import  freely available and provides a systematic framework for genome annotation.

175
It quickly gathers information from disparate databases and, by optionally 176 reusing BLAST results that had been stored during previous queries, can 177 increase in speed with successive uses. In providing a new means to analyze 178 transcriptomic data, the CDH makes clear the potential for using the rapidly 179 growing amount of genomic and transcriptomic data in many organisms, to 180 facilitate functional analysis in poorly annotated or emerging model systems.

181
Users of the CDH should keep in mind that its reports are necessar-182 ily limited by pre-existing data from the TGD, TetraFGD, and the NCBI.

183
For example, the TetraFGD does not provide co-expression data for genes 184 whose expression level falls below a set threshold. Because of this limit, some 185 T. thermophila genes may be overlooked by the CDH. Executable files for 186 the program can be found at http://ciliate.org/index.php/show/CDH.  (6) (2002) Figure 1: CDH architecture. Beginning with a single T. thermophila gene as a query, the CDH identifies all genes that are co-regulated with it, via the TetraFGD. Next, the CDH uses the TGD to gather the annotation and sequence data for each gene in the co-regulated set. For each gene in the co-regulated set, the CDH then runs forward and reciprocal BLAST searches, through the NCBI and TGD, to identify likely orthologs. A phrase matching algorithm, based on the Ratcliff-Obershelp algorithm [22], as implemented by the python difflib library, is then used to summarize the annotations of the putative orthologs for each T. thermophila gene in the co-regulated set. These summaries, which provide predictions about the function (e.g., relevant biological pathway) of the T. thermophila gene query, are presented along with the other data gathered, in the final report. Figure 2: Setting CDH search parameters. The CDH is run through the terminal. The CDH prompts the user to define several parameters. These are: 1) the initial gene, i.e., the query; 2) the z-score threshold to be applied as cutoff for strength of co-regulation, which determines how many of the co-regulated genes will be subject to analysis via homology; 3) the extent to which data gathered in prior searches should be used; 4) whether results should be stored in Dropbox; 5) whether to run BLAST searches with cDNA or protein sequences; and 6) in which taxa to run the BLAST searches. For (2), the z-score threshold determines how many co-regulated genes will be included. For example, raising the threshold increases the stringency of the requirement for strength of co-regulation, so results in fewer co-regulated genes that are subsequently analyzed via BLAST, etc. For (3), the available options are: a) to run the search from scratch, overwriting any files associated with the queried gene; b) to re-use existing data for co-regulation, annotations, and sequences, but to run all of the BLAST searches from scratch; c) to re-use any existing data that are pertinent to the given query; d) to clear NCBI database errors from a previously run search and redo the associated BLAST searches; e) to only run the search for the co-regulation, annotation, and sequence. The example query in this screenshot is set to run a CDH search for the gene TTHERM 00313130 (Sortilin 4); to consider genes that are co-regulated with it with a z-score of 5 or higher; to gather all of the data de novo; to save all of the data locally; and to run the BLASTp searches only within the Ciliates.