Computational discovery and annotation of conserved small open reading frames in fungal genomes

Background Small open reading frames (smORF/sORFs) that encode short protein sequences are often overlooked during the standard gene prediction process thus leading to many sORFs being left undiscovered and/or misannotated. For many genomes, a second round of sORF targeted gene prediction can complement the existing annotation. In this study, we specifically targeted the identification of ORFs encoding for 80 amino acid residues or less from 31 fungal genomes. We then compared the predicted sORFs and analysed those that are highly conserved among the genomes. Results A first set of sORFs was identified from existing annotations that fitted the maximum of 80 residues criterion. A second set was predicted using parameters that specifically searched for ORF candidates of 80 codons or less in the exonic, intronic and intergenic sequences of the subject genomes. A total of 1986 conserved sORFs were predicted and characterized. Conclusions It is evident that numerous open reading frames that could potentially encode for polypeptides consisting of 80 amino acid residues or less are overlooked during standard gene prediction and annotation. From our results, additional targeted reannotation of genomes is clearly able to complement standard genome annotation to identify sORFs. Due to the lack of, and limitations with experimental validation, we propose that a simple conservation analysis can provide an acceptable means of ensuring that the predicted sORFs are sufficiently clear of gene prediction artefacts. Electronic supplementary material The online version of this article (10.1186/s12859-018-2550-2) contains supplementary material, which is available to authorized users.


Background
Small open reading frames (smORF) are sequences that potentially encode for proteins but are shorter than other more commonly translated genomic DNA sequences [1]. Such protein sequences can theoretically range from a minimum of two to~100 residues. Various values have been reported for what can be acceptable as the limits to be a small and functional protein. The problem of determining what constitutes the minimum number of codons to be considered as protein coding has been discussed since the earliest genome sequences for Saccharomyces cerevisiae were published [2,3]. In addition to the term smORF, these sequences have also been referred to as short open reading frames (sORFs) and the proteins that they encode have at times been referred to as microproteins.
Despite the term sORF turning up only in more recent literature, the existence of genes that code for proteins of 150 residues and less have been known for more than three decades. Functional sORFs have been identified in a wide range of organisms from prokaryotes to humans. The Sda protein (46 residues) found in Bacillus subtilis is known to inhibit sporulation by preventing the activation of a required transcription factor [4,5]. Proteins such as TAL (11 residues), found in Drosophila melanogaster, are known to be important for leg development [6,7]. The Cg-1 protein (< 33 amino acids) is involved in controlling tomato-nematode interaction [8]. In Homo sapiens, the humanin (24 amino acids) protein is involved in mitochondria-nuclear retrograde signalling that controls apoptosis [9,10]. Possibly the smallest ORF reported to date encodes a six residue polypeptide -MAGDIS; this ORF is referred to as the upstream open reading frame (uORF) in the mRNA of S-adenosylmethionine decarboxylase (AdoMetDC), a key enzyme in the polyamine biosynthesis pathway [11].
Nevertheless, it has also been shown that many ORFs with lengths of 100 or less amino acids may have been missed during gene prediction from whole genome sequences because the gene prediction tools are tuned to ignore small genes perceived to be 'junk' or non-protein coding [21]. For example, the early genome annotations of S. cerevisiae had defined 100 residues and 150 residues as the minimum number to be encoded by an ORF thus in a way setting a parameter value for future gene predictions and annotation work [2,3]. Perhaps as a consequence of such practices being integrated as part of standard gene prediction protocols, the number of sORFs that have been identified over the years has remained relatively small. Although the parameters for the gene prediction can be tweaked and changed in light of a better understanding regarding the existence of sORFs, the challenge of ascertaining that the annotated sORFs are indeed protein coding and not artefacts remains [17].
In this work, we have identified potential sORFs from fungal genomes by specifically repeating the gene prediction and annotation processes based on a residue length cutoff of 80 amino acids or less and specified the range of sORFs length distribution among homologs to avoid false positives. The cutoff of 80 residues was chosen as a simplistic means of selecting ORFs that were most likely to have been overlooked by previous gene predictions. The identification of 1986 putative predicted sORFs involved a large sequence dataset extracted from 31 fungal genome sequences with a total of 210,928 ORFs from existing gene prediction and annotation. The predicted sORFs were then compared to identify highly conserved examples within the fungal genomes dataset by adopting the assumption that such highly conserved sequences may code for common or even essential functions and are thus unlikely to be artefacts or randomly matched examples. This can potentially be a quick and inexpensive means of identifying subsets of sORFs that are classified as hypothetical proteins for experimental characterization.

Identification of potential sORFs
The fungal genomes selected were required to have associated annotations for predicted genes thus limiting our dataset to only 31 genomes at the time the work was initiated. These annotations were utilized to identify 5255 sORFs genes that had already been identified in the original annotation to encode for a maximum of 80 amino acid residues. The ORF prediction process was then repeated for all 31 fungal genomes using the computer programs getorf [22,23] and sORFfinder [24] as detailed in methods section. This process resulted in 16,156,945 sORFs identified by getorf and 902,110 found by sORFfinder. The results of both searches were overlapped to yield a consensus of 42,587 potential sORFs sequences encoding for 80 residues or less. The ORFs predicted by getorf with a cutoff of 240 nt were considered as genes that can either be a region that is free of STOP codons or a region that begins with a START codon and ends with a STOP codon [22,23]. However, all the sORFs identified in this study have both START as well as STOP codons.
Characterization of the fungal conserved sORFs CD-HIT [25] clustering of the combined 47,842 potential sORFs at 70% sequence identity resulted in 730 sORFs clusters that comprised of 1986 sequences putatively conserved in at least two fungal species (Fig. 1). Four of the sORFs predicted have Kozak sequences based on an ORF integrity value that was derived by calculating their coding potential using CPC2 [26]. The majority of the sORFs predicted were in the under 40 residues range with the shortest sORF in this dataset being composed of only 11 amino acids (Fig. 2).
The clustering based on 70% similarity of the 47,842 potential sORFs resulted in a total of 1986 putative sORFs that are conserved within at least two fungal species (Additional file 1). Among the 1986 putatively conserved sORFs, 927 have homologs with known functions (35 from the purpose built sORF prediction process; 892 from existing genome annotations) (Fig. 3). The remainder 1059 putatively conserved sORFs have uncharacterized functions and can be further divided into two categories: the first set -23 sORFs with homologs outside of the 31 fungal genomes; and the second set -1036 sORFs with no detectable sequence homologs outside of the 31 fungal genomes. The latter set can thus be considered as fungi specific sORFs.

Discussion
The standard gene prediction process may miss ORFs that encode for protein sequences of less than 100 residues [12,27]. In order to address this, we carried out a two pronged approach using a dataset of 31 available fungal genomes to carry out: (i) identification of ORFs that have already been annotated to be below 80 residues in length and (ii) repeating the gene prediction process for each genome to specifically identify genes that encode for sORFs of 80 residues or less.
The bias of the parameter often used to predict genes that require a minimum of 100 codons may bypass the sequences in the intergenic spaces, especially when such regions are less than 100 residues in length, but yet they may actually encode for functional proteins of less than 80 residues. Intronic sequences may also hypothetically code for such sORFs. Therefore, these sequences were used as the target for sORF searches in this work. Predicted sORFs that were found to occur in multiple genomes were selected for further characterization. However, it was anticipated that such searches can return a large number of predicted genes, many of which could be artefacts of the search process itself. In order to address this, the pool of predicted sORFs were then compared to each other to find potentially homologous sequences within the predicted sORFs dataset. It is expected that such short sequences that are conserved in several genomes can be assumed to be of functional importance and thus not an artefact of the gene prediction process, especially more so if those sequences were also extracted from the intronic regions as was the case in this study.
At the time this work was initiated, 31 fungal genomes were selected because they had relatively complete genome sequences and had accompanying annotations. All the selected genomes were from the kingdom fungi and from various phyla including Ascomycota, Basidiomycota and Microsporidia (Table 1). Due to this diversity, we therefore believe that the workflow developed would be widely applicable for all fungi and possibly for other kingdoms as well.
The first approach merely involved identifying ORFs from the existing available annotations for sequences that fitted the maximum 80 residues criterion used for this study. This approach was dependent on parameters had been set by the annotators of the deposited data as the minimum number of codons that were to be considered as protein coding. The sORFs retrieved from this extraction provided a reference for what had already been identified. The second approach, which can be considered as the major feature of this work, involved repeating the gene prediction and annotation process by specifically identifying potential ORFs in the intronic, exonic and intergenic regions. We had opted to focus the searches on sequences extracted from the intronic and intergenic regions because a relatively high number of sORFs can be found within these regions as demonstrated by the discoveries of 3241 putative sORFs in the intergenic regions of Arabidopsis thaliana [28] and 15 sORFs in the intronic regions of Drosophila [29].
The sORF identification in the second approach involved the use of two computer programs, sORFfinder and getorf. The sORFfinder program was specifically designed to detect small ORFs. The getorf program, which is available as part of the emboss package, employs a less stringent approach that simply involves setting the sequence length parameter for genes to be below 240 nucleotides within the start to stop codons reading frame. In order to throw a wider net, we specifically included intronic and intergenic sequences as inputs for sORF identification. It is not unexpected that the output of  both programs would contain false positives. In order to narrow down the selection, we had only selected outputs that were agreed on by both sORFfinder and our getorf runs at the 240 nt cap. This filtering was done in order to reduce the number of sequences for further characterization. It is however possible that true sORFs are present in the dataset that were predicted by only one of the programs and thus not investigated further as a part of this work. This is a clear limitation of the process that we had introduced as a means to acquire a more manageable number of sequences for further characterization.
The bypassing of sORFs that are located in the intergenic regions can occur during what is considered as the standard gene prediction process because these stretches of sequence only have sufficient length to encode for polypeptides that may be shorter than 100 residues [12,27] and are thus overlooked as simply being non-coding filler sequences between two coding sequences. In order to address the possibility that a large number of sORFs in the intergenic regions may have been missed during a standard gene prediction process as demonstrated by the work of Hanada et al. that identified novel small open reading frames that were confirmed to at least be transcribed [28], our analysis also specifically targeted for the presence of sORFs in those sequences.
Although there were no predicted sORFs that were conserved in all 31 genomes, there were 68 sORFs from two homologous clusters that were present in 26 of the 31 fungal genomes. Additionally, there are 1663, 215, and 40 sORFs that could be found in ¼, ½ and ¾ of the 31 genomes, respectively (Fig. 4). The two clusters identified by the genome screening approach consist of sORFs that are homologous to 40S ribosomal protein S28 (Fig. 5a i) and 40S ribosomal protein S30 (Fig. 5a ii). In the first cluster, 11 sORFs from eight species, Cryptococcus neoformans, Candida glabrata, Eremothecium cymbalariae, Kazachstania africana, Naumovozyma castellii, Agaricus bisporus, Aspergillus nidulans and Myceliopthora thermophila that were originally annotated as hypothetical proteins, were updated to be homologs of 40S ribosomal protein S28. The annotation for this homology assignment was obtained using BLAST and domain analysis using InterProScan. Furthermore, the evolutionary analysis on this cluster showed that all of these sORFs are conserved in fungi and are closely related in the fungal group when compared against the outgroup, Ananas comosus (pineapple) (Fig. 6a). This demonstrates the utility of reannotation projects in general and especially when they are designed to identify specific targets such as the one we have carried out in updating the existing annotation.
The function of sORFs in the first alignment set, which are conserved in about half of the 31 genomes ( Fig. 5b i), are proteolipid membrane potential modulators that modulate the membrane potential, particularly to resist high cellular cation concentration. In eukaryotic organisms, stress-activated mitogen-activated protein kinases normally play crucial roles in transmitting environmental signals that will regulate gene expression for allowing the cell to adapt to cellular stress [30]. This protein is an evolutionarily conserved proteolipid in the plasma membrane which, in S. pombe, is transcriptionally regulated by the Spc1 stress MAPK (mitogen-activated protein kinases) pathway. There are two sORFs (C. dubliniensis-sf4096_1 and P. pastoris-sf9282_1) from the computational reannotation that were clustered together with conserved alignments, and thus indicating they may have the same function (Table 2). Further evolutionary analysis of this cluster showed that all of the sORFs in this cluster are closely related in fungal group against bacteria, Halmonas xinjiangensis (Fig. 6b).
The second alignment in Fig. 5b ii shows four sORFs predicted from the reannotation that are homologs to  the 60S ribosomal protein. This is a possible indicator that sORFs may have been missed during a standard genome annotation process. Our analysis identified a higher number of sORFs candidates in S. cerevisiae compared to that published by Kastenmayer et al. [14]. The total of 77 sORFs predicted for S. cerevisiae contained all the 16 sORFs predicted by Kastenmayer et al. There are 20 sORFs in this set that were predicted by the sORFfinder and getorf integrated prediction process ( Table 3). The other 57 sORFs predicted for S. cerevisiae have already been previously identified and was extracted from the genome screening approach.
In 274 clusters predicted from the 31 fungal genomes, 892 putative conserved sORFs have been annotated previously as a gene and have known functions. Characterization of the putative conserved sORFs revealed that approximately 3.8% of the newly predicted sORFs have known functions but were not annotated as genes in the available genome annotations. Our sORF annotation workflow also determined that 832 of the putative conserved sORFs predicted are hypothetical proteins or have no characterized function (Fig. 3). Even though these sORFs do not have a known function, their conservation across multiple species imply that their presence is of some functional importance. Out of the total of 848 predicted sORFs from the 31 genomes (Fig. 3), 93 sORFs from the sORFfinder-getorf integration output have homologs in other organisms (Additional file 2).
The total 1986 predicted sORFs were blast searched against the refseq database [31] and classified according to the three major Gene Ontology (GO) classes of molecular function, biological process and cellular component. Of the 1986 sORFs predicted, only 617 predicted sORFs could not be classified according to GO classes. This resulted in 2746 putative sORFs being classified into biological process, 4546 putative sORFs classified as cellular components and 155 predicted sORFs classified to be involved in molecular function (Fig. 7). The number of genes resulting from the Gene Ontology classification are higher than the total number of predicted sORFs predicted because one gene can be associated with multiple classes. The overall classification showed that most of the sORFs predicted have roles in biosynthesis and nucleic acid metabolism.
The methods that we have developed from available and proven tools are expected to be easily deployable to other genomes as and when they become available with minimal modifications. Recently, a psychrophilic yeast genome had been reported [33] that has other functional data also available such as gene expression during cold stress [34,35] and the characterization of proteins involved in cold adaptation [36][37][38]. The mining of such genomes for sORFs that can then be integrated to the functional data may be a cost-effective means of identifying sORFs that are involved in psychrophily or other relevant extremophilic adaptations.

Conclusions
The results of our work reveal that a high number of potential sORFs could be overlooked by the standard gene prediction workflow. We therefore recommend that the standard genome annotation process be complemented by analyses that specifically target the annotation of sORFs [39,40], and then have both results integrated to Fig. 7 Classification of predicted conserved sORFs based on Gene Ontology. The solid colors represents cellular components, the dot patterns represents biological processes and cross patterns represents molecular functions provide a more complete genome annotation. This workflow is applicable for big data analysis because this study involved a large number of sequences from 31 completed fungal genomes that consisted of intergenics, introns, ORFs and genome sequences. Although the functional validation for predicted sORFs cannot be done based solely on the genome sequence without any corresponding transcriptomic or proteomic data, it is still possible to imply a putative status for the predicted sORFs by evaluating their conservation with the assumption, albeit a very simplistic one, that the observed conservation implies a conserved function of some biological importance and thus less likely to be artefacts of the gene prediction process. Furthermore, the predicted sORFs predicted will be incorporated into a database consisting of sORFs from fungal genomes.

Methods
A workflow was created to predict sORFs from fungal genomes ( Fig. 1) and the components and steps involved are provided below. The source code for the programs in this workflow have been deposited on GitHub -https://github.com/firdausraih/sORFs-fungal-genomes (Additional file 3). The data for sORFs were sourced from two datasets: (i) existing annotations made available with the genome sequences and (ii) a purpose built search.

Source of genome data
The data for 31 fungi genomes were downloaded via FTP from the NCBI at ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Fungi/ (Table 1). These 31 fungi genomes were selected from 36 fungi genomes in NCBI based on the completeness of their genome analysis and annotation.

Screening sORFs from genome annotation
Known or existing sORF annotations were first extracted from the existing genome annotations available for the fungal genomes used. This dataset was restricted to annotations for a maximum length of less than 240 nucleotides or 80 amino acids.

Identification of intergenic, intronic and coding regions
The intronic and coding regions for the genomes were identified using Artemis [41] [https://www.sanger.ac.uk/ science/tools/artemis] based on the chromosome, scaffold or contig sequences and the protein coding sequences for each genome. The intergenic regions were extracted from the genome annotations in the General Feature Format (GFF) format using a Perl script. The intergenic regions were extracted from both the forward and reverse strands.

Identification of sORF using sORFfinder-getorf approaches
Gene predictions that specifically targeted the identification of sORFs were done by using sORFfinder-getorf approaches that combined two programs: getorf and sORFfinder. The prediction of the sORFs were carried out for each scaffold and/or chromosome. The sORFs predictions using getorf from the EMBOSS package [23,42] were restricted to a maximum length of 240 nucleotides. Identification of sORFs by sORFfinder [24] was carried out using a 0.5 probability parameter. The results of sORFfinder, which by default is set at a maximum of 100 amino acids, were then filtered for output containing 80 amino acids in length.

Determining existing homologs for the predicted sORFs
The predicted sORFs from getorf and sORFfinder search outputs for each fungi genome were combined and clustered using CD-HIT-EST [25,43] and those with 100% identify were removed. Unique sequences that represented each cluster were then used as BLAST queries to search against a database of open reading frames (ORFs) for 31 fungal genomes using BLASTX [44,45]. BLAST hits that aligned to less than two thirds of the query sequences and with less than 30% sequence identity were removed and the remainder were used as a potential sORFs dataset.

Identification of conserved sORFs
The pre-annotated sORFs and those that were predicted as potential sORFs were then combined and clustered using CD-HIT at 70% identity to remove clusters that contained only a single sequence. For each cluster, sORFs that have homologs in at least two different species in one cluster were considered as potentially conserved sORFs. All conserved sORFs were identified their Kozak sequences using CPC2 [26].

Multiple sequence alignments and evolutionary analysis
Identification of conserved sORFs in the clusters were carried out using the MUSCLE [46] sequence alignment program. A multiple sequence alignment generated using MUSCLE, which included one out group identified by PSI-BLAST [45], was used as input to construct a phylogenetic tree with 1000 bootstrap replications using the Jones-Taylor-Thornton (JTT) model based on the Neighbor joining method using PHYLIP 3.695 [47,48].

Function prediction and classification
The predicted sORFs were annotated using blast, interpro and classified using BLAST2GO into the three main Gene Ontology classes of molecular function, biological processes and cellular component [49][50][51].