Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction

In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.

Due to tremendous progresses in terms of efficiency, accuracy and cost for the high-throughput sequencing technologies, a large number of genome sequences of eukaryotic, prokaryotic and archaea organisms are increasingly becoming available [1,2]. These efforts are expected to open the window for better understanding the kinds of biological processes because essential information in principle is encoded in genome sequences. Nevertheless, it is also challenging for meaningfully decoding the huge amount of DNA sequences; for example, we are still infants in understanding biological implications of the substantial fraction sequences of "junk DNA" in eukaryote genomes, which don't encode any known proteins [3]. Additionally, a recent publication also revealed that the sequence context has functional consequences by influencing the substitution rate of adjacent nucleotides [4], which would complicate the biological explanation of genome sequences because the more complex mathematical models would be required.
By contrast to experimental investigations on biological functions, the in silico analysis of DNA sequences is essential in post-genomic era. There are many general properties of DNA sequence, such as GC content and base composition, having been well used for in silico analysis [5]. Additionally, ab initio prediction of gene structure is a critical step after sequencing whole genome and therefore has received much attention over the past decade [6]. Because of limitations of biological knowledge and bioinformatic algorithm, however, it still remains to be further improved on precision for these existing bioinformatic tools of gene prediction. In the present article, we briefly review these well-characterized features of DNA sequence and applications to ab initio gene prediction in eukaryotes. Although some literatures were published more than ten years ago, it is still helpful to provide an overall landscape for promoting the development of bioinformatic tools. Also, genome architectures for these available eukaryotic species are summarily illustrated in advance.

Outlines of genome architecture
To explore the evolutionary dynamics and biological consequences on genome size, base composition, and relative proportions of functional and nonfunctional sequences are deemed fascinating challenges in biology. The transposable genetic elements, in combination with natural selection, have been acknowledged to contribute to genome evolution, which result into considerable accumulation of repetitive sequences [7][8][9]. However, many proposed mechanisms trying to account for the genome evolution still remain uncertain or controversial, and these topics are also beyond scope of the present review. Fortunately, the recently prevailing approach of pan-genome analysis would be anticipated to provide more insights into this field [10]. genome (C-value) have been widely observed even among the closely related species from same genus [3], which is thereby termed the Cvalue paradox. Scientific publications in eukaryotes on diversity patterns, evolutionary mechanisms and research methodologies in relation to genome size were recently summarized [11]. The traditional view suggests that more than 90% of human genome are nonfunctional and therefore regarded as "junk DNA", whereas ENCODE project recently argued that up to 80% of genome sequences have functional roles [2,12]. Of course, the two opinions are also being on the road for heated debate. Here, we analyzed the genome sequences for 32 representative eukaryote species and roughly illustrated their comparisons on genome size, GC content, and relative proportions of intergenic regions, exons and introns (Fig. 1). Unsurprisingly, an intuitional correlation between genome size and fraction of intergenic regions could be drawn out.
Additionally, the proportions of exons and introns show consistent changes more or less.

Well-characterized features within genome sequence
Although it is impossible to be completely verified, the conserved features of DNA sequence would exist for corresponding to various biological functions, while some of them are already known but some unknown yet. On the basis of this supposition, we are able to perform in silico analysis of DNA sequences for functional investigations. On the whole, features of DNA sequence in eukaryotic genomes could be routinely categorized into two classes, including the compositional properties and functional signals (Fig. 2). In brief, all five indices were generated by the dissection of annotation information of reference genome (in GFF format) downloaded from NCBI (March, 2016); and these steps were performed using in-house scripts written in Python language. Additionally, the screenshot of NCBI taxonomic tree is employed to show the phylogenetic relationships among species, in which the full Latin scientific names of species were used.

Repetitive sequences
Our knowledge on the organization of eukaryote genomes has dramatically increased due to ever-growing genome sequences [1]. A well-known feature of eukaryote genomes is that they consist of substantial proportion of repetitive sequences occurred in hundreds or thousands of times [13]. According to evolutionary origins and genomic distribution, repetitive DNA sequences could be overall classified into three types [14,15], including the tandem repeats, interspersed repeats, and long terminal repeats (LTRs). Tandem repeats, such as microsatellites, minisatellites and satellites, are characterized by two or more contiguous repetitions of short fragments [16]. Interspersed repeats mainly include short and long interspersed elements; and both of them, together with LTRs, are evolutionarily derived from the transposable elements [17,18]. As the evolutionary dynamics, diversity pattern, and biological function of repetitive sequences in eukaryote genomes have been intensively reviewed elsewhere [19][20][21].
The specific databases, such as Repbase Update [22] and SINEBase [23], provide platforms and computational tools for depositing, naming and annotating the repetitive sequences in eukaryotes. Meanwhile, various bioinformatic tools have been developed for finding repetitive sequences in genome, including RepeatMasker [24], PILER [25] and RepeatExplorer [26]. In human, it was estimated by de novo tool that about 70% of entire genome is repetitive or repeat-derived, which was higher than estimation using the alignment-based approaches [20,27]. In practices, the repetitive sequences are always masked in advance for finding eukaryotic genes because of their absences for encoding proteins [28].

Coding measures
Due to constraints of natural selection, base composition of proteincoding DNA sequences would significantly differ from non-coding sequences or random expectation. Various coding measures, in relation to base composition, had been early proposed with statistical virtue [29]. Among them, the most widely used measure is codon usage bias [30]; the observed frequencies for all 64 possible codons in a DNA sequence could be first counted. Alternatively, each codon could also be translated into amino acid and then generated the observed frequencies of 20 amino acids and stop codon. Subsequently, these observed frequencies of codons or amino acids are used to model the discriminant function for distinguishing coding from non-coding sequences. In more general way, the linguistic word in length of arbitrary n nucleotide acids can be phased and subjected to calculation of the observed frequencies. After comparing various word lengths, it has been acknowledged that the 6 bp word, which is also termed hexamer, would be the most informative index [31].
Although genetic codon is represented as triplet, the degrees of biological conservation significantly differ among the first, second and third positions. Therefore, the base composition bias among three codon positions

Mutual information
Singularity density distribution Entropic segmentation etc. would be expected to provide valuable information for discriminating between coding and non-coding sequences [29]. To better demonstrate this issue, we analyzed the base frequencies among three positions between coding segment and untranslated regions for 38, 542 reference sequences of human mRNA. Additionally, 12, 367 sequences of known lincRNA in human were also included for comparison (Fig. 3). Our results clearly revealed the bias of base composition within coding segments in terms of both the absolute and relative frequencies. However, both intergenic and intron sequences should be further investigated. In fact, more than two decades ago, Fickett proposed a statistical index named Fickett TESTCODE [32], which combinationally utilized information of both base composition and codon usage bias and was employed for computationally estimating the coding potential of DNA sequence [33]. Recently, Python package of repDNA was published for efficiently generating feature vectors in relation to base composition of DNA sequences [5], which could facilitate analysis for biologists without well bioinformatic background.

Other mutual information
Regardless of functional implications, it is also possible to find mutual information to discriminate between coding and non-coding sequences. For example, according to information-theoretic quantity, average mutual information was designed and taken as a species-independent statistical index for distinguishing coding from non-coding DNA sequences [34]. The segmentation method according to the estimated entropy in relation to base composition of DNA sequence was proven to be powerful for finding borders between coding and noncoding regions [35]. The local properties of DNA sequence, rather than global features, were also successfully used for partitioning the coding and non-coding regions in eukaryotic genome [36].

Functional signals
In addition to compositional properties of DNA sequences as mentioned above, genome sequences in eukaryotes would contain many intrinsic signals for guiding various biological functions, such as transcription, processing of pre-mRNA, and translation into amino acids [28]. Briefly, the well-known functional signals in relation to genic transcription mainly include TATA box, initiator, cap signal, CpG islands and polyadenylation signal. As for the genomic distribution, sequence characteristics and computational detection of transcriptional signals have been specifically addressed [37][38][39]. After being transcribed into pre-mRNA, splicing mechanism will be initiated for removing introns and producing mature mRNA; and during which splicing sites are recognized by the canonical presence of GT at donor site upstream of intron and AG at acceptor site downstream of introns, respectively [40,41]. Beside start and stop codons, the Kozak sequence (GCC(A/G)CCAUGG) as well as upstream open reading frames (uORFs) would be the principal translational signals [42].
Although these functional signals would play important roles in predicting gene structure and organization, especially for proteincoding genes, two intrinsic limitations should be taken into account when including them into bioinformatic algorithm. First, there is no any statistical meaning by analyzing functional signals in DNA sequences. Second, not all of genes contain the canonical functional signals, i.e., some signals would be completely absent or present by the noncanonical forms. For example, minor types of splicing sites have also been acknowledged in addition to canonical GT/AG [40]. In practices, therefore, both functional signals and compositional properties are always combined together for gene prediction.

Bioinformatic tools for ab initio gene prediction
Over past two decades, ab initio gene prediction from anonymous DNA sequences has acquired great achievements [43] and also boosted by need of genomic annotations when eukaryotic genomes become available [44]. For existing tools, much attention has been paid to prediction of protein-coding genes due to functional importance and algorithmic convenience. By contrast, the number and function of noncoding RNA (ncRNA) genes in eukaryotes, with exceptions of tRNAs and rRNAs, have remained largely unknown [45]. Therefore, the computational approaches for finding ncRNA genes in eukaryote genomes should be specifically addressed [46].

Brief description on prediction of protein-coding genes
The prevailing tools for computational prediction of protein-coding genes in eukaryotes have been considerably optimized, and on which specific reviews or comparatively technical analyses on their strengths and weaknesses had been already published [6,[47][48][49]. In the present review, therefore, we only summarize the pivotal features for these prevailing tools for ab initio prediction of eukaryotic genes in Table 1. Briefly, computational approaches of ab initio gene prediction could be discussed on two aspects, including the used information for describing DNA sequences and the employed algorithms for establishing the discriminant function. Various sequence features within eukaryote genomes in relation to gene prediction have been documented above. For modeling discriminant function, the most often used algorithms include Markov model and dynamic programming. Actually, most of them also utilize the information of sequence similarity by searching against database for improving prediction accuracy.

Prediction of ncRNA genes
Term of ncRNA generally refers to RNA molecule without needing to be translated into protein, which could directly function as RNA [56]. Therefore, ncRNAs would lack functional sense of ORFs and/or sequence features similar to protein-coding genes. However, absences of significant ORF or coding measures are not sufficient for supporting it is an ncRNA gene [3]. There are a variety of ncRNAs with differential structures and functions [45,57], which significantly complicate ab initio prediction of ncRNA genes in eukaryote genome. In theory, a conserved feature for most if not all of ncRNAs is the presence of secondary structure, which would facilitate the computational prediction [46,57].

miRNA genes
The microRNA (miRNA) is an abundant family of ncRNAs playing ubiquitous roles for post-transcriptional regulations in eukaryotes with length of~22 nucleotides. According to the biogenesis pathway, mature miRNAs are derived from intermediate precursor of pre-miRNAs in length of more than 70 nucleotides, which are almost characterized by a stem-loop structure [58,59]. Another feature of miRNAs is highly evolutionary conserved on primary sequences and secondary structures even across taxonomically diverse species [45]. Therefore, the prevailing computational approaches for finding miRNA genes are preferable to simultaneously depend on both intrinsic sequence features and homology similarity [60,61]. However, it is also necessary to predict the non-conserved or species-specific miRNA genes [62], hence we herein focus on ab initio approaches which completely utilize intrinsic features.
First, the potential to form hairpin structures is vital for selecting as candidates of miRNA genes, which could be computationally deduced on basis of the estimated free energy by tools of RNAfold [63] and Mfold [64]. Actually, the homology search-based approaches, such as MiRscan [65] and miRseeker [66], were also designed to first scan intergenic regions of entire genome and generate full list of candidates according to the deduced hairpin sequences before homology search. Therefore, design of the prevailing PalGrade tool is first to assign a score to each candidate sequence according to stability of computational hairpin, which, together with other features such as hairpin length and loop length, are subsequently used for establishing predictor of miRNA genes [62].
The support vector machine method can be used to discriminate between real and pseudo pre-miRNAs as implemented in triplet-SVM [67]. Similar to triplet-SVM, MiPred [68] additionally employed the thermodynamics-related features and random forest algorithm for achieving higher performance. A more sophisticated algorithm in ProMiR [69], termed the paired hidden Markov model-based probabilistic co-learning method, was proposed to utilize sequential and structural characteristics for efficiently predicting non-conserved miRNA genes. An alternative approach is HHMMiR, which used hierarchical hidden Markov model to describe the evolutionarily non-conserved hairpins [70]. A Naïve Bayes classifier (BayesmiRNAfind) was also proposed for prediction of miRNA genes, which efficiently utilize data from multiple species to provide better training dataset [71].
Recently, the speed of computational algorithm also began to be intentionally taken into consideration when predicting miRNA genes from entire genome. Tool of miRNAFold [72], an ab initio computational method, developed an approximation algorithm for searching hairpin sequences within genome and then resulted in significant decrease in number of candidates of interest. Along with rapid advances of highthroughput sequencing of small RNA, computational tools of miRNA prediction have been designed to utilize the sequenced short reads for structural analysis, such as MiRDeep and its varieties [73].

lncRNA genes
Long noncoding RNAs (lncRNAs) are typically more than 200 nucleotides in length without protein-coding capability; and the estimated number in human genome would be significantly higher than protein-coding genes [74,75]. Experimental examinations of lncRNA genes become feasible in eukaryotic species because they can be profiled by RNA-seq method due to their presences of poly(A) tails and other mRNA-like features [76]. In contrast to miRNAs, however, it is much difficult for ab initio predictions of genomic sequences which are transcripted into lncRNAs because of lack of informative features and evolutionary conservation [74]. Despite this fact, a few statistics of lncRNAs, such as the secondary structure, protein-coding potential and miRNA binding sites, have been proposed [77].
In practices, several existing tools could be used to computationally deduce the coding potential of cDNA sequences or the assembled transcripts from RNA-seq data. On basis of six biologically meaningful sequence features, including the possible ORFs and homology search hits, computational estimation of coding potential (CPC) was successfully established by support vector machine method [78]. Similar to CPC, computational tool of CPAT alternatively used the logistic regression method to model four sequence features for estimation of coding potential [33]. Of course, it is also expected to perform ab initio prediction of lncRNA genes from genome alone when our understanding on lncRNA biology significantly increase.

Concluding remarks
Along with the increasing sophistication and complexity of machine learning methods, it is anticipated that more and more biological processes could be computationally modeled. Meanwhile, the high-throughput sequencing technologies produce huge amounts of biological data each day, which would further motivate the development of computational biology. Ab initio computational prediction of eukaryotic genes, with a long history of intensive research, has considerably contributed to our understanding on the related biological questions. However, there still remain practical needs not only for further improvements in prediction accuracy of protein-coding genes but also for development of new approaches for finding ncRNA genes. In the present review, therefore, we outline the achievements in relation to two main aspects of ab initio gene prediction during the past two decades, including these well-characterized sequence features in eukaryote genomes and their practices in bioinformatic tools. However, the prediction methods on basis of homology search are not addressed here because of its relatively straightforward concept.