An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm

Background Detection of important functional and/or structural elements and identification of their positions in a large eukaryotic genomic sequence are an active research area. Gene is an important functional and structural unit of DNA. The computation of gene prediction is, therefore, very essential for detailed genome annotation. Results In this paper, we propose a new gene prediction technique based on Genetic Algorithm (GA) to determine the optimal positions of exons of a gene in a chromosome or genome. The correct identification of the coding and non-coding regions is difficult and computationally demanding. The proposed genetic-based method, named Gene Prediction with Genetic Algorithm (GPGA), reduces this problem by searching only one exon at a time instead of all exons along with its introns. This representation carries a significant advantage in that it breaks the entire gene-finding problem into a number of smaller sub-problems, thereby reducing the computational complexity. We tested the performance of the GPGA with existing benchmark datasets and compared the results with well-known and relevant techniques. The comparison shows the better or comparable performance of the proposed method. We also used GPGA for annotating the human chromosome 21 (HS21) using cross-species comparisons with the mouse orthologs. Conclusion It was noted that the GPGA predicted true genes with better accuracy than other well-known approaches. Electronic supplementary material The online version of this article (10.1186/s12859-017-1874-7) contains supplementary material, which is available to authorized users.


INTRODUCTION
Biological sequences are primarily useful computational data in molecular biology. Sequences represent symbolic descriptions of the biological macromolecules like DNA, RNA, and Proteins. A sequence can provide a vital insight into the biological, functional, and/or structural data about a molecule encoded in it. Therefore, the molecular information can be easily deciphered by analyzing several biological sequences. Over the past decade a major boost in sequencing, especially after the advent of next-generation sequencing (NGS) technologies (Liu et al. 2012(Liu et al. ) led to 2002. Other than these, some of the approaches based on neural network (Zhou et al. 2008), wavelet transform (Abbasi et al. 2011), Genetic Algorithm (GA) (Sree and Babu 2013;Hwang et al. 2013) have been applied for accurate detection of a gene.
The ab-initio based method predicts the genes directly from the genomic sequences relying on two significant features like gene signals and gene content. Several ab-intio programs have been extensively used in genome annotation, such as GENSCAN, FGENESH, and GeneID. However, the ab-intio based approaches normally predict a higher rate of false positive results while annotating large multi genomic sequences (Dunham et al. 1999). In particular, ab-initio gene identifiers determine the intergenic splice sites poorly in the prediction process.
Conversely, a homology search on the databases of already established and experimentally verified coding sequences have performed well and proved a good approach in decoding the structure of the genes having known homologs. At present, a large number of known protein coding genes, cDNA, proteins, and ESTs are available in the databases. Therefore, sequence similarity based gene prediction methods are becoming increasingly useful in finding the putative genes in genomic sequences and thereby provide an evolutionary relationship between the raw genomic data and known cDNA, proteins or gene databases.
To date, many different techniques are available for solving gene location detection and its structure prediction in the large eukaryotic genome. Acencio and Lemke (2009) introduced a decision tree-based classifier and trained that with different attributes like network topological features, cellular compartments, and biological processes to generate various predictors for finding essential genes in S. cerevisiae. EVidenceModeler (EVM) (Haas et al. 2008) was presented as a tool for automated eukaryotic gene structure annotation that computed weighted consensus gene structures based on both type and abundance of available evidence. SCGPred (Li et al. 2008) was another score based gene-finding program that combines multiple sources of evidence. Logeswaran et al. (2006) had developed a WAM-CpG algorithm based on the weight array method (WAM) and CpG islands. Nasiri et al. (2011) analyzed the performance of different ab-initio gene finders on orthologous genes of human and mouse. Genome Annotation based on Species Similarity (GASS) (wang et al. 2015) is a shortest path model with DP based tool. It annotated a eukaryotic genome by aligning the exon sequences of the annotated similar species.
DNA numerical representation is another approach that was utilized in several algorithms where residues were converted to numerical values. Akhtar et al. (2008) had presented DNA symbolic-to-numeric representations and compared it with the existing techniques in terms of accuracy for both the gene and the exon prediction. Abbasi et al. (2011) showed a significant improvement in accuracy of exonic region identification using a signal-processing algorithm that was based on Discrete Wavelet Transform (DWT) and cross-correlation method. Saberkari et al. (2013) predicted the locations of exons in DNA strand using a Variable Length Window approach based on z-curve.
It was a 3-D curve used to illustrate DNA sequences and to present a complete description of DNAs' biological behavior. A Digital Signal Processing (DSP) based method was used by Inbamalar and Sivakumar (2015) to detect the protein coding regions where the DNA sequences were converted into numeric sequences using Electron Ion Interaction Potential (EIIP).
Evolutionary algorithms like GA based techniques have also been used in solving the gene prediction problem.
Perez-Rodriguez and Garcia-Pedrajas (2011) developed a method based on genetic algorithm for gene prediction.
Another GA approach (GA_PAUC) was used by Hwang et al. (2013) to maximize the partial Area Under the Curve (AUC). This technique used features of sequence information, protein-protein interaction network topology, and gene expression profiles to maximize the AUC of Receiver Operating Characteristic (ROC) plot. Amouda et al. (2010) proposed a web based tool Intron Multiple Aligner by Genetic Algorithm (iMAGA) for aligning the intron sequences to find their pattern. A model called feature-based weighted Naive Bayes model (FWM) which was based on Naïve Bayes classifiers, logistic regression, and genetic algorithm, was developed by J. Cheng et al. (Cheng et al. 2013).
In this paper, we propose a GA based optimized gene prediction method named as Gene Prediction with Genetic Algorithm (GPGA). It is used in the analysis of large, unknown eukaryotic genomic sequences by mapping with known genes. The advantage of this algorithm is that it can be utilized as a tool for identifying a gene optimally in a large genomic sequence. The GPGA is a novel evolutionary process with significant accuracy that can be utilized in mapping of a genome with genes present in several well-known repositories like Ensemble (http://www.ensembl.org), UCSC (http://www.genome.ucsc.edu) browser and others. We observe the performance of GPGA, which is very promising in terms of sensitivity and specificity on different benchmark datasets.

RESULTS AND DISCUSSION
The performance of the GPGA was tested on two benchmark datasets, HMR195 (Rogic et al. 2001), and SAG . In the experiment, we statistically evaluated the sensitivity and specificity of the proposed method at exon level and also compared the results with other well-known and relevant techniques. Further, we annotated human chromosome 21 with GPGA for large-scale evaluation.
The proposed algorithm has been written in C and implemented on an IBM Power 6 system with 8 GB RAM per core.

Test datasets
To test the performance of GPGA, we considered two benchmark datasets of different categories having wellannotated genomic sequences. One dataset (HMR195) comprises the real genomic sequences where each sequence contains only one gene. Whereas, another dataset (SAG) consists of a set of annotated gene sequences which are arbitrarily placed in the background of random intergenic DNAs and each sequence contains more than one gene.
The datasets were taken from the GeneBench suite (http://www.imtech.res.in/raghava/genebench). Brief descriptions of these datasets are given below. The GPGA is a GA-based homology technique, which determines the presence of a gene by identifying the position(s) of exons in a large unannotated eukaryotic DNA sequence. In the experiment, we compared the positions of exons found by the GPGA in the corresponding genomic sequence with the actual positions mentioned in the annotation file provided with the test datasets. For such test, we generated a customize dataset of homologous genes consulting both the test datasets (HMR195 and SAG).
We considered top three species, namely, human, mouse, and rat in searching the homologs for the test datasets since they are phylogenetically very close and the test datasets also consist of the genomic sequences of those three species. To construct a dataset of homologous genes for the sequences of test datasets, we used Blast Like Alignment Tool (BLAT) (Kent WJ 2002) of UCSC genome browser using the default nucleotide alignment parameters. We considered BLAT as the target database of BLAT is not a set of sequences, but instead an index derived from the assembly of the entire genome. First, we extracted the genes from the genomic sequences of both test datasets based on the positions of exons, mentioned in their respective annotation files and then make each of them as a query to the BLAT run. For HMR dataset, all 195 genes and for SAG dataset, 178 genes were searched against human, mouse, and rat genome separately using their latest assembly (Human: hg38; mouse: mm10; and rat: rn6). For human origin of the test genomic sequence, we searched homologous sequences in mouse and rat, for mouse origin, we searched homologs in human and rat, and for rat origin we searched in human and mouse. We  had not been performed directly with the extracted exons from the genomic sequences of test datasets based on the mentioned positions in the annotation file. Because position comparison of GPGA by extracting exons based on the actual position with the given annotated position reduces the real genomic level complexity. In BLAT run, although we had always considered the top homologs, some of them are of poor quality in terms of similarity. It is also noted that some of the homologous sequences did not contain the same number of exons of the query and/or precise exon boundary presumably because the BLAT results contain newly assembled genome with respect to our benchmark datasets. However, we included those sequences to our new homolog sets to increase the noise in gene data and to test the efficiency of GPGA by excluding them as falsely verified sequences. Duplicate inclusion of a same homologous sequence for different query (gene) was eliminated from the sets. Thus, we prepared two homolog sets, one for HMR dataset and another for SAG dataset and finally we combined them to prepare one customized homolog set. The process flow for generating the final homolog dataset is shown diagrammatically in Figure 1.
The proposed method extracted each of the exons separately from the homolog sets and searched for the presence of them to 195 genomic sequences of HMR and 43 sequences of SAG considering both plus (Watson) and minus (Crick) strands.

Performance assessment
The predicted exon positions by GPGA were compared with the actual exon positions present in their corresponding annotation file. In the post-processing of matched result, we had considered a filtering criterion to identify a gene as true prediction where minimum 60 percent similarity at the gene level along with at least 60 bp sequence length was found. We performed statistical analysis of the experimental results to determine the performance accuracy of GPGA (see Methods for details). The results were also compared with other well-known and relevant annotation tools.
For both HMR and SAG datasets, we measured , and W E (wrong exon) using human, mouse, and rat homolog sets separately. The average value of each parameter was calculated separately for human, mouse, and rat homolog sets and was considered as the final measurement (see Supplemental_file_1.pdf: Statistical analysis and Table S1). The statistical measure was not considered when GPGA did not find any such homologous exons of a gene in a sequence.  and W E (the proportion of predicted wrong exons and actual predicted exons) were also included in the evaluation process for the superiority of the tools. Here, the GPGA also performed well. The accuracy measurement parameters are presented in Supplemental_file_2.xls (see Table S4 and S5). Furthermore, even when a good similarity is found, the limits of predicted exon positions were not always very precise. Small exons were also missed by GPGA becau- From Figure 4, it is observed that the number of wrong exon prediction is maximum for the smallest length range exons (≤30). This is because for a small length exon, the chance of getting alternative regions, other than the actual position is higher.

Annotation of human chromosome 21
We had also performed annotation of human chromosome 21 (HS21) to observe the performance of GPGA at the chromosome level. We have selected HS21, as it is the smallest human autosome that wraps around 1-1.5% of the human genome and its structure and gene content have also been intensively studied. Therefore, it is considered as an excellent dataset to validate any gene prediction method. Number of gene identification on each human chromosome using various approaches is now an active research area. One of the most challenging works among them is the cross-species comparison of human genome. The comparison of human genome with other related species helps to identify the related genes since the functionally related sequences tend to be actively conserved through evolution. We have chosen the phylogenetically related species like mouse genome for cross-species comparison with the human genome. HS21 shows conserved syntenies to mouse chromosomes 16, 17, and 10 (MM- Due to limited computing resources, we initially divided the target sequence into multiple (total 26 numbers) divisions. Each of them consists of 16-lac bp of HS21 except the last one. Each of the divisions was run against total comprehensive sets of MM10, MM16, and MM17.

Results of Annotation
In the experiment, we analyzed the results defining the stringency based on the lengths and similarities of the conserve sequences. We accordingly categorized the sequence length into 50, 100, and 150 bp. value, we identified a large number of conserved blocks and genes. However, a large number of exons of a gene did not follow GT-AG splicing rule. We also got a significant number of blocks and genes over 150 bp stringency.
However, we considered medium length of 100 bp as our final value to balance the sensitivity as well as specificity (details are provided in Supplemental_file_3.pdf: Table S6). Following the stringency of 100-70, (Table 1) Table S9). From Figure 6, it was noted that the regions of conserved blocks and the locations of genes were close to each other and they were distributed more at the distal part (gene-rich region) of HS21. When the sequences were compared with 80% identity over 100 bp (100-80), 1607 conserve blocks and 194 genes were detected.
For 100-70 level, we also provided the base substitution data in Supplemental_file_3.pdf (seeTable S7, S8, and Figure S1) showing the highest rate of transitions (substitution between two purines and between two pyrimidines) than transversion (substitution between one purine and one pyrimidine) and the higher rate of substitution at third codon position (Wobble position) than first and second.
To compare the GPGA result with others, we considered only those genes that have either unique start or end positions. We excluded alternate transcripts having the same start and end positions. Out of the 361 genes predicted by GPGA for HS21, we found 283 genes share unique start and/or end positions. Table 2 shows the comparative results of GPGA with Refseq and Gencode assembly. Refseq found 411 genes, of which 271 contain unique start and/or end positions. Out of 271 genes, 158 genes were partially (either of the both ends of a gene) predicted by the GPGA. The partial prediction is because, GPGA performed ungapped mapping for finding the conserved genes, whereas, the results of other methods showed alignment including gaps. Gencode basic assembly got 309 unique genes, of which 162 genes were partially predicted by the GPGA. On the other hand, Gencode comprehensive set got more genes than Gencode basic as it contained more novel exons. It predicted 509 unique genes, out of which 174 genes were partially predicted by GPGA. Table 3 contains the comparative results of GPGA along with other gene prediction tools. From the table, it is noticed that the GPGA predicted more genes than other gene prediction tools.     The results proved the performance superiority of GPGA compared with other well-known ab-initio or homology based approaches. Nevertheless, due to limited computational resources, we split the entire HS21 into several segments. This approach increased the computation time.

CONCLUSION
The proposed approach (GPGA) is an integer based evolutionary process which simplifies the gene prediction technique. The GPGA was tested with two well-known benchmark datasets HMR195 and SAG to evaluate the performance in terms of sensitivity and specificity at the exon level. However, one of the datasets HMR195 consists of real genomic sequences and the other one SAG contains semi-artificial set of genomic sequences. Such choice of datasets helps to truly measure the performance of an approach in a noisy environment. Finally, it was noted that the GPGA truly predicted the gene better than other well-known approaches and its accuracy is more than 90%. 1 5

Gene prediction tools
The limitation of GPGA is that it often fails to predict the correct position of a short length exon since same sequence is frequently repeated in a genomic sequence. Another shortfall of GPGA is that it can perform well on an unannotated raw sequence, only when there is a well coverage of annotated information of orthologous genes.
In future research work, we want to introduce further the information of content sensors and signal sensors like GC-content value, TATA box, promoters and other compositional parameters along with the sequence homology to improve the sensitivity of the GPGA. We also wish to perform parallel computing for large-scale annotation without splitting the query length. In addition, we would like to observe the performance of the GPGA after introducing gaps in it.

Genetic algorithm
One of the most commonly used evolutionary techniques for optimization is GA, which is stochastic in nature. It iteratively executes a set of individuals called a population. Each individual is referred to as a chromosome that encodes a possible solution to the given problem. Each solution is assigned a problem specific fitness score. After every iteration (generation), the fittest individuals are carried on to the next generation, and this process continues until a termination criterion is fulfilled. The three genetic operators -selection, crossover, and mutation help to modify a population in each generation. The conventional GA normally represents a chromosome by a binary string.
Binary representation, however, can be problematic for solving some problems as it is sometime difficult to encode a real problem with binary window. Another problem in binary coding is the increased length of the string for representing a large and complex optimization problem, which increases the computational complexity and the memory space. So, the problem specific GAs have been necessary to develop with other types of representations in mind, apart from binary notation.
One of the most used GAs is the Real coded GA (RGA), whose significance is justified in several theoretical studies (Goldberg DE 1991;Radcliffe NJ 1991). In RGA, chromosomes are represented by the real ( Here, we have modified the conventional GA with the integer coding. The changes in crossover and mutation have also been performed for solving the problem efficiently. Such modification improves the performance of the proposed GPGA.

Gene Prediction with Genetic Algorithm
The objective of the proposed method (GPGA) is to map a well-annotated known sequence onto an unknown large genomic sequence. The mapping determines the homologous relationship between the known sequence (with the known genes) and the unknown genomic sequence by identifying the homologous gene(s) in the unknown sequence. .

Fitness Function
The fitness score of a chromosome represents the alignment score. The alignment finds the presence of a conserve region (exon) in the query sequence. In the score calculation, we have considered that an identical match gets +1, and a mismatch gets a 0. Thus, the score is computed by the following fitness function, where, w i defines a local alignment score and n is the total number of local alignments.  .

Genetic operators
Three genetic operators namely, selection, crossover, and mutation play an important role towards the convergence of the problem. These operators also maintain a balance between the exploration and exploitation of the search space (Ortiz-Boyer et al. 2007).

Selection operator
In GPGA, we have considered tournament selection technique with tournament size 3 as a selection operator.

Crossover operator
In the GPGA, we have considered a modified crossover operation named as Adaptive Position Prediction (APP) crossover. APP crossover is a self-controlled-crossover operation that adaptively modifies Segment of a genome sequence(Q)(Chromosome P 1 ) where, rnd is the function to generate random number. Thus, the crossover operation helps to predict the correct exon position by adaptively narrowing down the difference between l and u . This adaptive nature helps in finetuning of the operator for converging to the optimal position.
The APP crossover operation is represented algorithmically in the following way.
• Randomly select two parents

Mutation operator
The mutation operation is performed similar to the APP crossover. It is also named as Adaptive Position Prediction (APP) mutation. It mutates the offspring generated from the crossover operation to another possible offspring to maintain the diversity in the population for faster searching of the optimal position of the given exon Thus, the modified offspring is generated as follows.
where, rnd is the random number generator.
The algorithmic steps of the APP mutation operation are given below.