Genetic Algorithm Approach to Find the Estimated Value of HMM parametersfor NS5 Methyltransferase Protein

Dengue is the pandemic disease caused by Dengue virus (DENV), a mosquito-borne flavivirus. In recent years dengue has emerged as a foremost cause of severe illness and deaths in developing countries.About 400 million dengue infections occur worldwide each year.In general, dengue infections create only mild illness but infrequently expand into a lethal illness termed as severe dengue for which no specific treatment. The machine learning approach plays a significant role in bioinformatics and other fields of computer science.It exploitsapproaches like Hidden Markov Model (HMM), Genetic Algorithm (GA), Artificial Neural Network (ANN), and Support Vector Machine (SVM).The GA is a randomized search algorithm for solving the problem based on natural selection phenomena.Many machine learning techniques are based on HMM have been positively applied. In this work, We firstly used HMM parameters on the biological sequence,and after that, we catch the probability of the observation sequence of a mutated gene sequence. This study comparesboth methods, G.A. and HMM, to get the highest estimated value of the observation sequence. In this paper, we also discuss the applications ofGA in the bioinformatics field. In a further study, we will apply the other machine learning approaches to find the best result of protein studies.

Dengue virus (DV), the causative agent of dengue, resides in the family Flaviviridae and is transmitted to humans by biting Aedes aegypti mosquitoes.Four serotypes (Dengue Virus serotype 1, Dengue virus serotype 2, Dengue virus serotype 3, and Dengue virus serotype 4) are recognized (El Sahili, Lescar 2017).The range of dengue disease spans from a flu-like disease termed dengue fever to Dengue hemorrhagic fever.In chronic cases, it causes dengue shock syndrome and sometimes terminates in death.The most prevalent clinical symptoms of acute dengue disease are hemorrhagic diathesis, liver involvement, and plasma leakage.The DV genome is prepared into a single open reading frame (ORF) of single-stranded (positive -sense) RNA of 900 kDa and flanked at 5'end by type I cap and at 3'end by untranslated regions and encodes a precursor polyprotein.Post-translational modification of precursor protein gives rise to three structural (C, prM, and E) proteins and seven nonstructural (NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5) proteins (Anasiret al., 2020).
In this work, we consider the NS5 Methyltransferase protein of the Dengue Virus to find out the probability of observation sequence.Firstly we convert the protein sequence into nucleotide sequence.After that, we implement the G.A. (selection, crossover, and mutation operator).The forward algorithm of HMM is already well discussed in my previous paper (Katiyaret al., 2020) to calculate the probability of the observation sequence.In their work, we compare both (HMM and GA) results in the bases of crystallography data the resolution (in Angstroms).
Here, we define the Hidden Markov Model as probabilistic models, in which sequences are generated from two simultaneous stochastic processes.This model captures the hidden information from observable sequential symbols (e.g., a nucleotide sequence AGCT).This model is defined by states (n), state probabilities (m), transition probabilities (a), emission probabilities (b), and initial probabilities (i).Therefore in a hidden model, there are two stochastic processes: moving between states and the process of emitting an output sequence.The sequence of state transitions is a hidden process and is observed through the sequence of emitted symbols (Alghamdi R 2016).HMM is characterized by the following 1.N -The number of states in the model.Given applicable values of N, M, A, B, and ð, the HMM can be used as a maker to give an observation sequence.

Three Basic Problems for HMM s
Given the form of HMM of the previous section, there are three basic problems of interest that must be solved for the model to be useful in real-world applications (Mor Bet al., 2020).These problems are the following Problem 1: Given the observation sequence O = O 1 O 2 …… O .T. and a model 5ØÌÞ = (A, B, ð), how do we efficiently compute P (O|5ØÌÞ), the probability of the observation sequence, given the model?Problem 2: Given the observation sequence O = O 1 O 2 ……O T and the model 5ØÌÞ, how do we choose a corresponding state sequence Q = q 1 , q 2 ……….q T, which is optimal in some meaningful sense (i.e., best "explains" the observations)?

Genetic Algorithm (GA)
The genetic algorithm is a method of natural selection that belongs to the class of Evolutionary Algorithms (EA).Genetic Algorithms type of optimization algorithm, meaning they are used to find the optimal solution(s) for a given problem that maximizes or minimizes solution of problem (Silvaet al., 2019).A Genetic Algorithm is the biological process of reproduction and natural random selection to find for the best 'fittest' solution.Like evolution, several of a genetic algorithm's works randomly permit us to set the level of randomization and control (Jenningset al. 2019).These are expected to be more powerful algorithms providing random and in-depth search.Such features prove GA to be better than other optimization methods, which have drawbacks like lack of stability, derivatives, linearity, or other features.GA are often designed to simulate a biological process.However, the entities that this terminology refers to in genetic algorithms much simpler than their biological complements (L Halduraiet al., 2016).The basic components of Genetic Algorithms are: 1. Function of optimization 2. Population of chromosomes 3. Random selection of chromosomes 4. Crossover to next generation of chromosomes 5. Mutation of chromosomes in the new generation Some programming languages are also used to implement the GA.They are also well suited for modeling occurrences in economics, ecology, the human immune system, population genetics, and social systems optimization, machine learning optimization (Harsh Bhasin et al. 2011) etc.
The necessary steps involved in the genetic algorithm are (i) generate a random population (suitable and possible solutions for the problem) of chromosomes, (ii) calculate the fitness f(x) of each chromosome (x), (iii) repeat the process until a totally new population is created, (iv) select two parent chromosomes of better fitness, (v) taking into consideration of crossover probability, a crossover is performed between the selected parents (without crossover offspring would be same as a parent), (iv) similarly, with some mutation probability, the new offspring is mutated at each locus, (vii) resulting new offspring is placed in a new population, (viii) the newly generated population is used for running with the genetic algorithm, (ix) if end condition is satisfied, run is stopped, the resultant best solution is placed in the current population and (x) go back to step number two for its fitness evaluation.
A list of some genetic algorithm applications is shown in (Table 1).GA methods play a significant role and provide a useful set of tools in different fields and bioinformatics analysis.
The GA-based methodology provides accuracy, efficiency, and potential for growth to data analysis in bioinformatics.The basic principle behind GAs is that it creates and maintains a population of individuals represented by chromosomes.Chromosomes are essentially a character string analogous to the chromosomes appearing in DNA.These chromosomes are typically encoded solutions to a problem, which then undergo a process of evolution according to rules of reproduction and mutation.

I n t h i s w o r k , w e u s e t h e N S 5
Methyltransferase protein sequence.The protein structure was downloaded from RCSB PDB (http:// www.rcsb.org),and accession codes are used in this study.The coordinates and structure factors for DENV-2 NS5 Methyltransferase protein complexes have been deposited in the Protein Data Bank under accession code.
Find the optimization value of the function.In the process of the crossover method, first, we select any one population and select any two random points of some length shown in (Figure 1).The same process can be done in the second selected population.Then we exchange that portion from one population to the other.In mutation, we make minor changes in anyone's population.Now   2) shows the data of 20 gene sequences of non-structural methyltransferase protein with resolution power, their number of residues count and relevant references.

selection of Population
From (Table 2), we select any two sequences, i.e. (PDB code 1 and PDB code 2), as the initial population and apply the crossover and mutation function and generate the estimated values of both populations using the forward algorithm of HMM.Then select the population PDB code of better-estimated value.Again we select the one sequence of PDB code from (Table 2) and other PDB code, which contains a better-estimated value and applies the same crossover and mutation function now.Once again, find which PDB code contains a better estimate value; this process will continue until all the PDB code finished from the (Table 2).

Crossover
To generate a random population, we applied the crossover operator on the given gene sequences.In this method, we have selected the two gene sequences (among 1 to 20 protein codes in table 2) of NS5 Methyltransferase protein of dengue virus and interchanged their position.

Mutation
In this process, we slightly changed a small part of the gene sequence.In a simple sentence, we can say that mutation would change one or more genes, also called as interchanging mutation.After applying the mutation operator, we get the mutated gene sequence, and after this, we calculate the optimum value of the gene sequence of using the forward method of the HMM approach.
In this method, we used the two strings of gene sequences of NS5 methyltransferase protein then applied the crossover and mutation operator.Further, the process of selection, crossover, and mutation operator was performed shown in (Figure 2).Selected two points randomly of some length in seq1, and the same length at seq2 then applied the crossover operation.In crossover operation, exchange the selected part from both gene sequence Fig. 3. Result of after mutation operation seq1 and seq2, and at the result, we have got seq1* and seq2*; and after this, we did mutation operator seq2* and got the seq2**.(In mutation operation here, we have selected a single char 'G' and char 'C' at the different places of seq2* and then interchange both characters, and get the seq2**).The same process is applied in the (Figure 3).In which we have got results after crossover and mutation operation.
Following matrix for the forward algorithm of HMMs, where each sequence have individual tables ofA = (a ij ), B = {b j (k)} and ð = (ð i ) matrix values.
Using matrices of A, B, and Pi, we apply the GA and forward algorithm of HMM to find the global maxima value of P(O|5ØÌÞ) and store in (Table 3).

resulTs And disCussion
This paper presents a comparative result of HMM and GA with the resolution power (crystallographic value of protein sequence given in Table 2).Table 3 shows the value of P(O|5ØÌÞ) for the gene sequences taken in this research work after the crossover and mutation operation of the GA and forward algorithm of HMM.The result also depends on the length of the gene sequence.We have taken 20 gene sequences of NS5methyltransferase protein with residues count (i.e., the total number of A, G, C, and T characters in sequence) and P (O|5ØÌÞ) value of the forward algorithm of HMM and P (O|5ØÌÞ) of the value of forwarding algorithm after applying GA method.We observed that sequences that have the minimum values of protein 3MTE are 1.40E -210 , and the maximum value of protein 1L9K is 1.96E -304 in the forwarding algorithm.But after GA protein code 2P3Q has the minimum value 7.08E -311 while the maximum values of protein 3MQ2 is1.81E -206 .Using Table 3, we are comparing and showing the value after the forward algorithm and after GA in (Figure 4).
(Figure 4) shown the comparative graph of between HMM and GA to find which method shows the global maxima.Here we observed that in GA, global maxima values lie between (9.85 to 9.35) while in HMM, global maxima values lie between (8.93 to 8.59).So we can say for this experiment, the values of GA for global maxima always give a better result as compared to the HMM forward algorithm.So, here we can state that we can use the global maximum values of both the algorithms so as to get the knowledge of which protein to use among the given four protein as a drug target (in case of drug design).Thus, we obtained the better protein in all four proteins which have maxima values, can take any protein which has the highest value of, i.e., (9.86).
In the case study, we discussed the crystallographic data of the resolution (in Angstroms) of a protein structure resolved by X-ray diffraction or Nuclear Magnetic Resonance (NMR).This field can be queried for a value or a range of values.But here, we apply the computational methods to find the resolution power of a protein sequence to take the large data of protein sequence.Most of the structure data are obtained from X-ray crystallography and NMRspectroscopy.X-ray crystallography determines the preparation of atoms within a protein by passing X-rays through a crystallized form of the protein and analyzing the resulting X-ray diffraction pattern.This technique provides the highest resolution and usually yields only one ).In addition to these experimental methods, some researchers use computational modeling to predict the structure of a protein by simulating the forces that act on each atom in a molecule of known structure.However, this method produces non-experimental models and the least dependable results.

ConClusion
In this work, we are studying and applying the GA and forward algorithm of HMM on the protein sequence of NS5 Methyltransferase protein of the dengue virus.At last, we compared the estimated value of the forward algorithm of HMM with the GA approach.In our experiment, we observed that the genetic algorithm provides a better result as compared to the forward algorithm.In the future, we will repeat this experiment with a large number of gene sequences with higher lengths and will compare the results of different algorithms like GA, Metropolis, GWW, and Ant Colony algorithms to evaluate their capabilities to find the global maxima.In a further study, we will improve the algorithm and make it more effective for long protein sequence prediction using a multi-core computing platform used other machine learning approaches as an above mention for the biological data generation in the analysis and discovery for the drug design.
2. M-The number of district observation symbols per state.3. The state transition probability distribution A = (a ij ) 4. The observation symbol probability distribution B = {b j (k)} 5.The initial state distribution ð = (ð i )

Fig. 4 .
Fig. 4. Graphic image of the comparative analysis of P(O|λ values between HMM and Genetic Algorithm for 20 gene sequences of proteins

Table 1 .
Some common applications of Genetic Algorithm

Table 3 .
Showing the value of P(O|λ) using the Forward Algorithm and Genetic Algorithm