Classification of Retroviruses Based on Genomic Data Using RVGC

: Retroviruses are a large group of infectious agents with similar virion structures and replication mechanisms. AIDS, cancer, neurologic disorders, and other clinical conditions can all be fatal due to retrovirus infections. Detection of retroviruses by genome sequence is a biological problem that benefits from computational methods. The National Center for Biotechnology Information (NCBI) promotes science and health by making biomedical and genomic data available to the public. This research aims to classify the different types of rotavirus genome sequences available at the NCBI. First, nucleotide pattern occurrences are counted in the given genome sequences at the preprocessing stage. Based on some significant results, the number of features used for classification is reduced to five. The classification shall be carried out in two phases. The first phase of classification shall select only two features. Unclassified data in the first phase is transferred to the next phase, where the final decision is taken with the remaining three features. Three data sets of animals and human retroviruses are selected; the training data set is used to minimize the classifier’s number and training; the validation data set is used to validate the models. The performance of the classifier is analyzed using the test data set. Also, we use decision tree, naive Bayes, k-nearest neighbors, and vector support machines to compare results. The results show that the proposed approach performs better than the existing methods for the retrovirus’s imbalanced genome-sequence dataset.


Introduction
Viruses are the inevitable parasites that affect other cellular organisms. Therefore, they are called genetic parasites. They can only replicate when they have access to the cellular system of the host organisms. They are composed of two or three main parts. The first and important part is the genes composed of Deoxyribonucleic Acid (DNA) or Ribonucleic Acid (RNA). The second part is the protein coat that is useful for the protection of genes. Some viruses also have a third portion called an envelope, consisting of lipids surrounding the whole virus particle [1]. Most of the study has been done on the viruses that are associated with some disease. Retroviruses are composed of RNA and have reverse transcriptase (RT) gene that causes the conversion of RNA to DNA. This converted DNA is integrated with the host DNA while entering a cell. The DNA structure is made of nucleotide. Nucleotides are of four types, namely cytosine (C), thymine (T), adenine (A), and guanine (G). Therefore, DNA is a sequence of A, T, G, and C in order. For the computer scientist, this sequence of nucleotides (in order) looks like a string whose characters are taken from a set of alphabets A, T, G, C. A codon is a group of three nucleotides. There is a total of 64 different combinations of nucleotides from a set of A, T, G, and C. Different organisms have other counts of codons that can be used for computational processing in research on these organisms.
There are many databases available online that provide DNA sequences for different organisms. One of the most important and the most widely used databases is the National Center for Biotechnology Information (NCBI) database. sSome genome-sequence regions consist of statistically useful data while the other regions are either less useful or contain hardly detectable information. Genome-sequences data are used in many computational methods of statistical processing to detect the relevant region inside the genome sequences [2,3].
Genomic data have issues of variable dimensionality, characters with limited alphabets, and imbalanced data. The retrovirus genome sequence contains an imbalanced dataset in which the majority class has more samples than the minority class. Almost all classifiers have higher error rates on the minority class but perform well on the majority class. From a statistical point of view, this is a general problem related to almost all of the classifiers. The minority class samples may not represent their class, so their methods have a poor result on unseen data. Different techniques are used to handle the imbalanced datasets, e.g., upsampling, downsampling, etc., [4,5]. Consider two classes: diseased and healthy. The healthy class has 100 samples, and the diseased class has one sample. We can say that the majority class is −ve and the minority class is +ve in medical term. This is natural in pathology or diagnostics. Now, we have the previous knowledge that 99% of people are healthy. If we classify a sample of 1000 persons and declare all of them healthy, then the classifier's accuracy will be 98% since the doctors found 20 of them sick. It is observed that the classifier performance is great, yet one class of experiments remained unidentified, s and therefore, the performance measure is not correct.
We can use an alternative performance measure in which we can weigh both the classes equally. It can be observed that all 980 samples of the healthy class and none of the 20 samples of the diseased class were identified correctly. So, the performance is just 50% which is bad.
This study aims to classify various types of retroviruses using DNA sequences of retrovirus available in the biomedical repository, e.g., NCBI, with the help of computational methods. The focus of this study is a similarity measure without alignment. We focus on the finding similarity measure without alignment. We observe the performance of different features and machine learning techniques for retroviral genome classification. This paper has developed a two-phase algorithm to classify various types of retroviruses using DNA sequences of retrovirus available in biomedical. The performance of the classifier has been compared with some other machine learning algorithms.
The rest of the paper is organized as follows. The related work has been presented in Section 2. In Section 3, we have presented the methodology and the algorithm. Results are presented and discussed in Section 4, and the work is concluded in Section 5.

Related Work
In the quest to perform retroviruses classification, a proper and well-formed database of nucleotide-sequences of retroviruses DNAs is needed. There are many resources accessible where the DNAs sequence data of retroviruses is available. A list of recent and previous databases is available at [6]. National Centre for biotechnology and information (NCBI) is one of the important resources of genetic information. Required genome sequence databases are easily available at NCBI in two forms. One of them is the Reference Sequence (RefSeq) database containing combined data for each model species of viruses. The other is GenBank containing data of each virus available publicly. The RefSeq provides a comprehensive set of useful, non-redundant, well-annotated, and explicitly connected DNAs and proteins record for each organism. Sequence records are presented in a widely accepted format and are accepted after computational validation [7]. On the other hand, GenBank provides open access and a comprehensive collection of all original sequences. Sequences discovered and approved by NCBI are grouped in a comprehensive archive [8].
Alignment based and alignment-free methods are two general types of classification methods of viruses DNA. Alignment based classification is a traditional technique based on matching DNA sequences. This method performs classification in the following three steps-identifying conservative regions in DNA sequences in the first step. Alignment is done through insertion, deletion, and mutation in the second step. Distance measures are derived between genomes using alignment scores in the third step. Some techniques available in the literature are based on sequence alignment and derivation of alignment scores. A review of those techniques is available in [9,10]. For example, we can perform alignment between every two DNA sequences or between multiple DNA sequences simultaneously [11][12][13][14][15][16]. We can also perform alignment based on certain local DNA sequence structures [17][18][19][20][21] or a complete global structure of DNA sequences' global structure. Substitution scoring matrices such as a point accepted mutation (PAM) and BLOcks Substitution Matrix (BLOSUM) and many other scoring systems have been presented to perform classification [22,23]. The proposed methods work well on small and similar DNA sequences of viruses, but there are computational and fundamental limitations on diverse and large viruses DNA sequences. In terms of computational complexity, it is infeasible to perform optimal DNA sequences' alignment for large data set of viruses DNA sequences generated by next-generation sequencing techniques [24,25]. Alignment based method presented above requires (L2) time and space complexity, where L is the length of a sequence. More computationally efficient methods with specific properties for sequence alignment have been developed for specialized purposes, but the techniques used in these methods may not reflect the phylogeny [26,27]. The evolutionary assumption used in developing scoring methods and sequence alignment may not reflect phylogeny in fundamental virology concepts [28,29]. Simultaneously, the evolutionary method assumes linearity in scoring methods based on different scales [30]. Due to the limited number of features, these methods are combined with distance-based classifiers to develop potentially more powerful machine learning algorithms. Alignment free methods perform viruses' DNA sequence classification based on the degree of similarity between different features. As an alternative to alignment-based schemes or similarity score procedures, alignment-free schemes map the viral genome sequence to a feature spacepoint where the distance between the original sequence features helps classify the viruses [31]. Modern representation techniques perform classification using nucleotide occurrence statistics and the information about its position [32]. For example, count of k-mers, Kolmogorov complexity of sequence, absent words, matrix invariants, genomic signal processing, curves, and images [33][34][35][36][37][38].
Features selection and limited biological information are the common drawbacks of alignmentfree methods. However, these methods work well in several aspects. These methods help the DNA sequence be the only available information as the associated biological knowledge required for the alignment process is not needed. Thus, no alignment is needed. These methods work well where highly diverse DNA sequences are available, and the alignment process is not trustworthy. These methods can deal with large DNA sequences datasets more efficiently as all sequences are presented in a fixed format with feature space points. Therefore, these can be used in machine learning techniques and applications such as k-nearest neighbour (k-NN) classifier, rule-based classification, support vector machine (SVM) and artificial neural network [39][40][41][42].
In the earlier study, the alignment-free methods using nucleotide statistics worked efficiently for different viruses DNA sequences but gave poor results for similar viruses DNA sequences [43]. However, in later studies, alignment-free methods work well compared to alignment-based methods with more sophisticated features, even at species levels and genus [40].
Machine learning techniques can be categorized based on distance matrices such as feature vectors and hierarchical relationship. The k-NN classifier was used to predict the label of virus DNA sequence [44,45]. The distance between the features of training data sets was calculated. The prediction was made based on the majority vote of classes in k-nearest neighbours and classes was assigned to an input DNA sequence based on the nearest distance where k-NN function was used to implement k (parameter of model) [1].
Random Forest (RF) is an assemblage technique comprising of decision tree groups. In [45], through the process of training, a large number of the uncorrelated decision trees was developed. Each tree was constructed by selecting a random subset from training virus genome-sequences data. sA random subset of characteristic variables was selected as a node based on possibility and maximum information to grow a tree. The tree was then grown by frequently splitting nodes up to the threshold. To select the label of a given DNA sequence, every tree casts a single vote for the selected class, and the one with the maximum votes was the final prediction of the RF technique A technique was used to recognize unknown genes of related purpose from specified data by applying a support vector machine (SVM). A quality evaluation method was developed where the quality of DNA's chromatograms was classified into low and high. The SVM classifier was used to predict two classes [45]. Machine learning techniques were presented in the quest to identify infected and actual genes, and a review of different genome data classification mechanisms by machine learning was discussed in detail [46].
A method was proposed for the global features generation of genome sequences. Human endogenous retroviruses genome-sequences were used as the data set. Infinite sequence generators were evolved to produce sequences with an augmented collection of matching blocks over a critical size in the target genome sequences. As compared to other techniques such as GC content, infinite string matching is the multiple location-based techniques. Different types of global features were selected, and genome sequences were classified using single feature threshold classifiers [47].
In [35], a DNA sequence-based species classification technique was presented. Three types of data set, i.e., iris, wine and new-thyroid, were selected for this purpose. For the development of efficient and robust classification algorithms, different DNA signature components like GC contents, exon (sum of first three nucleotides) and intron (fourth nucleotide), weight, and annealing temperatures were used as features. DNA sequence-based data classification (DSDC) was presented for species classification. It was observed that any sort of data tuning, preprocessing, and post-processing steps of data mining were not needed. It was also observed that proposed algorithms work well as compared with different differential evaluation variants. Nearest neighbors classification was used for optimization, and 1-NN was used as a performance baseline limit. The average accuracy of DSDC algorithms for the wine dataset was 74.15%, for the new thyroid dataset was 85.58%, and for iris, the dataset was 87.33%.
In [48], Fourier transform was used to generate characteristic sets based on randomness amount to classify retroviruses DNS's sequences previously unidentified. This study used four types of data sets, including HERV, complete retroviral genome data RV, negative NRV data, and the human genome. These data sets were collected from NCBI and HERV was collected from RetroSearch. Four types of features were generated by using the Fourier phase histogram. These features were additionally applied for the analysis of RF classifier accurateness. It was observed that to distinguish retroviral genomes from non-coding sections, RF classifier produces satisfactory results.
The basic local alignment search tool (BLAST) is similar to SmithWaterman-Gotoh algorithms, but the difference is that it uses only an investigative search rather than a comprehensive search. This permits it to rum about 50 times quicker at the cost of some accurateness. It recognizes similarities (hits) amid input and query sequences and consigns scores. Overlying hits were grouped and consigned regions scores built on the BLAST scores of the sequences. Using FASTA, search regions between two stop codons were used that were long enough (<62 nucleotides), and these were compared to a database sof over 6000 non-retroviral and retroviral proteins. FASTA searches and BLAST searches are comparable, except that it is exclusively tuned for aligning different proteins. This database has been expanded and updated. Data is presented online in addition to data for similar regions from RepeatMasker. Often, there are perceptible alterations [49].

Methodology
In preprocessing step, we count the nucleotide pattern in given DNA sequences of both human and animal retroviruses. Let P = [p 1 , p 2 , p 3 , . . . , p 64 ] are the nucleotide patterns where P i represent a group of three nucleotides over the alphabet set = {A, C, G, T}. We count the occurrence of each pattern P i in given DNA sequences data obtained from the above method and store for animals and human separately. The flow of the methodology is shown in Fig. 1.
Let h i be the i th human retroviruses samples, for i ∈ [1, m] and a j be a j th animal retroviruses sample, for j ∈ [1, n] where h i , a i ∈ N 64 . We define H, such that.
Features are in row order for H and A both. We redefine human data H in column order as i th column g i contain i th feature data of all samples, as below: Similarly, we redefine animal data A in column order as j th column b j contain j th feature data of all samples, as below: We solved the issue of characters with limited alphabets and variable dimensionality in this step. We minimize the number of features by selecting only significant features for classification. For this purpose, the following are the details of the features reduction step. Let G min i is the minimum value of the i th feature g i .
Similarly, G max i is the maximum value of the i th feature g i .
The value 1 is assigned to X i 1 if j th value of b j is greater than or equal to G min i . X i 1 is computed as: We compute X i 2 as if j th value of b j is greater than to G max i . The X i 2 is assigned 1. The equation is represented as: We compute X i by subtracting the column sum of X i 1 and X i 2 as follows: Consider the matrix X i , as follows: where x 1 , x 2 , x 3 , . . . , x n are columns of X m . Let us define F as five features where X i Values are the minimum.

Classifier
The classification of Training data is carried out in two phases. The first part is based on features f 1 and f 2 . The second one is based on three features, namely f 3 , f 4 and f 5 .

Phase I
In phase I, features f 1 , f 2 are selected for classification. We select only selected features from the given data. We define A such that A is a data set of f 1 , f 2 columns in given data A as: Similarly, H is the data set of f 1 , f 2 columns in Human data H.
Let G min i f i the minimum value of the i th feature g f i .
G min Let G max i f i the minimum value of the i th feature g f i .
G max Now we Define Y i 1 such that Y i 1 contain 1 for all values of A features data b f i where values of b f i is greater than or equal to G min i f i and 0 otherwise. Similarly We compute Y i by subtracting Y i 1 from Y i 2 as follows: where Y i Represent the count of features belongs to the human range.

Phase II
All the participants with decision label D i 1 "Unknown" is selected for phase II.
In phase II f 3,4 , f 5 are the features selected for classification. A is the set of f 3 , f 4 , f 5 features data. We define A such that Similarly, we define f 3 , f 4 , f 5 features data for human as follow: Let G min i f i the minimum value of the i th feature g f i .
G min Let G max i f i the minimum value of the i th feature g f i .
G max Now we define Z i 1 as Z i 1 contain 1 for all values of Animals data b f i where values of b f i is greater than or equal to G min i f i and 0 otherwise.
Similarly, Z i 2 contain 1 for all values of given data b f i where values of b f i is greater then G max i f i and 0 otherwise.
We compute Z i by subtracting Z i 1 and Z i 2 as follows: We take decision D i 2 on the basis of Z i such that.
We carried out simulations with the help of Matlab (c) Software.

Detection Algorithm
Input: Gen-a genome

Results
Results of the classifier are presented in Tab. 1. The proposed method correctly detects 91.30% of genomes used in training data. In Phase-I of the training step, 30 from 41 animals' retroviruses are correctly labeled as "Animal", 2 are wrongly labeled as "Human" and 9 are labeled as "Unknown". All human retroviruses data are classified correctly. In Phase-II of the training step, 7 from 9 animals retroviruses are correctly labeled as "Animal", 2 are wrongly labeled as "Human". The result of the classier during the validation stage is 96.15%. In Phase-I of the validation step, 17 from 24 animal's retroviruses are correctly labeled as "Animal", 1 is wrongly labeled as "Human" and 6 are labeled as "Unknown". From human data 1 is correctly labeled as "Human" and 1 is labeled as "Unknown". In Phase-II of the validation step, 1 human and 6 animals retroviruses data are classified correctly. The result of the classifier during the testing stage is 92%. In Phase-I of the testing step, 16 from 22 animals retroviruses are correctly labeled as "Animal", 1 is wrongly labeled as "Human" and 5 are labeled as "Unknown". All humans are detected correctly. In Phase-II of the testing step, 4 animals retroviruses are correctly labeled as "Animal", and 1 is wrongly labeled as "Human". Results are given in Tabs. 1-7.      3  2  22  21  92  2  Naive Bayes  3  3  22  21  96  3  kNN1  3  2  22  22  96  4  kNN3  3  2  22  22  96  5  kNN5  3  0  22  22  88  6  SVM  3  1  22  22  92  7  Our classifier  3  3  22 20 92

Performance Analysis
In order to check the performance of our classifier, standard performance metrics are used in this research. Given a test set with N samples, let N P and N N be the number of positive samples ('Animal') and the number of negative samples ('Human') within the dataset (N = N P + N N ), respectively. After the classification, let T P and F P be the number of positives detected as positive. T positive and the number of positives classified as negatives (NP = TP + FP ). Similarly, let T N and F N be negatives classified as being negative and the number of negatives classified as being positive (N N = T N + F N ). For this research, we have considered the following metrics to analyze the performance, as given in Tab. 8. Considering Tab. 6, we have taken 22 samples from animal (N P = 22) and 3 samples from human (N N = 3) as a test dataset to the classifier. Thus N = 25. Again from Tab. 6, it is clear that T P = 20, F P = 0, T N = 3 and F N = 2. Results of performance analysis are shown in Tab. 8, and the confusion matrix is given below Tab. 9.

Alternate Performance Measure
We can use an alternative performance measure, as presented in the introduction chapter. Results are shown in Tab. 10. This table shows the result of an alternative performance measure.

Discussion
We can use this similarity measure technique without alignments on motif-based proteinsequence, phylogenic tree construction, protein sequence analysis, clinical pathology, and other medical sciences.
• We have selected features to range based on the minimum and maximum values. Other range selection methods can also be used based on the precision of the classifier.
• We have performed a random classification technique. Another type of classification method can be used and analyzed based on the classifier's accuracy. • Classification can be performed in multiple phases by selecting two features in each phase up to the significant results. • The result of the classifier can be improved by selecting mutated genes in the training stage.

Conclusion
In this study, we developed an algorithm for the classification of retroviruses based on DNA sequences. Firstly, the preprocessing step counts the occurrence of nucleotide patterns in given DNA sequences. Features are reduced to five based on significant results in the second step. In the final stage, classification was carried out in two-phase. In the first phase, we select two features. The given data not classified in the first phase was passed to the next phase. In the second phase, we select three features. Three data sets were selected. The first was used in training, the second was used in validation, and the third set was used to test the classifier's performance, and the third set was used to test the classifier's performance. The third set was used to test the classifier's performance. It has been observed that the number of features selected provides sa significant result as compare to other combination of features. Characters with limited alphabets and variable dimensionality issues are handled using a preprocessing step. The decision of the selected threshold for the classifier in both phase provides reasonably significant results as other thresholds provide. It is observed that the selected procedure of classification gives significant result on all data sets. There is "Training", "Validation" and "Testing". Almost all classifiers have higher error rates on minority class but perform well on majority class. The proposed algorithm provides better results on both majority and minority classes of imbalanced data.
Funding Statement: This work was supported by the Soonchunhyang University Research Fund.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.