Detection of intra-family coronavirus genome sequences through graphical representation and artificial neural network

In this study, chaos game representation (CGR) is introduced for investigating the pattern of genome sequences. It is an image representation of the genome for the overall visualization of the sequence. The CGR representation is a mapping technique that assigns each sequence base into the respective position in the two-dimension plane to portray the DNA sequence. Importantly, CGR provides one to one mapping to nucleotides as well as sequence. A coordinate of the CGR plane can tell the corresponding base and its location in the original genome. Therefore, the whole nucleotide sequence (until the current nucleotide) can be restored from the one point of the CGR. In this study, CGR coupled with artificial neural network (ANN) is introduced as a new way to represent the genome and to classify intra-coronavirus sequences. A hierarchy clustering study is done to validate the approach and found to be more than 90% accurate while comparing the result with the phylogenetic tree of the corresponding genomes. Interestingly, the method makes the genome sequence significantly shorter (more than 99% compressed) saving the data space while preserving the genome features.


Introduction
Representations of genomic data into numerical, graphical or audio form have gained importance in bioinformatics research. In recent years many studies showed different ways of genome representations to find their various DNA significances. A graphical representation of DNA is convenient to achieve a visual analysis of the distribution of nucleotide bases A, C, G and T of DNA. The mapping model was introduced as an Hcurve, a three-dimensional space representation of the DNA sequence by Hamori andRuskin in 1983 (Hamori andRuskin, 1983). Later on, researchers found many exciting ways of graphical representation in different methods (Bielińska-Wąż and Wąż, 2017;Mo et al., 2018;Randić et al., 2006;Hoang et al., 2016;Touati et al., 2021;poor and Yaghoobi, 2019;Sun et al., 2020). Chaos Game Representation (CGR) is one of the graphical techniques, which assigns each DNA bases into the respective position in the two-dimension plane to portray the DNA sequence (Hoang et al., 2016). This 2-D representation was introduced to show the depiction of the local and global pattern of DNA by using iterated function systems based on chaotic dynamics techniques (IFS) (Jeffrey, 1990). The CGR technique was recently shown as an efficient classifier of helitrons families in Caenorhabditis elegans genomes (Touati et al., 2021). Also, the CGR tool is useful to make a comparative study among genome sequences (Hoang et al., 2016). A genome sequence can be characterized in both graphical and numerical form through Chaos Game Representation. Importantly, CGR provides one to one mapping to nucleotides as well as sequence mapping (Jeffrey, 1990). From one coordinate of the CGR plane, it can tell the corresponding base and its location in the original genome. Therefore, the whole nucleotide sequence (until the current nucleotide) can be restored just from a point location of the CGR. Additionally, a CGR portrays whole-genome data or parts of the sequence in a single plane, making the CGR tool more practical for a comparative study (Deschavanne et al., 1999). Recently, CGR was used for capturing the recurrence features from SARS-Cov-2 datasets, and the results were proposed for clustering (Olyaee et al., 2020). Mostly, the CGR plot were made in square in shape but it can be presented in n-vertex polygon CGR (Xiaohui et al., 2014).
The year 2020 started with a threat of a pandemic where a virus badly hit the human life. Gradually, the virus spread almost all around the world. The human community is presently suffering from a pandemic caused by a positive RNA strand virus, Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Yan et al., 2020). Moreover, there are several viral diseases, as mentioned in the world health organization (WHO) pandemic list, such as the Middle East respiratory syndrome (MERS), Disease X, severe acute respiratory syndrome (SARS), Ebola and Influenzas. These viruses have shown their impact on human in several years (Who 0000). Initially, WHO suspected the source of the present pandemic as Disease X. However, a novel coronavirus (2019-nCoV) is found for the COVID-19 pandemic Wu et al., 2020;Zhu et al., 2020). Sometimes, a new virus comes from its variants or after mutations. For example, SARS-CoV-2 is 79.5% similar in nucleotide sequence with the previous SARS-CoV family and shows 96% similarity with another coronavirus, SL-Cov-RaTG13 . More than these two examples, there are also many more cases for similarities in nucleotide sequences, as discussed in the study .
Genome studies of sequence analysis in the laboratory are always expensive and time-consuming work. Therefore, the computation process using an artificial neural network (ANN) can be an efficient tool in studying sequence analysis in segmentation (Cheng et al., 2012) and computing of DNA sequences (Zhong et al., 2020). An ANN network consists of an input layer, hidden layer and output layer. The smallest processing unit of an ANN is neurons, connecting the different layers in the network (Garro et al., 2016). Each neuron is responsible for carrying the input from the previous layer and computing the sum or part of the input with individual weight. The simplified human brain inspires the interconnections, nodes and layers of a neural network to compute different recognition algorithms. ANN is used when there is plenty of data and no algorithmic solution possible with them. These data are used to train the network during the learning process of the network (Hoang et al., 2020). Multilayer perception (MLP) with one or more hidden layers is more accurate with given sufficient training data (Hoang et al., 2020). A multilayer ANN is capable of finding a correlation among the input data. This machine learning tool is efficient in pattern recognition that can be applied in gene detection, sequence classification, disease detection and many other aspects in bioinformatics. The ANN method was used in the detection of high dimensional complex cancer datasets (Lancashire et al., 2009). The technique is one of the deep learning methods applied to solve data analysis and computer vision. It was efficient to classify DNA damages based on comet assay images (Atila et al., 2020). A neural network tool can be beneficial to organize the virus sequences into their family in this context. The virus sequences are positive RNA strand, mainly thirty thousand bases in size. A classification and clustering study were made among the different virus sequences using a support vector machine (SVM) and haar wavelet (Paul et al., 2021). This study explores the combine performance of CGR and ANN to detect the virus sequences from the same family. For the purpose, the Coronavirus family, including SARS-CoV, SARS-CoV2, MERS and Alphacoronavirus (Alpha CoV), were taken in the analysis.

Material and methods
The CGR method converts the one-dimensional sequence into a twodimensional graphical form. First, the DNA nucleotides were assigned to the four vertices in a unit square Euclidean plane. The coordinates of the nucleotide vertices were A = (0,0); C = (0,1); T = (1,0) and G = (1,1). The mapping process plots a dot starting from the center P 0 = (0.5, 0.5) of the plane. It reads the DNA string (S n ) and checks the first character (S n=0 ) to plot in the plane. This stage points the position (P 1 ) to the half distance between the last dot (center) and the vertex matching to the first nucleotide. For the next point(P 2 ), it is precisely at the halfway of P 1 (now acts as the initial point) and corresponding nucleotide vertex. Similarly, the process continues to the Nth or last character (S n=N ) of the DNA sequence.
A short sequence 'ACTTGAATG' (S n , wherelength, N = 9) was taken as an illustration for the CGR process shown in Fig. 1a. Inversely, from a given CGR coordinate, the original sequence can be traced back in Fig. 1b. The genomic position of the sequence governs the geometric coordinates of the CGR plane. Each coordinate of the CGR plane acts as a memory of the source sequence for the specific nucleotide, pointed by the coordinate (Jonas et al., 2001). For example, the third nucleotide (from left to right, P 3 = T) in S n can be tracked to recover the other nucleotides (first and second, P 1 = A, P 2 = C) subsequence until the third position in Fig. 1b. First, P 3 was found in the fourth quadrant of the CGR plane in the giant square where the correspondent nucleotide vertex is 'T'. Now, CGR plane needs to zoom only in the mentioned quadrant. Here, in this square (green square) P 3 comes in the first quadrant or 'C' side square. Similarly, in the next step, the point P 3 lies in the 'A' vertex side (yellow square). If the vertexes of these squares are arranged in ascending order based on the size, the controlling vertexes 'ACT' can be retrieved as the recovered sequences.

Frequency Chaos Game representation (FCGR)
The Frequency Chaos Game Representation (FCGR) is an inherited method of CGR. It converts a DNA sequence into a 2-D plot based on a kmer occurrence of the DNA. In the FCGR method, the CGR unit square plot is divided into 2 k *2 k squares (Lichtblau, 2019). Each sub-square captures the k-mers according to the geometric position of the corresponding nucleotide of the DNA string. For K = 1 or one order FCGR (FCGR 1 ), CGR is divided four sub-square. Each sub-square holds the 'K' subsequent nucleotide, i.e. A, C, T and G (poor and Yaghoobi, 2019). Fig. 2a shows the FCGR 1 plot of the sequence S n and the distributions of monomers are A = 3, C = 1, G = 2, and T = 4. Similarly, there are sixteen sub-square with dinucleotide feature for FCGR 2 , which is shown in Fig. 2b. Here, the dimer distributions are like AA = 1, AC = 1, GA = 1, AT = 1, CT = 1, TG = 2, GT = 1, TT = 1 and the rest are zero. The same way FCGR 3 , FCGR 4 or FCGR n can be determined.

FCGR to numerical coding
Each k-mer occurrence can be found from the FCGR k process. This frequency matrix of FCGR can be restructured with the occurrence of kmer. The dimer (k = 2), trimer (k = 3) or k-mer can be determined from the nucleotide sequences. And the total number of dimers are 4 2 = 16 , trimers 4 3 = 64 and so on (4 k for k-mer). Here, FCGR 3 (K = 3) is shown as an example, similarly any FCGR k (k = 3) can be found. The smallest unit square of FCGR 3 plane was mapped based on the occurrence of the trimers. As an example, the SARS-CoV sequence (accession number: AY278741) of length 29,727 base pairs was represented in the form of FCGR 3 (Drosten et al., 2003). The DNA 2-D image of the corresponding sequence is shown in Fig. 3. The color indicates the presence of codons in the genome sequence. The darker color represents the fewer presence, and the lighter color is for the most significant number of the existence of the trinucleotides.

Classification approach by neural network
Here, a feed forward MLP model was designed based on supervised learning algorithms. Generally, in the beginning, the weight of the model was fed with random values during the training phase. Later, each feeding of training data or epoch, the model follows some specific   T. Paul et al. Expert Systems With Applications 194 (2022) 116559 learning protocol with the adjustment of weight to minimize the error between desired and actual output. Here, the learning process used scaled conjugate gradient method. The output y k (x) is defined by Eq. (2).
Where, x i is the input neurons, and the propagated weight is ω i . The 'b' in (2) stands for bias.
The FCGR 3 matrix elements were taken as input of the neural network as shown in Fig. 4. There are four virus family members as an output to detect through the ANN. The ANN was designed with 64 input neurons for the 64 datasets of FCGR 3 matrix and 4 output for the classified result of the four types of viruses. In the hidden layer, 10 neurons were taken for the ANN model in the Neural network pattern recognition application in Matlab 2019b.

Datasets
The different coronavirus sequences are available in the National Center for Biotechnology Information (NCBI) (NCBI-Database, 2020). The data for the present study was downloaded with Genbank accession number, location and the published date. A total number of 1787 sequences were taken with FCGR 2 and FCGR 3 related sequences as a supplementary file (S1).

Result and discussion
The CGR method is capable of translating the genome sequence into graphical and numeric data at the same time. Here, all the sequences are mapped into FCGR 2 and FCGR 3 formats. Then, they are classified through ANN. Out of 1787 sequences, randomly selected fifteen percentage were taken for each validation and testing separately. The rest, seventy percentage sequences were used as training data. The four different coronavirus types were classified, and a confusion matrix was plotted for training, validation, test data as shown in Fig. 5a and Fig. 5b for the FCGR 2 and FCGR 3 formats, respectively. The target classes were marked 1,2,3 & 4 for MERS, SARS-CoV-2, Alpha CoV & SARS-CoV, respectively.
The size of FCGR 2 dataset is sixteen and for the FCGR 3 formats, it is sixty-four. Here, the virus genome sequences (approximately 30,000 bases) were mapped into the CGR formats (64 and 16 lengths). Fig. 5a and Fig. 5b show that FCGR 3 data performs better than the FCGR 2 dataset for data classification. FCGR 3 achieved 99.8% overall accuracy and, on the hand, FCGR 2 was 90.4% accurate to accomplish the classification target. The mapped datasets (FCGR 2 and FCGR 3 ) achieved the desired outcomes. Therefore, the other FCGR k (k = anything other than 2 and 3) sequence was not prepared as the dataset would have been very small or very long in length.
The dendrogram plot is a hierarchy clustering representing a tree. It was made from the CGR data and corresponding eleven genomes sequences as leaf nodes, in Fig. 6a and Fig. 6b. A random eleven number of sequences were taken as a representative for the easy readability of the plot, however, the programming is adaptable for all 1787 sequences in one tree (some examples given in supplementary file). Randomly, 2-4 genomes were taken from each species of the 1787 coronavirus dataset. Hierarchical binary clustering method was applied to make four distinguish cluster for the groups (SARS-CoV, SARS-CoV2, MERS and Alpha CoV) of the virus family. The branch colors are indicating different groups. To verify the result, a phylogenetic tree in Fig. 7 was plotted with the same genome sequences that were used in Figs. 6 (a and   b). The Dendrogram tree of CGR shows that the genome features are preserved even after transforming the long genome sequences into short FCGR 3 and FCGR 2 sequences.  The dendrogram tree ( Fig. 6a and Fig. 6b) and phylogenetic tree (Fig. 7) were made from the same genome sequences, which were selected randomly from the dataset (given in the supplementary files). The input sequence length for the phylogenetic tree is nearly thirty thousand bases (coronavirus sequence length). On the other hand, only 16 and 64 length datasets were represented the virus genome in the dendrogram trees. It is evident that the phylogenetic tree shows the desired result as total length of the sequence without applying any mapping or transformation of the data (Deng et al., 2006). Strikingly, the dendrogram tree or the hierarchical cluster tree from CGR data were nested with the series of subsets which were defined as SARS-CoV, SARS-CoV-2, MERS and Alpha CoV. In contrast, both the trees, dendrogram tree and phylogenetic tree are not identical ( Fig. 6 and Fig. 7), but importantly, the subsets of both the trees are having the same group members. The only difference, which was made by Alpha CoV subset, made the difference in the evolutionary distances for both the trees.
The CGR process is one-to-one sequential mapping in the numerical representation of genome data. As a result, the nature of the genome could be preserved and sustained after the transformation. The great advantage of the transformation is the possibility to reconstruct the original genome data up to a given point of the CGR sequence. Moreover, the CGR behaves as a genome signature in a unique twodimensional plot (Hoang et al., 2016;Deschavanne et al., 1999). The CGR pattern of the genome sequences is unique to each species. The species go through a slight variation along the whole genome. The variation in CGR pattern among the species is primarily significant on few factors, i.e. the base concentration, unusual repetition and the stretches of the bases of the genome (Deschavanne et al., 1999). Here, for the coronavirus, there is a very small sequential difference in the virus strain at different locations of the same virus. Therefore 2-D representation of CGR plane shows a mild pattern (color) difference in some smallest unit squares (Fig. 3b). However, there is a difference in the protein structure for different viruses even from the same family, that gives a more diverse 2-D CGR plane compared to the same genome  variant. An ANN is an efficient categorizing tool, which can classify the 2-D CGR plane matrix data by using the learning algorithms.

Conclusion
In this study, a virus classification method was introduced in a novel way. Firstly, a virus protein sequences were translated into CGR plane. There are sixty-four codons and twenty amino acids, which are responsible to make proteins. Coincidentally, CGR plane consists of sixty-four smallest unit square box. Therefore, each codon was distributed on the specific smallest box on the CGR plane. All the genome sequences were plotted into CGR plane that gave an 2 k × 2 k matrix for each sequence. The protein features were mapped into the CGR 2-D plane. Visualization of the CGR plot gives a primary idea about the sequences. Here, the matrix was created with various genome of different coronavirus to show the CGR plane is efficient to classify the group of viruses. From the result, it can be concluded that the clustering was done with high accuracy rate. Therefore, the method will be more efficient for diversified virus sequences for the inter-family clustering. The classification method was designed based on the CGR method and ANN tool was used for higher accuracy. Besides, the technique was also used in distance analysis. Virus classification using the CGR and phylogenetic analysis was compared for genome sequences and found to be comparable.
The work resulted in an artificial intelligence-based algorithm to detect virus protein sequences. The main benefit of the method is encoding the genome sequence (big data) into an organized, well represented small data. The FCGR 2 and FCGR 3 have 0.053% and 0.213% data size, respectively, compared to the size of their actual genome sequence. Two more dendrogram trees were plotted and given as supplementary files (S2 and S3). The sequences were taken randomly from the dataset (S1) for the plot S2 and S3. There is a small clustering error (1 out of 36 samples for Fig. 6a, S2 and S3) in plot S3. Where a SARS-CoV2 was grouped with the Alphacoronavirus family. Irrespective of the small clustering error, these trees give an idea about the virus segmentation for the majority (35 out of 36) of the studied samples. This error was possible, as the number of taken CGR sequences is very less (11 to 13 sequences) for a hierarchical clustering tree. But when the system was trained with a large number of samples (1787 samples) then the system performed with efficiency of 99.8% for FCGR 3 and, 90.4% for FCGR 2 . Which implies a good efficiency to work on small data sets. Consequently, using this method will reduce the processing time and memory occupancy of a genome database. This will improve the efficiency of the virus detecting tool, as the training can be done with a larger number of genome samples compared to earlier methods. The virus protein sequence will be converted into image sequences which will be efficient for categorizing the dataset into virus families.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.