Genomic signal processing for DNA sequence clustering

Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.


Genomic Signal Processing for DNA INTRODUCTION
However, since these methods require large computational times for determining similarity among 47 sequences, the use of K-means is not feasible for this application. Therefore, other approaches for DNA 48 clustering have been proposed based on the use of these similarity computation methods. Two of the most 49 popular algorithms for clustering biological sequences are the CD-HIT (Li and Godzik, 2006) and the 50 UCLUST (Edgar, 2010). Both algorithms use a greedy approach for identifying representative sequences 51 that can be used as a "seed" to group all of the sequences that have a similarity score above a certain 52 threshold. However, the computational resources necessary to perform the multiple sequence alignments 53 remain the main challenge which limits the number of sequences that can be clustered. 54 More recently, an approach for the analysis of genomic data that has captured the attention of 55 researchers in recent years, is the use of genomic signal processing (GSP) which is based on the use 56 of digital signal processing (DSP) theory and algorithms to analyze DNA or protein sequences. GSP 57 methods require the transformation or mapping of the biological sequences, usually represented as a string 58 of characters (i.e., A, T, G and C) to a numeric representation (i.e., a signal) that can be processed using  Cheever et al., 1989), and sequence alignment (Skutkova et al., 2015). 67 One of the main advantages of GSP methods is that the analysis of the genomic data can be performed 68 very quickly because of the optimal coding of the algorithms and the processors that have been designed 69 specifically for those tasks. of features from the Fourier spectrum which may reduce the dimensionality of the data and perhaps its 73 discriminative power as compared with the use of the whole raw spectrum. Moreover, these works employ 74 an hierarchical clustering algorithm instead of the K-means which properties allow us to generate plots 75 that are different from the traditional dendrograms and that facilitate the exploration of the results.

76
In this paper, we propose an approach for performing cluster analysis of DNA sequences that is based 77 on the use of GSP methods and the K-means algorithm. We also present a visualization method that 78 allows us to easily inspect and analyze the results. Our results indicate the feasibility of employing the 79 proposed method to find and easily visualize interesting features of sets of DNA data.

81
DNA sequence to signal 82 In order to be able to employ the DSP methods in genomic data, it is necessary to first perform a 83 transformation or mapping of the DNA sequences to be analyzed into numerical values representing the 84 information contained by them. There currently exist several proposed DNA numerical representations. 85 However, one of the most popular of this DNA to signal mapping is the Voss representation, which 86 employs four binary indicator vectors, each meant to denote the presence of a nucleotide of each type at a 87 specific location within the DNA sequence (Voss, 1992).

88
Given a DNA sequence α (e.g., α = AT TCGCAT...) we can employ the Voss representation to compute its corresponding fourth-dimensional DNA signalX α by applying Eq. (1) 1. Compute a main centroid point M in the n-dimensional space corresponding to the geometrical center of the K centroids location computed as: where i ∈ [1, 2, .., n].
118 2. For each cluster j, compute the Euclidean distance d j of its centroid C j with respect to the main centroid M: 3. Each centroid of the k clusters is sorted according to its distance to the main centroid and an angle is assigned to them, according to its index ι ∈ [0, 1, ..., k] in the sorted array: 4. The main centroid M and the clusters centroids C ι are mapped into a two dimensional space φ ,

5.
Each centroid C ι is plotted as a point around the main centroid M point, according to its distance 121 and its angle as computed by: 6. We sort each set of DNA sequences in Ω assigned to a specific centroid C ι , according to the distance 123 δ z of each sequence z, with respect to its assigned centroid. The angle θ z is also computed similarly 124 to step 3. 125 7. Finally, each sequence z is then plotted into φ by computing their correspondent coordinates as:    The NCBI database reported that all of them are not mitochondrial genes, but the product of nuclear 234 genomic sequencing where scaffold primary assembly showed those fractions with alignment homology 235 reported to COXI, but not proven genetic activity. We also found that both the second and third copies of  Manuscript to be reviewed belonging to this kingdom which have a greater chance to group together due to their high class similarities.

259
Note that the second largest kingdom of Plants decompose faster than Fungi, which is the third largest 260 group.

261
To determine the validity of the results, we computed centroids for true kingdoms and we compare 262 these centroids to those discovered with our method. Figure 7 depicts the mean square distances between 263 each cluster centroid and the sequences assigned to that cluster by the proposed method using K = 6, and 264 the mean square distances between a cluster centroid generated with the sequences corresponding to each Manuscript to be reviewed Mean square distances between each cluster centroid and the sequences assigned to that cluster by the proposed method using K = 6, and the mean square distances between a cluster centroid generated with the sequences corresponding to each of the six kingdoms. We evaluated the performance of the MATLAB implementation of proposed algorithm "Signal Tool for  The time required to transform the 141 sequences from strings of characters to their corresponding PSDs 278 was 0.921 seconds and it is not considered in Table 1 since this is performed only one time. Note that 279 the time required by STARS is significantly smaller with respect to ClustalW. UCLUST is time-constant 280 at 1 second for every experiment, however, note that the number of clusters generated by this method 281 was practically the same number of sequences (i.e., the method assigns a cluster to each sequence). This 282 is because UCLUST requires a sequences identity range of at least 40% for amino acids and 65% for 283 nucleotides (Edgar, 2010).  show rapid changes that can determine divergence along several phylogenetic groups, according to which 341 hypervariable region is being evaluated. If this would be the case, K-Means clustering may be adapted to 342 steps of low mutation rates before high mutation rate regions. COXI gene mutations spanning all of the 343 sequence may increase the amount of spurious clustering due to converging hotspots.

344
The presence of a spurious cluster that is gathered together by their size, is an indication of the 345 need to filter out sequences with large indels. Despite such mishaps, the proposed method is capable of 346 performing an analysis of relationships between multiple DNA sequences with minimum handling and without the need of sequence alignment, which results in less human and computational time compared to 348 traditional methods. We tested this method with a number of markers (i.e. mammal mtDNA, influenza . Also consistent with COXI results, the most evident aspect is the 353 tendency to prioritize division of heavily populated groups.

354
The proposed method may be used to evaluate the capability of a marker or gene to differentiate 355 between organisms at different levels, to identify subgroups within a set of organisms, and perform 356 classification of organisms with respect to known sequences or classification of sections of a DNA 357 sequence. Furthermore, this method can also be used to perform similar analysis with amino acid 358 sequences. 359 We have demonstrated that it is possible to group DNA sequences based on their frequency components.

360
It is the subject of future work to identify whether distinct frequency bands amount to greater weight in 361 the clustering of sequences.

362
The proposed method has been coded and executed in MATLAB. The source code and the datasets 363 employed for the results presented in this paper are available at Github 364 CONCLUSION 365 We have presented a method for performing cluster analysis of DNA sequences that is based on the use 366 of GSP methods and the K-means algorithm. We also proposed a visualization method that allows us