Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images

Graphical representation of DNA sequences is one of the most popular techniques for alignment-free sequence comparison. Here, we propose a new method for the feature extraction of DNA sequences represented by binary images, by estimating the similarity between DNA sequences using the frequency histograms of local bitmap patterns of images. Our method shows linear time complexity for the length of DNA sequences, which is practical even when long sequences, such as whole genome sequences, are compared. We tested five distance measures for the estimation of sequence similarities, and found that the histogram intersection and Manhattan distance are the most appropriate ones for phylogenetic analyses.


Introduction
Sequence alignment 1,2 is generally used to estimate similarities between relatively short sequences such as nucleotide sequences of genes or amino acid sequences of proteins. However, the time complexity of the alignment is O(L 2 ) for sequences of length L, which requires an enormous amount of computation time when L is large. Therefore, it is necessary to develop so called alignment-free methods, which are independent of alignment, to compare long sequences such as whole genome sequences in practical time.
In this article, we propose a new method for sequence comparison categorized in the graphical representation. We express a DNA sequence as a binary image-each pixel of a binary image is plotted in either black or white-on a 2-dimensional space and count the occurrence frequencies of 3 × 3 bitmap patterns on the binary image. The distance between the binary images is measured based on the occurrence frequency histograms of the bitmap patterns. As for distance measures between histograms, we selected five frequently used measures: histogram intersection 41 , Manhattan distance, Bhattacharyya distance 42 , Jensen-Shannon divergence 43 , and Kendall's rank correlation coefficient 44 . Based on phylogeny of 31 mitochondrial genome sequences, we seek for the most appropriate distance measure for our method among the five.

Generating a binary image from a DNA sequence
Here, we describe, step-by-step, the procedure to generate a binary image from a DNA sequence.

Graphical representation of a DNA sequence
At first, we assign two-dimensional vectors that are perpendicular to each other to individual types of bases, A, T, G, and C. The number of the independent variations of the assignment is 3!/2 = 3, when we identify the assignments that can be transformed from each other by the rotations of 90-degree or the inversion with respect to the vertical or horizontal axis (Fig. 1). We chose the left most assignment in Fig. 1, where nucleotides A and T are placed on the upper quadrants, and G and C on the lower ones, so that the GC content of a DNA sequence can be grasped easily from the resultant graphical representation. Next, connecting consecutively the vectors assigned to the bases of the DNA sequence from the origin to the end one by one, we can draw a 2-dimensional graph. Fig. 2(a) shows an example-a graphical representation of sequence "ACATATG."

Multiplying weighting factors
In order to extract potential information conveyed by individual bases, we introduced weighting factors based on a Markov chain model into the process of generating binary images 45 . As the weighting factor, we used self-information I(E), which is the amount of information that we will receive when a certain event E occurs. Let P(E) be the probability that event E occurs, then I(E) is defined as I(E) = − log 2 P(E) in bits.
Here, we define P(E) according to the second order Markov chain, concerning about codons in the coding regions of DNA sequences, which is calculated by the occurrence Figure 2: Generating a binary image of sequence "ACATATG": (a) the primary graphical representation, (b) the graphical representation modified with weighting factors, and (c) the generated binary image.
frequencies of triplets of bases. The probability that nucleotide z occurs after a pair of nucleotides xy (x, y, z ∈ {A, T, G, C}) is calculated by where N xys (s ∈ {A, T, G, C}) is the number of occurrence of triplet xys, which is measured in all the DNA sequences analyzed. The individual vectors corresponding to the bases on a DNA sequence are multiplied by the weighting factors according to the preceding two bases. For example, when P(A|AC), P(T|CA), P(A|AT), P(T|TA), and P(G|AT) are .20, .66, .41, .31, and .44, respectively, the weighting factors are calculated to be 2.3, 0.60, 1.3, 1.7, and 1.2, respectively. The first two of a series of vectors in Fig. 2(a) are drawn without weightings because the corresponding weighting factors can not be available. The third vector A is multiplied by 2.3 because the preceding doublet of bases is AC and, therefore, the weighting factor corresponding to P(A|AC) is chosen. The remaining vectors are similarly multiplied by the weighting factors. As a result, the graphical representation of sequence "ACATATG" is modified as shown in Fig. 2(b).

Generating a binary image
A binary image is a digitized image that has pixels of only two possible values 0 and 1, which are typically plotted in white and black, respectively. From the graphical representation of a DNA sequence, we generate a binary image by the following manner; we set value 1 for the pixels that include at least a part of a vector, and 0 otherwise, in the graphical representation ( Fig. 2(c)).

Local patterns
We define a local pattern as a bitmap of a set of adjacent pixels of a certain size of window. Because each pixel of a binary image has two pixel values, the number of local patterns is 2 n , where n is the number of pixels belonging to a window. Windows of too large size are dominated by white pixels, and, on the other hand, those of too small size can not have enough variations to express a DNA sequence. In this study, therefore, we chose the window size of 3 × 3, where the number of the local patterns is 2 9 = 512. Note that we do not include the local pattern whose pixels are all white into the local pattern histograms because it represents the empty background of the images. Fig. 3 shows the five examples of local patterns of window size 3 × 3 with their serial numbers below. We derive the serial numbers by lining up the pixels from the upper left corner to the lower right and interpreting them as a binary number with zeros and ones for white and black pixels, respectively, with the upper left corner being the highest bit.

Counting the occurrence frequencies of local patterns
Sliding a 3 × 3 window by one pixel per move over the binary image, we counted the occurrence frequencies of the local patterns. If the trajectory of the graphical representation of a DNA sequence is like a random walk, the average distance between the origin and the terminus of the trajectory would be O(L 1/2 ), where L is the sequence length. In that case, the rectangle area covering the whole trajectory is proportional to L; hence the computational time to count the occurrence frequencies would become O(L). In reality, however, the trajectory is not a random walk but a curved line of length O(L) in most cases (see Fig. 4 in Section 3), in which the computational time to count the occurrence frequencies becomes O(L 2 )-this is equivalent to that for pair-wise sequence alignment.
Accordingly we devised the method of counting the occurrence frequencies as follows. We divided the binary image into square regions of 10 × 10 pixel, and marked the regions having at least one black pixel, simultaneously when generating the binary image. Then, we counted the occurrence frequencies of the local patterns only in the marked regions. Thus we can reduce the computational time of counting to O(L), although there remains room for improvement in reducing the numerical coefficients.

Distance measures between local pattern histograms
There are several measures to estimate similarity/dissimilarity between two histograms. Among them, we chose commonly used five measures as the distance measure between the local pattern histograms, and compared them, seeking for the appropriate measure for our method. We briefly describe them below. In the following formulas, p i and q i are the occurrence frequencies of the local pattern of serial number i in histograms P and Q, respectively, and N is the largest serial number of the local patterns (i.e., N = 511 for 3 × 3 pixel local patterns). Note that, in the calculation of the distances, the occurrence frequencies are normalized to be Histogram intersection (HI) Histogram intersection was proposed by Swain et al. 41 for color indexing of images, which is defined as It ranges from 0 to 1, with 1 for P and Q being identical. To calculate distances, we convert it to D HI (P, Q) = 1 − HI (P, Q).
Manhattan distance (MD) Manhattan distance, also known as City block distance or L 1 -norm, is defined as which ranges from 0 to 2, with 0 for P and Q being identical.
Bhattacharyya distance (BD) Bhattacharyya distance 42 is defined between two probability distributions from a divergence which ranges from 0 to 1, with 1 for P and Q being identical. The Bhattacharyya distance is defined from the divergence as D BD (P, Q) = − ln BD (P, Q).
Jensen-Shannon divergence (JS) Jensen-Shannon divergence 43 is a symmetrized and smoothed version of Kullback-Leibler divergence 46 , which is defined as where M = (P + Q)/2 and KL ( · , M) is the Kullback-Leibler divergence calculated by Here, m i = (p i + q i )/2. The Jensen-Shannon divergence ranges from 0 to 1, with 0 for P and Q being identical.
Kendall's rank correlation coefficient (τ) Kendall's rank correlation coefficient 44 , also known as Kendall's τ, is defined as where X is the number of concordant i, j (i > j) pairs in which (p i − p j )(q i − q j ) > 0 is satisfied; Y is the number of the discordant pairs in which (p i − p j )(q i − q j ) < 0 is satisfied; r is the number of one kind of the tie pairs in which p i = p j and q i q j are satisfied; and s is the number of the other kind of the tie pairs in which p i p j and q i = q j are satisfied. If both p i = p j and q i = q j are satisfied, the corresponding i, j pairs are excluded from the computation. Kendall's τ lies between −1 and 1, with 1 for the rank orders of p i s and q i s being completely in agreement with each other, and with −1 for them being completely reversal with each other. We re-scaled the Kendall's τ as so that D τ (P, Q) ranges from 0 to 1, with 0 for the rank orders of P and Q being identical.

Genome sequences analyzed
We downloaded mitochondrial genome sequences of 31 mammalian species from Gen-Bank 47 and analyzed them. Table 1 summarizes the genome sequences. Mitochondrial genomes are widely used to study genome evolution and phylogenetic inference due to, for example, a high mutation rate relative to nuclear genomes and a nearly uniform size for mammalian species.

Weighting factors
We counted the number of occurrences of every tri-nucleotides in all the genome sequences listed in Table 1, sliding a window of length three by one nucleotide per move, and calculated the weighting factors by the method described in Section 2.1. The calculated weighting factors are shown in Table 2. The large (small) weighting factors indicate that the corresponding triplets rarely (often) occur in the genome sequences.

Local pattern histograms
We counted the numbers of occurrences of the local patterns for the 31 mammalian species and constructed the local pattern histograms.      Fig. 7 shows the phylogenetic trees constructed from the calculated distance matrices using Unweighted Pair Group Method with Arithmetic mean (UPGMA). The trees are drawn by statistical analysis software R 49 . The trees of the top panel, the middle one, and the bottom one are constructed based on distance measures HI and MD, BD and JS, and τ, respectively. The upper two trees seem to be well reconstructed in that primates, elephants, cats, bears, and so on, are located in their respective clades, although sheep is separated from buffalo-cow pair in the middle tree. In the bottom tree, on the other hand, some species are located on inadequate places; for instance, pig and white rhinoceros are included in primates, and leopard is separated from cat-tiger pair. Table 3 shows the Pearson's correlation coefficients calculated from the distance matrices among the five distance measures. We can find that HI-MD pair and BD-JS pair are strongly correlated with each other, which confirms the topological aspects of the resultant phylogenetic trees in Fig. 7. To evaluate our phylogenetic trees quantitatively, we measured the Robinson-Foulds (R-F) distances 50 between our trees and a reference tree (Fig. 8), which was constructed by ClustalW 51 based on the multiple sequence alignment of the mitochondrial genome sequences. The R-F distances were calculated by treedist program in Phylogeny Inference Package (PHYLIP) 52 . Table 4 summarizes the measured R-F distances. Among the five distance measures, HI and MD show the best performance.  At this point, it is worth discussing the position of hedgehog. We compared our tree constructed by HI and MD (the top panel of Fig. 7) with those given by Huang et al. 21 and Yu et al. 53 Overall configuration of our tree approximately agrees with their trees except for the position of hedgehog. Krettek et al. 54 performed a phylogenetic analysis using concatenated sequences of 13 protein-coding genes of mitochondrial genomes of nine mammals including human, harbor seal, cow, and hedgehog, and identified the position of hedgehog as basal relative to the other species included. Our result is consistent with Krettek et al. 54 rather than Huang et al. 21 and Yu et al. 53

Conclusion
In this study, we proposed a novel method for estimating similarities between DNA sequences. In this method, we express DNA sequences as binary images by replacing individual bases with 2-dimensional vectors and connecting them successively. We counted the occurrence frequencies of 3 × 3 bitmap patterns on the binary images and measured the distances between them based on the frequency histograms of the bitmap patterns. We tried five frequently used distance measures to estimate similarity/dissimilarity between two histograms: histogram intersection, Manhattan distance, Bhattacharyya distance, Jensen-Shannon divergence, and Kendall's rank correlation coefficient.
We compared our phylogenetic trees with that constructed by ClustalW for mitochondrial genomes of 31 mammalian species. Among the five distance measures, histogram intersection and Manhattan distance showed the best performance in terms of Robinson-Foulds distance between the phylogenetic trees, as well as having the more appropriate position of hedgehog on the phylogenetic tree than those given by Huang et al. 21 and Yu et al. 53 .
The most time consuming step in our method is counting the occurrence frequencies of local patterns. Its time complexity is O(L) for sequence length L, which is practical even for very long sequences, although the proportionality constant is large; it is thought to be possible to make the constant smaller to reduce the actual computation time by improving the counting method.