Inter-chromosomal k-mer distances

Inversion Symmetry is a generalization of the second Chargaff rule, stating that the count of a string of k nucleotides on a single chromosomal strand equals the count of its inverse (reverse-complement) k-mer. It holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. Building on this formalism we introduce the concept of k-mer distances between chromosomes. We formulate two k-mer distance measures, D1 and D2, which depend on k. D1 takes into account all k-mers (for a single k) appearing on single strands of the two compared chromosomes, whereas D2 takes into account both strands of each chromosome. Both measures reflect dissimilarities in global chromosomal structures. After defining the various distance measures and summarizing their properties, we also define proximities that rely on the existence of synteny blocks between chromosomes of different bacterial strains. Comparing pairs of strains of bacteria, we find negative correlations between synteny proximities and k-mer distances, thus establishing the meaning of the latter as measures of evolutionary distances among bacterial strains. The synteny measures we use are appropriate for closely related bacterial strains, where considerable sections of chromosomes demonstrate high direct or reversed equality. These measures are not appropriate for comparing different bacteria or eukaryotes. K-mer structural distances can be defined for all species. Because of the arbitrariness of strand choices, we employ only the D2 measure when comparing chromosomes of different species. The results for comparisons of various eukaryotes display interesting behavior which is partially consistent with conventional understanding of evolutionary genomics. In particular, we define ratios of minimal k-mer distances (KDR) between unmasked and masked chromosomes of two species, which correlate with both short and long evolutionary scales. k-mer distances reflect dissimilarities among global chromosomal structures. They carry information which aggregates all mutations. As such they can complement traditional evolution studies , which mainly concentrate on coding regions.


Background
The phenomenon of Inversion Symmetry (IS) has recently been reevaluated and established in [1]. This generalization of the second Chargaff rule [2] implies that the number of occurrences of any sequence m of length k on a chromosomal strand S is equal to the number of occurrences of its inverse (reverse-complement) sequence m inv on the same strand. Another way of stating the same fact is that the number of occurrences of m on one chromosomal strand is equal to the number of occurrences of m on the other strand provided both are being read along their own 5′ to 3′ directions.
The accuracy of such statements depends on the length k of the nucleotide sequences which are being employed. It turns out to have a monotonic dependence on k, i.e. as k increases the symmetry worsens. If one sets the required accuracy at 10% one finds [1] that it holds for k ≤ KL where KL grows logarithmically with the length L of the chromosome. KL values for mammals are 9 or 10, while for bacteria they are 7 or 8. These choices of KL guarantee that all possible k-mers of a particular k-value will be found on the chromosome in question.
Inversion symmetry can be restated as the demonstration of a low k-mer distance between the two strands of the same chromosome [3], with exact symmetry implying zero distance. The notion of k-mer distances between different chromosomes, within and between species, is a simple extension of the same basic idea: comparing frequencies of all strings of nucleotides of the same length k on different chromosomes, summing over one or over both strands of each chromosome.
Short k-mer distances can be interpreted as large structural similarities between chromosomes. In bacteria we establish correlations of short k-mer distances between bacterial strains with large synteny proximities. Both concepts are explained in the Methods section. For bacterial strains, they also serve as good measures of evolutionary distances.
The synteny proximities which we employ are valid measures between bacterial strains which are very close evolutionary relatives. Otherwise one cannot find large genomic sections with high identities among them. Therefore, conventional synteny measures which are used in genomic evolutionary studies [4] are very different from our synteny proximities and are mostly concentrated on coding regions. k-mer distances, which are global measures, can be used to compare any two chromosomes. When studying eukaryotes, the compared chromosomes are dominated by non-coding regions. Comparing minimal k-mer distances between various genomes, we find interesting results. In particular, ratios of unmasked to masked minimal genome distances, correlate with evolutionary distances among different species.

Definitions and properties of k-mer distances between chromosomes
The term k-mer refers (in the genomic context) to all possible nucleotide substrings of length k that are contained in a given chromosomal strand of length L, uncovered by a sliding-window search. The total number of their occurrences is N = L-k + 1. We define the empirical frequency of a specific k-mer, e.g. m 1 , in the strand S as the number of occurrences of this k-mer in S divided by N Let us define the k-mer distance D 1 as the L1-norm of the difference between k-dim vectors containing frequencies of all k-mers, when comparing two chromosomal strands (e.g. positive strands of two chromosomes) S 1 and S 2 : The index 1 in D 1 refers to the fact that we use only one strand on each chromosome in this comparison of two chromosomes.
Similarly, we may define a distance measure D 2 by taking into account both strands of the two chromosomes, reading them along their own 5′ to 3′ directions. Since each specific k-mer on the negative strand, is accompanied by its inverse (reversecomplement) on the positive strand, we may define D 2 as where we use a single strand on each chromosome and define for every k-mer its inverse (reverse complement) and sum over all of them along a single strand of each of the two chromosomes. Division by 2 is introduced in the definition of D 2 because the effective number of counts on each chromosome becomes 2 N.
The triangular inequality implies that for every single k-mer. It follows then that Using the above definitions we summarize the properties of k-mer distances: 1. Positivity. By definition all distances are nonnegative. 2. If D k 1;2 ðS 1 ; S 2 Þ ¼ 0 then S 1 and S 2 are equivalent, in the sense that both chromosomes have the same frequencies of all k-mers. This does not necessarily imply that the two chromosomes are equal to each other, because they may differ in length. 3. Symmetry. By definition, D k 1;2 ðS 1 ; S 2 Þ ¼ D k 1;2 ðS 2 ; S 1 Þ. 4. Inequality: D k 2 ðS 1 ; S 2 Þ ≤ D k 1 ðS 1 ; S 2 Þ, as proved above in Eq. 5. 5. Triangular inequalities of distances: This can be proved in an analogous fashion to property 4.
6. Inversion symmetry [1] implies that D k 1 ðS 1 ; S 2 Þ ¼ 0 if S 2 is the inverse of S 1 (or equivalent to it in the sense of property 2). Otherwise this distance will be positive. Such a definition of inversion symmetry has been introduced by [3]. D k 2 ðS 1 ; S 2 Þ ¼ 0 is a trivial statement for two strands which are inverses of each other. 7. Monotonic increase with k: To prove this property note that a k-mer m i k can be generated from a corresponding m j by summing over the indices using the {j,i} association, and applying the extended triangular inequality to each set of four f i whose k-mers m i k begin with the same (k-1)-mer m j k-1 with index j. This proof can be trivially extended to D 2 . One condition for these inequalities to hold is that all k-mers are realized on the chromosomal strands which are being investigated, i.e. all nðm k i Þ > 0. Finally we touch upon the question of the range of kvalues for which the distance measures can be applied.
Shporer et al. [1] have introduced the notion of the KL limit. This is the k-value for which Inversion Symmetry fails at the rate of 10%. They demonstrated that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic slope of KL~0.7 ln(L), where L is the length of the chromosome. This limit can also be derived from the assumption that L> > 4 k allowing for all k-mers to be expressed on the chromosome.  As an example of relevant statistics we display in Fig. 1 the percentage of missing k-mers, i.e. those which do not appear on the strand, and the distance between two close strains of E. coli as function of k, demonstrating that good results are obtained for k ≤ KL = 7.
When evaluating distances between two chromosomal strands with different lengths, L 1 and L 2 , one should limit oneself to KL where L = min(L 1 , L 2 ), guaranteeing that the same k is valid for both chromosomal strands which are being compared.
We provide a python program for calculating k-mer distances between two chromosomes, given as fasta files, in (https://github.com/akafri/k-mer-distances).

Definition of synteny distances
Synteny blocks are genetic sequences in genomes of two species which consist of aligned homologous genes. A recent example of their importance was demonstrated by [5,6]. Here we introduce definitions of synteny distances, which will be used to compare with k-mer distances. This comparison will be carried out using different strains of the same bacterium, where large synteny blocks with identity percentages higher than 90% exist. The threshold of 90% is arbitrary. It was made to guarantee high similarity between the relevant chromosomes. For bacteria, where the selection of a positive strand is well defined, we differentiate between Direct Synteny Blocks (DSB), appearing along the same strand in both genomes, and Inverse Synteny Blocks (ISB), lying on opposite strands. An example is shown in Fig. 2.
Searching for synteny blocks, BLAST was first used to identify local alignments between the full two sequences. The R package OmicCircus [7] was used to visualize results. From the BLAST output, we extract synteny blocks that have identity percentages higher than 90%, and calculate the overall sequence lengths of DSB and ISB (L DSB and L ISB ) respectively. We then define direct synteny proximity and overall synteny proximity as where L 1 and L 2 are the lengths of the chromosomes S 1 and S 2 which are being compared.

The matched-pair algorithm for k-mer distances between two species
To define distances between two eukaryote genomes we started by evaluating a distance matrix between all chromosomes of the two species. We then constructed a graph whose vertices are the chromosomes of the two species and its edges (lines connecting the vertices) represent the distance value of each pair. We proceeded along the following algorithmic steps: 1. Eliminate edges with distances > 1 from the graph.

Define an empty distance vector.
3. Find the edge of the graph with the lowest distance value. 4. Add this value as an entry to the distance vector. 5. Remove this edge from the graph and repeat from step 3 until the graph is exhausted. 6. Inspect the resulting distance vector and report its minimum (the first edge considered by the matching algorithm) and its median.

Distance measures in bacteria
We compared genomes of 23 strains of E. coli and 14 strains of Salmonella enterica. They are listed in Tables 1  and 2. In Fig. 3 we present correlations of P DSYN with D 1 for (a) E. Coli and for (b) S. enterica strains. In each of the two data sets we have looked into all pairs of strains. The data are presented for k = 7. We report only results between strains of the same bacterium since no significant correlation was found between any two strains of the two different bacteria. The higher statistics of E. coli leads to a clearer observation of the correlations.
Next we turn to correlations of over-all synteny with D 2 k = 7 . This is presented in Fig. 4. Once again we note  Table 3 Minimal and median D 2 k = 8 distances between six genomes belonging to different mammals, for unmasked versions of the genomes. See Methods for definition of the computational procedure the strong correlations in the data. The strong negative correlation is particularly significant for the E. coli strains where we have many more pairs of strains which can be compared with one another. Hence we limit our further analysis to just E. coli strains.
In order to appreciate the variation with k we display in Fig We find different correlations of the two measures with P DSYN . Whereas D 1 displays the expected negative correlation, for all relevant k, D 2 is less sensitive to the direct synteny measure. This may be expected since D 2 is a measure sensitive to both strands whereas P DSYN is sensitive to only one strand in each chromosome.
In order to appreciate this result let us dwell on the question why inversion symmetry [1] holds up to large k-values of order KL. The plausible explanation is that genomes evolve through rearrangement processes. These rearrangements are inversions of sections between two breakpoints on the same chromosome. They may follow one another in a nested fashion. This scenario can explain the observed inversion symmetry, as demonstrated in [1]. Pevzner and Tesler [5] have argued that such phenomena are the basis of chromosomal evolution for single chromosomes and, with lower probability, also between different chromosomes. Here we observed that D 1 between two strains of bacteria correlates strongly with both P DSYN and P SYN for all k ≤ 7, both reflecting chromosomal evolution at the short evolutional scale appropriate to different strains of the same bacteria.

Distance measures between different species
In the previous section we have analyzed k-mer distances between closely related bacterial strains, where the synteny distances that we have defined can be easily observed. When evolutionary genomics is applied to different eukaryotes one often limits oneself to similarity between homologous proteins rather than accurate duplications or inversions of large sections of the DNA. The use of k-mer distances can indicate similarities between full chromosomes, which is the study we propose. From Inversion Symmetry we learn the powerful effect of rearrangement within a single chromosome. Rearrangements may also occur between chromosomes and k-mer distances reflect their effects.
Evaluating minimal D 2 distances according to the matched-pair algorithm (see Methods) we obtain the results displayed in Tables 3 and 4. The genome inputs, both unmasked (Table 3) and masked (Table 4), are taken from the UCSC server (see data supplementary file). Clearly, there is quite a difference between the two choices: masking reduces the distance values Table 4 Minimal and median D 2 k = 8 distances between masked genomes of different mammals. See Methods for definition of the computational procedure Table 5 Ratio of unmasked to masked minimal D 2 k = 8 distances. The ratios among primates and rodents are correlated with evolutionary time estimates (http://www.timetree.org/), but this is not true for the rest of this table considerably. We use k = 8 which is a choice appropriate for all displayed species in Tables 3, 4, 5 and 6. There are several striking results in the two tables 3 and 4. One important result is the closeness of minimal and medial distance values. This implies that similar kmer distances are observed for many chromosomal pairs of the two genomes, and are not limited to a single particular pair of chromosomes. In other words, homology spreads out between different chromosomal sections of the two compared species.
Another important result is the huge difference between minimal k-mer distances of unmasked and masked genomes. Conventional understanding regards the low complexity components of the unmasked regions as unprotected by evolution. Hence ratios of unmasked to masked minimal D 2 k = 8 distances measure the aggregated effect of different strengths of mutations when the low complexity sections of genomes are taken into account.
The results for these ratios are presented in Table 5. They seem to be correlated to evolutionary time lapses among primates and rodents, where the separation between human and chimpanzee is dated at 6.7 MYA (million years ago), between mouse and rat 20 MYA and between rodents and primates 90 MYA. However the correlation between all four to dog and cow, ceases to exist. The separation age between the primates to dog and cow is estimated at 96 MYA and between dog and cow 78 MYA. All the evolutionary estimates are derived from the time-tree website (http://www.timetree.org/).
A major tool employed in genomic evolutionary studies is Reversal (or inversal) Distance (RD) [5,6]. Concentrating on the orders and details of genes or other markers, the idea is to work out how many inversions take place along the evolutionary path from one species to another. RD is the minimum number of reversals required to transform one genome into the other. The web-tool of (http://www.timetree.org/) can be used to evaluate such distances. They fit much better the evolutionary time estimates, which is somewhat a tautology because the estimates of (http://www.timetree.org/) take the RD methodology into account. However, RD is problematic when very large evolutionary distances are concerned, because of the shortage in genes which can be compared between distant organisms. K-mer distances are not subject to such constraints. Hence they can be applied to such problems. In Table 6 we compare human with the nematode (C. elegans) and the fruit fly (D. melanogaster), using the same methods as in Table 5. Obviously these results are satisfactory.
Interestingly, k-mer distances are immune to large inversion events. In fact, this was the reason we use them to begin with, starting with the lessons drawn from Inversion Symmetry of chromosomes. On the other hand, k-mer distances are sensitive to all other mutations that occur along an evolutionary path. In this sense, K-mer minimal Distance Ratios among genomes (KDR) can serve as a complement to RD. Moreover, it is applicable to all eukaryotes.
The full potential of KDR has still to be investigated and explained. Evolutionary genomic tools deal extensively with substitution rates, in particular the nonsynonymous ones affecting amino-acid changes in proteins. The analogous investigation of substitution rates in low-complexity and high-complexity genomic regions is needed to explain how KDR, or the various minimal or median k-mer distances among genomes, can be used for meaningful evolutionary conclusions.

Conclusions
We have introduced measures of k-mer distances, and applied them to bacteria and to eukaryotes. The two measures D 1 and D 2 were compared to synteny measures in bacteria, tracing large identical sections of chromosomes between two strains of the same species. We identified a strong correlation between D 1 and direct syntenic regions and a strong correlation between D 2 and both direct and inverse syntenies, which indicates evolutionary similarity between two strains. We argue therefore that k-mer distances are validated as good measures for evolutionary distances within bacteria. D 2 measures are also adequate for estimating distances between any two genomes which may have very ancient common ancestors. We exemplify this fact by demonstrating such distance measures between several eukaryotes. We find that there exists considerable difference between masked and unmasked distances, as expected from common evolutionary understanding of rapid variation in low complexity regions, being less protected by evolution. Moreover, we exploit this difference to Table 6 Unmasked and masked minimal D 2 k = 8 , their ratios, defined as KDRs, and the separation age estimates derived from (http:// www.timetree.org/) establish minimal K-mer Distance Ratios (KDR), which correlate with evolutionary time scales of primates and rodents, as well as very large time scales such as between human, nematode and fruit fly.
Whereas conventional evolutionary studies continue to use traditional methods following changes within and throughout homologous genes, our k-mer distances take into account the full chromosomes, involving both coding and non-coding sections. As such, they carry novel information which complements traditional investigations.