Inversion symmetry of DNA k-mer counts: validity and deviations

The generalization of the second Chargaff rule states that counts of any string of nucleotides of length k on a single chromosomal strand equal the counts of its inverse (reverse-complement) k-mer. This Inversion Symmetry (IS) holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. The existence of IS has been demonstrated in the literature, and other pair-wise candidate symmetries (e.g. reverse or complement) have been ruled out. Studying IS in the human genome, we find that IS holds up to k = 10. It holds for complete chromosomes, also after applying the low complexity mask. We introduce a numerical IS criterion, and define the k-limit, KL, as the highest k for which this criterion is valid. We demonstrate that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic dependence of KL ~ 0.7 ln(L), where L is the length of the chromosome. We introduce a statistical IS-Poisson model that allows us to apply confidence measures to our numerical findings. We find good agreement for large k, where the variance of the Poisson distribution determines the outcome of the analysis. This model predicts the observed logarithmic increase of KL with length. The model allows us to conclude that for low k, e.g. k = 1 where IS becomes the 2nd Chargaff rule, IS violation, although extremely small, is significant. Studying this violation we come up with an unexpected observation for human chromosomes, finding a meaningful correlation with the excess of genes on particular strands. Our IS-Poisson model agrees well with genomic data, and accounts for the universal behavior of k-limits. For low k we point out minute, yet significant, deviations from the model, including excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. Interestingly, this correlates with a significant (but small) excess of genes on the same positive strands.


Background
Erwin Chargaff has made, in 1950, the important observation that the numbers of nucleotides in DNA satisfy #A = #T and #G = #C [1,2]. This statement, made on the basis of experimental observations with fairly large errors, played a crucial role in realizing that DNA has an underlying base-pair grouping, as subsequently proposed by Crick and Watson [3] in their double-helix structure.
The second Chargaff rule [4] states that the same sets of identities of nucleotide pairs hold for each long enough single DNA strand. This rule has been tested [5] for genome assemblies of many species, and found to be globally valid for eukaryotic chromosomes, as well as for bacterial and archaeal chromosomes. It fails for mitochondria, plasmids, single-stranded DNA viruses and RNA viruses.
The validity of the second Chargaff rule was unexpected. Obviously it should be regarded as a global rule, i.e. applicable to large sections of chromosomes. Nonetheless, not being derived from a compelling principle, such as the one underlying the first rule, it remains a mystery. This is even more so, when one studies extended versions of Chargaff 's second rule. Indeed, Albrecht-Buehler [6] observed that for triplet oligonucleotides, or 3-mers, it remains true that their chromosomewide frequencies are almost equal to those of their reverse-complement 3-mers. Prabhu [7] has shown that this symmetry holds up to 5-mers in various species. This has been reviewed by Baldi and Brunak [8] who have argued that such symmetry rules have to be incorporated in Markov models of genomic sequences.
We refer to the symmetry between counts of k-mers and their reverse complements as Inversion Symmetry (IS): the counts of a k-mer of nucleotides on a chromosomal strand are almost equal to those of its inverse (reverse-complement) string.
Note that this implies that the number of times a string of nucleotides of length k is observed on a strand, when read from 5' to 3' , is almost equal to the number of times it is observed on the other strand when the latter is read from its 5' end to 3' end.
Recent analyses of inversion symmetry include the following: Qi and Cuttichia [9] who have shown that inversion symmetry exists while reverse symmetry fails, i.e. k-mers and their reverses do not appear with equal rates; Baisnee, Hampson and Baldi [10], who introduced a measure S1 to analyze inversion symmetry in a systematic fashion; Kong et al. [11], who established the validity of IS on 786 chromosomes of many species and showed that reverse or complement symmetry do not hold, and argued that IS may be due to segmental or wholegenome inverse duplications; Wang et al. [12] who argued that values of k for which k-mer IS is valid increase with organismal complexity; and Afreixo et al. [13] who applied various criteria to demonstrate the statistical significance of IS up to k = 10. Studies of symmetries related to IS appear in [14,15].
We introduce an IS measure which is different from S1 of [10], albeit the numerical results of both measures are correlated (see section 4 in Methods). Our measure is based on the ratio between differences of counts of inverse k-mer pairs and their sum. We propose the criterion that if the average of this normalized measure (over all strings of length k) is less than about 0.1, IS will be regarded as a valid approximate symmetry. The average is taken over all M k ≤ 4 k strings of length k which exist at least once on the chromosome. The value of k, for which the IS measure is closest to 0.1, is defined as the k-limit (KL). This turns out to be KL = 10 on long human chromosomes (see Additional file 1), and KL = 7 or KL = 8 for bacteria.
Using this measure, one can readily demonstrate the existence of inversion symmetry, and the absence of analog symmetries between reverse pairs or complement pairs, as well as compare between different species. We will show that the k-limit of inversion symmetry, KL, is logarithmically dependent on the length L of the chromosome, or of a chromosomal section on which it is measured. Moreover, this dependence is universal, i.e. it is valid for most species.
To analyze all these observations on a rigorous statistical basis, we introduce a Poisson model for the random occurrence of counts, regarding N(S) as a stochastic variable for any string S of length k. We define X(S,S * ) = |N(S)-N(S * )|/(N(S) + N(S * )), which is a stochastic variable having positive values 0 ≤ X ≤ 1. In general S * is some permutation of S over the set of all strings of length k. When S * ≡ S inv , i.e. where S * is the inverse of S, IS implies that X < <1. For strict IS, X = 0. In practice we may observe small deviations when checking for its realization on a chromosome. The important question we address is whether these deviations mean that the IS rule is not valid, or that the data are consistent with IS yet the observed values of X reflect statistical fluctuations.
To answer this question we introduce the stochastic variable Z(S,S * ) = (N(S)-N(S * ))/(N(S) + N(S * )) ½ . The symmetry assumption means that N(S) = N(S*), i.e. these two stochastic variables have the same distribution. If we further assume that N(S)~Poisson then Z should be approximately distributed as a standard normal (see Methods).
We will demonstrate on genomic data that for inversion symmetry we empirically observe that Z~Standard Normal, but for other pairings of S and S* (e.g. reverse or complement) it is not. Continuing with the analysis of IS we show that For small k, E k [X] is extremely small, yet E k [|Z|] = E k [X/σ X ] > 2, where σ X is the (theoretically estimated) standard deviation of X(S,S inv ). Therefore we conclude that there exists a systematic small breaking of IS, observed for k < 4 on human chromosomes. For large k (k > 5 on human chromosomes) E k [|Z|] = E k [X/σ X ] < 1 hence, due to the large variance, one may state that the observed X values are consistent with IS. Moreover, the data are consistent with ZS tandard Normal for large k.
The empirical values of X(S,S inv ) for large k are of the order of magnitude of (N(S) + N(S inv )) -½ . The logarithmic variation of the k-limit, KL, as function of chromosomal length L, is correctly predicted by our IS-Poisson model. We use the italicized notation N, X, Z, for the stochastic variables of our model, and employ N, X, Z for their empirical counts on chromosomes. KL is defined as the value of k for which E k [X] is closest to 0.1.

Inversion symmetry (Generalized 2 nd Chargaff Rule)
Let S and S * be two strings of nucleotides of same length k, i.e. two k-mers. Suppose they appear N(S) and N(S * ) times respectively on a particular chromosome. We denote by X(S,S * ) the normalized difference X(S,S * ) = |N(S)-N(S * )|/(N(S) + N(S * )) where S is one of the M k different k-mers over the 4 nucleotides, which are being counted on the chromosome at least once, i.e. N(S) > 0 and/or N(S * ) > 0. If both N(S) = N(S * ) = 0, X(S,S * ) is defined to be 0. In general 0 ≤ X(S,S * ) ≤ 1.
We use E k [X] = ∑ S X(S,S * )/M k , where M k is the number of different k-mers encountered empirically, as a measure to demonstrate and quantify the studied symmetry. For low and moderate k, we find that M k = 4 k , but for large k-values, such as k > 10 in the human genome, many of the k-mers may not be realized empirically, leading to lower M k . In the following we will look at values of X(S,S * ) over various possible choices of string pairs S and S * , and demonstrate that for inverse pairs they are distributed differently than for other types of kmer pairs. Let us start by computing E k [X] for inverse pairs (i.e. S and S * ≡ S inv are reverse-complements of each other) for different k, on various chromosomes of the human genome assembly HG38. Data were downloaded from the UCSC genome browser http://genome.uscs.edu. The values of E k [X] for several human chromosomes are displayed as function of k in Fig. 1. Inversion Symmetry (IS) is seen to hold quite well for k-mers with large k-values for all the displayed chromosomes. Chr Y, which is the shortest among the 24 chromosomes, has the least inversion symmetry. IS holds also for all other chromosomes (Additional file 1). It fails for the mitochondrial chromosome, which is a wellknown exception to the 2 nd Chargaff rule.
Repetitive structures are well-known to constitute major fractions of eukaryotic chromosomes, hence one may wonder to what extent they are responsible for the observed inversion symmetry. To resolve this issue, we employed the same operations on the masked output of the UCSC genome browser, after filtering chromosomes for interspersed repeats and low complexity sequences. The results keep displaying the same behavior, with negligible differences for high values of k. Even chrY, which is well known for containing numerous repeats, with only 36 % of it surviving the masking filter, keeps showing the same qualitative behavior as in Fig. 1. In Additional file 1 we provide a list of the highest k-values for which E k [X] ≈ 0.1, both before and after masking (which removes repetitive and low-complexity stretches of the chromosome). We define the k-limit (KL) of IS, as the value of k for which E k [X] is closest to 0.1. The observed reduction in KL from 10 to 9 for the largest chromosomes, is due to the fact that masking shortens the effective chromosome length. The dependence of KL on length is an issue to which we will return below.
We have performed the same analysis on the older genome assembly HG18, leading to very similar results (see Additional file 2). We find similar IS results for mouse, frog, fly, worm, and yeast. Moreover, we find that inversion symmetry holds also for bacteria, but it is valid for a lower range of k-mers, only up to KL = 6 or 7.

Outstanding features of inverse k-mer pairs
In order to demonstrate how Inversion Symmetry, observed for frequencies of inverse pairs, differs from other natural pairings, we compare different choices of pairings of k-mers, We have evaluated histograms of X(S,S * ) = |N(S)-N(S * )|/(N(S) + N(S * )) for all pairings, and computed their averages E k [X] = 4 -k ∑ S X(S,S * ) for different k. Calculations were performed both for human chromosomes as well as for many other species. Figure 2a depicts the distributions of X values for inverse pairs on human chr 1 of HG38, evaluated for k = 4 to 10. These distributions are very narrow, leading to very low E k [X] values, consistent with the results displayed in Fig. 1. As k increases they widen, leading to increasing E k [X] values, which will be discussed below and are quoted in Table 1. In Fig. 2b and c we plot the corresponding distributions for the cases of random pairs (b), where for each S a random choice of S * is being made, without repetition, and reverse pairs (c) on chr 1. Distributions of complement pairs (d) are identical to those of reverse pairs and are therefore not displayed as an additional figure. Note that the distributions in 2b and 2c are completely different from 2a: they possess a rugged wavy behavior, stretching over the whole range of 0 < X < 1. Since k-mer distributions on the human genome are known to be different for strings containg CG dimers [16], we studied the same problems removing all such k-mers. It turns out that, for the resulting k-mer strings, the second peak in (b) and (c) disappears. But, even then, cases b and c continue to be very different from case a, displaying long tail distributions. Such characteristic differences occur also for all other species that we have tested, and also for masked chromosomes in human.
We can further use these distributions to establish that a symmetry relation holds only for inverse pairs, leading to very low E k [X] values, and not for any other pairing. Table 1 lists the values of μ k = E k [X] for the three cases a, b and c, making it quite evident that IS holds and other symmetries do not. We will not dwell on it further, since Kong et al. [11] have already established the validity of IS (albeit using different measures) on 786 chromosomes of various species, and showed that reverse or complement symmetries do not hold.

Statistical analysis of inversion symmetry
In the Methods section we point out that, for large enough counts, if the counts N(S) and N(S*) are drawn from the same distribution, then the variable Y = (N(S) -N(S*))/(N(S) + N(S*)) should have an approximately Gaussian distribution with mean 0 and standard deviation σ G. Moreover, the distribution of X = |Y| will have an expectation value E k [X] = 0.8 σ G and standard deviation σ X =0.6 σ G . If the counts N(S) are drawn from a Poisson model, we expect for each pair to find σ G = (N(S) + N(S*)) -½ . Hence Z = (N(S) -N(S*))/(N(S) + N(S*)) ½ should follow a standard normal distribution, i.e. a Gaussian with mean = 0 and variance = 1. Hence the IS-Poisson model predicts E k (|Z|) = 0.8 and σ k (|Z|) = 0.6, when rounded up to first decimal point.
We have tested this model by evaluating results for inverse pairs of k-mers on chr1 of HG38. The results are displayed in Table 2.
For low k-values, where E k [|Z|] = E k [X/σ X ] ≥ 2, one may say that a mathematical hypothesis of strict IS is invalid, since the peak of the Z-distribution lies outside the allowed confidence interval. On the other hand, clearly for all k < 4, E k [X] < <0.01. Although the violation of IS is very small numerically, it is still statistically significant.
For large k > 5 we see that the data tend toward the prediction of our IS-Poisson model, approaching the limit of E k (|Z|) ± σ k (|Z|) = 0.8 ± 0.6. This means that, due to the large variance, arising from relatively small values of N(S) + N(S inv ), the mathematical IS hypothesis cannot be refuted. It also means that variance plays a dominant role leading to the observed values of X(S,S inv ) which are of the order of magnitude of (N(S) + N(S inv )) -½ . This implies that we should be able to deduce the behavior of the k-limit, which is indeed the case as will be shown below.
To get a visual confirmation of the Gaussian nature of the Z-distribution we plot in Fig. 3a the results for k = 8 on human chr 1. The ensemble of Z-values contributed by all k-mers makes up the Gaussian distribution which is displayed here. The variance calculated from these data is 1.27, quite close to the value 1 expected from a standard normal distibution. For comparison, we display in Fig. 3b the analogous distributon of reverse pairs, which has variance of 1600. Note the different scales and shapes, which reflect the large difference between inverse and reverse distributions. The complement-pair distribution (not shown here) is essentially identical to the reverse one.
Considering the peak at Z = 0 displayed in Fig. 3b, it is important to note that there exists a subset of k-mers which obey S = S rev , i.e. they are palindroms. They will contribute to the peak at Z = 0, with small variations of palindroms contributing to the region around this peak. Nonetheless, their numbers are small compared to all 8mers: only 3224 out of 65536 8-mers lie within |Z| < 1.65, which is where 90 % of a standard normal distribution are expected to reside. Hence the variance of the reverse-pair Z-ditribution is very large.

Inversion symmetry for chromosomal sections
We next test to what degree IS is valid within various sections of human chromosomes. In Additional file 3, we display a characteristic distribution of inverse pairs drawn from a section of length 10Mbp, and in Additional file 4 we show an analogous distribution for length of 1Mbp. The IS quality, as determined by our convention, deteriorates leading to lower k-limits as the length of the section decreases, but it remains valid. The distributions in Additional file 4 are evidently noisier than their analogs in Additional file 3; however they are much narrower than those of the reverse and random pairs (not shown here).
To study systematically different sections of chromosomes, we evaluate the E k [X] values of inverse, random and reverse pairs, on non-overlapping windows of given lengths L. In general, inverse-pairs lead to smaller E k [X] than the other pairing choices. To determine the k-limit we impose the condition E k [X] <0.1 on the average over all chromosomal sections. The example displayed in Additional file 5 is of chr1, which is being tested with windows of length L = 5Kbp for inverse-pairs of k = 2. Although the average value is 0.07, obeying our criterion for IS validity, it is quite obvious that on many 5 K windows the values are higher. The value KL = 2 is chosen as the k-limit of IS validity in this case. Reducing the section length further down to L = 1Kbp, in Additional file 6, we find that IS fails even at order k = 1, i.e. the second Chargaff rule does not hold for such short sectors.
Similar evaluations for different chromosomes, on both HG18 and HG38 assemblies, lead in a consistent manner to the k-limits of "human sections" displayed in Table 3, where they are compared with results obtained for various other species, both eukaryotes and prokaryotes. They all follow a logarithmic increase of KL as function of the length of the chromosomal section, as is quite evident from their display in Fig. 4.
The logarithmic increase is modelled well by our IS-Poisson model. To prove it let us define N(S) = f(S) L /4 k , and let us assume that E[|Z|] reaches its asymptotic value 0.8. We may then rewrite The expression in {} is the expectation value of (f(S)) -½ within the |Z|-distribution. Let us denote it by 0.8c k Table 2 Results of the evaluation of averages and variances over k-mers of X and Z distributions on human chr 1. Large k-values approach the results E k (|Z|) = 0.8 and σ k (|Z|) = 0.6 expected from standard normal Z distributions If c k is a slowly varying function of k then k < lnL/ln4 + const = 0.72lnL + const. This result is evidently borne out by the experimental fit k < 0.73 ln(L) + const. in Fig. 4. Furthermore, studying human chr1 we find that the experimental averages of (f(S)) -½ (without weighting by |Z|) obtain the values 1.24, 1.45, 1.47 for k = 4, 6, 8 respectively. This verifies that c k is indeed a slowly varying function.
An early observation of inversion symmetry measures increasing logarithmically with sequence size has been   [10] for various DNA and RNA sequences (see Fig. 1 in [10]).

Modeling inversion symmetry
If IS holds exactly for some k = k 0 , it will hold also for all k < k 0 , since the latter are substrings of the former and, therefore, all the frequencies of the k inverse-pair substrings will be matched (since the frequencies of their k 0 hosts are being matched). In Methods we show that this statement is also true when IS is approximately true, i.e. when E k [X] < <1. In practice we find it to hold when we apply our criterion E k [X] ≤0.1 (see Table 4). One may wonder to what extent the opposite may hold within, e.g., low order Markov models: will a Markov model, constructed such that it satisfies IS for some k induce IS at the level k + 1? The answer is negative. Even for low values of k, a Markov model based on a lower statistic cannot generate the higher statistic [8]. This issue has been discussed in [10] where the difference between the two has been termed "residual symmetry". The simplest random model is that of a uniform distribution, which is generated on the basis of the second Chargaff rule (i.e. #A = #T and different from #C = #G). Such a distribution will trivially account for low E k [X] values for inverse pairs at large values of k, limited by the length of the model chromosome. However it will Displayed results are for chr1 of HG38, HG18, mouse, C elegans, and S cerevisiae, and for the full bacterial chromosome of E coli. M refers to masked chromomes. Centromere regions were removed from the HG 38 data. Highlighted results are the ones determining the k-limit, KL, of the different chromosomes also give rise to very low values for reverse pairs at a similar range of k, because any comparison of k-mers with one of their permutations will lead to similar E k [X]. In other words, this random independent (but not IID) model satisfies additional symmetries that are not observed in genomic data. Therefore it is not a realistic model of inversion symmetry. A plausible explanation of the observed IS can be based on the fact that genomes evolve through rearrangement processes. By comparing synteny blocks in human and mouse, Pevzner and Tesler [17] have argued that rearrangements occur on many scales in the genome, and intra-chromosomal rearrangements are more frequent than inter-chromosomal ones. Rearrangements may be viewed as inversions of sections between two breakpoints on the chromosome, and they may even follow one another in a nested fashion. Their study [17] demonstrated that human and mouse chr X share 281 synteny blocks of size >1 Mb, and at least 245 rearrangements occurred since the divergence of the two species.
Building on this intuition, derived from comparative genomics, it seems reasonable to assume that a series of such rearrangements on different scales may lead to IS. This mechanism has already been suggested by [6], and has been studied by [18] and by [11]. We have tested it on a simple model, starting from the human mitochondrial chromosome, which does not satisfy the second Chargaff rule. Since the mitochondrial chromosome is only 16Kbp long, we first construct out of it an enlarged model chromosome with length L = 100Mbp, by concatenating random selections of subsequences of chr M. We then apply to it rearrangements at various scales. We found that 5000 rearrangements at scales of 100 K have led to good IS effects, but best results were obtained for 50,000 rearrangements, whose breakpoints were randomly chosen, and their section lengths befit a uniform random distribution of length < 10 K. These results exhibit a high degree of IS, as displayed in Additional file 7. To look further into it, we present in Table 4  to 0.8, hence the success of the KL formula. By comparing with the 3 rd part of the table, we see that for these values of k, only for a very small fraction of all possible k-mers, both S or S inv are not realized on the studied chromosomes.

Inversion symmetry: validation and deviations
Our criterion for approximate IS, E k [X] ≤ 0.1, was introduced as an intuitive but somewhat arbitrary decision. From Table 4 we learn that this is where the Z distribution approximates very well the data. For larger k we observe that some of the k-mers do not appear, and their fraction increases rapidly with k. Hence our criterion selects also the range where almost all k-mer strings are being realized. This serves as a posterior justification of our IS criterion. Table 4 carries also the message that, for small k, a strict validity of IS cannot be guaranteed. This may be interpreted by stating that the breaking of IS is small, but it is statistically significant. In particular, testing the 2 nd Chargaff rule, one finds a systematic deviation from N(T) = N(A) and N(G) = N(C) for all human chromosomes, as displayed in Table 5. For most human chromosomes, we find an excess of T over A and of G over C on the positive strand. Only chr 8 and chr 22 display opposite trends. As shown, these results are statistically significant, when compared with an assumption that the counts of complement nucleotides are derived from the same Poisson distribution. Moreover, the same is true for both bare and masked versions of the chromosome. The difference between the bare and masked regions of the chromosome defined the low-complexity chromosomal regions. The asymmetry seems to be quite significant in all three regions (bare,masked,low complexity) as can be seen in Additional files 8 and 9.
It is well-known that there exist local violations of the 2 nd Chargaff rule; in particular, there exists an excess of #G over #C and #T over #A on the coding strand within most genes. Green et al. [19] have argued that mutational asymmetry has acted over long periods of time to produce such a compositional asymmetry, and discontinuities of such asymmetries are associated with loci of replication origin. These questions have also been studied by Huvet et al. [20]. Could it be that the asymmetry that we have encountered is somehow connected to these findings? Since the gene coding strand may be either the plus (P) or the minus (M) strand of the conventional genomic notation, this may seem to be unrelated, assuming there is equal probability for genes to occur on each strand.
The convention which is being used in the UCSC genome browser is that the "plus" strand refers to the linear 5' to 3' order of encountering the p-arm before the centromere, which is followed by the q-arm of the chromosome. This convention is consistent with NCBI "top" assignment. Counting protein coding genes and RNA genes on human chromosomes as recorded by GeneCards (http://www.genecards.org) we are led to the conclusion that they display a clear excess of genes on the plus (P) strands. The results are displayed in Table 6. Their relation to preferences of #T > #A and #G > #C on the P strand looks statistically significant. Clearly genes occupy only a small fraction of human chromosomes but they could still be the cause for the very small deviation from the Chargaff rule. It may also be that some other mechanism leads to a built-in excess on the chromosomes, and the latter affects the preference of gene allocations within the two strands. A notable exception to the observed general trend is chr 11.
There are two different issues which are noteworthy in Table 6. One is the correlation of the preference of #T > #A and #G > #C with the positive labeling of the strand. The other is the correlation of #T > #A and #G > #C with the preference for gene counts. Whereas the first may be coincidental (although it could be related to the labelling convention whose sources we were unable to trace), we believe that the second can be meaningful.
Next we looked for the violation of the 2 nd Chargaff rule on mouse and yeast, with the purpose of characterizing the asymmetries and looking for correlations with gene occurrences. The gene counts were obtained from MGI (MRK_list2 in ftp://ftp.informatics.jax.org/pub/reports/index.html) for mouse, and from SGD snapshot (http://www.yeastgenome.org/genomesnapshot) for yeast. While asymmetries of nucleotide occurrences are evident and significant in both, gene data are quite smaller than in human and no conclusive correlations can be deduced.  All Z values are very significant, but for Z(G,C) on chrY which corresponds to a p-value of 0.367. All other have inequality p-values < 0.01. On all chromosomes we observe #G > #C on the positive strand. Same is true for #T > #A, but for chr8 and chr22, where #T < #A, which is also a significant observation (|Z| > 2.575 corresponds to an inequality p-value < 0.005) The analyses are listed in Additional files 10 and 11. Finally we test the 2 nd Chargaff rule on C elegans and E coli in Additional file 12. While the former shows some significant inconsistencies, the latter is completely consistent. This behavior correlates well with the trends already noted in the first raw of Table 4, indicating large values of E[|Z|] for k = 1.

Discussion
Inversion symmetry may be stated as the equality N(S) = N(S inv ) where N are counts and S is some arbitrary string existing on a chromosome. Conventionally one studies such equalities over the space of all S which are k-mers of some given length k. In addition to this equality one requires that, if S inv is replaced by other permutations over the space of all k-mers, analog rules will not hold. After reinvestigating these questions on various genomic data, with special attention devoted to human data, we turned to a rigorous statistical study. For this purpose we defined the normalized differences Y = (N(S)-N(S inv ))/(N(S) + N(S inv )) and X = |Y|. If the equality N(S) = N(S inv ) holds for stochastic variables N, we expect the variable Y to have approximately Gaussian behavior. If, moreover, N is a Poisson distribution, then Z = (N(S)-N(S inv ))/(N(S) + N(S inv )) ½ should have approximately a standard normal distribution. The stochastic variable Z is the appropriate one to be used for a z-test, characterizing the significance of IS values displayed by Y or X, under the IP-Poisson model.
In order to characterize approximate IS we have employed E k [X] ≤ 0.1 as a convenient measure. We saw in Table 4 that it captures the region for which significant results are obtained, and almost all k-mers appear on the chromosome. Defining the k-limit KL as the kvalue for which E k [X] turns out to be closest to 0.1, we uncover a logarithmic increase of KL with chromosomal length. It turns out that this behavior is accounted for by our IS-Poisson model.
Our original definition of IS regarded it as an approximate symmetry. As such it was seen to be valid for all ranges of k up to KL. With the advent of the IP-Poisson model, we may investigate to what extent it can serve as an exact symmetry. It turns out that, for very low k, E k [X] though extremely low, is significantly different from 0. In other words, the confidence intervals derived from IS-Poisson, exclude a peak at Z = 0. This has lead us to investigate the violation of the 2 nd Chargaff rule, i.e. deviations from the relations N(T) = N(A) and N(G) = N(C). We find that deviations are very significant in human and in mouse, and quite significant on chromosomes of other eukaryotes. Moreover, in human we observe that, for most chromosomes, N(T) > N(A) and N(G) > N(C), i.e. these excesses are observed to occur on chromosomal plus (P) strands. Investigating the occurrences of genes on both strands, we find a similar excess with significant slight preference for the P strand. These results, for nucleotide excess and gene excess, are displayed in Table 6, and are seen to hold for a large majority of chromosomes. 1 Still, there exist also some counter-examples. Could it be that the known asymmetries of complement nucleotides on gene coding strands are related to the observed correlation of the two effects in the human genome? This remains an interesting question for future studies.

Conclusions
Inversion symmetry is valid for almost all chromosomes, even after filtering out their low-complexity regions. We have defined an empirical criterion of IS, and a corresponding k-limit (KL), which is the highest k for which all k-mer distributions abide by the symmetry. Analyzing the IS behavior using rigorous statistical methods, and comparing empirical results with our IS-Poisson model, we account for the universal increase of KL with respect to the chromosomal length.
For low k we find minute, yet significant, deviations from strict IS. This includes excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. We point out that this finding correlates with a significant (but small) excess of genes on the same positive strands.

Methods
For a string S of length k, and its symmetry-related S*, we introduce the stochastic variables N(S) and N(S*), and through them the following variables X, Y and Z (using an italicized notation): All three variables can be evaluated for each specific k-mer, hence they carry implicit indices of the space of 4 k k-mers. The empirical values of the different variables will be denoted by N, X, Y and Z. Measures similar to Y appear in the literature for k = 1 and k = 2 and are known as skews.
The symmetry-relation means that N(S) and N(S*) are drawn from the same distribution, N(S) = N(S*). Furthermore, it is reasonable to assume that k-mer appearances on a long chromosome resemble a Poisson process. This has been verified by us by investigating counts for all non-overlapping windows of some size L (e.g. L = 100 K on human chr 1). If the expectation of the Poisson is large enough (a typical quoted number is 30), we can safely assume by the Central Limit Theorem that the distribution of the statistic Z we define is well approximated by a standard normal distribution.
1. The semi-Gaussian distribution of inverse-pair differences.
Let us consider a pair of k-mers, with counts N(S) and N(S*) respectively, for which we evaluate the ratios Y = (N(S) -N(S*))/(N(S) + N(S*)) and X = |Y|. Moreover, we assume that these counts are due to two random variables drawn from the same distribution, thus having the same average, E[N(S)-N(S*)] = 0, and follow a Gaussian distribution G = exp(−Y 2 /2 σ G 2 )/ σ G (2π) ½ . The counts of the distribution of X = |Y| will then follow a semi-Gaussian distribution P = 2 exp(−X 2 /2 σ G 2 )/ σ G (2π) ½ , defined for positive X only. The mean and variance of this semi-Gaussian are E[X] = σ G (2/π) ½ ≈ 0.798σ G and V[X] = σ G 2 (1-2/ π). Hence σ X = 0.603σ G = 0.755E X . Empirical verification of inverse pair distributions can be carried out by choosing counts for all nonoverlapping windows of some size L (e.g. L = 100 K on human chr 1). Testing the X and Y distributions for inverse pairs of k = 8 we find the above description to be valid.

Poisson distributions of counts.
Let us now assume that the counts N(S) observed on a chromosome, are realizations of stochastic variables which follow Poisson distributions, each with its own mean = variance. In the IS limit the distribution of the inverse N(S inv ) coincides with that of N(S). Their difference should have a mean of 0, and variance which is the sum of the variances. Thus, for each inverse pair of k-mers, we expect Y = (N(S) -N(S inv ))/(N(S) + N(S inv )) to become approximately Gaussian with mean 0 and standarddeviation σ G =1/(N(S) + N(S inv )) ½ . Alternatively we can state that Z = (N(S) -N(S inv ))/(N(S) + N(S inv )) ½ should approximately follow a standard normal distribution with mean = 0 and variance = 1. It follows then, from the previous paragraph, that in this regime we should obtain, after averaging over all kmers, the results E k (|Z|) = 0.8 and σ k (|Z|) = 0.6. Note that E k (|Z|) may also be viewed as E k (X/σ X ), where σ X = 1/(N(S) + N(S inv )) ½ for every particular k-mer under consideration. Tables 2 and 4  For simplicity we assume that all 4 k k-mers are being realized on the chromosomal strings. Let {S j , j = 1…4 k } be the set of all k-mers, and {S' i , i = 1…4 k-1 }be the set of all (k-1)-mers. Each (k-1)mer can be extended to the right by one nucleotide, resulting in four k-mers. Let us refer to this extension of S' i as the set S jϵI . Corresponding relations will hold for their inverse partners, extended by nucleotides to their left. It follows then that the counts of these sets can be related N(S' i ) = ∑ jϵI N(S j ). Hence Using the notation N(S j ) -N(S j inv ) = ΔN(S j ), we note that the numerator on the right obeys |∑ jϵI ΔN(S j )| ≤ ∑ jϵI |ΔN(S j )|. Because of varying signs this inequality may imply a strong decrease. We may now compare the expressions of E k-1 [X] = 4 -k+1 ∑ i X k-1 (i) and E k [X] = 4 -k ∑ j X k-1 (j). Using the results of the previous paragraph we conclude that the numerators of X k-1 (i) in E k-1 [X] are smaller (or equal) than the numerators of X k (j), where jϵI, in E k [X]. Note however that the denominators of X k-1 (i) and X k (j), where jϵI, are different. To the extent that all N(S jϵI ) have similar values within the group jϵI when we approach the IS limit, this leads to X k-1 (i) ≤ ∑ jϵI X k (j)/4, which implies that E k-1 X ½ ≤E k X ½ : In practice, for large k, we find in Tables 2 and 4 that E k-1 [X] ≈ E k [X]/2. It should be emphasized that the monotonic increase holds in the IS limit, i.e. when E k [X] < <1, but it is not a general property of k-mers on any chromosomal section. Synthetic counter examples can be constructed. 4. Comparison of E k [X] with the S1 measure. The measure S1, introduced by Baisnee Hampson and Baldi [10], comparing counts of all kmers with their inverses, is defined by S1 ¼ 1À The denominator in this expression equals twice the length of the chromosome. The numerator may be regarded as an L1 distance between two sets of sequences. Note the difference from our measure E k [X], which may be written as E k [X] averages the relative difference of all k-mers on equal footing, whereas S1 sums all absolute differences. A comparison of the two different measures on human chr1 is presented in Table 7. We find that E k [X] is roughly twice (1-S1) k , and the latter is approximately equal to E k-1 [X].