Inverse Symmetry in Complete Genomes and Whole-Genome Inverse Duplication

The cause of symmetry is usually subtle, and its study often leads to a deeper understanding of the bearer of the symmetry. To gain insight into the dynamics driving the growth and evolution of genomes, we conducted a comprehensive study of textual symmetries in 786 complete chromosomes. We focused on symmetry based on our belief that, in spite of their extreme diversity, genomes must share common dynamical principles and mechanisms that drive their growth and evolution, and that the most robust footprints of such dynamics are symmetry related. We found that while complement and reverse symmetries are essentially absent in genomic sequences, inverse–complement plus reverse–symmetry is prevalent in complex patterns in most chromosomes, a vast majority of which have near maximum global inverse symmetry. We also discovered relations that can quantitatively account for the long observed but unexplained phenomenon of -mer skews in genomes. Our results suggest segmental and whole-genome inverse duplications are important mechanisms in genome growth and evolution, probably because they are efficient means by which the genome can exploit its double-stranded structure to enrich its code-inventory.


Introduction
Symmetry has been considered as an aspect of beauty in mathematics [1], physics [2], chemistry [3], evolution [4], human appearance [5], and psychology [6]. The cause of symmetry is usually subtle, and the pursue of it often leads to a deep understanding of the possessor of the symmetry. Chargaff's parity rule, stating that in a DNA sequence contents of A and T, and of C and G, are separately identical [7], was a crucial clue to Watson and Crick's discovery of the double helical structure of DNA [8]. Chargaff's second parity rule (CPR2) states that at a lower level of accuracy the first rule also extends to a single strand of DNA [9][10][11][12]. This monomeric base-complement symmetry has two possible generalizations to k-letter words, or k-mers: complement and reverse-complement, or inverse, symmetries. It has been suggested that CPR2 is a special case of inverse symmetry, not complement symmetry [13][14][15][16]. If we represent the four types of bases by black and white arrows that can point up or down-black, A/T; white, C/G; up, A/G; down, C/T-and place a k-mer in front of a pair of right-angled mirrors, then the laterally reflected image is the reverse (conjugate) of the k-mer, the image reflected through the mirror below, its complement, and the doublyreflected image through both mirrors, its inverse. For example, the reverse, complement, and inverse conjugates of the 5-mer AAGTC are CTGAA, TTCAG, and GACTT, respectively. In our notion, a genome with a perfect symmetry is one where for every word in the genome, there is symmetry-conjugate of that word somewhere else in the genome. For a proper discussion of symmetry a quantitative description of the phenomenon, including its complete absence, is needed.
In a search for insights into the dynamics that drive the growth and evolution of genomes, we conducted a comprehensive study of the three symmetries in 786 complete chromosomes (all the complete genomes available in public databases when the study was initiated). We focused on symmetry because, in spite of the extreme diversity of genomes, we expect the same dynamical principles and mechanisms to drive genome growth, and expect footprints of the symmetry generating part of the dynamics to be the most robust. For this study we define a new symmetry index, x r , where r~r, c, and i stand for reverse, complement and inverse, respectively, that allows us to quantify accurately the three symmetries in any sequence long enough-about 5 kb-to make word counting in the sequence statistically meaningful. Each symmetry index has a value ranging from zero (perfect symmetry) to approximately unity (absence of symmetry). The value of our index is intuitively understood. For instance, in the case of inverse symmetry, an index value of 0.05 for 5-mers implies that the average difference in the frequencies of all inverse-conjugate pairs of 5-mers is one-twentieth that of all pairs of 5-mers. Using the indexes we verified that all three symmetries are absent as expected in sufficiently long random sequences, and we found that: reverse and complement symmetries are absent, globally and locally, in the 786 complete chromosomes studied; in sharp contrast, a high level of global inverse symmetry (GIS) is ubiquitous in almost all complete chromosomes; the grand average of the GIS index in all complete chromosomes is 0:073+0:066; while broadly similar in their global behavior, chromosomes exhibit a wide variety of patterns in local inverse symmetry (LIS); coding and non-coding regions have essentially the same global and local symmetry properties. We infer from these results that inverse segmental duplication, in several forms, is an important mechanism in the growth and evolution of genomes. As a byproduct we also gained a quantitative understanding of reverse, complement, and inverse skews in genomes, in monomers and k-mers.

Partition of k-mers into m-sets
By a genomic sequence we mean a single-stranded sequence. We call a k-nucleotide word a k-mer and denote the set of all t:4 k types of k-mers by S. Given a sequence, we count the frequency of occurrence (or frequency) f u of each k-mer type u in S using an overlapping sliding window of width k and slide one [17]. The sum of the frequencies is P u[S f u~L {kz1, approximated by L, and the mean frequency is f f~L=t. Let the fractional A/T-and C/G-content of a sequence be denoted by p and q~1{p, respectively. Whereas the p : q ratio varies widely from genome to genome, the well-verified CPR2 [9,11,12,18] states that in any long stretch of a single strand of genomic sequence the A:T and C:G ratios are both invariably close to 1. This property suggests a binary partition of S into subsets (m-sets) S m , m~0 to k, where each of the t m~k m 2 k types of k-mers in S m contain m and only m A/T's (note that P m t m~t ) [19]. For example, in the case of k~2, S 0 is the set CC, CG, GC, GG; S 1 is the set CA, CT, GA, GT, AC, AG, TC, TG; and S 2 is the set AA AT, TA, TT.

Definition of Symmetry Indexes
Given k §2, let P r be the set of distinct r-conjugate (but nonself-conjugate) pairs of k-mers types, where r~r, c, and i denote reverse, complement, and inverse symmetry, respectively. For example, for k~2, The r-symmetry index, x r , is defined as: where u { is the r-conjugation of u, s mu is the standard deviation of the m-set to which both u and u { belong, and N r is the number of r-conjugate pairs in P r . For example, for k~2, P r~6 , P c~8 , and P r~6 . We make two remarks concerning Eq. (1). First, k-mers from different m{sets are treated separately. Second, the difference (f u {f u { ) is measured relative to ffiffi ffi 2 p s mu , the average fluctuation of frequencies of k-mers within the m-set to which u and u { belong. Because the standard deviation in frequencies of all k-mers (for given k, but regardless of m-sets) depends sensitively on base composition and is not an accurate measure of the fluctuation in frequencies, the two features mentioned above are crucial for disentangling symmetry from the effects of base composition. By design x r is expected to be close to unity in the absence of r-symmetry. A x r significantly less than unity indicates the presence of r-symmetry and x r~0 implies exact r-symmetry. We have verified that x r &1 for all three symmetries in random sequences.
Comparing x r with an L 1 -Distance Index In [15] an index defined in terms of an L 1 -distance, , was used to measure symmetries in DNA sequences: the closer S 1 approaches unity the better the symmetry. The weakness of an algorithm based on S 1 is that it is of the order of unity in all cases. Moreover, unlike x it is sensitive to compositional variations. Table 1 gives the values of x r and S 1 r (r~c and i) for the 4.6 Mb E. coli chromosome, the 228 Mb human chromosom I, and matching random sequences. Here, matching means having the same length and base composition. We obtain a matching random sequence by either sufficiently scrambling the genomic sequence or generating a random sequence using an appropriately loaded die, and have found the two methods yield mutually consistent results as far as symmetry is concerned. It is evident that x has a significantly better analyzing power. The S 1 values in the table does indicate inverse symmetry to be better than complement symmetry in the E. coli genome and the human chromosome I. However, they also indicate both symmetries in random sequences are at least as good as inverse symmetry is in DNA sequences, which is of course incorrect. In sharp contrast, the x values correctly indicate that inverse symmetry is present at a high level in DNA sequences where complement symmetry is absent, and both symmetries are absent in random sequences. The S 1 in Table 1 for random sequence is very close to unity not because f u {f u { j j between a conjugate-pair is especially small, but because in a random sequence the difference between any given pair (from the same m-set) is small. This illustrates the importance of measuring f u {f u { j j against s mu , which has non-trivial properties [20]. Table 1. Comparing symmetries measured by x and S 1 . We derive a relation between x r and v r , the fraction of r-conjugate k-mers that are paired. For simplicity we do not explicitly mention m-sets and write s m as s s, but it is understood that a pair of k-mers always implies a pair in which both k-mers belong to the same m-set. Let SDf T~ffi ffi ffi 2 p s s be the average frequency difference of an unrelated pair (of k-mers). This means a typical pair of k-mers have respective frequencies and f z~ f f zSDf T=2, so that on average the fraction of unrelated k-mers that are ''paired-up'' is Similarly, if SDf T r is the average frequency difference in a r-conjugate pair, then the fraction of k-mers paired with their respective r-conjugate partners is It follows from the definition of x r (Eq. (1)) that a mean-field approximation of its value is Note that Sx r T~1 when v r~v0 and Sx r T~0 when v r~0 :5. It is important to realize that the value of v 0 alone does not determine the level of symmetry. For long random sequences, s s scales as L 1=2 so that s s 2 f f *0 and v 0 0:5. Yet none of three symmetries are present in random sequences, nor are they expected (see Table 1). Finally, the fraction of r-conjugate-paired k-mers above background level is Global and Segmental Symmetry Index We use x r,gl to denote the global x r for a whole chromosome and x r,l to denote the index for a segment of length l. For each chromosome we compute a x r,l versus l plot such as the one shown in Fig. 1, where each data is the mean segmental value x i,l (in this case r~i) of all the non-overlapping segments of length l into which the chromosome is partitioned. The error bar gives the standard deviation. The datum at full length is the global x i,gl for the chromosome. The body of data is seen to be roughly linear in the log-log plot for segment lengths up to near the full chromosome length, followed by a sharp drop in x i,l thereafter. There is a measurable k-dependence in the data (here k = 2 stands out, but not in all cases) but in this report we consider mostly k-averaged data. We utilize this property to characterize a chromosome by x i,bg and r x , where x i,bg is the linear part of the k-averaged x i,l extrapolated to full chromosome length, and r x is the ratio x i,bg x i,gl . For example, in Fig. 1, x i,bg &0:38, x i,gl &0:051, and r x &7:5.

The x-Matrix
Given a chromosome, a user defined overlapping sliding window is used to generate a set of N overlapping segments of length l covering the entire chromosome. The (i,j) element of the symmetric N|N x i -matrix is the x i,2l value of the concatenation (of length 2l) of the i th and j th segments of the set.

Symmetry Index for Coding and Non-Coding Parts
From each complete sequence, the coding and noncoding segments are spliced from a single strand of the chromosome and the segments-coding and noncoding-are separately concatenated in the order and orientation as they occur in the strand to form two sequences, the coding and non-coding parts, respectively. Symmetry indexes for the two parts are separately computed.

The Complete Sequences
The 786 complete sequences analyzed in this study, 356 eubacteria chromosomes, 28 archaea chromosomes, and 402 chromosomes from 28 eukaryotes, were downloaded in November of 2006 from the National Center for Biotechnology Information (NCBI) chromosome database [21], except the rice genome, which was taken from the Rice Annotation Project Database (RAP-DB) [22]. The set included all the non-redundant prokaryotic and eukaryotic complete genomes in public databases at the time of the download. Individual chromosomes range in length from 200 kb to 230 Mb. The total length of the 786 sequences is 2:18|10 10 bases. A list of the complete chromosomes is given in Table S1, SI.

Computing Programs
All computing programs used in generating the results reported in this paper can be downloaded from the ISDB [23].

The Inverse Symmetry Database (ISDB)
Data, in the form of numerical lists and plots, on local and global symmetries for 786 complete chromosomes are given in the Inverse Symmetry Database (ISDB) [23]. Here we present a

Reverse and Complement Symmetries Are Absent on All Scales
The quantity x r,l measures the segmental average of the r-symmetry index for a chromosome partitioned into segments of length l (Methods), and a x r,l {l-plot reveals the scale-dependence of x r,l . We computed the x r,l {l and x c,l {l plots for a large selection of chromosomes and found that in all cases the two symmetries were absent on all scales. Two examples, for B. burgdorferi and the human chromosome 1, are shown in Fig. 2. The data given at the top of Fig. 3 (a) is a summary of the finding that reverse and complement symmetries are globally absent, x r,gl &x c,gl &1, in all chromosomes studied (see ISDB [23] for full results). These results confirm a previous finding [13] that CPR2 cannot be a specialization to monomers of a general k-mer complement symmetry.  (Table S1, SI) showing that GIS is strongly present, namely x i,gl vv1. Fig. 3 (b) shows the k-dependence of category-averaged x i,gl varies with category, possibly as a reflection of the diversity of organisms included in each category. For example, the vertebrates are phylogenetically far closer than the organisms included in the unicellular category. Because the k-dependence is not pronounced and owing to the large quantity of data, in this report we will focus on k-averaged results. Data for eubacteria and archaea are not given separately as they are not significantly different. A power-law dependence on sequence length L,

All Chromosomes Have Good Global Inverse Symmetry
where d~0:48+0:04 and L 0~1 :4|10 4 b. The grand average for 786 sequences is Sx i,gl T~0:073+0:066. A possible mathematical origin of the power-law behavior will be reported elsewhere.
Complete lists of k-averaged (k = 2 to 6) and k-specific x i,gl are given in Table S1, SI and ISDB [23], respectively.

The Case of E. coli as an Example
The 4.6 Mb genome of E. coli K12 is almost compositionally even with p~0:492. We examine the results for k~4. The mean frequency (in bases) of 4-mers is f f~L 4 4~1 7,900 (to three significant figures). The standard deviations for the m-sets (Methods) are computed to be s m~9 000, 7660, 5510, 5950,  6050 for m~0 to 4, respectively. The weighted mean standard deviation is s s~P m s m L m L~6410, hence the average difference between the frequencies of a typical pair of 4-mers (among the 32640 pairs) is Df~ffi ffi ffi 2 p s s~9060. There are 120, 128, and 120 reverse-, complement-, and inverse-conjugate pairs of 4-mers. The global symmetry indexes for the genome are computed to be x r~0 :977, x c~0 :946, and x i~0 :031. The average frequency differences for a typical pair of reverse, complement, and inverse-conjugate pairs of 4-mers are Df r~8 910, Df c~8 620, and Df i~2 80, respectively. We remark that x i is well approximated by a mean-field estimation of Sx i T~Df i s s~0:031 (Methods). This implies, for instance, that if the frequency of the 4-mer AAGC is 18000, then the frequencies of its reverse-conjugate (CGAA), complement-conjugate (TTCG) and inverse-conjugate (GCTT) would fall within the ranges 9000-27000, 9400-26400, and 17700-18300, respectively, with *0:95% confidence. Note that a single inverse-conjugate pair having a frequency difference of the order of Df r is sufficient to cause Df i to increase from 280 to 860 and raise x i from 0.031 to 0.095.

Chromosomes Exhibit Several Types of LIS
The x i,l {l plots ( Fig. 1, Methods) of chromosomes exhibit considerable variation (see ISDB [23] for a complete set of such plots). We notice that a x i,l {l plot can be meaningfully characterized by two quantities: x i,bg , which measures the ambient, or background, LIS, and r x , the ratio x i,bg x i,gl . A large r x implies the symmetry is much stronger globally than it is locally. Fig. 4 (a) is a x i,bg {r x plot of the 786 sequences. Although the data do not appear to form distinct clusters, for ease of discussion we used the function T~5x i,bg À Á 2 z 0:3r x À Á 2 to partition chromosomes into four types: Type A, T w9; type B, 9 §T w4; type C, 4 §T w1; type D, T ƒ1. This way of classification implies that whereas the difference between two adjacent chromosomes on opposite sides of a boundary may be fuzzy, there is a stark distinction between, say, type A and type D chromosomes. Of the 356 eubacteria, 33%, 17%, 38%, and 12% are types A, B, C, and D, respectively. The 28 archaeons are split evenly between types C and D, with types A and B absent. About 4%, 21% and 75% of the 402 eukaryotic chromosomes are type B, C and D, respectively ( Fig. 4 (b)). A classification of chromosomes by inverse symmetry type is given in Table S2, SI. Many of the phylogenetically most deeply rooted thermophilic eubacteria, including A. aeolicus and T. maritima, are type D. Multicellular chromosomes, with chromosomes larger than the typical bacterial ones, are exclusively type D, but some smaller protozoan chromosomes, including some from P. falciparum and E. cuniculi, are type B or C ( Figure S1, SI). With few exceptions inter-chromosomal differences in multicellular organisms are slight, while those in protozoans tend to be larger. Within a complete sequence, the general properties in inverse symmetry of coding and non-coding parts are similar ( Figure S2, SI). We note that CPR2 is significantly more strongly violated in individual genes (exons in eukaryotes) than in non-coding regions (see, e.g., [24]). However, this difference is not apparent in our case because of the way we concatenate coding (and non-coding) parts. Specifically, the coding part concatenates all genes in both orientations into a single strand (Methods). Several factors that generally hold-there are exceptions-now conspire to make CPR2 violation on a long stretch of the coding concatenate to be typically much weaker than that in a single gene: the level of violation is fairly uniform for all genes; the densities of positively (+) and negatively ({) oriented genes are about the same; the violations on +genes and {genes have opposite signs. The vast majority of chromosomes have x i ƒ0:2; see Table S3, SI for a list of 38 exceptional chromosomes. Of the 23 prokaryotic chromosomes in this category, 10 are type A, 6 each are types B and C, and 1 is type D; all are eubacteria. Of the eukaryotic chromosomes in this category, 9 and 6 each are types B and C, respectively; all are unicellular and most are from the yeast, P. falciparum, and E. cuniculi genomes. Only one chromosome has x i w0:4: X. fastidiosa, with x i~0 :517. A preliminary study indicates a correlation between type-classification and phylogeny [25].
x i -Matrix Reveals Strong Intra-Chromosomal Correlation in Segmental Inverse Symmetry The x i -matrix is designed to display the inverse relation between two segments from the same chromosome. In the present case, a chromosome is scanned by a window of width 100 kb and slide 25 kb, and the (i, j)-element of a x i -matrix gives the x i -value for the 200 kb concatenate composed of the i th and j th (100 kb) segments in the chromosome (Methods). By definition a x i -matrix is symmetric. Fig. 5 shows graphical representations of x i -matrices, or x i -matrix plots, of four representative chromosomes, C. acetobutylicum, E. carotovora, M. mazei, and Synechocystis, one for each of the four types. The chromosomes were chosen for their typicality, for having x i,gl 's being approximately 0.05, and for having lengths approximately 4 Mb. We first focus on the x i -matrix plot for the type-A C. acetobutylicum, Fig. 5 (a). It is composed of four roughly equalsized quadrants (the plot is symmetric with respect to its skew diagonal). With the finer structure ignored, the two diagonal quadrants are ''gray/blue'', or x i,200 &0:3, and the two skewdiagonal quadrants are white, or x i,200 &1. An exception is a single pixel near the midpoint of the skew-diagonal, which is light gray/ blue. The whiteness of the skew-diagonal (except the midpoint) implies LIS on a scale of 100 kb is absent in the entire chromosome. For reference, the x i -matrix plot for a random sequence will be all white. The midpoint of the chromosome happens to be near the terminal site (ter) of replication. We call the (approximate) half-chromosome to the left of ter the lead-strand, and the other half the lag-strand. The (200 kb) concatenates whose x i,200 make up the lower-left (upper-right, respectively) quadrant are composed of segments both from the lead-strand (lag-strand), and those whose x i,200 make up the upper-left quadrant (or the lower-right, which is the same) are composed of one segment from the lead-strand and another from the lag-strand. The light color of the two skew-diagonal quadrants implies that the k-mer contents of any two segments from the same strand are similar, such that the inverse-symmetry property of the concatenate is close to that of either of the component segments, which in this case have similar low levels of symmetry. In contrast, and this is the most interesting part of the plot, the dark color of the two diagonal quadrants suggests that any two segments taken from different strands-one each from the lead and lag strands-have a significant ''inverse relation'', meaning that, as far as k-mer content is concerned, the two segments are relatively close to being mutual inverse conjugates. This being the case, the nature of the non-whiteness of the single pixel near the middle of the skew-diagonal is also understood: it indicates a (200 kb) concatenate that straddles the ter site, so that one of its component segments is (mostly) from the lead-strand, and the other from the lag-strand. We can now interpret the entire pattern of the type-A x i -matrix plot as follows: the chromosome is bisected by the ter site into two almost equal strands (in some cases the bisection occurs at the origin site (ori) and in some cases the partition is not so nearly equal), each of which is without inverse symmetry, but k-mer-wise the two are nearly mutual inverse conjugates. However, it cannot be said that the two strands are simply mutual mirror inverse copies. For if that were the case, then the two diagonal quadrants would be mostly white save for a black, narrow diagonal ridge several pixels wide. In the type-A x i -matrix plot shown in Fig. 5, which is that of C. acetobutylicum, the neat four-quadrant appearance reflects the fact that the ori site is close to the origin of the genome. In some type-A chromosomes neither the ori nor the ter site is close to the origin, and this causes their x i -matrix plots to look superficially more complicated (see below).
The x i -matrix plot for the type-B chromosome is similar to a type-A plot, except that all four quadrants are a shade darker than their counterparts in the type-A plot, caused by the chromosome having a higher (and nearly) homogeneous ambient inverse symmetry with x i,200 &0:5. The type-D plot is qualitatively different from the two just discussed. It does not have a quadrant structure and therefore exhibits no hint of bisection of the chromosome. Rather, not counting finer structures, it shows x i,200 to be 0.2 or less everywhere, implying that the inverse relation between any two segments is strong. The type-C chromosome has a structure intermediate between types B and D. The example shown in Fig. 5 is broadly composed of two sections: a type-B-like section from 0 to 2.7 Mb and a type-D-like section from 2.7 to 4 Mb. In this context, the 0.9 to 3.8 Mb segment of the type-D Synechocystis appears to be a ''super-type D'' embedded within the chromosome. Fig. 5 (a-c) show that some segments within a chromosome, some as long as 1 Mb, have LIS significantly distinct from that of the rest of the chromosome. Such segments suggest alien origins, possibly the result of lateral gene transfers [26].
The type-specificity of x i -matrix plots is also seen in the box plots in Fig. 6, which compare distributions of frequency differences (D) of pair of k-mers (in this case 4-mers) in various pair-sets. For type A the set of D's for intra-strand inverseconjugate pairs (lead and lag) are not different from that for uncorrelated pairs (whole), while that for all inverse-conjugate pairs (inv) has distinctly smaller values. In contrast, for type D the lead-and lag-sets are similar to the inv-set and have distinctly smaller values than the whole-set. As before the patterns for types B and C are intermediate between type A and type D.
In Fig. 7 the k~3 x i -matrix plots of 40 prokaryotic chromosomes are shown to indicate the diversity of chromosomes as reflected in such plots. The types of the eight rows are, respectively: A (eubacteria), A (eub), B (eub), C (archaea), C (eub), C (eub), D (arc), D (eub). We remark that in some plots a tidier four-quadrant structure can be obtained by a shifting of the  coordinates of the origin. The general trend of type-dependence of the plots is such that, going from A to D, the largest structure decreases in size, and the lightest color gets darker, or equivalently, the ambient level of inverse symmetry rises. All plots are rich in fine structure and invite a deeper level of analysis than presented here. The k-specific x i -matrix plots of all 384 prokaryotic chromosomes are given in ISDB [23]. x i -matrix plots of eukaryotic chromosomes, which generally take longer to compute, will be added to the database at a later date.
Four-Quadrant Structure of x i -Matrix Plot Is Directly Related to Discontinuity in x i,l -l Plot Some characteristics of chromosomes that cause each to have a distinct x i -matrix plot are evident in their x i,l {l plot. The (a) panels in Fig. 8 illustrate the four type-distinct x i,l {l-plots. These plots have two prominent features: (i) From l §2 kb onward, except the data near full length, x i,l and l have an approximate power-law relation, with the power-law exponent increasing from type A (in which case it is almost zero) to type D; (ii) x i,l tends to show a discontinuity when l is near full length, with the discontinuity increasing from type D (not apparent) to type A. The magnitude of the continuity is given by r x (Methods), with r x~1 indicating no discontinuity. For a type-A chromosome, the large value of r x (*20) is directly related to four-quadrant structure of its x i -matrix plot ( Fig. 5 (a)). For l less than half the full sequence length L, x i,l gives the (averaged) values of x i for segments that are entirely either in the lead-strand or in the lag-strand of the chromosome. Since in these halves inverse symmetry is essentially absent, x i,l *1. That x i,l decreases from about 1 at l L=2 to about 0.05 at l~L implies the k-mers in the lead-and lag-strands have a strong inverse-conjugate relation.
We now analyze in detail the plots in Fig. 8 (a) near l equals to full length L where the discontinuity of data occurs. Let u be a typical k-mer (say, AACGC, in the case of k~5) in an m-set (for the present example it is the m~2 set), and u { (GCGTT) be its inverse-conjugate. Consider the two halves of the chromosome, the lead-strand and the lag-strand, and let f u and f u { (f ' u and f ' u { , resp.) be the frequencies of u and u { from the lead-strand (lagstrand). Then the data near l*L=2 in panels (a) of Fig. 8 imply that on average (Methods) where ffiffi ffi 2 p s mu is the average fluctuation of the frequencies of all kmers in the m-set (the subscript u in m u indicates the m-set to which the pair u and u { belongs). The value of x i,bg ranges from *1 for type A, implying that LIS is absent in both the lead-and lag-strands of the chromosome, to being in the range of 0.2 to 0.5 for types B and C, implying LIS is moderate, to *0:05 for type D, implying LIS being strongly present in both halves. The data at the end points giving the discontinuity may be written as The last term is expressed as a remainder because, regardless of type, the global x i,gl is typically small, of the order of 0.05. Significantly, the negative sign attached to the right-hand-side characterizes a ''mutual inverse relation'' between the lead-and lag-strands. To summarize, a strong LIS is absent in type A but present in type D everywhere, and the strong GIS in type A is an expression of a lead-lag mutual inverse relation while in type D it may be an extension of LIS.
Ori and ter Sites Are Revealed by x i -Scanning Each of the (b) panels in Fig. 8 gives the result of segmental x i,100 when a chromosome is scanned by a 100 kb wide sliding window. In each case x i,100 fluctuates around a nearly constant background value that is typical; about 1.0, 0.5, 0.3, and 0.15 for types A, B, C, and D, respectively. Over the background are isolated sharp minima, prominent in types A and B and less conspicuous in type C but absent in type D. As a general rule, in types A and B the two deepest minima occur near the ori and ter sites. In type C the minima near these sites are two among many. In type D there is no longer any feature that is conspicuously associated with either site. It has been shown that x i -scanning is an effective tool for locating ori and ter sites for non-type-D chromosomes [25]. Known or putative ori and ter sites in all prokaryotic chromosomes studied are given in the ISDB [23].

Ori or ter Sites Are Centers of Inverse-Symmetry Reflection (CIR)
In the (a) plots of Fig. 8 the ordinates of the 4 and ? symbols give x ori and x ter , the values of x i,100 at the ori and ter sites, respectively. The fact that at least one of x ori or x ter is noticeably less than the average x i,100 (except for type D) is another indication that segments straddling the ori and ter sites tend to have a high level of inverse symmetry. To further test this inference, we compute three types of x i,l -plots, with results shown in the (c) panels of Fig. 8. The square symbols represent data for segments whose centers are either the ori or the ter site, whichever is nearer the midpoint of the chromosome, the bullet (triangle, resp.) symbols represent data for segments that start from ori or the ter site and extend towards the 59 (39 resp.) end. The results indicate that while the bullet-and triangle-symbol data are type-dependent and tract the x i,l {1 plot of the chromosome ((a) panels), the squaresymbol data are similar for every chromosome and drop rapidly with increasing l regardless of type. This confirms our interpretation that ori and ter site are (near) centers of inverse-symmetry reflection (CIR).

CIRs Are Turning Points for Inverse Skews
Cumulative base-skews, or compositional asymmetry, have been noticed to ''turn'' at loci near ori and ter sites [18,[27][28][29][30]-here called CIRs-and this fact has been used to locate such sites [31]. This phenomenon can be understood as a consequence of the pair of relations Eqs. (6) and (7) when applied to monomers. The relations however predicate a more general phenomena unrelated to the overall base composition of a chromosome: inverse skews in k-mers (not just in monomers) is strongest in type A and weakest in type D. The panels (d) of Fig. 8 show, in the four representative chromosomes, cumulative GC-and AT-skews, and panels (e) show cumulative inverse skews in eight base-neutral, inverse-conjugate 4-mer-pairs. The correlation between the magnitudes (determined by x i,l ) of the skews and type is evident. Eq. (7) states that the slope of a cumulative inverse skew, when it is measurable, must change sign at a CIR. Our relations make no prediction with regards to the relative magnitudes and signs of the GC-and AT-skews. Data on monomer, 2-mer, and 4-mer inverse-skews in all prokaryotic chromosomes studied are give in the ISDB [23].

Complement and Reverse Skews Are the Norm and Do Not Have Turning Points
From Eq. (6), and because complement and reverse symmetries are both absent (x r,bg &x c,bg &1), pervasive cumulative complement and reverse skews in k-mers without turning points are expected in all chromosomes. We have verified this to be true in general. As examples, panels (f) in Fig. 8 show cumulative reverse skews in eight base-neutral, complement-conjugate 4-mer-pairs in the four representative chromosomes (see Figure S3, SI for more examples).

A Quantitative Description of the Three Prototypes of Cumulative k-mer-Skews
The relation between base-skew [18,[27][28][29][30][31][32][33] and complement/ inverse symmetry has been noted previously [13][14][15][16]34]. Rocha et al. [35] made a comprehensive review of the pro and con of eight hypotheses, including the most often evoked cytosine deamination, genome rearrangements, recombination signals, put forward to explain compositional skews. They concluded that whereas all the (eight) hypotheses have the potential to explain part of available data, none is entirely satisfactory, and that the simplest explanation is that the bias is multifactorial.
Our study shows that cumulative skews in k-mers, including monomer, have three prototypes: up-and-up (row (f), Fig. 8; Figure  S3, SI), up-and-down (or down-and-up) (rows (d) and (e), type-A and -B columns, Fig. 8), and flat (rows (d) and (e), type-D column, Fig. 8). The classification is dictated by Eqs. (6) and (7) and the values of x r,bg and x r,gl : up-and-up if x r,bg &x r,gl &1; up-and-down if x r,gl %x r,bg 1; flat if x r,gl &x r,bg %1. Thus, the cumulative skew between an inverse-conjugate pair of k-mers (including monomers) will be up-and-down in type-A or -B chromosomes (mildly so for type-C), flat in type-D chromosomes, and never up-and-up in any chromosome. In contrast, the cumulative skew between a reverseconjugate or complement-conjugate pair of k-mers, or any pair of k-mers that are not inverse-conjugate (unless the pair are related by a hidden, not yet discovered symmetry), will be up-and-up in any chromosome. In other words, up-and-up manifests no symmetry, up-and-down manifests strong global symmetry but weak or no local symmetry, and flat manifests strong local (and consequentially global) symmetry. To sum: cumulative skew of the up-and-up type is the norm, to be expected between a randomly selected pair, and the other two types are special, occurring only between inverseconjugate pairs.
In the up-and-up and up-and-down types, the approximate constant of the slope of the cumulative skew is a reflection of the typical approximate uniformity of k-mer-content on a scale greater than 25 kb in most chromosomes [36,37]. Eq. (6) shows the key to the magnitude of the slope is s m . It has been pointed out that s m has an approximate kand m-dependent universal (for all genomes) value: ffiffi ffi 2 p s m & ffiffi ffi 2 p s s&0:20L 3:2 ð Þ {k (to within a factor of two) [19] (Strictly, s s&max 0:14L 3:2 ð Þ {k , 2 k L {1=2 ), so the maximum k for which the first expression applies is 10 and 14 for 2 Mb and 200 Mb chromosomes, respectively). This means that, typically, the full values (when x r,bg~1 ) for skews are about 6:3|10 4 b per Mb for monomers, 2:0|10 4 b per Mb for 2-mers, and about 1:9|10 3 b per Mb for 4-mers. These estimates, together with the computed x r,bg and x r,gl , give a reasonable quantitative account of the k-mer-skews we observe, including data shown in Fig. 8 (d-e) and in Figure S3, SI. An in-depth quantitative study of this subject will be reported elsewhere.

Inverse Duplication Generates Inverse Symmetry
Segmental duplication is known to be a driving force in chromosome growth and evolution [38][39][40][41][42][43], and inverse segmental duplication (ISD), or segmental duplication from one strand of the DNA to the other strand ( Fig. 9), is also known to have occurred in chromosome evolution [44][45][46]. ''Countless inversions and inverted transporsitions,'' which are types of ISD events, were invoked to explain patterns of violation of CPR2 in 3-mers [47,48]. Instead of being mass produced by ISD, inverse-conjugate pairs may also be generated in a deliberate base-by-base process in the form of stem-loop extrusions (SLEs) from duplex DNA [12], to be discussed at length in a later section. We know of no mechanism analogous to ISD that can stochastically generate either reverse or complement symmetry on a large scale, and this may explain why these two symmetries are not prominent in chromosomes. Figure 9. A schematic inverse segmental duplication event. In an inverse segmental duplication, a segment is copied, inverted, and reinserted into the host sequence. In such an event, because the inverse conjugates of all k-mers appearing in the copied segment is added to the sequence via the inverse duplicate, such a duplication event enhances global inverse symmetry (GIS). doi:10.1371/journal.pone.0007553.g009 The fraction of a sequence that causes it to manifest inverseinverse symmetry is estimated in mean-field approximation f f is the fraction of unrelated k-mers that are paired-up and Sx i T is the mean-field approximation of x i . For sequences with strong inverse symmetry (x i &0:05), Using the universal value for s s given above and noting that f f~L 4 k we have 2dv inv &0:10 1:26 ð Þ k (for kƒ10), which for k~2 to 6 yield 2dv inv~0 :25+0:15: This suggests that inverse-symmetry generating mechanisms played a major role in chromosome composition.

Codon Usage and Inverse Symmetry
We analyze possible effects of codon usage on inverse symmetry. The 3-mer and codon frequencies from a DNA sequence differ in two aspects. First, codons are counted in the natural orientations of the genes, while 3-mers are counted along the gene concatenate in one direction (Methods). For example, the codon Trp, or UGG, encoded in a negatively oriented gene (or in a gene in the negative strand) adds one count to the frequency of codon Trp, but one count to that of the 3-mer CCT, not AGG. Second, codons are read only from protein-coding genes and are frame defined, while 3-mers are read over the entire sequence-non-coding as well as coding parts-using a sliding window of slide one. Therefore, the summed frequency of the codons (L codon ) is one-third the summed length of protein-coding genes, and that of 3-mers (L 3mer ) is the length of the DNA sequence (minus 2). In prokaryotes, where the coding region is typically about 88% of the chromosome, the ratio L codon =L 3mer 1=3, so the likelihood that codon usage could determine the inverse symmetry of the entire chromosome is already small. In increasingly advanced eukaryotes the ratio becomes progressively much less than 1/3-in human it is about 0.01, and the likelihood decreases accordingly. Table 2 uses the coding region (gene concatenate) in the four type-representative chromosomes to illustrates the difference between inverse symmetries in codons and 3-mers. Over the entire gene concatenate 3-mers have excellent symmetry even as codons have no or very poor symmetry. This can be the manifest of ''genic inverse symmetry'', meaning that the two sets of positively and negatively oriented genes are broadly homologous. Genic inverse symmetry ensures k-mer inverse symmetry in the coding region, independent of codon usage. The ''lead'' and ''lag'' results in Table 2 show that, just as for k-mers, genic inverse symmetry may also be global (type A), or local (type D), or shades in between. Even good genic inverse symmetry is insufficient to explain the kmer inverse symmetry we observe, because the latter is of the entire chromosome, which includes both coding and non-coding regions. Therefore, genic inverse symmetry cannot be the cause genomic inverse symmetry. Rather, it is a consequence of whatever that causes genomes to have inverse symmetry.

Type A Suggests Chromosome-Size Inverse Duplication and the prox Hypothesis
To simplify discussion we define the following: a prox duplication is one such that the site of insertion of the duplicated segment is proximal (relative to chromosome-scale) to the site of duplication; a dist duplication, necessarily trans-CIR, is one such that the site of insertion is distal to the site of the duplication (Fig. 10). A prox-ISD tends to enhance LIS-near the location of the ISD event-as well as GIS, whereas a dist-ISD can enhance only GIS. The most parsimonious explanation for the fact that a type-A chromosome has strong GIS while both of its two approximately equal sized halves-the lead and lag strands-are without inverse symmetry is that it is the result of a whole-genome/chromosome ISD (WGID) on a chromosome that had no inverse symmetry before the event. Following a WGID event, an originally symmetry-free chromosome will have perfect GIS, with x i,bg~1 but x i,gl~0 , which defines an extreme type-A chromosome. The x i -matrix plot of a chromosome immediately after an WGID event will not look exactly like Fig. 5 (a), however. Its skew-diagonal quadrants will be mostly white, just like in Fig. 5 (a). The two dark diagonal  quadrants seen in panel (a) will be replaced by quadrants of a much lighter shade (but not white, owing to the fact that the wordcontent of a chromosome has a fair degree of homogeneity), with a black, narrow diagonal strip running through it. The actual type-A patterns seen in Figs. 4 and 5 could be the consequence of (i) very few of either prox-ISD or dist-DSDs (direct segmental duplications) but (ii) many prox-DSDs occurring after the WGID. If we assume that a WGID is the major event giving rise to the pattern of inverse symmetry in type-A chromosomes, then (i) and (ii) above may be viewed as constraints that need to be satisfied. Since it is known that DSD is a major driving force in genome growth [38][39][40][41][42][43], satisfaction of constraint (ii) and the second half of constraint (i) follows if we hypothesize that as a general rule SDs are mostly prox. We call this the '' prox hypothesis''. (This hypothesis is consistent with an unrelated requirement, put forward for understanding the general existence of long-range correlation in genomes, that at least a significant portion of DSDs are made in tandem [20].) The prox hypothesis drastically simplifies the narrative for a type-A chromosome: a chromosome suffers a WGID, after which very few prox-ISD occurred. If a WGID event did occur then we should find homologs between the two arms of the chromosome. Our preliminary study using sequence alignment indicates that the pattern of homologs is consistent with the hypothesis that a WGID occurred in a type-A chromosome (C. acetobutylicum) and not in a type-D chromosome (Synechocystis), as shown in Fig. 11. On the other hand, both chromosomes exhibit good genic inverse symmetry in ways that are expected, global in type A and local in type D. A comprehensive BLAST-based study of this topic is underway and results will be reported elsewhere. An alternative explanation for type A is that there had been no WGID, instead the high degree of inverse symmetry was the result of many dist-ISDs, such that the cumulative length of inversely copied segments was close to (dv inv *) 13% of the full chromosome length (Eq. (8)). For this explanation to work constraint (i) is still needed. This alternative explanation seems highly unnatural since it implies there were simultaneously many dist-ISDs and very few prox-ISDs. Whole-genome duplication (WGD) was first proposed by Ohno as an important mechanism for genome evolution [49]. Recently it has been firmly established that such events did occur in yeast [50][51][52], ray-finned fishes [53], and freshwater puffer fish [54]. The possibility of WGID was previously discussed in connection with base-skews in B. burgdorferi [16,34].

Type D Suggests Many prox-ISD Events
In a type-D chromosome, the existence of inverse symmetry on all scales (greater than 5 kb) including local genic inverse symmetry (Fig. 11) and the homogeneity of x i across the entire chromosome (type-D patterns in in Figs. 4, 5, and 6) can best be explained as the result of many small, and mostly prox, ISD events; necessarily prox because otherwise LIS on a small scale would not be generated. This explanation is consistent with the prox hypothesis. Our results suggest that the upper bound of the distance between copying and insertion sites in a prox-ISD should be considerably less than 100 kb. In spite of the absence of distinct CIRs in type-D chromosomes (type D in Fig. 8 (b,c)), we may not rule out the possibility that early WGIDs did occur in such chromosomes, because much of the trace of an early WGID, assuming that it had occurred, would have been obliterated by the large number of small ISDs that came afterwards.

Possible Role of SLE in Local Inverse Symmetry
Extrusion of a stem-loop (SLE) from duplex DNA [12] can enhance local inverse symmetry in general and CPR2 in particular. If such structures are of adaptive significance, then in a scenario of ''Nature (writing) with parity primarily at the oligonucleotide level'', organisms which had single base mutations that strengthened the stem would have been selected over organisms that had not [24]. Inverse-conjugate pairs formed in such a process are necessarily extremely proximal to each other, with a separation not exceeding tens of base pairs. SLE cannot be the sole cause of inverse symmetry because unlike prox-ISD it does not generate local genic inverse symmetry, which is prevalent in type-D chromosomes (Table 2 and Fig. 11). Furthermore, SLE promotes inverse symmetry at the expense of violating Chargaff's (first) parity rule [7] (CPR1), a key to DNA replication and biological inheritance. The CPR1-violating effect of SLE may be lessened if it is considered as a mechanism that generates, as opposed to a structural form that manifests, inverse symmetry. In this case SLE exists as a extruding structure only during its formative phase (which cannot be too short a time if it is to be formed one nucleotide at a time [12]), but subsequently becomes a part of the (non-extruding) stem-loop structure [55] via DNA replication (making two copies that are slightly different). Then only a small number of structural SLEs need to exist in the chromosome at any given time, causing only a miniscule violation of CPR1. Current sequencing techniques based on the shotgun method cannot distinguish SLS from SLE. The number of SLSs in excess of expected background with minimum stem length of 12 bp and loop length ranging from 5 to 100 nt was measured to be *0:004 per base in 40 prokaryotes [55]. If we assume all SLSs with loop lengths less than 50 nt-about half the total [55]-are generated by the SLE mechanism then we estimate the SLE contribution to 2dv inv to be about 0:002|12~0:024, about onetenth of the full value of 2dv inv (Eq. (8)).

A Unified Interpretation for All Types
If we view types B and C as intermediates between A and D, then a unified interpretation of the behavior of x i for all types emerges: Every chromosome, with the possible exception of type-Ds, experienced a WGID. The chromosomes differ mainly in the amount of prox-ISDs each had, in increasing amount from type A to D, with type A hardly any, and type D close to the saturation amount, involving a fraction of 2dv inv *0:25 of the chromosome. In all cases a large number (unconstrained as far as inverse symmetry is concerned) of prox-DSD events may have happened, while dist events, either ISDs or DSDs, occurred rarely. A fraction of highly proximal inverse-conjugate pairs, possibly contributing to up to one-tenth of the local inverse symmetry in type D chromosomes, and possibly a larger fraction in the background component in the inverse symmetry in the other types, may have been generated by SLE instead of ISD. The DSD and ISD (and SLE) events that occurred were mostly neutral, because the coding and non-coding parts in a chromosome do not differ significantly in their patterns of inverse symmetry ( Figure S2, SI). On the other hand, because segments involved in ISD (as well as DSD) events sometimes contained genes, ISD enhanced genic inverse symmetry just as it did k-mer inverse symmetry, so the unified interpretation explains the results seen in Table 2 and in Fig. 11. Many alternative interpretations are possible for our data, but none will be as unifying, simple, and parsimonious as the one proposed here. We believe that by refining and expanding the analysis reported here, a great deal more about how genomes grew and evolved can be learned.
Our study suggests that the ISD events, if they did occur, were causatively related to DNA replication. First, we found it consistent to identify the sites of insertion of WGID events in type-A, B and C chromosomes (the CIRs) with ori or ter sites. Second, genomes known to have multiple ori sites tend to be archaeons and eukaryotes [56,57], not eubacteria, and we found archaeons and eukaryotes tended to be type D and never type A, while eubacteria tended to be the opposite. On the other hand, some type-D chromosomes are from eubacteria, and half of archaea are not type D. It could be that not all ISD events are associated with replication, or that some eubacteria also have multiple ori sites while some archeaons do not, or both. We offer no explanation why replication may cause ISD, except to point out that during replication the chromosome is spliced at the ori site, and this offers opportunities for the chromosome to misconnect on rare occasions, possibly resulting in an ISD event. In any case, the genome seemed to have developed machineries for ISD and used it frequently, probably because ISDs allow it to efficiently exploit its double-stranded structure to enrich its code-inventory.  Figure S1 x i,l -l plots for six eukaryotes (number of chromosomes in parentheses). (a) Yeast (16), (b) Worm (6), (c) Fly (6), (d) Human (24), (e) P. falciparum (14), (f) E. cuniculi (11). In each case the result for all chromosomes are overlayed. Results for other k-mers are similar. Found at: doi:10.1371/journal.pone.0007553.s004 (1.26 MB TIF) Figure S2 x i,l -l plots for the coding and non-coding parts of (a) the type-A eubacterial B. burgdorferi (5% of chromosome is noncoding), (b) the type-D archaeon M. acetivorans (29%), (c) the type-C chromosome 14 of the protozoan P. falciparum (41%), and (d) the type-D chromosome 1 of human (49%). Found at: doi:10.1371/journal.pone.0007553.s005 (0.92 MB TIF) Figure S3 Cumulative skews of 12 base-neutral reverse-and complement-conjugate paris of 4-mers in four types of chromosomes. Reverse (a) and complement (b) skews in type-A C. acetobutylicum; reverse (c) and complement (d) skews in type-B E. carotovora; reverse (e) and complement (f) skews in type-C M. mazei; reverse (g) and complement (h) skews in type-D Synechocystis.

Author Contributions
Conceived and designed the experiments: HCL. Performed the experiments: SGK WLF HDC ZTH NZ. Analyzed the data: SGK WLF HDC ZTH NZ BZ HCL. Wrote the paper: HCL.