Great Lakes on the mend.

Comparative genomic approaches are useful in identifying molecular differences between organisms. Currently available methods fail to identify small changes in genomes, such as expansion of short repetitive motifs and to analyse divergent sequences. In this report, we describe an anchor-based whole genome comparison (ABWGC) method. ABWGC is based on random sampling of anchor sequences from one genome, followed by analysis of sampled and homologous regions from the target genome. The method was applied to compare two strains of Mycobacterium tuberculosis CDC1551 and H37Rv. ABWGC was able to identify a total of 104 indels including 20 expansion of short repetitive sequences and five recombination events. It included 18 new unidentified genomic differences. ABWGC also identified 188 SNPs including eight new ones. The method was also used to compare M. tuberculosis H37Rv and M. avium genomes. ABWGC was able to correctly pick 1002 additional indels (size >100 nt) between the two organisms in contrast to MUMmer, a popular tool for comparative genomics. ABWGC was able to identify correctly repeat expansion and indels in a set of simulated sequences. The study also revealed important role of small repeat expansion in the evolution of M. tuberculosis strains.


INTRODUCTION
Comparative analysis of fully sequenced genomes is a powerful approach to detect and measure diversity among organisms. It has become apparent in the last few years that many biological properties including clinical features can be inferred successfully from the analysis of full genome sequences (1)(2)(3)(4). In parallel a number of experimental studies have been initiated to document differences among closely related organisms and field isolates in the form of sequence differences, such as single nucleotide polymorphisms (SNPs), repetitive sequencebased polymorphisms, variable transposon insertions, recombination events, etc (5). Both these approaches complement each other. In the context of pathogenic organisms the results from these studies can help in developing newer methods for diagnosis and identification of drug and vaccine targets (6)(7)(8).
One of the major problems in understanding hostpathogen relationship in many infections is to explain the variety of clinical features ranging from asymptomatic infection to different forms of invasive disease. Many of the differences in the clinical features can be attributed to genetic differences among pathogenic strains. For example, comparative genome sequence analysis identified 1500 distinct genes present in pathogenic Escherichia coli O157, H7 but absent in non-pathogenic E. coli strain K-12 (9). These include genes involved in colonization and toxin production, responsible for disease pathology. Among mycobacteria, SNPs, insertion elements and genomic deletions have been associated with clinical features of different strains and species (10). In these organisms, unlike many others, there is no evidence for the presence of toxin genes which can be directly associated with virulence. Comparative analysis of a number of different strains and species of mycobacteria indicate that many of the sequence polymorphisms arise from specific deletion patterns. The genes affected by the deletions have important roles in the biology of these organisms (11). Since, deletions tend to be irreversible events (12,13), the pattern of deletions can be used to deduce the phylogeny of the mycobacteria. The distribution of deletions suggests that Mycobacterium tuberculosis H37Rv has not originated from M. bovis (14,15) as thought previously. Also, deletions along with other mutational analysis can be used as markers to study the evolution of genomes. The SNPs have also been used to carry out phylogenetic analysis of M. tuberculosis strains (16). Some of the identified SNPs in M. tuberculosis alter activities of enzymes thought to be involved in pathogenesis. The results show that this species is highly clonal, without detectable lateral gene transfer. Attempts have also been made to associate virulence with insertion of IS elements and repetitive polymorphic sequences (17). Different markers have been deployed for typing clinical isolates of M. tuberculosis, for example IS6110-based restriction fragment length polymorphism (RFLP) (18), spoliogotyping, a PCR-based method using repetitive elements at a single locus (19) and variable number tandem repeat (VNTR) typing, a tool based on repetitive elements (20). While these experimental approaches are useful for detection of a few differences, they cannot display total genomic diversity. Microarray hybridization has also been used to map variations among strains and isolates of mycobacteria (21,22). Though this approach involves a large number of loci spanning the genome subtle differences are likely to be missed, as coding regions are normally used for making the arrays and detection is based on hybridization. Computational methods based on whole genome sequences may be more useful for identification of small changes among genomes.
The existing computational approaches used for identification of the diverged sites and regions between genomes are based on complete genome alignment. Several algorithms have been developed for alignment of large sequences. These can be categorized as local alignment and global alignment-based methods. Anchor identification is the first step in global alignment-based methods, such as DIALIGN (23), DBA (24), LAGAN (25), GLASS (26) and AVID (27). While GLASS and AVID use exact k-mers as anchors, substitutions or mismatches are allowed in DIALIGN and LAGAN. The consecutive anchor pairs are then aligned. These methods are not able to identify genomic rearrangements, such as inversions and translocations (28). On the other hand local alignment techniques, such as MUMmer (29), WABA (30), BLASTZ (31) and SSAHA (32) are able to locate translocations and rearrangements. These methods also use anchor-based strategies employing slightly different approaches for anchor identification. While MUMmer identifies exact matches, WABA allows mismatches at the wobble positions. The local alignmentbased tools fail to identify short repeat expansions as these are either shown as overlapping regions (MUMmer) or the two overlapping regions are merged (WABA). Many of these methods are not suitable for analysis of a pair of diverged sequences as the anchors are not identified properly or very few anchors are identified. Therefore, there is a need for new genome comparison tools that can analyse both closely related and diverged genomes with capability to find indels, repeat expansions and rearrangements, and other mechanisms contributing to genome diversity.
In this article, we present a new method anchor-based whole genome comparison (ABWGC) for identification of divergent regions between genomes. We have tested our method on different strains of a pathogen M. tuberculosis. We have also analysed two different species of mycobacteria to show that this method can also identify divergent sequences, a feature not available in many other comparative genomic tools. Based on the comparison with other alignment tools we conclude that ABWGC is a preferred method for finding small changes in genomes such as small repeat expansion, indels, rearrangement between closely related genomes and analysis of divergent genomes.

METHODS
Let S (the query) and T (the target) be two genomes of length N and M, respectively. We first select some random positions on the query genome. Each of these positions would be starting points of the anchors. The anchors are of fixed length m and we require that these anchors be non-overlapping. As such we need to ensure that there was a minimum distance, L, between two successive random positions, where L ! m. We obtain this as follows.
Let x 1 ; x 2 , . . . ; x N be a random permutation of the numbers 1; 2, . . . ; N, where each permutation is equally likely to occur. This random permutation is obtained by the Mersenne Twister programme (http,//www.math.sci. hiroshima-u.ac.jp/ m-mat/MT/emt.html). The random positions of the anchors are constructed according to the following iterative scheme, let y 1 ¼ x 1 ; and y 2 ¼ x k 1 , where k 1 ¼ minfj > 1; j x j À y 1 j! Lg; having defined y i and k iÀ1 let ; j x j À y l j! L for all l ig. We terminate this iterative scheme when it is not possible to define any further y. Let fy 1 ; y 2 , . . . ; y n g be the set of all possible y's obtained by the above scheme. We note here that y 1 ; y 2 , . . . ; y n need not be in either an increasing or a decreasing order. However, with a slight abuse of notation assume that y 1 ; y 2 , . . . ; y n are in an increasing order.
Let j i denote the nucleotide at the position y i þ j in the query genome S. Thus, for example j i ¼ A if the nucleotide at the ðy i þ jÞth position in the query genome S is A, etc. The string represents the string consisting of m consecutive nucleotides of the genome S starting at the y i th position. The strings Að1Þ; Að2Þ, . . . ; AðnÞ represents our anchors at position y 1 ; y 2 , . . . ; y n on the genome S. The choice of y i 's ensure that these anchors do not overlap.
Based on these anchors we obtain a set of strings Bð1Þ; Bð2Þ, . . . ; BðnÞ from the target genome T. The string B(i) is that segment of T which gives the highest BLAST score when compared with the string A(i) of the query genome S.
To fix notation, let the string B(i) start from the position t i of the target genome T. Letting j i denote the nucleotide at the position t i þ j in the target genome, we have We note that B(i)'s may be overlapping, and although A(i)'s are arranged in an increasing order according to their position in the genome S, B(i)'s need not preserve that order. Let i.e. p i is the inter-anchor distance in the S genome (including the end points) and l i is the distance between the corresponding BLAST hits (the absolute value taken so as to ignore inversions in location of the hits). The positions of the anchors were recorded in order to find inversions. In case of duplications there will be multiple hits in the target genome T for an anchor Aði þ 1Þ. The expected position of Bði þ 1Þ is estimated by adding p i of the homologous anchor in the S genome with the position of B(i). If an anchor is present in multiple copies, there would be a number of hits with nearly equal mismatch scores. The anchor, whose position is close to the expected computed position, is chosen for analysis.

Mismatch score calculation
The proportion of mismatches between the strings A(i) and B(i) was calculated between these strings. The mismatch score is based on a binary scheme, a nucleotide match between the string A(i) and B(i) was given a value of 1 and a mismatch was given as 0. The sum of these 1s and 0s normalized with the length of the anchor was a mismatch score of the anchor. More formally, the mismatch score is given by the quantity CNS (cumulative normalized score) given below. Let The mismatch score is ðAðiÞ; BðiÞÞ: where n is the total number of anchors.

Estimation of anchor order
The gene-order approach used depends on the conservation of the genes (33), based on the same approach the anchor order was estimated as follows, For i ¼ 1; 2, . . . ; n À 2 let where oðAðiÞ; BðiÞÞ is 1 if the anchor order is preserved and it is 0 if there is a breakpoint in the synteny of the two genomes. This oðAðiÞ; BðiÞÞ is used instead ðAðiÞ; BðiÞÞ to obtain an equivalent CNS.

Identification of variable regions
All the p i and l i where the difference between them was >10 bp were extracted from genome S and genome T, respectively. The extracted sequences were then globally aligned to obtain the exact position of the insertion, deletion and duplication. The position were then mapped with the genome to get the coordinates of these events. For identification of divergent regions due to nucleotide alterations, a clustering approach was used. High scoring anchors above a threshold were clustered and a sampling procedure was carried out in the entire clustered region. If the samples also followed the high scoring criteria then the whole clustered region was a divergent region. The divergent region flanking an anchor its boundaries were extended in both the direction till a low scoring region was reached. To know the precise boundaries of the divergent region, the homologous regions from both the S and T genomes were aligned globally. The parameters such as gap open penalty and gap extension penalty were changed according to divergence between the genomes. The indels and duplications were then subsequently identified.

Data
The full genome sequences of M. tuberculosis strain CDC1551, H37Rv and M. avium subsubspecies paratuberculosis K-10 strain were obtained from NCBI (http,//www.ncbi.nlm.nih.gov/genomes/lproks.cgi). The three strains of M. tuberculosis F11, C and Haarlem have been isolated from South Africa, New York City and Netherland, respectively and the complete genome sequences are not yet available. The partial genome sequence were obtained from Broad Institute (http,// www.broad.mit.edu/annotation/genome/mycobacteriumtuberculosis-spp/MultiDownloads.html). Three simulated data sets were also prepared in order to quantify the performance of ABWGC. The first simulated data set was a DNA sequence of 3 kb extracted from the genome of M. tuberculosis H37Rv in which 1 kb region was randomly mutated. Another data set for simulation consists of two DNA sequences which contains five and six copies of a tandem repeat, respectively. In the third simulated data set in a DNA sequence, two foreign sequences were inserted with a gap of 10 nt between the inserted regions.

Availability
The set of programmes which constitute ABWGC and detailed instruction of their use can be obtained from the corresponding author on request (anchalv@gmail.com, av1563@students.jnu.ac.in, alok0200@mail.jnu.ac.in).

Description of ABWGC
Random sampling and determination of optimal anchor length. The approach described here was based on random sampling of fully sequenced genome sequences. Short sequences of predefined fixed length were extracted from random positions of a given genome (S) with some restrictions as described in the Methods section. These samples were used as anchors. BLAST algorithm was used to get the homologous regions for each of the anchors in the target genome T. A score was calculated for each anchor, based on sequence mismatch with target anchor and a cumulative normalized score (CNS) was then derived encompassing all the anchors as described in the Methods section. The CNS is essentially the average score of all the anchors reflecting the level of diversity at the nucleotide sequence level. Detailed analysis has also shown that the final CNS score was independent of sampling (data not shown).
To determine the optimum length, anchors varying from 25 to 300 nt from M. tuberculosis strain H37Rv (S genome) were extracted and compared with M. avium (T genome). Figure 1 shows the change in CNS at every sampling point for an indicated anchor size. The value of CNS changed with increase in the number of samples eventually reaching a plateau after about 3000 samples. Sampling from a genome involved $10% of the total size in terms of nucleotides. For all analysis, the value of CNS used was the one obtained when the curve reached a plateau. The value of CNS was similar if the anchor lengths varied from 100 to 300 nt. However, there was a marked difference in the score when the anchor size was less than 100 suggesting that 100 nt could be the optimum size of the anchor for determining divergence between genomes.
Identification of diverged regions at the level of nucleotide sequence. Anchors with high score in any pairwise comparison correspond to divergent DNA segments. For analysis of the diverged regions, the mismatch score of each anchor was plotted in order to obtain the overall distribution ( Figure 2a). When two strains of M. tuberculosis were compared, most of the anchors had score less than 20 suggesting a high degree of identity (Figure 2a and c). A few anchors had high scores, suggesting that these may lie in diverged regions. The boundaries of the diverged regions were identified by clustering the neighbouring anchors with scores above a threshold. The threshold value for the identification of the diverged regions was obtained by the distribution analysis of the scores of anchors obtained when M. tuberculosis genome was compared with a randomly generated sequence of the same length and base composition (Figure 2b and d). None of the individual anchor score was less than 70. A threshold score of 20 was used for subsequent analysis involving closely related organisms, such as strains of M. tuberculosis H37Rv and CDC1551. More sampling was carried out from the entire clustered segment. If the anchors derived from this sampling were also found to have high scores above the threshold, then the entire clustered region was considered to be the divergent region. To find a divergent region around an anchor (a high score anchor flanked by anchors with low scores) sampling was carried out in its flanking regions. Sampling was stopped when a low scoring region was reached on both sides of the divergent anchor. Precise boundaries were then obtained by aligning the putative divergent regions using a combination of both local (Smith and Waterman alignment) and global alignment (Needle and Wunsch) methods (34,35). Apart from large variations it is also possible to find small changes (for example SNP) in nucleotide sequences. If an anchor mismatch score is above 0 it indicates sequence mismatches. Precise location of SNP was obtained by alignment of homologous anchors. As, the number of anchors used in the analysis covers 10% of the genome, the number of SNPs identified may be a fraction of the total numbers present. Increasing the number of samples would increase the detection rate.
Insertion, deletion, recombination and inversion. Interanchor regions were extracted and anchor order was determined from the nucleotide positions of the respective anchors in the two genomes. All these measures were then used to determine indels, duplications and disruption in synteny. The differences in the lengths of inter-anchor regions of S and the lengths between the corresponding anchors of T genome were indicative of these events. The plot of the distribution of inter-anchor lengths of two closely related organisms is expected to contain most points on or around a diagonal, reflecting small or no change in the distance. The few points scattered around the diagonal line depict small changes. Large alterations can be identified as the points away from the diagonal ( Figure 3). The inter-anchor length distribution analysis showed that >97% of the anchors were within 10 nt. Therefore, a threshold of 10 nt was used to identify those inter-anchor regions selected for further processing. The selected inter-anchor sequences were extracted from both S and T genomes and aligned using a global alignment method to identify the indels. Both the strands of the genome were taken into consideration while extracting homologous inter-anchor regions. All the indels obtained by ABWGC were checked again using a BLAST analysis against the target genome for their absence or presence in the target genome. Identification of repetitive sequences. An insertion event would result from the addition of a stretch of nucleotides or an increase in the number of repeats of a repetitive sequence. In order to determine the nature of an insertion, the inserted region was extracted and BLAST analysis was carried out against the inter anchor region where the insertion was originally found. The presence of repetitive elements is indicated by multiple BLAST hits. Number of hits is equal to the number of copies of the repetitive sequence present in the region. The consecutive multiple hit depicts tandem repeats. The results were confirmed by tandem repeat finder (36).
Analysis of two M. tuberculosis strains ABWGC was applied to catalogue the genomic differences between the two strains of M. tuberculosis. The strains H37Rv and CDC1551 were used as S and T genomes, respectively. The majority of the anchors displayed scores in the range of 0-10 confirming a close evolutionary relationship between the two genomes. However, high scores, that is more mismatches, were observed for some of the anchors, suggesting that these anchors correspond to regions of divergence (Figure 2a). The anchors (17) that map to the complementary strand of the T genome reflect flipping of the sequences. The number of such events was found to be five between these two strains ( Table 1). The two strains of M. tuberculosis H37Rv and CDC1551 are 99% identical in nucleotide sequence. Only two divergent regions were obtained by our analysis which mapped to MT1499 gene in M. tuberculosis CDC1551 and Rv3343c in H37Rv. These genes code for PE-PGRS and PPE, respectively. We also identified SNPs that map to the anchors. Eight of these SNPs have not been reported before though the SNPs observed by our analysis at the default level of sampling, that is 10% of the genome, is $6.5% of that identified by MUMmer ( Table 2). Inability of the MUMmer to identify these eight SNPs may be due to their locations in regions not aligned by MUMmer.
Analysis of the inter-anchor regions helped to identify 104 indels between the two strains. The majority of the variations ($85%) were restricted to the 5 0 or the 3 0 end of the coding regions. Only a small fraction (15%) of the insertions in M. tuberculosis H37Rv was mapped to the intergenic regions. Among the protein-coding genes (37), PE-PGRS and PPE genes accounted for the largest family (17) of proteins affected. Previous studies have shown that a number of large indels (>500 nt) are due to insertion elements, such as IS6110 (37). IS elements are not included in the results as these are easier to identify. Eight non-IS elements-derived indels were also detected. For example, an insertion of 653 bp in M. tuberculosis H37Rv was located in Rv1091 gene, a member of PE-PGRS family. Similarly, an insertion of 2273 bp spanning the genes Rv2123 and Rv2124 was also found in M. tuberculosis H37Rv. These changes are likely to alter the functions of the genes.
The nature of the indels was identified by a detailed analysis as described before. The results revealed that many of the observed indels were actually due to an increase in copy number of small tandem repeats (Table 3). In M. tuberculosis H37Rv, 22% (10/45) of the insertions were due to the increase in copy number of small repeats and ten such expansion of repeats were identified in M. tuberculosis CDC1551. Some of these alterations do not change the reading frame of the gene, resulting in a small insertion in the protein. For example, an 18-mer repetitive element was present in M. tuberculosis CDC1551 as two tandem copies. Three tandem copies of the same element was found in M. tuberculosis H37Rv (Figure 4a). Since the repetitive motif (18) is a multiple of 3 the reading frame is maintained. Deletions in M. tuberculosis CDC1551 involving PPE and PE-PGRS family (Rv0747, Rv01818c, Rv3343, Rv2741) followed an interesting pattern. The deletions involved a stretch of nucleotides containing repetitive sequences (Figure 4b), suggesting that recombination between the repeats may have been instrumental in the process of deletion. Some of the differences between the two strains of M. tuberculosis, reported here, were not identified by any previous study (Tables S1 and S2 in Supplementary Data). All the small repeats identified by ABWGC were essentially confirmed by tandem repeat finder ( Table 4). The differences were minor due to doubling of the size of the repetitive sequences predicted by ABWGC and mismatches allowed in tandem repeats by tandem repeat finder.
An analysis was carried out with M. tuberculosis strains F11, C and Haarlem to check if the variations observed between the strains H37Rv and CDC1551 are also seen in other strains. Since strains F11, C and Haarlem are not fully sequenced, detailed analysis using Figure 3. The inter-anchor length distribution. The anchors were extracted from M. tuberculosis CDC1551 randomly as described in the Methods section. These anchors were ordered in terms of their position in the genome. The homologous of these anchors were from M. tuberculosis H37Rv. The distance between two consecutive anchors were computed for both genomes. These inter-anchor lengths of two genomes were then plotted. The deviation from the diagonal represents the difference between the inter-anchor length of a pair of homologous anchors.
ABWGC was not possible. In this analysis anchors defining all the known insertions of H37Rv and CDC1551 (S ) were used as query sequences against sequences of strains F11, C and Haarlem (T ). The inserts from CDC1551 genome were extracted and the presence/ absence of the 46 inserted regions were observed in all the 3 T genomes. Twenty seven deletion sites were identical in size between H37Rv and F11 and variation was observed at two sites (Table 5). Strain C and Haarlem showed 18 and 15 identical deletion sites, respectively. This suggests that F11 is closest to H37Rv compared to other strains. This was further confirmed by taking inserted regions from H37Rv and comparing with the three strains.
Alteration in anchor order. Recombination events lead to reorganization in genomes which can be observed as conservation of genic synteny (38,39). Anchors can also be used to determine reorganization in genomes. The disruption in anchor order can easily be computed in a pairwise comparison. The anchors which map directly within transposable elements have been excluded from the analysis. The results presented clearly showed that such events occurred in mycobacterial genomes (Table 6). In CDC1551, the breakpoints in anchor order were due to transposable elements. Four breakpoints in anchor order in H37Rv were found in comparison to CDC1551. These regions map to genes Rv1316c, Rv2020c and Rv3020c. All of these genes contained repetitive sequences. Homologous recombination between these repetitive sequences may be responsible for the changes in anchor order ( Table 6).
Analysis of the anchors in H37Rv and CDC1551 revealed a rearranged region in H37Rv. This region contained a phiRv1 prophage within the biotin operon (10). There is a repetitive element REP13E12 flanking the rearranged region. Biotin is required as the growth supplement in some strains of M. tuberculosis. Horizontal gene transfer may have played a role in the integration of the prophage and this event may have occurred before M. tuberculosis became intracellular (40). The prophage intergration may be the cause of rearrangement of the genomic region in H37Rv. This region was also identified as a flipped region where the homologous anchor mapped to the complementary strand.

Analysis of two divergent species M. tuberculosis and M. avium
In order to find out if homologous anchors can be unambiguously identified in divergent genomes the method was applied to two different species of Mycobacterium, namely M. tuberculosis H37Rv and M. avium. The results of the analysis are shown in Figure 5. The majority of anchor scores were above 10, showing that the two genomes were not closely related. The analysis revealed presence of 1811 indels in H37Rv of size >100 nt. Moreover, four inverted regions were identified. Some of the results are shown in the Supplementary Table S3. It is clear from the analysis that it is possible to apply anchor-based method to the analysis of genomes of different species.

Evaluation of Performance
Mycobacterial genomes. A number of tools for comparing genomes have been reported. These include   Table 7. MUMmer was used with default settings (run-mummer 3 was used for the analysis).
While it identified 86 indels in the strain H37Rv (10), ABWGC and LAGAN (25) located 104 such events, which included the indels identified by MUMmer. MUMmer identifies indels as the regions between two consecutive MUMs that cannot be aligned. When two strains of M. tuberculosis were aligned by MUMmer, some overlapping MUMs were observed. These overlapping MUMs correspond to repetitive elements, and exact position could not be located. The output from LAGAN and MUMmer could not be processed to reveal that the insertions, were due to change in copy number of small tandem-repetitive motif. The local alignment tool BLASTZ was also used to test its performance. The indels identified by BLASTZ were fragmented as many short indels rather than a single large indel. For example, an insertion of 180 nt at 150882 in M. tuberculosis CDC1551 was identified by ABWGC and MUMmer but not by BLASTZ.
For closely related genomes such as H37Rv and CDC1551, complete analysis by ABWGC took <3 min on Pentium IV 3 GHz CPU. The results clearly showed that ABWGC is better in finding indels particularly those due to changes in the number of repetitive sequences compared to other currently available methods.
The results of analysis of M. tuberculosis strain H37Rv and M. avium are shown in Table 8. Only indels >100 nt are reported for comparison. The alignment of the two genomes by MUMmer is represented as clusters of MUMs. The regions between consecutive clusters are those which MUMmer failed to align. ABWGC was able to align these regions and identified probable positions of differences. For example, an indel of 990 nt was identified by MUMmer when H37Rv (starting from 399261)   was aligned with M. avium (starting from 4279373). Our results showed two separate insertions in H37Rv of sizes 231 (PE family) and 278 (13E12 family). In addition to these genes, this region encoded aspartate aminotransferase gene in both genomes (percent identity of 87%). A partial list of some of the differences between the predictions from the two methods is given in Table 8.
Overall it appears that ABWGC is a method of choice when divergent genomes have to be compared.
On simulated sequences. The results presented so far suggest that ABWGC is a better method for comparative analysis of genomes for finding indels, particularly in diverged genomes. This was further confirmed using     24699  18  2  335812  18  3  427312  15  4  428188  60  2  2163731  69  5  2165328  75  3  2347415  58  3  3171467  54  2  33663826  63  2  3948753 603 2 a set of simulated sequences. The first simulation data contained DNA sequence of 3 kb in which 1 kb region was mutated randomly to make a divergent region.
MUMmer identified the divergent region as an indel when the simulated data set was compared with the original sequence. In another simulated data set, six tandem copies of a repetitive motif was inserted at a given position. This sequence was compared with the same sequence but containing five copies of the repeat. MUMmer failed to identify the tandem copy correctly. Both the sequences were correctly analysed by ABWGC. In the third simulated example, two foreign sequences were inserted with a gap of 10 nt. MUMmer reported it as one large insertion whereas ABWGC reported as two insertions. The results confirmed that ABWGC is likely to be a better method for comparing genomes in order to find and characterize indels and expansion of repeats.

DISCUSSION
ABWGC fixes anchors randomly, followed by processing of the anchor coordinates to get comparative identification of diverged regions in terms of indels, repeats, inversion, etc. Many genome alignment tools also use anchors as the starting step. However, ABWGC differs from other methods in terms of the way anchors are generated and further processing of anchor information. Though the method is not capable of identification of all SNPs, it helps in identification of some of the SNPs that map within an anchor region. In this study, anchors totaling 10% of the total genome were sampled. Increasing the amount of sampling will increase coverage of the genome and it will be possible to find more SNPs.
Our preliminary results have shown that some of the measures in ABWGC can be used to get a distance estimate between genomes (manuscript in preparation). Pairwise distance estimates can then be used to draw genome trees, a feature not present in any other genome alignment tools.
The results presented here using both real and simulated data clearly show that ABWGC is a method of choice in finding specific differences among genomes compared to other genome alignment methods. The nature of the indels is deciphered by further processing of the data using global alignment tools. Since random anchors usually flank indels, the alignment tools are able to find correctly the indels and their positions unlike in other methods. ABWGC is also useful in finding changes in genomes due to increase or decrease in the number of short repetitive sequences. MUMmer, AVID and GLASS use exact matches for finding anchors. It is difficult to find exact matches in diverged sequences. The problem is more serious for non-coding regions which are more divergent than coding regions. Since ABWGC allows mismatches it is possible to identify homologous anchors in a pair of diverged genomes.
Previous studies have indicated that SNPs and indels have played a significant role in the evolution of M. tuberculosis strains (40)(41)(42)(43)(44)(45)(46). Indels were found to be mainly due to insertion elements (10,47). The major finding of this study is the identification of all the indels varying from one to hundreds of nucleotides. In this study, detailed analysis of indels of <10 nt was not included. But our limited analysis has shown that, like SNPs these are important for overall functional diversity of an organism. For example, one nucleotide insertion in coding region can change the reading frame unless compensated otherwise. A large fraction of the indels was due to expansion of short repetitive motifs. The amplification/deletion of the repeat sequences is thought to be due to replication slippage or unequal recombination events or by single-stranded annealing pathway (47). These processes may be responsible for the observed repeat expansion. Presence of additional copies of these motifs would in principle change the proteome of mycobacterial cells as many of this map to the coding regions and have a subtle effect on clinical features. The surprising finding of this study is the large number of variations found in the two strains of M. tuberculosis. Since the strain M. tuberculosis H37Rv is being cultivated for a long time and the strain M. tuberculosis CDC1551 is relatively a new isolate, the variations may be due to long-term in vitro cultivation of M. tuberculosis H37Rv. In order to rule out this possibility, the strain M. tuberculosis H37Rv was compared with the recently cultivated isolates M. tuberculosis F11, C and Haarlem and results showed that M. tuberculosis H37Rv is much closer to M. tuberculosis F11 compared to M. tuberculosis CDC1551. Therefore, the observed indels in the strain M. tuberculosis H37Rv were not due to long-term cultivation.
A number of experimental and a few computational studies have been carried out to find genetic variations in different strains and isolates of M. tuberculosis (10). ABWGC was not only able to find all the reported differences but also a number of other variations that have not been reported so far. It will be interesting to see the distribution of these variations among a number of clinical isolates from patients with different clinical manifestations. Our studies agree with published results which clearly show that M. tuberculosis undergoes genomic changes characterized by SNPs, indels, repeat expansion, etc. These can explain the level of phenotypic diversity seen among clinical isolates (46,48). A number of studies in prokaryotes have shown that multiple copies of tandem repeats play critical role in the evolution of bacterial genome. Occasionally, these changes can be correlated with the phenotypic properties of the organism. For example, the analysis of multiple locus variable number of tandem repetitive sequences showed that the Dutch Bordetella pertussis strain rapidly changed in late 1990s (49). The change in repeat number has given a clue of evolution. These changes influence antigenic variation leading to altered virulence (49). Such tandem repetitive elements have also been reported in Francisella tularensis (50) and Neisseria meningitidis (51). Therefore, it appears that changes in the number of tandem repeats may be a general mechanism by which organism evolve. Our findings about M. tuberculosis may be part of an overall strategy for bacterial evolution. Identification of such differences can help in identifying new diagnostic markers and targets for vaccine and drug discovery. In conclusion, our results show that ABWGC can be a useful tool in comparative genomics, providing features that are not available in other genome analysis tools.