Distribution of polymorphic and non-polymorphic microsatellite repeats in Xenopus tropicalis.

The results of our bioinformatics analysis have found over 91,000 di-, tri-, and tetranucleotide microsatellites in our survey of 25% of the X. tropicalis genome, suggesting there may be over 360,000 within the entire genome. Within the X. tropicalis genome, dinucleotide (78.7%) microsatellites vastly out numbered tri- and tetranucleotide microsatellites. Similarly, AT-rich repeats are overwhelmingly dominant. The four AT-only motifs (AT, AAT, AAAT, and AATT) account for 51,858 out of 91,304 microsatellites found. Individually, AT microsatellites were the most common repeat found, representing over half of all di-, tri-, and tetranucleotide microsatellites. This contrasts with data from other studies, which show that AC is the most frequent microsatellite in vertebrate genomes (Toth et al. 2000). In addition, we have determined the rate of polymorphism for 5,128 non-redundant microsatellites, embedded in unique sequences. Interestingly, this subgroup of microsatellites was determined to have significantly longer repeats than genomic microsatellites as a whole. In addition, microsatellite loci with tandem repeat lengths more than 30 bp exhibited a significantly higher degree of polymorphism than other loci. Pairwise comparisons show that tetranucleotide microsatellites have the highest polymorphic rates. In addition, AAT and ATC showed significant higher polymorphism than other trinucleotide microsatellites, while AGAT and AAAG were significantly more polymorphic than other tetranucleotide microsatellites.


Introduction
Microsatellites are short tandem repeats of a DNA sequence that are highly abundant in the genomes of eukaryotes (Hearne et al. 1992;Tautz 1993;Schlotterer, 2000). The high levels of allelic variation, codominant inheritance, and ease of analysis have made these markers attractive for population genetics, genome mapping, pedigree studies, and forensic analyses (Wright and Bentzen, 1994;Ellegren, 2000). In spite of the promising aspects of microsatellites as useful molecular markers, little is known about their origin, evolution, organization, dynamics, and roles in genomes. Recently, with the exponential increase in the number of genomic sequences available for different organisms, bioinformatic approaches have been used to investigate the distribution and frequencies of different types of microsatellites (Toth et al. 2000;Katti et al. 2001;Subramanian et al. 2003;La Rota et al. 2005). Comparisons in the frequency and distribution of microsatellites among different eukaryotic genomes have revealed the most dominant microsatellite types vary across taxa (Toth et al. 2000).
Xenopus laevis and its diploid sister species X. tropicalis are among the major model systems for the fi elds of molecular, cell, and developmental biology. In the past several years, the genomic information on Xenopus has accumulated rapidly, and NCBI now carries over 1.25 million EST sequences for X. tropicalis. The Joint Genome Institute (JGI) has released the assembly version 4.1 of the X. tropicalis whole genome shotgun reads at a coverage of 7.65X (http://genome.jgi-psf.org/Xentr4/Xentr4.info. html). The present study represents part of our efforts to generate a genetic map for X. tropicalis using microsatellites as markers.
One of our initial steps in generation of the genetic map was to develop a large set of "nonredundant" microsatellite markers. In this context we defi ne our nonredundant microsatellite markers as di-, tri-, and tetranucleotide microsatellites containing a minimum of fi ve non-interrupted tandem repeats, which are embedded in single copy fl anking sequences and thus (with proper primer design) can amplify a unique genomic location. The purpose of this manuscript is to investigate: (1) the distribution and frequency of perfect di-, tri-, and tetranucleotide microsatellites in the X. tropicalis genome; (2) the relative abundance of different repeat classes and motifs in nonredundant microsatellites; and (3) the variations in the rate of polymorphism within nonredundant microsatellites along with the factors, such as tandem repeat length and base composition, which affect these variations.

Materials and Methods
Animals DNA samples from two unrelated X. tropicalis frogs from each of the two major inbred strains, Nigerian and Ivory Coast, were used for polymorphic analysis. Frogs and/or DNA samples were generously provided by R. Grainger, U. Va., and R. Harland, UC Berkeley. The JGI sequence data was obtained from a single female Nigerian frog.

Estimation of frequencies of genomic microsatellites
Xenopus tropicalis genome assembly 4.1, generated by the Joint Genome Institute (JGI), Department of Energy (DOE) was used to estimate the distribution and frequencies of di-, tri-, and tetranucleotide microsatellites. For this study, all noninterrupted di-, tri-, and tetranucleotide microsatellites with 5 or more tandem repeats were analyzed. A total of 445 million bases, representing about 25% of the Xenopus tropicalis genome, was analyzed using a Perl script SSRIT (Temnykh et al. 2001). So as not to skew for microsatellites present only on long scaffolds, we analyzed 256 scaffolds ranging in size from 23,997 bp (Scaffold-2010) to 7,817,814 bp (Scaffold-1). The repeat motifs of di-, tri-, and tetranucleotide microsatellites were compressed into core groups in which different reading frames and complementary strand sequence were merged (Table 1). The results from output tables of SSRIT were analyzed using Microsoft Excel.

Selection of non-redundant microsatellites and polymorphism testing
The term "nonredundant microsatellites" refers to di-, tri-, and tetranucleotide microsatellites containing a minimum of fi ve non-interrupted tandem repeats that are embedded in single copy sequences. These microsatellites were identified by a bioinformatics screen from the Xenopus tropicalis genome assembly 2.0. The data mining script was based on the publicly available computer program, Tandem Repeats Finder (TRF) (Benson, 1999), and modifi ed to fi nd di-, tri-, and tetranucleotide microsatellites with more than 5 repeats embedded in unique fl anking sequences suitable for primer design. Initially, nonredundant di-, tri-, and tetranucleotide microsatellites were identifi ed randomly from the entire genome. Subsequently, identifi cation of nonredundant microsatellites was targeted to underrepresented scaffolds. Once nonredundant tri-and tetranucleotide repeat sequences had been identifi ed from all scaffolds that include them, the data mining script was further modifi ed to identify primarily dinucleotide repeats.
Primer pairs with annealing temperature at 58 °C (+ 2 °C) were designed and initially tested on agarose gels to confi rm their amplifi cation under standard conditions (58 °C, 1.5 mM Mg +2 , and 30 cycles). All primer pairs that amplifi ed single bands were tested for polymorphisms between Nigerian and Ivory Coast strains. Polymerase chain reaction conditions consisted of 10 ng DNA, 0.5 µM of forward and reverse primers, 1.5 mM MgCl 2 , 0.2 mM of dGTP, dCTP, dTTP, 0.02 mM of dATP, 0.05 U/µl of Taq, 1X buffer, and 0.07 µCi/ul of 35 S dATP. PCR amplifi cation profi le is 94 o C for 4 min followed by 30 cycles of 94 °C for 1 min., 58 °C for 1 min and 72 °C for 2 min with a fi nal elongation of 30 min at 72 °C. Amplifi ed products were electrophoresed in polyacrylamide gels and visualized by autoradiography. The known sequence of the pGEM-3zf(+) vector was used as a ladder to establish the size of the microsatellites.

Statistical analyses
Signifi cance of the differences in length of di-, tri-, and tetranucleotide microsatellites and the mean copy number of different motifs was determined by ANOVA. This step was followed by a post-test using the GraphPad Prism software, which employs the Bonferroni correction to adjust for multiple comparisons. Comparisons in average copy numbers between genomic and nonredundant microsatellites were carried out by Student's t-tests. Contingency tables were used to compare the polymorphism among microsatellites with different lengths, different types of microsatellites, and different motifs.

Results
Distribution and frequencies of di-, tri-, and tetranucleotide microsatellites A total of 91,304 perfect di-, tri-, and tetranucleotide microsatellites with a minimum of fi ve tandem repeat units were identifi ed in 444,970,789 bp (~ 25%) of the X. tropicalis genome ( Table 2). The total length of perfect di-, tri-, and tetranucleotide sequence represented in this sample is 1,705,957 bp, representing 0.38% of the total DNA analyzed. Dinucleotide microsatellites account for 78.7% of identifi ed microsatellites and signifi cantly outnumber tri-and tetranucleotide microsatellites (p Ͻ 0.001). The average distance between two   trinucleotide microsatellites (59.9 kb) is almost 10 times that of dinucleotide microsatellites (6.2 kb). Our analysis suggests that in every one million base pairs of genomic sequence, there are an average of 161 dinucleotide, 27 tetranucleotide, and 17 trinucleotide microsatellites. Among the di-, tri-, and tetranucleotide repeat classes of microsatellites, the most abundant repeat motifs are AT, AAT, and AGAT respectively ( Table 2). These three repeat motifs account for more than 66% of the microsatellites present in the X. tropicalis genome, with the AT microsatellite alone representing over 50% of the total microsatellites in the genome. Figure 1 graphically shows the mean number of tandem repeats present in each of the four most abundant microsatellite motifs for each repeat class. Interestingly, for both the dinucleotide and tetranucleotide repeat classes, the most abundant motif also contained the highest number of tandem repeats, that is, both the AT and AGAT repeats were signifi cantly longer than other di-, and tetranucleotide repeats (p Ͻ 0.001). However, this trend was not seen in the trinucleotide repeat class, as the ATC repeat class is not significantly more prevalent than the AAT repeat class.
Relative abundance of nonredundant di-, tri-, and tetranucleotide microsatellites As part of an ongoing effort to identify PCR amplifi able markers for use in developing a genetic map of X. tropicalis, data mining strategies were developed to identify microsatellites embedded in unique sequences suitable for unique genomic localization. To this end, we identifi ed 5,128 nonredundant microsatellites, which were subsequently analyzed elsewhere for polymorphisms (see methods). The distribution and relative abundance of these nonredundant di-, tri-, and tetranucleotide microsatellites is shown in Table 3 and Figure 2. As was seen in the genomic survey, AT, AAT, and AGAT are also the most abundant nonredundant motifs, accounting for 90.30%, 73.52% and 59.48% of di-, tri-, and tetranucleotide motifs respectively (Table 3). Likewise, AC, ATC, and ACAT are the second most abundant motifs in their respective repeat classes. CG repeats, which were found in low numbers in the genomic survey, were absent from our set of nonredundant microsatellites. Table 4 shows a comparison in average number of repeat units between genomic and nonredundant di-, tri-, and tetranucleotide microsatellite repeat classes. In all cases, the nonredundant microsatellites have signifi cantly longer repeats than their genomic counterparts (p Ͻ 0.001). This trend is also seen for most individual repeat motifs and is most pronounced for the dinucleotide motifs (Fig. 3).
Polymorphism of di-, tri-, and tetranucleotide microsatellites Effects of repeat length on the degree of polymorphism within microsatellites To examine the relationship between repeat length and degree of polymorphism, microsatellite loci were classifi ed into seven groups based on the length of their core repeat sequences. The percent of each group that is polymorphic is displayed graphically in Figure 4 for each repeat length group. Clear trends can be observed for the triand tetranucleotide microsatellites showing a correlation between repeat length and degree of polymorphism. To determine if these trends were statistically signifi cant, each microsatellite motif was divided into two length classes. Loci with a motif length 30 bp or less were designated as Class I markers, while those more than 30 bp were designated as Class II markers. Analysis of these groups revealed the Class II markers exhibited a signifi cantly higher degree of polymorphism than Class I markers for all the three microsatellite repeat classes (di-, tri-, and tetranucleotide) ( Table 5). This strongly suggests that repeat length does affect the degree of polymorphism for microsatellites. Variations in polymorphism among different types of microsatellites Statistical analysis further indicates the polymorphic rates of the three repeat classes of microsatellites analyzed are signifi cantly different (Table 5).
Here, "polymorphism rate" refers to the proportion of microsatellites in a given class that were shown to be polymorphic among individuals from the two strains of X. tropicalis. The pairwise comparisons show that tetranucleotide microsatellites have the highest polymorphic rates, signifi cantly higher than dinucleotide and trinucleotide microsatellites (p Ͻ 0.01). Specifi cally within the Class II markers, tetranucleotide microsatellites also exhibit the highest rate of polymorphism; however, this difference is significant only for dinucleotide microsatellites (p Ͻ 0.01), and not for trinucleotide microsatellites. In Class I markers tetranucleotide microsatellites also exhibit a signifi cantly higher polymorphism rate than trinucleotide loci (p Ͻ 0.05). However the polymorphism rate for tetranucleotide microsatellites was not seen to be significantly higher than dinucleotide microsatellites (p = 0.21).  Variations in polymorphism among different motifs of microsatellites Figure 5 shows the rate of polymorphism for the most common microsatellite motifs. Although there was no signifi cant difference in the rate of polymorphism among the three di-nucleotide motifs (AT, AC, and AG), among the four most abundant trinucleotide motifs, AAT and ATC show signifi cantly higher polymorphism than AAG and AGG (p Ͻ 0.01). Likewise, the most abundant tetranucleotide motifs, AGAT and AAAG, are signifi cantly more polymorphic than ACAT and AAAT (p Ͻ 0.01). The higher polymorphism of microsatellites with motifs of AAT, ATC, AGAT, and AAAG seem to be correlated with their relatively longer repeat length (Fig. 3).

Discussion
Characteristics of X. tropicalis genome and the distribution of microsatellites Our bioinformatics analysis found over 91,000 di-, tri-, and tetranucleotide microsatellites in ~25% of the X. tropicalis genome, suggesting there may be over 360,000 within the entire genome. Within the X. tropicalis genome, dinucleotide (78.7%) microsatellites vastly out-number tri-(8.1%) and tetranucleotide (13.2%) microsatellites. Although, there is some variation in the literature, these observations generally agree with data from other vertebrates (Toth et al. 2000). In the present study, the trinucleotide repeats are the least abundant of the microsatellites, which is consistent with studies in other vertebrates as well. Trinucleotide repeats, however, are more prevalent in protein coding regions, while di-and tetranucleotide repeats are scarce in exons (Li et al. 2002(Li et al. , 2004Morgante et al. 2002;Toth et al. 2000;Dieringer and Schlotterer, 2003). The latter is probably the result of negative selection against frameshift mutations, which limits the expansion of microsatellites in coding sequences (Metzgar et al. 2000). During our analysis of the three types of microsatellites in scaffolds from the Xenopus tropicalis genome assembly 4.1, we noticed trinucleotide repeats were overrepresented in some scaffolds and underrepresented in others. This could enable us to distinguish exonrich scaffolds from those scaffolds containing primarily intergenic regions. In the X. tropicalis genome the AT-rich repeats are overwhelmingly dominant. All three most abundant motifs in the three types of microsatellites (AT, AAT, and AGAT) are AT-rich (Table 2). Among all the di-, tri-, and tetranucleotide repeats identifi ed, 51858 out of 91304 repeats (56.8%) are 100% AT repeats (e.g. AT, AAT, AAAT, and AATT), while 90128 (99%) repeats have an AT content not less than 50%. The high abundance of the AT-rich repeats in X. tropicalis could be partly attributable to the low melting temperature of ATrich fragments and high mutation rates in poly (A/T) tracts (Prasad et al. 2005). However, these factors cannot explain why different taxa have different abundant repeat motifs.
Although exceptions exist (Schug et al. 1998), AC repeats have been reported as the most common dinucleotide repeats in most animals, including humans (Beckmann and Weber, 1992;Nadir et al. 1996;Katti, 2001), primates (Jurka and Pethiyagoda, 1995;Toth et al. 2000), rodents (Beckmann and Weber, 1992;Toth et al. 2000), chickens (Moran, 1993), Fugu (Edwards et al. 1998), bivalves (Cruz et al. 2005), and Drosophila (Schug et al. 1998;Bachtrog et al. 1999). In contrast, AG repeats are found to be the most abundant dinucleotide repeats in honey bees (Estoup et al. 1993) and yellowjacket wasps (Thoren et al. 1995), while AT repeats dominate the dinucleotide microsatellites in silkworms (Prasad et al. 2005) and yeast (Toth et al. 2000). Signifi cantly, the predominance of AT repeats in the X. tropicalis genome found in the present study is the fi rst such report in vertebrates. Interestingly, our results differ from those of Toth et al. (2000) who found that AC repeats are the most abundant repeats in vertebrates, occurring more than twice as frequently as AT repeats. In their study, 12.15% of the vertebrate taxonomic group was represented by Xenopus laevis, sister species of X. tropicalis. Further analysis is needed to determine whether the distribution of repeat motifs observed in X. tropicalis is characteristic of Xenopus laevis or other closely related frog species.
In contrast with dinucleotide abundance levels, the most prevalent tri-and tetranucleotide repeats of X. tropicalis, AAT and AGAT, are consistent with the results in some other vertebrates including X. laevis, although differing from those seen in some mammalian species (Edwards et al. 1998;Toth et al. 2000). Schlotterer (2000) has suggested that taxonspecifi c predominance of different repeat motifs could be infl uenced by a different base composition in the genome as well as differences in the DNA mismatch repair systems. In addition, Prasad et al. (2005) have suggested there is a potential relationship between distribution of repeat motifs and higher-order chromatin structure. Tetranucleotide microsatellites containing the AGAT (GATA) motif are known to be associated with the sex chromosome in humans and to play a role in higher order chromatin organization and function (Singh et al. 1994;Zhao et al. 1995;Subramanian et al. 2003). X. tropicalis certainly provides a unique opportunity for comparative studies on the role of AGAT repeats because of the predominance of AGAT repeats in its genome.

Comparisons between nonredundant and genomic microsatellites
The nonredundant di-, tri-, and tetranucleotide microsatellites, which were used as candidate markers for our linkage map, were independently identifi ed from the X. tropicalis genome. Criteria for identifying nonredundant markers are that they have unique fl anking sequences, that are long enough and have suffi cient complexity to enable the design of unique PCR primers (Sharrocks, 1994). Among the three repeat types of nonredundant microsatellites analyzed, the distribution pattern of different motifs is generally consistent with that of genomic repeats, in that the most abundant di-, tri-, and tetranucleotide nonredundant motifs are AT, AAT, and AGAT. Although, this subset of microsatellite loci is similar to those identifi ed in the entire X. tropicalis genome, the relative abundance of different motifs within di-, tri-, or tetranucleotide microsatellites show some divergence between nonredundant versus genomic repeats. For example, the AT repeats account for 64.7% of the total dinucleotide genomic loci, but 90.3% of the nonredundant dinucleotide loci respectively, suggesting a smaller proportion of AC and AG repeats are embedded in unique sequences with long enough fl anking sequences for useful primer design. The discrepancies between the abundance of specifi c motifs in nonredundant loci versus genomic microsatellites may result from the appearance of AC or AG repeat strings embedded within more complex repetitive sequences. Large complex minisatellite repeats comprise over 1% of the X. tropicalis genome, with Comparing polymorphism rates between Class 1 and 2 for each microsatellite repeat group (di, tri-, and tetra-), *means the 2 size classes are signifi cantly different (p Ͻ 0.05), **means highly signifi cantly different (p Ͻ 0.01). Comparing polymorphism rates among the three microsatellite repeat groups (di, tri-, and tetra-), + means signifi cant differences (p Ͻ 0.05), ++ means highly signifi cantly differences (p Ͻ 0.01) (see text).
our initial surveys suggesting sequences containing AC repeats appear in very high copy numbers in these minisatellites. Inclusion of AC or AG repeat strings in larger, more complex minisatellite sequences could skew the distribution of repeat motifs among genomic microsatellites.

Factors affecting microsatellite variation
It is well known that microsatellites are hot spots for genome mutation and variation (Weber, 1990;Ellegren, 2004). The variability seen in microsatellites is primarily due to sequence length polymorphisms caused by variable numbers of tandem repeats (Ellegren, 2000(Ellegren, , 2004Neff and Gross, 2001). In the present study, we compared the percentage of polymorphic loci in two different size classes (class I: length Յ 30 bp; class II: length Ͼ 30 bp). We found class II markers are signifi cantly more polymorphic than class I markers for all three microsatellite repeat types. This suggests loci with larger numbers of repeats are more prone to mutation/expansion than those with fewer repeats. This result is consistent with other observations based on pedigree analyses (Brinkmann et al. 1998;Schug et al. 1998;Bachtrog et al. 2000;Kayser et al. 2000) and population genetics studies (Goldstein and Clark, 1995). The correlation between repeats length and the variability of microsatellites is understandable according to the replication slippage model, which is one of the widely accepted mutation mechanisms (Levinson and Gutman, 1987), as the longer the repeats, the more chances exist for the slipped-strand mispairing to occur. Repeat type is another factor that has been found to affect stability of microsatellites (Schlotterer, 2000;Ellegren, 2004). Our study compared the polymorphism rate of 1,907 di-, 933 tri-, and 2,288 tetranucleotide microsatellites. The results indicate the tetranucleotide microsatellites have the highest rate of polymorphism while the dinucleotide microsatellites are the least polymorphic. Our results agree with Weber and Wong's observation (1993) that the mutation rate for tetranucleotides is almost four times higher than that of dinucleotide repeats. Sia et al. (1997) reported similar mutation rates for tetranucleotide and dinucleotides repeats. However, two subsequent studies using different methodologies (Chakraborty et al. 1997;Lee et al. 1999) reached the conflicting conclusion that dinucleotide microsatellites have higher mutation rates than tetranucleotide microsatellites. However, the discrepancies between these studies may have resulted from insuffi cient data. Additional analysis is required to clarify the effects of repeat type on the polymorphism of microsatellites. It has also been reported that the base composition of the repeat motifs may play a role in the variations of microsatellites. When they compared slippage rates between different microsatellites with different base compositions in Drosophila using an in vitro replication system, Schlotterer and Tautz (1992) found that sequences with high AT content mutate faster than those with high GC content. In contrast, Bachtrog et al. (2000) found that GT/CAcontaining microsatellites of D. melanogaster had the highest mutation rate, while the AT-containing microsatellites had the lowest. Still another study showed that the CA and GA repeats of similar length in Escherichia coli genome exhibit similar mutability (Eckert and Yan, 2000). Although, our results indicate there are no differences in the polymorphic rate among the three dinucleotide motifs (AT, AC, and AG), among those most predominant tri-and tetranucleotide microsatellites, AAT and ATC exhibit a higher rate of polymorphism than AAG and AGG, and AGAT and AAAG are more frequently polymorphic than ACAT and AAAT. It remains unclear if the higher variability in AAT, ATC, AGAT, and AAAG microsatellites is a universal or species-specifi c phenomenon. In humans, an AAAG tetranucleotide locus has also demonstrated hypermutability (Talbot et al. 1995). It is worth noting that of the four tri-and tetranucleotide microsatellite motifs showing the highest rate of polymorphism, all have a higher number of repeat units per loci than their less polymorphic members. However, this trend does not hold for the dinucleotide loci as AT loci have signifi cantly more tandem repeats than either AC or AG loci, yet the rate of polymorphism of AT does not signifi cantly differ from the other two.