Genome-Wide Microsatellite Characterization and Marker Development in the Sequenced Brassica Crop Species

Although much research has been conducted, the pattern of microsatellite distribution has remained ambiguous, and the development/utilization of microsatellite markers has still been limited/inefficient in Brassica, due to the lack of genome sequences. In view of this, we conducted genome-wide microsatellite characterization and marker development in three recently sequenced Brassica crops: Brassica rapa, Brassica oleracea and Brassica napus. The analysed microsatellite characteristics of these Brassica species were highly similar or almost identical, which suggests that the pattern of microsatellite distribution is likely conservative in Brassica. The genomic distribution of microsatellites was highly non-uniform and positively or negatively correlated with genes or transposable elements, respectively. Of the total of 115 869, 185 662 and 356 522 simple sequence repeat (SSR) markers developed with high frequencies (408.2, 343.8 and 356.2 per Mb or one every 2.45, 2.91 and 2.81 kb, respectively), most represented new SSR markers, the majority had determined physical positions, and a large number were genic or putative single-locus SSR markers. We also constructed a comprehensive database for the newly developed SSR markers, which was integrated with public Brassica SSR markers and annotated genome components. The genome-wide SSR markers developed in this study provide a useful tool to extend the annotated genome resources of sequenced Brassica species to genetic study/breeding in different Brassica species.


Introduction
Microsatellites, which are also known as simple sequence repeats (SSRs, often defined as 1-6 bp), variable numbers of tandem repeats (VNTRs) and short tandem repeats (STRs), have been found in all genomic regions of all examined organisms. 1 Microsatellites have been traditionally regarded as 'junk' DNA and are mainly used as 'neutral' genetic markers. 2 In recent years, microsatellites have been demonstrated to have many important biological functions (e.g. the regulation of chromatin organization, DNA metabolic processes, gene activity and RNA structure) 3,4 and have therefore emerged as the third major class of genetic variations, alongside single nucleotide polymorphisms (SNPs) and copy number variations (CNVs). 5 Microsatellite markers are co-dominant, multi-allelic, easily detected, hypervariable, highly reproducible and abundant in the genome. 6 Therefore, among the available genetic marker systems (e.g. RFLP, RAPD, SSR, AFLP, SRAP and SNP), the SSR marker has been the preferential choice for various applications, such as variety identification, genetic diversity evaluation, phylogenetic relationship analysis, genetic map construction, linkage/association mapping of gene/QTL, marker-assisted selection and comparative mapping. 7,8 Of the 47 genera in the Brassiceae tribe within the Brassicaceae (Cruciferae) family, the genus Brassica currently comprises 38 species, 9 which include economically important crops that provide many vegetables, condiments, fodders and oil products. 10 The main cultivated Brassica species include three diploid species, Brassica rapa (AA, n ¼ 10), Brassica nigra (BB, n ¼ 8) and Brassica oleracea (CC, n ¼ 9) and three allotetraploid species, Brassica juncea (AABB, n ¼ 18), Brassica napus (AACC, n ¼ 19) and Brassica carinata (BBCC, n ¼ 17). The genetic relationship of the six widely cultivated Brassica species are described as U's triangle 11 in which the three allotetraploid species originated from the chromosome doubling after the natural hybridization between the three diploid species.
Much research has been conducted to identify/characterize genomic/genic microsatellites and/or to develop markers in the Brassica species through probe (containing a repeated motif) hybridization against genomic/ cDNA clones 12 -19 or through in silico analysis of publicly available bacterial artificial chromosome (BAC) sequences, 20 BAC-end sequences (BESs), 21 -23 genome survey sequences (GSSs), 24 whole genome shotgun sequences (WGSs), 25,26 expressed sequence tag sequences 27,28 and unique transcript sequences. 29 -32 However, the pattern of microsatellite distribution has remained ambiguous, and the development/utilization of SSR marker has still been limited/inefficient in Brassica, which is mostly due to the lack of genome sequences. First, the sequences, programmes, criteria and parameters that are used for mining microsatellites usually have differed across these previous studies, which have made it difficult to compare and integrate these results to obtain the definitive conclusions on the pattern of microsatellite distribution. Secondly, only a small part of the genomic sequences of usually one species have been analysed in each of these previous studies. Therefore, it has been impossible to obtain general conclusions on the pattern of microsatellite distribution. In addition, the total number (%10 000) of previously developed publicly available SSR markers is still limited 33 and not sufficient for many studies, which require a large number and/or high density of genetic markers, such as high-density linkage map construction, gene/QTL fine-mapping and genome-wide/regional association mapping. Thirdly, due to the lack of genome sequences, the genomic distribution of microsatellites and the physical position(s)/product(s) number of the previously developed publicly available Brassica SSR markers have been all or mostly unclear, which has hindered their exact and/or effective utilization.
Thanks to the rapid development of genome sequencing technology, the genome sequences are currently available for tens of plant species (http://www. phytozome.net), including three recently sequenced Brassica crop species, namely B. rapa, 34 B. oleracea (http://www.ocri-genomics.org/bolbase/index.html) and B. napus (our unpublished data). These sequences provide a powerful tool for genome-wide microsatellite characterization and/or marker development, which has been conducted in several model and crop plants, such as Arabidopsis (http://www.arabidopsis. org/), rice, 35 maize (mips.helmholtz-muenchen.de/ plant/maize/), sorghum (genome.jgi-psf.org/Sorbi1/ Sorbi1.home.htm), black cottonwood, 36 cucumber, 37 Brachypodim distachyon 38 and foxtail millet 39 but not Brassica. In view of this circumstance, we conducted genome-wide microsatellite characterization and marker development in the three sequenced Brassica crop species. The main objectives of this study were as follows: (i) to characterize and compare the frequency and distribution with respect to the motif length, type and repeat number of microsatellites in the assembled genomic sequences of these Brassica species; (ii) to characterize and compare the genomic distribution of microsatellites in the assembled pseudochromosomes of these Brassica species; (iii) to develop SSR markers from the assembled genomic sequences of these Brassica species and determine their copy number and positional relationship with the previously developed publicly available Brassica SSR markers and the annotated genome components; (iv) to construct a user-friendly comprehensive SSR marker database of Brassica and (v) to evaluate the newly developed genome-wide SSR markers by PCR (polymerase chain reaction) amplification in representative B. napus inbred lines.

Sources of genome sequences
The three inbred/pure lines, namely Chiifu-401 (B. rapa), O212 (B. oleracea) and Zhongshuang11 (B. napus), were sequenced by our own and several other institutes using Illumina GA II technology, and high-quality sequence reads were assembled using stringent parameters. Finally, a total of 40 549 (283.8 Mb), 120 061 (540.0 Mb) and 5098 (1000.9 Mb) sequence scaffolds were obtained for B. rapa, B. oleracea and B. napus, respectively, which represents 58.5, 77.6 and 81.7% of the nuclear genome and covers .98% of the gene space.  The minimum repeat unit was defined as 12, 6, 4, 3, 3  and 3, respectively, for the mono-to hexanucleotide. Compound microsatellites were defined as !2 repeats interrupted by 100 bp.

Development of SSR primers
Primer pairs were designed from the flanking sequences of identified microsatellites using the pri-mer3_core program (http://www-genome.wi.mit. edu/cgi-bin/primer/primer3_ www.cgi) in batch mode. Two perl scripts, p3_in.pl and p3_out.pl, serve as interface modules for the programme-toprogramme data interchange between MISA and the primer modeling software Primer3. The primerdesigning parameters were 18-27 bp primer length, 57 -638C melting temperature, 30-70% GC content and 100 -300 bp product size. The designed SSR primer pairs were denominated as the names of sequence scaffolds followed by a serial number of microsatellites (such as BrScaffold000001_1).

Localization/mapping of SSR markers
by in silico PCR The primer-pair sequences of previously developed publicly available Brassica SSR markers were downloaded from the brassica.info website (http://www.brassica. info/resource/markers/ssr-exchange.php) and additional files in the recent literature. 20,24,26,29,30,32 To determine their physical positions and copy numbers, the previously and newly developed Brassica SSR markers were aligned to the assembled genomic sequences of the studied Brassica species. This alignment was conducted using the in silico PCR method 41 with the following default parameters: 2 bp mismatch, 1 bp gap, 50 bp margin and 50 -1000 bp product size.

Validation of SSR markers by PCR amplification
A total of 3974 SSR primer pairs were synthesized to test for PCR amplification in six representative B. napus cultivars/inbred lines (Tapidor, Westar, Zhongshuang11, No. 07197, No. 73290 and No. 91032), which were chosen from the core collections of a natural population and the parents of several segregating populations in our laboratory, for their large genetic distance and extreme trait(s) performance (our unpublished data).
Genomic DNA of the six accessions was isolated from young leaves. PCR was performed in 20-ml volume that contained 0.2 mM dNTP, 0.5 U of Taq DNA polymerase, 75 ng of template DNA, 0.5 mM each primer and 1Â PCR buffer (10 mM Tris pH 9.0, 50 mM KCl and 1.5 mM MgCl 2 ). DNA amplification was conducted by the 'touchdown' method, with the following thermal profile: initial denaturation at 948C for 5 min; six cycles of 30 s at 948C, 45 s at 638C with a 18C decrease in annealing temperature per cycle and 1 min at 728C; 26 cycles of 30 s at 948C, 45 s at 578C and 1 min at 728C and a final extension at 728C for 10 min. The PCR products were separated on 6% denaturing polyacrylamide gels and were visualized by silver staining.

Statistical analysis
The correlation analysis was performed using the SAS PROC CORR procedure incorporated into SAS version 8.0. The Excel statistical function CHISQ.TEST was used to obtain the significance level (P x 2 test ) of the degree of fit for the practical and hypothetical distributions of microsatellites as well as genes and TEs in the assembled pseudochromosomes. In accordance with their high correlation (Supplementary Table S1), the distributions with respect to the motif length of microsatellites in the assembled genomic sequences of B. rapa, B. oleracea and B. napus were almost identical: mono-, di-, tri-and tetranucleotide repeats accounted for very similar and relatively high proportions, whereas penta-and hexanucleotide repeats were relatively uncommon (Fig. 1A).

Frequency and distribution
In accordance with their high correlation (Supplementary Table S1), the distributions with respect to the motif type of microsatellites in the assembled genomic sequences of B. rapa, B. oleracea and B. napus were almost identical ( Fig. 1B; Supplementary Table  S2). More specifically, both the dominant/major and absent/scarce mono-to hexanucleotide motifs in the assembled genomic sequences of the three Brassica species were mostly identical ( Table 1; Supplementary  Table S3). Interestingly, the dominant/major motifs (A, AT, AAG/AAT, AAAT, AAAAT and AAAAAT) were all A/T rich (Table 1), whereas the absent/scarce motifs were mostly C/G rich (Supplementary Table S3), which were highly consistent with the previous reports on microsatellites identified from 536 seed BACs of B. rapa, 20 3500 genomic clones 42 and 595 577 WGSs 26 of B. oleracea and 13 794 GSSs (mainly BESs) of B. napus. 24 It should be noted that the nucleotide composition characteristics of both the dominant/major and absent/scarce motifs in the assembled genomic sequences of the three Brassica  species corresponded well to their much higher A/T (mean ¼ 63.8%) than C/G (mean ¼ 36.2%) content.
In accordance with their high correlation (Supplementary Table S1), the distributions with respect to the motif repeat number of microsatellites in the assembled genomic sequences of B. rapa, B. oleracea and B. napus were also almost identical (Fig. 1C). Obviously, the microsatellite abundances decreased significantly as the motif repeat number increased, and the rate of this change was the slowest for dinucleotide repeat, followed by mono-and trinucleotide repeats, and was faster for other long repeats (Fig. 2). As a consequence, the difference between the average and minimum motif repeat numbers was the largest for dinucleotide repeat, followed by mono-and trinucleotide repeats, and was relatively small for tetra-to hexanucleotide repeats (Table 1).
In addition, the motif repeat number of the corresponding mono-to hexanucleotide repeats or motifs of microsatellites in the assembled genomic sequences of B. rapa, B. oleracea and B. napus were highly similar ( Table 1; Supplementary Table S4). As a consequence, the total repeat length (¼microsatellite number Â motif length Â motif repeat number) proportions of the corresponding mono-to hexanucleotide repeats or motifs of microsatellites in the assembled genomic sequences of B. rapa, B. oleracea and B. napus were mostly similar (Table 1; Supplementary Table S5).

Genomic distribution
The genomic distributions of microsatellites and their relation with the annotated genome components (mainly as genes and TEs) were investigated ( Fig. 3; Table 2), based on the assembled pseudochromosomes of the sequenced Brassica species (currently available for B. rapa and B. oleracea; Supplementary  Table S6).
For both B. rapa and B. oleracea, the frequency of microsatellites was high at/near both ends but low in/ near the middle of all the pseudochromosomes (Fig. 3), which likely corresponded to the peri-telomere and centromere, respectively. 43 The frequencies of microsatellites for the different pseudochromosomes of B. rapa or B. oleracea were generally comparable, which was in accordance with the similar frequencies of genes/TEs for these chromosomes ( Fig. 3; Table 2). Interestingly, the homoeologous chromosomes A3 and C3 both exhibited the highest microsatellite frequency among all the pseudochromosomes of B. rapa or B. oleracea, respectively, which was in accordance with their highest gene frequency among these chromosomes ( Fig. 3; Table 2). In accordance with the high significance of the P-values of the x 2 test between the practical and hypothetical/average frequencies of microsatellites in the 1-Mb genomic intervals (Table 2), the physical distribution of microsatellites on all the pseudochromosomes of both B. rapa and B. oleracea were highly non-uniform (Fig. 3), which suggests the non-random occurrence of microsatellites. In accordance with the usually higher P-values of the x 2 test between the practical and hypothetical/average frequencies of microsatellites for the 9 pseudochromosomes of B. oleracea than for the 10 pseudochromosomes of B. rapa (Table 2), the distribution of microsatellites was more uneven in B. oleracea than in B. rapa (Fig. 3), which was likely attributable to the more concentrated distribution of genes/TEs in B. rapa than in B. oleracea. For both B. rapa and B. oleracea, the frequencies of microsatellites in the 1-Mb genomic intervals studied were significantly positively or negatively correlated with those of genes (total r ¼ 0.75 and 0.87) or TEs (total r ¼ 20.61 and 20.73), respectively (Table 2), which was accordant with one of the interesting findings in this study, that the genomic distribution of microsatellites was generally in accordance with that of genes but opposite to that of TEs (Fig. 3). These results were in agreement with the previous findings, which showed that microsatellites are preferentially associated with non-repetitive DNA/gene sequences in the plant genome. 5,44 The high agreement of microsatellites and genes strongly suggests the putative role of microsatellites in regulating gene function 3 -5 and the use of SSR markers for tagging/cloning genes.
In conclusion, the genomic distributions of microsatellites in the assembled pseudochromosomes of B. rapa and B. oleracea were generally similar.  Table 3). The primer pairs could not be designed for the remaining microsatellites, mostly due to the constraint of obtaining sufficient flanking sequences from either side of the identified microsatellites. Similar observations have also been observed in other genome-wide microsatellite marker development studies in plants, such as rice, 35 black  The physical positions of the newly developed genome-wide SSR markers of B. napus will be determined soon because the anchoring of its sequence scaffolds will be completed after several months (our unpublished data). Because of the polyploidy nature of Brassica, 45 SSR markers usually amplify multiple fragments from homologous DNA sequences, which could complicate or cause errors in the genotype scoring. Therefore, all of the newly developed genome-wide SSR markers were subjected to in silico PCR analysis in the assembled genomic sequences of B. rapa, B. oleracea and B. napus, and the numbers of in silico PCR product(s) were recorded and summarized (Table 3)  Interestingly, the SSR markers that generated tens to thousands of in silico PCR products were mostly associated with the annotated TEs, especially the retrotransposons.

Development and database of genome-wide SSR markers
We also determined the relationship between the physical positions of the newly developed genomewide SSR markers and the previously developed publicly available Brassica SSR markers as well as the annotated genome components (mainly as genes and TEs) (Supplementary Table S7). Of the 115 869 SSR markers developed from B. rapa, 5991 (5.2%), 22 596 (19.5%) and 32 648 (28.2%) were involved in public Brassica SSR markers, genes and TEs, respectively. Of the 185 662 SSR markers developed from B. oleracea, 12 322 (6.6%), 33 228 (17.9%) and 73 487 (39.6%) were involved in public Brassica SSR markers, genes and TEs, respectively. Of the 356 522 SSR markers developed from B. napus, 23 928 (6.7%), 58 952 (16.5%) and 161 090 (45.2%) were involved in public Brassica SSR markers, genes and TEs, respectively. Interestingly, the TE-associated SSR markers were rarely involved in the annotated genes and mostly generated tens to thousands of in silico PCR products.
To facilitate the access and effective utilization of the Brassica SSR markers, we constructed an integrative database (http://oilcrops.info/SSRdb), which has search tools to obtain much useful information for the newly developed genome-wide SSR markers from the sequenced Brassica species and the previously developed publicly available Brassica SSR markers (Fig. 4). For the previously developed publicly available Brassica SSR markers, this information includes the primer-pair sequences, microsatellite repeat, source, reference and number of in silico PCR product(s) in the assembled genomic sequences of the sequenced Brassica species (currently only for B. rapa, B. oleracea and B. napus). For the newly developed genome-wide SSR markers from the sequenced Brassica species, this information includes the following: (i) the sequence, type, length and physical position of microsatellite    (Table 4). Of these, 3880 SSR markers (97.6%) successfully amplified at least one clear fragment, while the remaining 94 (2.4%) failed to amplify, which could be due to the differences between the genome sequences of B. napus and its two progenitors, B. rapa and B. oleracea. 46,47 The amplification rate (97.6%) of the tested SSR markers in the six B. napus cultivars/inbred lines was slightly or much higher than the corresponding rates (94.3/82.9, 89.2 and 77.4%, respectively) for the previously developed SSR markers from GSSs (mainly BESs)/unique transcripts of B. napus, 24,30 BACs of B. rapa 20 and WGSs of B. oleracea, 26 which suggests that there is a high quality in the SSR markers that were developed from the assembled genomic sequences. The amplification rate of the tested SSR markers showed small variations for different motif lengths, motif repeat numbers and repeat lengths (i.e. motif length Â motif repeat number), which was consistent with the previous reports in Brassica 20,24,26 and rice. 35 For the majority of the tested SSR markers, the numbers of fragment(s) amplified from the six representative B. napus cultivars/ inbred lines were equal or very close to those of in silico PCR product(s) in the assembled genomic sequences of B. napus (Supplementary Table S8). In particular, most (1602 of 1813; 88.4%) of the tested SSR markers that generated one in silico PCR product in the assembled genomic sequences of B. napus also amplified only a single clear fragment from the six representative B. napus cultivars/inbred lines. A considerable proportion (1099 of 3880; 28.3%) of the successfully amplified SSR markers also produced weak fragment(s), which could correspond to non-specific amplification(s) from homologous DNA sequences.
The majority (2765 of 3880; 71.3%) of the successfully amplified SSR markers was polymorphic across the six representative B. napus cultivars/inbred lines ( Table 4). The polymorphism rate of the tested SSR   1 to .3). The polymorphism rate of the tested SSR markers decreased slightly from the mono-to tetranucleotide repeats, while it increased quickly from the penta-to hexanucleotide repeats. This inconsistency of the relationship between the SSR marker polymorphism level and the motif length was also observed frequently in the previous SSR marker evaluation experiments, such as in the tests of the 627 and 1000 SSR markers from the GSSs (mainly Table 4. Amplification and polymorphism rate of the tested SSR markers and their association with the number of amplified fragment(s), the motif length, the motif repeat number and the repeat length  37 This type of inconsistency could be attributable to the observation that only a small number of SSR markers of the specific (usually long) motif length(s) have been used to investigate this relationship in all of the above-mentioned studies (e.g. only 21 and 12 pentaand hexanucleotide repeat SSR markers were tested in the current investigation), which worthwhile to develop more SSR markers with long motifs to further investigate the relationship between the SSR marker polymorphism level and the motif length. The polymorphism rate of the tested SSR markers was highly positively correlated with both the motif repeat number and the repeat length (r ¼ 0.74 and 0.86, respectively), which was basically consistent with the previous reports in Brassica 24,30 and other plant species, including cucumber 37 and carrot. 48 Both correlation coefficients in the current investigation were much higher than or equal to the corresponding values (0.21 and 0.41; 0.74 and _) that were estimated with the 627 SSR markers from the GSSs (mainly BESs) of B. napus 24 or the 1009 SSR markers from the assembled genomic sequences of cucumber, 37 respectively. Strikingly, the tested SSR markers that were designed from compound repeats were almost all (80 of 82; 97.6%) polymorphic across the six representative B. napus cultivars/inbred lines (Supplementary Table S8).
Because the 1055 and 2919 tested SSR markers were developed from the sequence scaffolds of B. rapa and B. oleracea, respectively, they were thus designated as 'BrSF' and 'BoSF'. To facilitate the effective utilization of these tested newly developed BrSF and BoSF SSR markers, the following useful information was provided (Supplementary Table S8): (i) the type, length, position and sequence of the microsatellite repeat; (ii) the name, sequences, annealing temperatures and expected product size of the primer pair; (iii) the number of in silico PCR product(s) in the assembled genomic sequences of the sequenced Brassica species (currently for B. rapa, B. oleracea and B. napus) and (iv) the polymorphism survey and number of fragment(s) amplified in six representative B. napus cultivars/inbred lines.

The pattern of microsatellite distribution is likely
conservative in Brassica In the current study, almost all of the important characteristics of microsatellite distribution in the assembled genomic sequences of the three recently sequenced Brassica crop species have been analysed and compared. To the best of our knowledge, this study is the first report on the genome-wide analysis and comparison of the pattern of microsatellite distribution across the different species within the same genus in plants.
First, the frequencies of microsatellites in the assembled genomic sequences of B. rapa (496.8 per Mb), B. oleracea (424.8 per Mb) and B. napus (420.6 per Mb) were similar, and all were higher than almost all of the previous estimations. 20,21,24,26,42 The slightly higher frequency of microsatellites in B. rapa than in both B. oleracea and B. napus is likely attributable to the more concentrated distribution and lower content of TEs in the assembled genomic sequences of B. rapa than in B. oleracea and B. napus (Fig. 3) because the frequencies (285.5, 272.0 and 285.4 per Mb) of microsatellites in the coding DNA sequences of the three species are almost equal. 49 Secondly, in accordance with the high correlation between these variables (Supplementary Table  S1), the distributions with respect to the motif length, type and repeat number of microsatellites in the assembled genomic sequences of the three Brassica species were almost identical ( Fig. 1; Supplementary  Table S2). More specifically, both the dominant/major and absent/scarce mono-to hexanucleotide motifs in the assembled genomic sequences of the three Brassica species were mostly identical (Table 1;  Supplementary Table S3). Interestingly, the dominant/major motifs were all A/T rich, while the absent/ scarce motifs were mostly C/G rich, which corresponded well to the much higher A/T than C/G content in the analysed sequences. Thirdly, the repeat numbers of the corresponding repeats or motifs for the three Brassica species were mostly similar (Table 1; Supplementary Table S4). Fourthly, the total repeat length (¼microsatellite number Â motif length Â motif repeat number) proportions of the corresponding repeats or motifs of microsatellites in the assembled genomic sequences of the three Brassica species were also mostly similar (Table 1; Supplementary Table S5). In addition, the genomic distributions of microsatellites in the assembled pseudochromosomes of B. rapa and B. oleracea were generally similar (Fig. 3).
In conclusion, almost all of the analysed important characteristics of microsatellite distribution in the assembled genomic sequences of the three sequenced Brassica crop species were highly similar or almost identical, which suggests that the pattern of microsatellite distribution is likely conservative in Brassica. This circumstance is understandable because B. napus (AACC, 2n ¼ 38) originated from the chromosome doubling after the very recent (%0.01 MYA) natural hybridization between B. rapa (AA, 2n ¼ 20) and B. oleracea (CC, 2n ¼ 18), 11  To the best of our knowledge, this study is the first report on genomewide SSR marker development in Brassica. Only a small proportion of the newly developed genomewide SSR markers (5.2, 6.6 and 6.7% for B. rapa, B. oleracea and B. napus, respectively) were involved in the previously developed publicly available Brassica SSR markers (Supplementary Table S7), which suggests that most of the newly developed genome-wide SSR markers should represent the new SSR markers. The huge-number and high-frequency genome-wide SSR markers developed from the sequenced Brassica species in this study could be useful for many studies that require large-number and/or high-density molecular markers, such as high-density linkage map construction, gene/QTL fine mapping and genome-wide/regional association mapping.
The acute physical positions of the majority of the newly developed genome-wide SSR markers of the sequenced Brassica species have been determined (http://oilcrops.info/SSRdb) based on the mapped sequence scaffolds (Supplementary Table S6) from which they are designed. In fact, the physical positions of most of the previously developed publicly available Brassica SSR markers have also been determined by in silico mapping against the pseudochromosomes of these sequenced Brassica species (http://oilcrops.info/ SSRdb). The high-density SSR marker-based physical maps constructed in this study could be useful for the rapid selection of genome-wide SSR markers that are well distributed over these chromosomes for various genotyping applications.
Because of the polyploidy nature of Brassica, 45 the developed SSR markers usually amplify multiple fragments from the homologous DNA sequences, as revealed in the current (Supplementary Table S8) and previous 12 -14,22,24,26,27,29,30,42,51 studies in Brassica. This could complicate or cause errors in the genotype scoring due to the reciprocal overlapping and uncertain allelism of these fragments. 33 However, only a small proportion of the previously developed publicly available Brassica SSR markers have been alleged to be single locus. 33 . Therefore, there is an urgent need to develop more single-locus SSR markers to facilitate their application in Brassica. Previously, the singlelocus SSR markers were developed by practical PCR amplification in a panel of inbred lines, 33 which was time consuming, labour intensive, high cost and, thus, inefficient. In the current study, through the highly efficient in silico PCR analysis, a large number of newly developed genome-wide SSR markers (92 517, 121 169 and 93 084 for B. rapa, B. oleracea and B. napus, respectively) were found to generate one in silico PCR product in the assembled genomic sequences of the three sequenced Brassica species (Table 3). In addition, thousands of previously developed publicly available Brassica SSR markers were also found to generate one in silico PCR product in the assembled genomic sequences of these Brassica species (http://oilcrops. info/SSRdb). More importantly, most (88.4%) of the tested SSR markers, that generated one in silico PCR product in the assembled genomic sequences of B. napus, also amplified a single clear fragment in the six representative B. napus cultivars/inbred lines (Supplementary Table S8). These results suggest that SSR markers that generate one in silico PCR product should be the putative single-locus markers and could be especially useful. Interestingly, the proportion (27.9%) of the newly developed genome-wide Brassica SSR markers (Table 3), which generated one in silico PCR product in the assembled genomic sequences of B. napus, was close to the corresponding proportion (33.8%) of the previously developed 9858 SSR marker from the GSSs/unique transcripts of B. napus, the BACs of B. rapa and the GSSs of B. oleracea, 33 which amplified a single clear fragment in six B. napus inbred lines.
Also known as 'functional' markers, 52 genic SSR markers are developed from genes and have a high transferability across related species. 52 Although several studies have been conducted to develop genic SSR markers from the ESTs/unique transcripts of B. rapa, 29,31,32 B. oleracea 31 and B. napus, 30 -32 the total number (,5000) of publicly available genic SSR markers has remained limited in Brassica (http:// oilcrops.info/SSRdb). In the current study, a large number of newly developed genome-wide SSR markers (32 648, 33 228 and 58 952 for B. rapa, B. oleracea and B. napus, respectively) were involved in the annotated genes (Supplementary Table S7) and thus belonged to the genic SSR markers. Of these, only a small proportion (7.2, 6.1, 6.7% for B. rapa, B. oleracea and B. napus, respectively) was involved in the previously developed publicly available Brassica SSR markers (http ://oilcrops.info/SSRdb). This finding suggests that most of these newly developed Brassica genic SSR markers could represent the new 'functional' markers, which should be highly useful in evolutionary studies, 29 comparative mapping, 32 candidate gene association mapping 53 and molecular breeding.
For the high transferability of SSR markers across the cultivated and wild Brassica species, 27 15,16,22,25 a considerable proportion of the newly developed genome-wide Brassica SSR markers (especially the genic SSR markers) should also be useful for the species that belong to other genera and tribes within the Brassicaceae family.
More importantly, we also constructed an integrative SSR marker database for Brassica (http://oilcrops.info/ SSRdb), which not only provides useful information on the newly developed genome-wide SSR markers from the sequenced Brassica species (currently only for B. rapa, B. oleracea and B. napus) but is also integrated with the previously developed publicly available Brassica SSR markers and the annotated genome components (mainly as genes and TEs). To the best of our knowledge, this is the first comprehensive SSR marker database for Brassica until now, and it should be a significant contribution to the Brassica research community.

Implications for SSR marker development
The numbers of clear fragment(s) amplified in the six representative B. napus cultivars/inbred lines for the 3974 tested SSR markers were usually equal or close to the numbers of in silico PCR product(s) in the assembled genomic sequences of B. napus (Supplementary Table S8). This finding suggests that the number of products amplified by SSR markers can be relatively accurately estimated by in silico PCR, which was in accordance with the previous reports in plants such as rice 56 and Brachypodium. 38 Therefore, the target microsatellite should be subjected to BLAST/in silico PCR analysis to estimate its copy number before SSR marker development, especially for the polyploidy species. In addition, most (88.4%) of the tested SSR markers that generate one in silico PCR product were also confirmed by practical PCR analysis (Supplementary Table S8). Therefore, the in silico identified single/low copy microsatellites should be preferential for marker development.
Replication slippage and recombination are currently two major mechanisms that are responsible for microsatellite expansion or contraction. 2,3,5,57 Because of the small numbers of the tested SSR markers of specific motif length(s), the relationship between the SSR marker polymorphism level and the motif length was usually inconsistent in both the current (Table 4) and previous 20,24,26,30,37 studies. However, the general trend was similar: the SSR marker polymorphism level tended to decrease as the motif length increased. This relationship is understandable because shorter motifs allow more possible replication slippage events per unit length of DNA. 58,59 In addition, the SSR marker polymorphism level was positively correlated with both the motif repeat number and the repeat length in both the current (r ¼ 0.74 and 0.86, respectively) and previous 37,47,48,60 studies. More importantly, the tested compound SSR markers were almost all (97.6%) polymorphic. These relationships are also understandable because more motifs, larger motif repeat number and longer repeat length give more opportunity for replication slippage. 2 Therefore, microsatellites with a shorter motif length, larger motif repeat number, longer repeat length and especially the compound repeat should be preferential for marker development.
It should be noted that a considerable proportion (Supplementary Table S7) of the newly developed genome-wide SSR markers from the sequenced Brassica species were involved in the so-called 'mobile DNA sequences' TEs 61 and should thus be unstable. In addition, the SSR markers that are associated with TEs (especially retrotransposons) mostly generated tens to thousands of in silico PCR products (http:// oilcrops.info/SSRdb). Therefore, caution should be observed with respect to marker development based on microsatellites that are associated with TEs (especially retrotransposons).