Efficiency of microsatellite isolation from orchids via next generation sequencing

Microsatellites, or simple sequence repeats (SSRs), are highly polymorphic, co-dominant genetic markers commonly used for population genetics analyses although de novo development of species specific microsatellites is costand time-intensive. Orchidaceae is one of the most species-rich families of angiosperms with more than 30,000 species estimated. Despite its high species-diversity, microsatellites are available only for a few species and all were developed by only using Sanger sequencing methods. For the first time in orchids, we used 454 GS-FLX sequencing to isolate microsatellites in two species (Cypripedium kentuckiense and Pogonia ophioglossoides), and report preliminary results of the study. From 1/16 plate that was subjected to sequencing, 32,665 reads were generated, from which 15,473 fragments contained at least one SSR. We selected 20,697 SSRs representing di-, tri-, and tetra-nucleotides. While 3,674 microsatellites had flanking regions on both sides, useable primer pairs could be designed for 255 SSRs. The mean numbers of reads, SSRs, and SSR-containing reads useful for primer design estimated for other 15 orchid species using Sanger sequencing method were 166, 78 and 31, respectively. Results demonstrate that the efficiency of microsatellite isolation in orchids is substantially higher with 454 GS-FLX sequencing technique in comparison to the Sanger sequencing methods.


INTRODUCTION
Microsatellites, or simple sequence repeats (SSRs) are regions of DNA that contain short tandem repeats (STRs) of 1 to 6 nucleotides.Microsatellites are known to occur in the genomes of living organisms including bacteria, plants, and animals [1].Although microsatellites are less abundant in coding regions than in non-coding regions, they are randomly distributed throughout the genome [2].In general, microsatellites are considered neutral markers, but evidence exists for their role in gene expression [3,4].Variation in SSRs is thought to arise primarily when strand slippage, accelerated by the presence of repeats, leads to stepwise mutation during replication resulting in higher mutation rates in SSR regions than in other regions of the DNA [2,5].
Over the past two decades, microsatellites have become one of the most popular markers for population genetics [6][7][8][9][10].High intraspecific polymorphisms, codominant nature, and ease of use are some of the advantages [11] although high cost-and time-investment limit their widespread use, especially for species whose genomes are not yet well sequenced to allow for rapid searches for SSRs within the existing genomic sequence database.In such cases, de novo isolation of SSRs becomes a necessary first step.While several methods are available for microsatellite isolation [12], the most commonly used conventional method involves screening of microsatellites in genomic libraries after hybridizing with artificial probes, cloning of the hybridized DNA fragments, and sequencing of the colony DNA using the Sanger method [13][14][15][16][17].Despite high cost-and timeinvestment, this method yields relatively few (100 to a few hundred, at the most) DNA fragments and the percentage of SSR-containing reads is typically quite low (0.04% to 12%; reviewed in Zane et al., [12]).
Recent advances in next generation sequencing have expedited the identification of SSRs in non-model organisms [18,19].Various next generation sequencing techniques are available such as 454 GS-FLX (Roche Applied Science, Penzburg, Germany), SOLEXA (Illumina, San Diego, CA), SOLiD (Applied Bioystems, Foster City, CA), tSMS (Helicos Biosciences, Cambridge, MA) [18].Among these, 454 GS-FLX sequencing is most useful for microsatellite isolation because it generates relatively longer reads than other techniques [20,21].So far, only few studies have reported isolation of microsatellites using next generation sequencing in plants (reviewed in Zalapa et al. [22]).
In this study, we report preliminary data of microsatellite isolation from two orchid species, Cypripedium kentuckiense and Pogonia ophioglossoides, using microatellite enrichment followed by 454 GS-FLX sequencing.DNA of the two species was pooled together before the enrichment and sequencing to reduce per species cost of microsatellite isolation.Both are insect-pollinated and each is presumed to be outcrossing.Microsatellites are not available for either species.We also compared the efficiency of microsatellite isolation between this study, and other orchid studies that used Sanger sequencing method for microsatellite isolation.

MATERIALS AND METHODS
Total DNA from approximately 100 mg of leaf tissue from single plants of Cypripedium kentuckiense and Pogonia ophioglossoides was extracted by using Qiagen DNEasy Plant Minikit (Qiagen, Valencia, CA), following the manufacturer's protocol.Quality and quantity of the DNA was determined in NanoDrop ND1000 (Thermo Scientific, Woburn, MA).Five micrograms of contaminant-free DNA (100 ng/µl concentration and 260/280 ratio of ~1.8) from each of the species was combined in a single tube and was digested with the restriction enzyme RsaI (New England Biolabs, Beverley, MA).After digestion, fragments were ligated to double stranded linkers, and they were denatured and hybridized to three artificial biotinylated microsatellite oligonucleotide mixes (mix one = (AG) 12 , (TG) 12 , (AAC) 6 , (AAG) 8 , (AAT) 12 , (ACT) 12 , (ATC) 8 ; mix two = (AAAC) 6 , (AAAG) 6 , (AATC) 6 , (AATG) 6 , (ACAG) 6 , (ACCT) 6 , (ACTC) 6 , (ACTG) 6 ; mix three = (AAAT) 8 , (AACT) 8 , (AAGT) 8 , (ACAT) 8 , (AGAT) 8 .Hybridized fragments were then captured on magnetic streptavidin beads (Dynal, Lake Success, NY).After washing out the unhybridized fragments, the remaining DNA was eluted from the beads and was amplified with the primer SimpleX-6.Sequencing of 1/16 th of a plate of the SSR-enriched libraries was carried out on a 454 GS-FLX high throughput sequencer by using titanium chemistry at Savannah River Ecology Lab (SREL), University of Georgia.Sequences were subjected to a 3'-end quality trim if any of the terminal 25 bases of the read contained a quality score (phred score) less than 20, or if it contained an ambiguous base.Reads shorter than 150 bp were discarded.Contig sequences were assembled in CAP3 [38] based on the criteria of 98% sequence identity and a minimal overlap of 75 bp.All singlet and contig DNA reads were screened in Msatcommander version 0.8.1 [39] to search for di-nucleotide SSRs with ≥8 repeats and tri-and tetra-nucleotide SSRs with ≥7 repeats each.Primer pairs were then designed for the SSRs that contained suitable flanking nucleotide sequences of appropriate length on either side by using PRIMER3 [40] based on the default criteria of the program.
To compare the efficiency of microsatellite isolation between this study and other orchid studies, we have selected 15 studies that have used Sanger sequencing methods for microsatellite isolation and have reported at least two of the three basic data, i.e., number of reads, total number of microsatellites, and total number of SSR-containing reads that were useful for primer design.

RESULTS
From the 1/16 th plate that was subjected to sequencing, 32,665 unique DNA fragments (reads), including 2,982 contigs, were generated and yielded 9.24 Mb of sequence data.Mean G/C content and read length among the 32,665 reads were 39% and 283 bp, respectively.Approximately 66% of all reads were under 300 bp, and only about 1% of the reads had lengths over 500 bp.The longest read length across all DNA fragments was 702 bp (in two individual reads).Of the 32,665 DNA reads, 15,623 contained at least one SSR consisting of ≥7 or 8 repeats.We observed multiple SSRs in 5,305 individual DNA reads (33.95% of 15,623 reads).Overall, 20,697 SSRs with ≥7 or 8 repeats were identified in the combined, partial genomic library of both species.Di-nucleotide SSRs were the most abundant, followed by tetra-(28.62%),and then tri-nucleotide SSRs (12.58%) (Table 1).Approximately 18% (3,674 of 20,697 SSRs) of the SSRs contained useable flanking regions on both sides of each SSR and hence were selected as candidate SSRs for primer development.Of these, 255 SSRs and their flanking regions (6.94% of 3,674 SSRs) met the default primer design criteria in PRIMER3.

DISCUSSION
Frequency of microsatellites in combined genomes of C. kentuckiense and P. ophioglossoides can be considered relatively high because approximately 50% of the DNA reads contained at least one microsatellite.The number of reads generated in this study was about 200 times higher (32,665 reads) than the mean number of reads in the 15 orchid species (166 reads) using the Sanger sequencing method (see Table 2 for data and references).
The number of SSRs found in our study was more than 260 folds higher (20,697 SSRs) than the average number of SSRs (78 SSRs) estimated for those 15 orchid species.The maximum and minimum numbers of SSRs found among the 15 studies were 270 and 12, respectively.The number of SSR-containing reads suitable for primer design in those orchid studies ranged from 7 to 63 with the mean number 31, which is four folds lower than in our study.This clearly indicates that the efficiency of microsatellite isolation in orchids using the next generation sequencing is several folds higher than using the Sanger method.
When the results of this study were compared with the results of microsatellite isolation in other plant species, the number of SSRs (20,697 SSRs) found in this study was more than 12% higher than the mean number of Mean 166 78 31 2 Enrichment with di-nucleotide SSRs, 3 Enrichment with tri-nucleotide SSRs, 4 Enrichment with tetra-nucleotide SSRs.
SSRs (18,437 SSRs) estimated for 20 studies that used the 454 GSFLX sequencing [22].Similarly, the number of microsatellite-containing reads that are useful for primer designing in our study is more than 40% higher than in those 20 studies.The number of SSRs found in the combined genomes of C. kentuckiense and P. ophioglossoides was more than 84 folds higher than the mean number of SSRs (245 SSRs) reported for 71 different plant species using Sanger sequencing method [22].The mean number of SSRs-containing reads (50 reads) in those 71 studies was about five times lower than in our study.Collectively, these observations further confirm that the combined genomes of C. kentuckiense and P. ophioglossoides have relatively high number of microsatellites, and the efficiency of microsatellite isolation using next generation sequencing is substantially higher than using the Sanger sequencing.
Although the number of oligonucleotide probes used for microsatellite enrichment was lower for di-nucleotide SSRs (2 probes) than for the tri-nucleotides (5 probes) and tetra-nucleotides (13 probes), the number of di-nucleotide SSRs in C. kentuckiense and P. ophioglossoides was higher than tri-and tetra-nucleotides (Table 1).These results indicate that the genomes of these species are rich in di-nucleotide repeats.Other studies have also reported higher percentage of di-nucleotide microsatellite repeats than tri-and/or tetra-nucleotide repeats in plants (e.g., [41] (55% di-nucleotides, and 45% tri-nucleotides); [42] (46% di-nucleotides, 34% tri-nucleotides, 8% tetra-nucleotides, and 12% others); [20] (89% di-nucleotides, 9% trinucleotides and 2% tetra-nucleotides).Despite the fact that we did not use TA as a probe for the microsatellite enrichment procedure, we still detected 353 SSRs comprised of this motif.This is not surprising as the most common di-nucleotide repeat motif found in plants is TA [43].
Generally, as the length of an SSR increases polymorphism also increases [44][45][46].Although more than 21% (4,436 of 20,697) of SSRs had more than 50 repeats, primers could not be designed for any of them because in those reads either the repeats were prematurely terminated or did not have suitable flanking regions for primer design.This outcome is due to the limitation of 454 sequencing whereby a majority of the SSR-containing DNA fragments are shorter than 400 bp [21].Percentage of SSR-containing reads that are useful for primer design are typically lower when using next generation sequenceing technique than with Sanger sequencing because the latter technique can generate fragments up to 1,000 bp long.However, shorter read length in high throughput sequencing is easily outweighed by the larger number of SSR-containing sequences.

Figure 1 .
Figure 1.Number of di-, tri-, and tetra nucleotide SSRs of different repeat lengths obtained from a combined enriched genomic library of two orchid species, Cypripedium kentuckiense and Pogonia ophioglossoides.Only one tetranucleotide SSR was between 151 and 200 repeats (the color is not visible in the pie-chart because of this low number).Minimum number of repeats for dinucleotide SSR motifs was 8, and it was 7 for tri-nucleotides and tetra-nucleotides.