High-Density SNP Genotyping of Tomato (Solanum lycopersicum L.) Reveals Patterns of Genetic Variation Due to Breeding

The effects of selection on genome variation were investigated and visualized in tomato using a high-density single nucleotide polymorphism (SNP) array. 7,720 SNPs were genotyped on a collection of 426 tomato accessions (410 inbreds and 16 hybrids) and over 97% of the markers were polymorphic in the entire collection. Principal component analysis (PCA) and pairwise estimates of F st supported that the inbred accessions represented seven sub-populations including processing, large-fruited fresh market, large-fruited vintage, cultivated cherry, landrace, wild cherry, and S. pimpinellifolium. Further divisions were found within both the contemporary processing and fresh market sub-populations. These sub-populations showed higher levels of genetic diversity relative to the vintage sub-population. The array provided a large number of polymorphic SNP markers across each sub-population, ranging from 3,159 in the vintage accessions to 6,234 in the cultivated cherry accessions. Visualization of minor allele frequency revealed regions of the genome that distinguished three representative sub-populations of cultivated tomato (processing, fresh market, and vintage), particularly on chromosomes 2, 4, 5, 6, and 11. The PCA loadings and F st outlier analysis between these three sub-populations identified a large number of candidate loci under positive selection on chromosomes 4, 5, and 11. The extent of linkage disequilibrium (LD) was examined within each chromosome for these sub-populations. LD decay varied between chromosomes and sub-populations, with large differences reflective of breeding history. For example, on chromosome 11, decay occurred over 0.8 cM for processing accessions and over 19.7 cM for fresh market accessions. The observed SNP variation and LD decay suggest that different patterns of genetic variation in cultivated tomato are due to introgression from wild species and selection for market specialization.


Introduction
Selection of favorable alleles through domestication and breeding has led to dramatic changes in seed and fruit attributes, plant habit, and productivity. For tomato (Solanum lycopersicum L), breeding has involved the competing forces of narrowed genetic variation due to best by best crosses followed by selection [1,2], and the expansion of genetic variation due to the introgression of genes for biotic stress resistance from wild species [3][4][5]. The long history of crossing to wild relatives has broadened the genetic diversity in contemporary germplasm relative to vintage and landrace germplasm [6][7][8][9]. In addition, breeding for distinct market classes and production systems has led to genetic differentiation in contemporary germplasm. For example, processing and field-grown fresh market tomatoes are now distinct sub-populations [8,9].
Novel sequencing technologies have uncovered sufficient variation to investigate the effect of human selection across an entire plant genome [10]. Single nucleotide polymorphisms (SNPs) are a predominant form of sequence variation among individuals representing as much as 90% of the genetic variation in any species [11]. SNPs are distributed throughout a genome, they provide stable markers for genetic analysis, and their detection is amenable to automation. It is increasingly cost and time efficient to genotype large populations in a high-throughput manner. Because of these advantages, SNPs have become a marker system of choice for genetic analysis in plant species. In tomato, SNPs have been discovered using several methods: in silico mining of expressed sequence tag (EST) databases [12][13][14], intron and amplicon sequencing of conserved orthologous set (COS) genes [15,16], and hybridization to oligonucleotide arrays [8]. Recently, 62,576 non-redundant SNPs were identified based on transcritpome sequences for six tomato accessions [10].
High-throughput SNP discovery has been accompanied by the development of array-based genotyping platforms that permit rapid scoring of several thousand markers in parallel [17]. Such SNP genotyping methods have facilitated high-density genetic map construction and genome-wide association analysis. For example, the use of a maize array with 49,585 SNPs produced two genetic linkage maps with 20,913 and 14,524 markers, respectively [18]. In tomato, we used the ''SolCAP'' array with 7,720 SNPs to generate high-density genetic maps using two F 2 interspecific populations: 3,503 markers in the S. lycopersicum LA0925 x S. pennellii LA0714 (EXPEN 2000) population and 4,491 markers in the Moneymaker x S. pimpinellifolium LA0121 (EXPIM 2012) population [19]. An array with 44,100 SNPs was used to genotype 413 diverse accessions of rice and analyzed for association with 34 quantitative traits [20]. SNP data from such arrays can also be a resource for germplasm management in breeding programs and has a role in genomic selection strategies for crop improvement [18,21].
The analysis of variation in tomato populations has often focused on differences between cultivated and wild species. In contrast, relatively little is known about which genes or genomic regions distinguish market classes within cultivated genepools. In the early 1900s, significant effort was placed on the evaluation of wild germplasm as a source of new resistance genes [3,4]. Wide crosses were used to introduce new sources of resistance, with several varieties in commerce carrying introgressed genes by the late 1930s and early 1940s. These efforts often had a regional focus, and it remains unclear how introgressed regions are dispersed within breeding programs [22]. In addition, there was a concerted effort to breed for distinct plant habits and fruit characteristics for the processing and fresh market tomato industries. These efforts were initiated in the 1940s and resulted in the first cultivars suitable for machine harvest by the early 1960s [23]. At least three sources of S. pimpinellifolium were incorporated into processing breeding in order to develop compact plants amenable to ''once over'' (i.e. destructive) harvest [23].
In order to investigate genetic variation on the tomato genome due to contemporary breeding, we subjected a collection of 426 accessions to high-throughput genotyping using the SolCAP SNP array. The tomato accessions represented different market classes of cultivated tomato and closely related wild species. Knowledge of the genetic and physical organization of the SNPs [19] allowed us to conduct population level analysis based on the germplasm panel. SNP genotypes from the array were analyzed to identify sub-populations, and to assess genetic differentiation and diversity between and within sub-populations. We also investigated patterns of linkage disequilibrium (LD) within each chromosome in three sub-populations of large-fruited cultivated accessions (processing, fresh market, and vintage). The population level analysis revealed regions of the genome with high genetic variation between subpopulations, suggesting that historical breeding practices have led to different patterns of genetic variation in cultivated tomato germplasm.

Array-based SNP Genotyping
We used an array consisting of 7,720 SNPs distributed throughout the genome [10,19] to genotype 426 accessions (410 inbreds and 16 hybrids; referred to as the SolCAP germplasm) (Table S1 and Table S2). The hybrids were chosen to maximize heterozygosity and develop a cluster file for the GenomeStudio software (Illumina Inc., San Diego, CA, USA). In order to establish an accurate and automatic genotype calling procedure for the GenomeStudio software that is usable across the entire genepool of cultivated tomato, clustering based on the SolCAP germplasm was cross-validated with a cluster file based on 92 hybrids developed independently by TraitGenetics [19]. The resulting high-quality cluster file is available through the eXtension website [24].
In order to assess the quality of SNP calls, we duplicated 34 accessions that were randomly selected using independent DNA preparations and genotyping facilities. The average proportion of consistent calls across all accessions was 98.7%. Cluster analysis was performed, and for all 34 accessions, the nearest neighbor was the duplicate accession. The proportion of consistent calls increased to 99.3% by excluding three accessions, OH981136 (processing, 91.8%), Purple Clabash (vintage, 90.0%), and Principle Borghese (vintage, 94.1%). One of the duplicate samples for OH981136 showed higher rates for no call (8.8%) and heterozygote call (8.4%) relative to the other sample (0.7% for no call and 0.2% for heterozygote call). The data quality between two samples of Purple Clabash was also different from each other (0.7% vs. 5.7% for no call). Heterozygote call rates were similar for the two samples. The duplicate samples of Principe Borghese differed by 5.9%, but showed similar rates for no call (0.8% vs. 1.0%) and heterozygote call (0.2% vs. 0.2%). These results suggest that overall reproducibility is high, and that differential calls may result from DNA quality that affects the percentage of ''no call'', residual heterozygosity, and variation within accessions.
The data were analyzed to determine the polymorphism rate based on the inbred accessions. A total of 7,500 SNPs (97.2%) were polymorphic and 61 SNPs (0.8%) were monomorphic (Table 1). Among the 7,500 polymorphic SNPs, there were 7,375 SNPs with ,10% missing data, 84 SNPs with 10-20% missing data, and 41 SNPs with .20% missing data. The SNPs with a high frequency of missing data were randomly distributed across accessions. Polymorphism of 123 SNPs could not be unequivocally determined because of a large amount ($10.0%) of missing data (34 SNPs) or because polymorphism detection was due to a single homozygous allele and only heterozygote calls for the alternate allele (89 SNPs). It is possible that the heterozygotes actually represent duplicated genes, or the alternate allele was simply not present as a homozygote. These 89 SNPs were also randomly distributed across the genome. In addition, 36 SNPs failed to produce a genotype call because of poor signals ( Table 1).
The proportion of heterozygous SNPs was assessed within cultivated tomato germplasm and S. pimpinellifolium accessions. The heterozygosity with all scorable markers (7,684 SNPs) was 0.02 for processing, 0.01 for large-fruited fresh market, 0.01 for largefruited vintage, and 0.04 for S. pimpinellifolium accessions. Using only markers that were polymorphic within each sub-population (range 3,700-6,022 polymorphic markers), the processing, fresh market, and vintage accessions showed heterozygosity levels of 0.02, 0.01, and 0.02, respectively. The heterozygosity for S. pimpinellifolium was 0.05.

Genetic Differentiation between Sub-populations
The 410 inbred accessions were first divided into five subpopulations based on a priori knowledge of germplasm pedigree, age, market class, and origin: 141 processing, 122 fresh market, 88 vintage, 43 wild cherry, and 16 S. pimpinellifolium. Principal component analysis (PCA) and linkage disequilibrium (LD) analysis supported separation of the fresh market sub-population into a group of 110 large-fruited accessions and 12 cherry accessions (cultivated cherry). Similarly, the vintage sub-population was divided into 61 largefruited accessions, 15 cherry accessions which clustered with the cultivated cherry sub-population, and 12 landrace accessions ( Figure 1 and Figure S1). This iterative analysis led to the definition of seven sub-populations for further analysis: 141 processing, 110 large-fruited fresh market (hereafter referred to as fresh market), 61 large-fruited vintage (hereafter referred to as vintage), 27 cultivated cherry, 12 landrace, 43 wild cherry, and 16 S. pimpinellifolium.
In the PCA analysis, the first two principal components (PC1 and PC2) explained 22% and 16% of the total variance, respectively. PC1 separated the S. pimpinellifolium accessions from all other sub-populations (P,0.001), while PC2 separated processing, fresh market, and vintage germplasm (P,0.01; Figure 1A). A subsequent PCA was conducted using only the processing, fresh market, and vintage sub-populations ( Figure 1B). The second, more focused analysis validated the significant separation of the sub-populations. In addition, PCA based on only the cultivated accessions suggested further divisions within both processing and fresh market sub-populations ( Figure 1B). Although most of the accessions were clustered into a corresponding sub-population, there were a few accessions that appear to be transitional and/or misclassified accessions. For example, Peto 460 (PI 600920; coordinates 222.1, 24.8 in Figure 1B) and Heinz 1370 (PI 341134; coordinates 4.6, 23.4 in Figure 1B) were grouped with the vintage accessions due to their date of release [25], but were developed as processing tomatoes. Similarly, Rio Grande (coordinates 212.5, 9.1 in Figure 1B) was classified as a fresh market accession, but clustered with the processing accession. This cultivar is an early ''Roma'' type tomato and was originally developed as a processing tomato. Clustering supports the processing origin of these three accessions ( Figure 1B).
Pairwise analysis of F st was used to test the significance of genetic differentiation between sub-populations ( Table 2). In order to test effects of marker choice for this analysis, we estimated pairwise F st using two independent subsets of SNP data derived from the array. Analysis was performed on all 410 inbred accessions for each marker set. In addition a re-sampling analysis was performed by randomly selecting n = 40 for processing, fresh market, and vintage sub-populations. Repetition of this analysis suggests that our conclusions are supported with balanced populations and across multiple marker subsets. Cultivated germplasm, including processing, fresh market, and vintage accessions were all significantly diverged (F st = 0.29 to 0.41, P,0.005) ( Table 2). The cultivated cherry, wild cherry, and landrace accessions were not differentiated from each other but were distinct from all other sub-populations (F st = 0.13 to 0.64, P,0.005). The S. pimpinellifolium accessions were separated from all other sub-populations with estimates of pairwise F st ranging between 0.57-0.81 (P,0.005; Table 2). The F st analysis verified genetic differentiation between sub-populations defined by PCA.
Further divisions within both processing and fresh market subpopulations detected in PCA were also tested by estimating  pairwise F st . The processing accessions can be divided into two groups consisting of 82 and 52 accessions, respectively (Table S3). Seven accessions were not grouped because they were outliers, these tended to be CULBPT accessions which derive from recent crosses to fresh-market material. The two main processing groups were significantly differentiated from each other (F st = 0.27 P,0.005) and distinct from the other germplasm (Table 2). Two groups of the fresh market tomatoes including 61 and 49 accessions each showed significant differentiation between them and from the other sub-populations (F st = 0.10 to 0.76, P,0.005; Table 2).

Levels of Polymorphism within Sub-populations
The level of genetic diversity within each sub-population was measured using allelic richness (A), expected heterozygosity (He), and polymorphic information content (PIC). Within cultivated germplasm, the highest estimates of A, He, and PIC were found in the cultivated cherry accessions (Table 3). Among the remaining cultivated sub-populations, the fresh market accessions showed higher A, while the landraces had the highest He and PIC values. The sub-population of vintage accessions contained the lowest variation for all three descriptors ( Table 3). The further division of accessions within both processing and fresh market sub-populations showed higher estimates of these descriptive statistics relative to the vintage sub-population ( Table 3).
The highest number of polymorphic markers (6,234 SNPs) was identified in the cultivated cherry sub-population. There were 5,909 and 5,650 polymorphic markers in the wild cherry and S. pimpinellifolium sub-populations (Table 4). For the contemporary accessions, 4,648 and 6,022 markers were polymorphic in the processing and fresh market sub-populations, respectively (Table 4). Fewer polymorphic markers were found in the vintage (3,700 SNPs) and landrace (3,159 SNPs) sub-populations. Distribution patterns of polymorphic SNP markers across 12 chromosomes were different between sub-populations (Table 4). For example, chromosome 8 showed proportionally lower SNP numbers for the processing and cultivated cherry accessions, as did chromosome 12 for the large-fruited fresh market and wild cherry tomatoes. Chromosome 10 had lower polymorphism rates for the vintage and landrace groups compared to contemporary processing and large-fruited fresh market sub-populations.
The number of polymorphic markers in sub-populations increases with the size of the population. This fact led to concerns that our estimates of diversity might be skewed by population size differences despite the fact that A and He are adjusted for population size. To address the concern, we performed rarefaction analysis to estimate polymorphic marker accumulation curves [26]. For all seven sub-populations, the curves reached an  The further divisions within both processing and fresh market sub-populations are indicated with a number followed by each sub-population name (e.g. Proc 1 and Proc 2). Seven processing accessions were excluded for this grouping because they were outliners. 2 Allelic richness [94,95]. 3 Expected heterozygosity corrected for sample size [96]. 4 Polymporphic Information Content [97]. doi:10.1371/journal.pone.0045520.t003 asymptote within the population size sampled ( Figure 2). The S. pimpinellifolium, wild cherry, and cultivated cherry sub-populations were the most diverse groups based on the curves. Within the large-fruited cultivated sub-populations, fresh-market accessions were more diverse than vintage accessions. The accumulation curves for the processing and vintage accessions merged at n = 60, despite the fact that they were well separated between n = 10 to 25. This result suggests that differences in number of polymorphisms detected between these sub-populations may be population size dependent.

Minor Allele Frequency of SNP Markers
Minor alleles for 7,310 SNP markers with physical map positions were determined based on genotypic data of 410 inbred accessions. Minor allele frequency (MAF) was then estimated within each sub-population (Table S4). We graphed MAF in order to visualize genetic variation between three representative sub-populations of cultivated tomatoes (processing, fresh market, and vintage) along with S. pimpinellifolium accessions. These plots revealed different MAF patterns over the entire genome with chromosomes 2, 4, 5, 6, and 11 being particularly variable between the cultivated sub-populations ( Figure 3A and Figure 4A). The processing and fresh market sub-populations showed unique MAF patterns on these chromosomes relative to the vintage subpopulation. MAF patterns on chromosome 5 distinguished the processing sub-population from the fresh market sub-population ( Figure 3A and Figure 4A). Common alleles in the S. pimpinellifolium accessions tended to be minor alleles in the cultivated subpopulations ( Figure 3A and Figure 4A). Since the processing and fresh market sub-populations were further divided into two subgroups, respectively, MAF was estimated within each sub-group and graphed (Table S4 and Figure S2). The processing sub-groups were distinguished by unique MAF patterns on chromosomes 5 and 11. The most variable MAF patterns between the large-fruited    Table S9 are positioned based on their coding sequences (arrow) and flanking markers (dotted line). The Y-axis represents allele frequency and the X-axis represents physical positions of the SNPs oriented with respect to the tomato genome sequence [102].   Table S9 are positioned based on their coding sequences (arrow) and flanking markers (dotted line). The Y-axis represents allele frequency and the X-axis represents physical positions of the SNPs oriented with respect to the tomato genome sequence [102]. fresh market sub-groups were found on chromosome 11 ( Figure  S2).

Loci Explaining Variation within Germplasm
PCA loadings measure the correlation between PC and SNP, and provide an estimate of how much each SNP contributes to variance. We extracted loadings from the PCA of only cultivated germplasm ( Figure 1B) and displayed these values relative to physical position in order to visualize SNPs that contribute most to variance in the germplasm panel. There were 778 SNPs with absolute values of .0.02 for PC1. Most of these SNPs were found on chromosomes 4 (24.0%), 5 (32.6%), and 11 (31.1%). For PC2, 709 SNPs with absolute values of .0.02 were found on chromosomes 5 (12.3%) and 11 (52.9%) ( Figure 3B, Figure 4B, and Table S5).
We also investigated loci that may be under positive selection between the three cultivated sub-populations using an F st outlier method based the expected distribution of F st and He [27,28]. We identified 339 candidates for loci under positive selection between processing and fresh market as falling outside of the 95% confidence interval ( Figure 3B, Figure 4B, and Table S6). A high portion of these loci (61.4%) were derived from chromosome 5, while 0.6-10.0% of the SNPs were distributed across the other 11 chromosomes. Comparison between processing and vintage germplasm detected 128 candidates for loci under positive selection ( Figure 3B, Figure 4B, and Table S7). Among these, 57 loci (44.5%) were located on chromosome 4 and 35 loci (27.3%) on chromosome 5. For comparison between fresh market and vintage sub-germplasm, 208 loci were outliers based on a 95% confidence interval ( Figure 3B, Figure 4B, and Table S8). Most of the loci were derived from chromosome 4 (42.6%) and chromosome 11 (43.5%). For all three pairwise comparisons, the candidate loci under positive selection were not randomly distributed within a chromosome.
To visualize the position of candidate genes that may have been selected for during breeding, we superimposed the position of 28 loci onto the physical map of MAF pattern, PCA loading, and F st outlier detection. Candidate loci included genes that affect fruit size, shape, and color, disease resistance, and plant morphology (Table S9). Three genes for fruit shape and size, fw2.2, OVATE, and lc are located on chromosome 2 [29][30][31][32]. Of these genes, only lc appears to be a candidate for a locus under selection based on MAF patterns, PCA loadings, and F st outlier detection of linked SNPs (Figure 3). The OVATE locus is polymorphic within all cultivated sub-populations, and therefore SNPs lack power to discriminate populations. The large fruited allele of fw2.2 is fixed in the cultivated sub-populations. The SUN mutation which controls elongated fruit shape is found on chromosome 7 [33] and is present at a low frequency in both processing and vintage accessions (Figure 4). The genomic region for the SUN mutation contains some SNPs with high absolute values for PCA loadings, though none of these were detected in the F st outlier analysis. Another fruit shape gene, FAS is found in a region of chromosome 11 [34] with high loadings for PC2, and LD between FAS and SNPs in this region may be responsible for detection of F st outliers between processing (which lack FAS) and vintage (which are polymorphic for FAS) germplasm ( Figure 4). Phenotypic data are available in flat file format through the SolCAP website [35] and in searchable form through the Sol Genome Network (SGN) ontology database using advanced search options with the stock number prefix, SCT; stock type, accession; stock editors, SolCAP project; and the organism as either Solanum lycopersicum or S. pimpinellifolium [36]. A detailed analysis of phenotypic data is provided in a separate publication [37].
Fruit color genes affecting skin and flesh color are distributed on chromosomes 1 (y), 3 (Psy1), 6 (B), 8 (gf), and 10 (u) [38][39][40][41][42][43]. Although y, Psy1 and gf are found in regions of the genome with SNPs that have moderately high absolute values associated with PCA loadings and somewhat variable MAF between subpopulations, these regions do not appear to be highly discriminatory ( Figure 3 and Figure 4). Regions on chromosome 1 and 3 were also not coincident with loci identified by the F st outlier approach. The old gold crimson (og c ) allele of the B gene encodes a mutation in the fruit specific lycopene beta-cyclase [41] and is segregating in all three cultivated populations. Thus minor variation in MAF patterns, PCA loadings, and the detection of a SNP under selection between processing and vintage are more likely a reflection of the closely linked SP allele. In contrast, the region of chromosome 8 containing gf contains several SNPs detected as outliers based on F st between processing and vintage germplasm (Figure 4), which is consistent with the absence of gf mutations in processing populations and the presence of the mutant alleles in several vintage accessions. The allele of u gene for uniform ripening appears to fall in a region that is fixed in the processing germplasm. This allele is present at a high frequency in fresh market germplasm and at a low frequency in vintage germplasm ( Figure 4). The region contains several SNPs with high absolute values for PCA loadings and also SNPs detected as F st outliers.
The signals of genetic differentiation on chromosomes 5, 6, and 11 might be due to allelic variation in disease resistance genes, and reflect different introgression histories. Two resistance genes to bacterial disease, Pto and Rx3 on chromosome 5 [44,45] appears to be candidates for loci under selection based on MAF patterns, PCA loading, and F st outlier detection (Figure 3). The region of chromosome 6 containing Cf-2 and Mi1.2 genes [46,47] shows signals of selection distinguishing processing accessions from fresh market and vintage accessions (Figure 3). In general, Cf genes are deployed more frequently in fresh market material while Mi has been widely used in processing germplasm in California, but rarely used in processing accessions in the Midwestern U.S. Chromosome 11 contains one of the oldest introgressions, I for Fusarium resistance [48,49]. This gene falls in a region of the genome that distinguishes contemporary germplasm from vintage germplasm based on MAF patterns ( Figure 4). The region also contains some SNPs with high PC loadings. SNPs were detected as outliers at the 95% confidence level between processing and vintage germplasm. The region of chromosome 11 containing Rx-4/Xv3 and I-2 [50,51] appears to be discriminatory between the cultivated subpopulations based on MAF patterns, PC2 loadings, and outlier SNPs (Figure 4). Other regions with signals for loci under selection include chromosomes 7 (I-3) [48,52], chromosome 9 (Ve1, Tm2 ' 2, Sw5, and Ph-3) [53][54][55][56][57], and chromosome 10 (Ph-2) [58] (Figure 4). The genomic regions for Ph-2 and Ph-3 show MAF patterns that distinguish processing from fresh market and vintage subpopulations ( Figure 4). SNPs were also detected in the regions by PCA loadings and F st outlier analysis. These resistance genes have been deployed in several processing tomatoes [59]. The region of the genome containing Tm2 ' 2 distinguishes fresh-market from vintage and processing accessions based on the presence of SNPs with a MAF of 10-15%, consistent with deployment of this resistance in fresh market accessions ( Figure 4). However, this region was not detected based on PC loadings or F st outlier detection, possibly due to low allele frequencies for markers in linkage disequilibrium with Tm2 ' 2.
Genes affecting plant morphology are distributed across several chromosomes. A region on chromosome 2 that appears to differentiate sub-populations contains the compound inflorescence gene (s) [60] (Figure 3). Chromosomes 5 and 6 contain genes (SP5 and SP, respectively) that control plant habit and may be key selection points in differentiating vintage from contemporary subgermplasm [61] and processing from fresh market [62,63]. The SP5 gene is in a region of the genome containing SNPs with high loadings for PC2, but does not appear to have been detected by F st outlier analysis (Figure 3). The jointless gene (j) on chromosome 11 [64] falls in a region containing SNPs with high absolute values for PCA loadings and outlier SNPs (Figure 4). This region distinguishes processing from fresh market and vintage accessions, reflecting the near fixation of jointless accessions in this germplasm category. The genomic region for the j-2 gene on chromosome 12 [65] contains SNPs with high absolute values for PCA loadings (Figure 4).

Linkage Disequilibrium (LD) Analysis
The extent of LD across each chromosome was analyzed for the processing, large-fruited fresh market, and large-fruited vintage sub-populations. Pairwise r 2 was calculated using 1,572 polymorphic SNP markers with MAF of $0.1 for processing, 1,504 for fresh market, and 700 for vintage accessions. The r 2 values were plotted against the genetic distance, and curves of LD decay were fitted using locally weighted scatterplot smoothing (LOESS) [66] and non-linear regression (NLR) [67]. The LOESS and NLR methods estimated similar LD decay on 9 chromosomes for processing, 8 chromosomes for fresh market, and 11 chromosomes for vintage germplasm. In these chromosomes, the observed difference of the LD decay between the methods ranged from 0 to 4.9 cM (Table 5 and Figures 5, 6, 7). Over 10 cM difference for LD decay estimated by LOESS and NLR was found on chromosome 6 (28.1 cM for processing and 14.1 cM for fresh market); chromosome 8 (20.2 cM for processing); and chromosome 11 (9.6 cM for fresh market and .40 cM for vintage) ( Table 5 and Figures 5, 6, 7). The use of either a fixed r 2 value of 0.2 or a value estimated using the 95 th percentile method resulted in similar values for LD decay over chromosomes in the three subpopulations.
Different patterns of LD decay were observed between chromosomes and sub-populations (Table 5 and Figures 5, 6, 7). Baseline r 2 values estimated using the 95 th percentile method ranged from 0.11 to 0.23. In the processing sub-population, the baseline r 2 value based on the 95 th percentile method was 0.23 and led to estimates of LD decay that ranged from 0.8 cM on chromosome 11 (both LOESS and NLR) to 7.7 cM on chromosome 8 (LOESS) and 35.2 cM on chromosome 6 (NLR) ( Table 5 and Figures 5, 6, 7). LD decay estimated from the 95 th percentile baseline r 2 value of 0.12 in the fresh market subpopulation ranged from 3.8 cM on chromosome 4 to 47.6 cM on chromosome 11 for LOESS, and ranged from 4.1 cM on chromosome 12 to 38 cM on chromosome 11 using NLR. With the 95 th percentile baseline r 2 value of 0.11 in the vintage subpopulation, LD decay occurred over the shortest distance on chromosome 12 (1.9 cM for LOESS and 1.1 cM for NLR), while the greatest distance for LD decay was found on chromosome 1 (9.9 cM for LOESS) and chromosome 3 (11.8 cM for NLR) ( Table 5 and Figures 5, 6, 7). Estimates of LD decay vary more on a chromosome to chromosome basis than those based on the method used to establish the LD cut-off or decay curve.

Discussion
A high density array with 7,720 SNP markers was used to genotype the SolCAP germplasm panel consisting of 410 inbred accessions. This diverse germplasm enabled the development of a cluster file for accurate SNP calling with the GenomeStudio software (Illumina Inc. San Diego, CA, USA). Over 98% of SNP markers generated consistent calls between duplicate samples across 34 accessions, with differences due mostly to DNA quality. Also, 7,375 SNP markers (96%) were polymorphic with ,10% missing data in the entire germplasm. We found that the level of heterozygosity (proportion of heterozygotes) was low, but variable between market classes of germplasm.
The SNPs provide excellent genome coverage, but their distribution is not reflective of chromosome size. For example, chromosome 1 is cytologically one of the largest, yet is underrepresented by SNPs relative to chromosome 11 which is cytologically small. It is possible that this distribution represents a distortion of the true measure of polymorphism, heterozygosity, or genetic diversity due to the sampling of markers. However, the germplasm presented in this study were well represented in the sequencing and SNP discovery pipeline [10] with five of the seven sub-populations contributing to sequenced germplasm. We found more variation in the level of polymorphism between chromosomes within a sub-population relative to the variation between sub-populations. These results suggest that there is little ascertainment bias within the red-fruited species, and that the chromosome to chromosome variation reflects breeding history rather than polymorphism discovery. The possibility of ascertainment bias when applied to more distant germplasm exists, and will be the subject of a future study.
Cultivated germplasm was divided into distinct sub-populations. The genetic differentiation between processing and fresh market germplasm reflects human selection for distinct ideotypes tailored to the needs of specific production systems. The contemporary processing and fresh-market germplasm were also distinct from vintage germplasm, which is consistent with previous findings [6][7][8][9]. The further division within processing germplasm confirms an earlier analysis based on the Bayesian model implemented in STRUCTURE [68] which demonstrated sub-structure consistent with breeding history and environmental adaptation [9]. We have not seen evidence for sub-structure within fresh market germplasm previously [9]. Upon close inspection, the cluster demarked by PC1 coordinates (-10, 10) and PC2 coordinates (230, 215) contained 87% of the fresh market accessions from Florida and 65% of the accessions from North Carolina ( Figure 1B). In contrast, the cluster from PC1 (10, 30) and PC2 (-15, 10) contained 100% of the fresh market accessions from Oregon and California. The Oregon and California accessions were not part of our previous collection [9], and the sub-structure identified in this analysis suggests that diversifying forces may play a role in shaping fresh market as well as processing tomato germplasm. Genetic differentiation within market classes may reflect founder effects in breeding programs, selection for specific traits, environmental adaptation, or a combination of these factors. Cultivated cherry, wild cherry and landrace accessions were not well separated. These results support previous studies showing that Latin American accessions and feral accessions often share alleles [69]. The lack of differentiation between cultivated cherry and wild cherry or landrace accessions was somewhat surprising, but is consistent with a diverse breeding base for the cherry market class.
Genetic diversity for each sub-population was measured using allelic richness, expected heterozygosity, and polymorphic information content. The descriptive statistics in the contemporary subpopulations exceeded levels found in the vintage sub-population, but were lower relative to the cherry sub-populations and S. pimpinellifolium. Although differences between processing and vintage sub-populations may be population size dependent, rarefaction analysis provides strong support for increased variation in fresh-market sub-population relative to vintage sub-population. Tomato has undergone genetic bottlenecks during domestication and through selection after the introduction of the crop into Europe [2,70]. Since the early 1900s, wild relatives have been used to introgress new alleles into cultivated tomato [3][4][5]. This practice is expected to increase allelic diversity in contemporary processing and large-fruited fresh-market germplasm.
The analysis of minor allele frequency (MAF) across the genome demonstrates genetic differentiation between sub-populations in specific chromosome regions. The different MAF patterns between three cultivated sub-populations (processing, fresh market, and vintage) were particularly evident on chromosomes 2, 4, 5, 6, and 11. The same regions of the genome were highlighted by SNPs contributing high absolute values to PCA loadings and by SNPs detected as F st outliers based on a deviation from the expected distribution of F st and He. Thus, these chromosomes appear to be under diversifying selection relative to other regions of the genome.
In order to investigate potential regions of the genome under selection, we superimposed MFA patterns, PCA loadings, and SNPs identified from F st outlier detection with the position of genes affecting fruit shape, size and color, disease resistance, and plant morphology (Figure 3 and Figure 4). Taken together, several regions of the genome containing candidate genes which may be under selection were detected based on coincident MAF patterns, PC loadings, or F st outlier analysis. However, the resolution of this analysis does not provide unequivocal evidence that selection for these candidates explains variation between sub-populations. Given our marker resolution and observed LD decay, candidate gene analysis was not highly informative. In addition, our ability to detect outliers may be influenced by allele frequencies and distribution across the sub-populations. In general, several regions under selection are compatible with the introgression of genes for fruit size and shape (e.g. lc and FAS), fruit color (e.g. gf and u), disease resistance (e.g.Cf-2 and I-2), and plant morphology (e.g. s and SP5).
Our results also suggest several chromosomes or regions of chromosomes with no obvious candidate genes to explain the observed differentiation. Chromosome 4 was identified as highly important for genetic differentiation between the sub-populations of cultivated germplasm. The role of this chromosome is less clear, though multiple loci affecting plant habit (e.g. dmt, Epi, glo, and si) have been mapped relative to morphological markers on this chromosome (http://tgrc.ucdavis.edu). Other areas under selection that are poorly explained by candidate genes include the chromosome 5 centromere which differentiates processing germplasm from the other sub-populations.
Common alleles at many loci in the S. pimpinellifolium accessions are minor alleles in the other sub-populations, including cherry tomato. A recent study with 144 cherry accessions demonstrated a separation into two groups: one close to the cultivated tomato and one showing admixture of cultivated tomato and S. pimpinellifolium [71]. Genetic diversity was higher in the S. pimpinellifolium relative to the wild cherry and cultivated tomato accessions, a result that is consistent with our findings. We expect accessions of S. pimpinellifolium to be relatively diverse due to their range of mating systems (many accessions exhibit facultative outcrossing) and wide geographic distribution in the native region [72]. The wild cherry accessions may be a mixture of feral cultivated accessions and wild progenitors of cultivated tomato [71].
The decay of LD over genetic distance is important to determine the density of markers appropriate for genetic analysis and selection strategies. LD levels vary both within and between species [73]. Previous estimates of LD decay in tomato were based on the entire genome with an average of 6-14 cM in processing accessions [74], 3-16 cM in fresh market accessions [74], and 15-20 cM in commercial European greenhouse accessions [75]. Given the marker density across each chromosome, we estimated LD decay on a chromosome by chromosome basis for processing, fresh market, and vintage accessions. LD decay was variable between chromosomes and sub-populations, suggesting that historical recombination is not uniform across the genome. For example, chromosome 11 showed decay over 0.8 cM for the processing sub-population, and decay over 19.7 cM for the fresh market sub-population. Although the similar estimates of LD decay was found between the LOESS and NLR methods on most of chromosomes, high levels of difference were found on chromosomes 6, 8, or 11 depending on the sub-populations. These chromosomes also showed high levels of non-homogenous distributions for pairwise r 2 values relative to the other chromosomes. This variation may reflect structure within sub-populations due to selection. When the a priori vintage germplasm subpopulation was based on age of variety, regardless of fruit size or geographical origin, the LD decay pattern for chromosome 4 displayed extensive LD and what appeared to be parallel patterns of decay. When the vintage accessions were separated into large fruited vintage and cultivated cherry/landrace groups, the dual pattern of LD decay disappeared and LD was reduced ( Figure S1).
The patterns of LD decay we observed appear to be a consequence of introgression and directional selection through breeding. The range of LD decay also suggests that recombination remains limiting in cultivated tomato because of the inbreeding mating system and an emphasis on backcrossing and pedigree selection in tomato breeding programs [22,76,77]. With LD decay exceeding 1cM, the SolCAP tomato array will be useful for most selection strategies and association studies as the average marker intervals range from 0.8 to 1.6 cM in the tomato genetic maps [19].
High-throughput SNP genotyping has provided a means to both visualize and quantify the effect of human selection on the tomato genome. Selection has reduced genetic diversity in vintage cultivated forms relative to wild and feral forms, and has changed the frequency of predominant alleles. In contrast to domestication, contemporary breeding has increased allelic diversity and heterozygosity in populations relative to vintage tomatoes. Selection has also led to sub-populations which are characterized by distinct haplotype blocks, patterns of allelic diversity, and recombination history. This analysis has highlighted specific regions of the genome that appear to be under selection, with SNPs on chromosomes 2, 4, 5, 6, and 11 clearly distinguishing fresh market and processing lineages. Our findings are not surprising given the history of breeding activities. However, incorporating this information into future breeding strategies offers both challenges and potential for creativity. Going forward, we will want to Figure 5. Linkage disequilibrium (LD) decay on chromosomes 1 to 4. LD measures r 2 against genetic map distance between pairs of SNP markers within each chromosome for processing, fresh market, and vintage sub-populations. Decay curves of locally weighted scatterplot smoothing (LOESS) [66] are represented by red, and decay curves of non-linear regression (NLR) [67] are represented by blue. Horizontal dashed and solid lines indicate the baseline r 2 values estimated using the 95 th percentile method (0.23 for processing; 0.12 for fresh market; and 0.11 for vintage) and a fixed r 2 value of 0.2, respectively. doi:10.1371/journal.pone.0045520.g005 balance a desire to promote recombination and generate new combinations of alleles, with the need to preserve desirable combinations of genes. These analyses also highlight a role for several chromosomes in the differentiation of cultivated tomatoes. A role for chromosome 4 has not been highlighted by previous studies yet this chromosome appears to have regions that have been selected during breeding activities.  (Table S1). Processing and fresh market germplasm represented contemporary accessions, while vintage accessions (sometime referred to as heirlooms) represented early tomato selections that in some cases predate the application of Mendelian principles to crop improvement [7,9]. Cherry tomatoes, often referred to as S. lycopersicum 'cerasiforme', in our germplasm panel were separated into two subpopulations: cultivated and wild accessions. Landraces included Latin American cultivars and represent early domesticates from regions near the centers of origin and domestication. The collection also contained germplasm that was considered commercially relevant, with several inbred lines that are parents of commercial hybrids [78][79][80]. The collection also included the parents of several important recombinant inbred and inbred backcross populations [45,[81][82][83][84], segmental substitution lines [85], and a mutation library [86]. Finally, two accessions that have been the subject of public sequencing efforts, Heinz 1706 and LA1589, were also included.

SNP Genotyping
The SolCAP germplasm panel was genotyped using a tomato array with 7,720 SNPs as implemented in the Infinium assay (Illumina Inc., San Diego, CA, USA). Details of the SolCAP SNP discovery pipeline are described previously [10], as are details of the array [19]. In addition, all SNPs on the array have been incorporated into the SGN database [87], the SNP annotation file is available through the SolCAP website [88], and sequences are available through the National Center for Biotechnology Information Sequence Read Archive (accession number SRP007969).
For each accession of the SolCAP germplasm, genomic DNA was isolated from fresh young leaf tissue according to a modified CTAB method [82]. Double-stranded DNA concentrations were quantified using the PicoGreen assay (Life Technologies Corp., Grand Island, NY, USA) and normalized to 50 ng/ul with 10 mM Tris-HCl pH 8.0, 1 mM EDTA. Genotyping was conducted with 250 ng of DNA per accession following the manufacturer's protocol for the Infinium assay. For SNP calls, the resulting intensity data were loaded in GenomeStudio version 1.7.4 (Illumina Inc., San Diego, CA, USA). In order to determine SNP genotype, we first used the automated cluster algorithm to generate initial calls. Clustering for every SNP was assessed by visual inspection and modified when the default clustering was not clearly defined. Particular attention was paid to a clear definition of the boundaries for heterozygote calls, which were reduced manually for a number of SNPs in order to reduce the number of ambiguous calls. As a result, the rate of alleles with no call was increased slightly.

Data Analysis
Physical positions of 7,666 SNPs were previously determined relative to the tomato genome sequence [19]. The SNPs with physical positions were filtered based on polymorphism and missing data. The 7,323 polymorphic SNPs with ,10% missing data included 13 markers with inconsistent chromosome assignments between physical and genetic map positions. We removed these SNPs and used the 7,310 SNPs for minor allele frequency (MAF) and rarefaction analyses. For principal component analysis (PCA), pairwise F st , descriptive statistics, and F st outlier detection, we used 4,393 polymorphic SNPs (excluding the 13 markers) with ,10% missing data that were genetically mapped in the Moneymaker x LA0121 F 2 population of 184 plants [19]. A second set of 3,473 markers was selected based on genetic position in the EXPEN 2000 genetic map [19] and used for pairwise estimation of F st . Polymorphism of these markers was determined based on the observation of at least one alternative allele in the 410 inbred accessions. For linkage disequilibrium (LD) analysis, the 4,393 SNP markers were filtered for a MAF of $10% within each sub-population.
Principal component analysis. The genetic relationship between sub-populations was analyzed using PCA as implemented in R [89]. GenomeStudio SNP data were converted to proportional scoring where 2 is equal to homozygous for the common allele; 1 is equal to heterozygote; and 0 is homozygous for the rare allele. Missing data were imputed using the R package pcaMethods [90] in which missing data calls are based on the SVDimpute algorithm [91]. PCA was conducted for the entire data set, and subsequently for a data set consisting of only the three major sub-populations of cultivated tomato (processing, fresh market, and vintage). The relationship between accessions was visualized by plotting scores for PCs. Marker contributions to the loadings of each PC were extracted and displayed relative to chromosome position in order to visualize regions of the genome containing markers that maximize variation within the germplasm collection and assuming that these represent regions of the genome with maximum diversity. Significant differences between sub-populations were tested via analysis of variance (ANOVA) for the eigenvalues of the first two principal components. Figure 6. Linkage disequilibrium (LD) decay on chromosomes 5 to 8. LD measures r 2 against genetic map distance between pairs of SNP markers within each chromosome for processing, fresh market, and vintage sub-populations. Decay curves of locally weighted scatterplot smoothing (LOESS) [66] are represented by red, and decay curves of non-linear regression (NLR) [67] are represented by blue. Horizontal dashed and solid lines indicate the baseline r 2 values estimated using the 95 th percentile method (0.23 for processing; 0.12 for fresh market; and 0.11 for vintage) and a fixed r 2 value of 0.2, respectively. doi:10.1371/journal.pone.0045520.g006 Genetic differentiation and diversity. As a measure of population differentiation, pairwise F st [92] was estimated using the Microsatellite analyzer v4.05 [93]. This analysis was conducted using two sets of markers (3,473 and 4,393 SNPs) for all 410 inbred accessions. We also estimated pairwise F st with a reduction of sample size (n = 40) for processing, fresh market, vintage subpopulations in order to estimate F st without bias due to different population sizes. The analysis was iterated for three separate sets of accessions with n = 40 for each market class. The P-value for the pairwise F st was obtained from 10,000 permutations of genotypes and a Bonferroni correction was applied. Genetic diversity within each sub-population was assessed based on allelic richness (A) [94,95], expected heterozygosity (He) [96] and polymorphic information content (PIC) [97] using the 4,393 SNPs for the 410 accessions. A and He were estimated using the MSA software which corrects for sample size [98], while PIC was calculated using the equation: where n is the number of allele and p i is the frequency of the i th allele [97]. Rarefaction analysis. This analysis was used to estimate how the number of polymorphic markers increases relative to sample size [26]. Analysis and graphing was conducted using the species accumulation curve function ''specaccum'' in the R package, vegan [99]. The method ''random'' was applied to estimate means and standard deviations from 100 sub-samples without replacement [26]. This method provides a data summary and permits boxplot methods to be used to graph the curves. The boxplot function was used to superimpose standard deviations.
Minor allele frequency. The minor allele was determined based on the allele calls for 410 inbred accessions. Minor allele frequency (MAF) was then calculated across all chromosomes for each of sub-populations and sub-groups within the processing and fresh market sub-populations. This approach permitted the comparison of allele frequencies between sub-populations. MAF was graphed to visualize genetic variation based on physical map position across 12 chromosomes using the R package ggplot2 [100]. We also positioned genes for disease resistance (13 genes), fruit shape and size (5 genes), fruit color (5 genes), and plant morphology (5 genes) on the MAF plot (Table S9). Physical positions of the genes relative to the tomato genome sequences were determined using coding sequences or flanking marker sequences. These sequences were aligned to the Tomato WGS chromosome v SL2.40 [101,102] using BLAST [103].
Loci under positive selection. In order to identify loci under positive selection between processing, fresh market, and vintage sub-populations, we used an F st -outlier detection method Figure 7. Linkage disequilibrium (LD) decay on chromosomes 9 to 12. LD measures r 2 against genetic map distance between pairs of SNP markers within each chromosome for processing, fresh market, and vintage sub-populations. Decay curves of locally weighted scatterplot smoothing (LOESS) [66] are represented by red, and decay curves of non-linear regression (NLR) [67] are represented by blue. Horizontal dashed and solid lines indicate the baseline r 2 values estimated using the 95 th percentile method (0.23 for processing; 0.12 for fresh market; and 0.11 for vintage) and a fixed r 2 value of 0.2, respectively. doi:10.1371/journal.pone.0045520.g007 Table 5. Chromosome by chromosome linkage disequilibrium (LD) analysis within three representative sub-populations of cultivated tomato. The estimate of the 95th percentile baseline r2 value in each germplasm group was 0.23 in the processing varieties, 0.12 in the fresh market varieties, and 0.11 in the vintage varieties. 2 Logally weighted scatterplot smoothing [66]. 3 Non-linear regression [67]. For NLR, the expected r2 was calculated using the model of Hill and Weir (1988 as implemented in the LOSITAN workbench [27,28]. Outliers were detected based on a distribution of F st and expected heterozygosity (He). Five simulations for each pairwise comparison were run with 10,000 iterations. Simulations were conducted using the neutral mean, forced mean, and a mutation model with infinite alleles. Under these options, LOSITAN estimated a 95% confidence interval, and SNPs that fell outside of this range were identified as under positive selection (higher than expected F st for an estimated He). Linkage disequilibrium (LD) analysis. The extent of LD across each chromosome was measured using the SNP genotypes in three representative groups of cultivated tomatoes (processing, fresh market, and vintage). Pairwise r 2 between markers within each chromosome was then calculated using TASSEL v 2.1 [104] and GGT v 2 [105]. In order to determine the decay of LD, these r 2 values were plotted against the genetic distance (cM) between markers. Curves of LD decay were fitted using locally weighted scatterplot smoothing (LOESS) [66] and non-linear regression (NLR) [67] using R [89]. For LOESS, smoothing parameters between 0.1 and 0.5 were tested, and a final parameter of 0.3 was chosen based on curve fits. For NLR, the predicted r 2 values between adjacent markers were calculated using the model of Hill and Weir [106]. Two methods were chosen to determine baseline r 2 values: a fixed value of 0.2 [107] and the parametric 95 th percentile of the distribution of the unlinked markers [108].