Diversity analysis of cotton (Gossypium hirsutum L.) germplasm using the CottonSNP63K Array

Cotton germplasm resources contain beneficial alleles that can be exploited to develop germplasm adapted to emerging environmental and climate conditions. Accessions and lines have traditionally been characterized based on phenotypes, but phenotypic profiles are limited by the cost, time, and space required to make visual observations and measurements. With advances in molecular genetic methods, genotypic profiles are increasingly able to identify differences among accessions due to the larger number of genetic markers that can be measured. A combination of both methods would greatly enhance our ability to characterize germplasm resources. Recent efforts have culminated in the identification of sufficient SNP markers to establish high-throughput genotyping systems, such as the CottonSNP63K array, which enables a researcher to efficiently analyze large numbers of SNP markers and obtain highly repeatable results. In the current investigation, we have utilized the SNP array for analyzing genetic diversity primarily among cotton cultivars, making comparisons to SSR-based phylogenetic analyses, and identifying loci associated with seed nutritional traits. The SNP markers distinctly separated G. hirsutum from other Gossypium species and distinguished the wild from cultivated types of G. hirsutum. The markers also efficiently discerned differences among cultivars, which was the primary goal when designing the CottonSNP63K array. Population structure within the genus compared favorably with previous results obtained using SSR markers, and an association study identified loci linked to factors that affect cottonseed protein content. Our results provide a large genome-wide variation data set for primarily cultivated cotton. Thousands of SNPs in representative cotton genotypes provide an opportunity to finely discriminate among cultivated cotton from around the world. The SNPs will be relevant as dense markers of genome variation for association mapping approaches aimed at correlating molecular polymorphisms with variation in phenotypic traits, as well as for molecular breeding approaches in cotton.


Background
Cotton (Gossypium spp.), primarily the tetraploid species, Gossypium hirsutum L., is produced in many diverse tropical, subtropical and warm temperate regions of the world. Because cotton cultivation is so extensive, ca. 31 million hectares worldwide in 2016 (USDA-NASS), the societal benefits from plant breeding-based gains in cotton performance are greatly magnified. These include improved natural resource conservation and preservation, reduced reliance on undesirable chemical protectants, more time-and energy-efficient production, and increased profits for growers and cotton-related businesses, as well as local and even national economies. Improvements in baseline crop production and product quality have been possible through breeding of cotton germplasm resources, especially genetically elite cultivars. The use of non-elite germplasm resources for genetic improvement has been increasing in part because molecular markers and sequence polymorphisms facilitate germplasm analysis, classification, categorization, genotyping, genomic comparisons, and various types of marker-assisted selection for breeding and breedingrelated research.
The importance of cotton fiber as a commodity is widely recognized due to its extensive use in textiles. The importance of cottonseed as a source of cooking oil and feed for cattle is less recognized and has been far less studied. Renewed interest in cottonseed for extended use as feed for non-ruminants or even human consumption comes from the increasing need for affordable sources of protein to feed a growing global population [1] and the ability to selectively eliminate gossypol from the seed [2]. Increased demand for feed and food will likely change across time and increase seed usage. Regardless, global climate changes will require cultivars adapted to higher temperatures, decreased precipitation, and/or increased salinity along with new genotypes with resistance to altered profiles of pests and pathogens. To meet these challenges, plant breeders will need to identify novel sources of variation and incorporate them into their breeding programs.
Cotton germplasm resources available worldwide [3] contain beneficial allelic variations that traditional and genomic breeding methods can exploit to develop cultivars adapted to emerging environmental and climate conditions. The USDA National Cotton Germplasm Collection (NCGC) is one of the largest collections of cotton germplasm resources [3]. The NCGC is comprised of over 10,000 accessions representing nine genomes and 45 species originating worldwide. However, the unequivocal differentiation, categorization, and classification of accessions remain challenging for Gossypium germplasm resources [4]. In some cases, race and species designations remain ambiguous [5][6][7][8][9]. Continual characterization and evaluation using the latest technologies are essential to providing the most accurate information to potential germplasm users. The 'Gossypium Diversity Reference Set' (GDRS) is a subset of approximately 20% of the entire NCGC, created in 2013 via proportional representation across taxonomic levels and geographic origins [4]. Simple sequence repeat (SSR) markers were recently used to genotype the GDRS to evaluate the applicability of a core set of markers across multiple genera and species [4] and to identify differences between individual accessions and groups of wild and improved types of G. hirsutum and G. barbadense L. [10]. Based on this information, core sets of G. hirsutum and G. barbadense are being developed for future genetic studies. The GDRS has been evaluated for seed oil, protein, and seed index traits [11], and subsets are being evaluated for other agronomic traits.
Diverse resources within other worldwide cotton germplasm collections have also been characterized with molecular markers, primarily SSRs [7,[12][13][14]. Additionally, elite cultivars from breeding programs are often genotyped to enable molecular comparisons of cultivars, known pedigrees, and the respective breeding germplasm resources in specific countries and growing regions [15][16][17][18][19][20]. The accumulation of knowledge regarding genetic differences is especially useful when combined with knowledge of differences in phenotypes because it allows breeders and other researchers to generate segregating populations of increased usefulness by choosing parents with complementary phenotypic traits and genetic distinctiveness. These populations are then utilized to combine and select favorable alleles; map and determine the genetic basis of a particular phenotype; and launch development of advanced cultivars.
With decreasing resources available for germplasm preservation, optimization of these resources becomes essential. Unrecognized redundancy of germplasm within collections has played a large role in impeding optimal management and use of collections [21]. It has been estimated that only one-third of the total number of accessions in rice germplasm banks may be distinct [22], and this situation to greater or lesser extent is recognized in many germplasm collections. Management of any plant collection requires a multitude of resources to maintain a highly viable seed inventory to ensure seed is available for utilization from the collection at any time. For cotton, the photoperiod sensitivity of many lines requires costly field and nursery production in two or more latitudes. Thus, maintenance of duplicated materials leads to considerable amounts of unnecessary effort and cost. Another concern in the management of germplasm collections and breeding programs is the maintenance of purity. Purity of lines is compromised by outcrossing, much of which cannot be easily detected by phenotypic observations. Accurate genotypic profiles in addition to phenotypic profiles are increasingly relied upon in purity testing of commercial cultivars [23]. Genotypic profiling is a sought after tool for germplasm resource management and is relevant to germplasm collections of cotton, which is self-pollinated yet highly amenable to insect-mediated cross-pollination [24].
Most recently in cotton, SSRs have been the marker of choice for researchers to identify genotypic differences [10,12,13,15,17,25]. SSR markers typically can detect many alleles per locus, resulting in a high polymorphism which allows for utilization of a smaller number of markers. However, genotyping with SSRs is timeconsuming, moderately costly, and difficult to do in a high-throughput manner [26]. In contrast, single nucleotide polymorphisms (SNPs), only have two alleles at each locus, and therefore a single marker is less informative at a locus than an SSR. However, SNPs occur more frequently than SSRs throughout the genome, and greater numbers of markers can be obtained. The benefit of SNPs is that it is possible to genotype a large number of SNPs simultaneously in a high-throughput, cost effective manner due to their bi-allelism. When diversity analyses were conducted with an equal number of SSRs and SNPs in maize, the SSR analysis was generally more informative [26]. But when the number of SNPs was increased, this limitation was overcome, and SNPs were able to resolve differences between extremely similar individuals as well as increase the accuracy of diversity estimates [26,27].
Compared to many field crops, cotton has only recently begun to benefit from the wide availability of SNP markers. Allotetraploid crops like cultivated cotton are especially challenging for the identification of SNPs [28,29]. Most efforts on cotton SNPs to date have focused mainly on discovery and validation [30][31][32][33][34][35][36][37]. These efforts have culminated in the identification of sufficient SNP markers to establish high-throughput genotyping systems, such as the CottonSNP63K array. These systems enable researchers to efficiently analyze large numbers of SNP markers and obtain highly repeatable results [38]. In the current investigation, we aim to utilize this SNP array for analyzing diversity within a predominantly G. hirsutum cultivar set and to identify loci associated with seed nutritional traits.
To obtain a more thorough characterization of the genetic diversity of cotton, we used the CottonSNP63K array to analyze a total of 395 cotton samples provided by the NCGC and collaborators worldwide. The data collected on the samples shows how this technology can be applied 1) to evaluate genetic diversity and population structure between groups of germplasm, 2) to gauge the effectiveness of the array to identify differences among individual G. hirsutum genotypes, 3) to investigate possible redundancies in NCGC accessions, 4) to relate this new technology with results from SSR markers, and 5) to explore the potential of the array for use in genome-wide association studies.

Plant material and genotyping
Genomic DNA was extracted from young leaves of single plants representing a panel of 395 diverse Gossypium genotypes following the protocol described in Hulse-Kemp et al. [38]. DNA from young leaves was extracted using NucleoSpin Plant II kits (MACHEREY-NAGEL, USA), quantified using PicoGreen, and normalized to 50 ng/μL prior to genotyping. Five of these samples (all improved genotypes developed in cotton breeding regions outside the United States) were removed from further analysis due to potential misclassification or admixture (Additional file 1), leaving 390 genotypes for analysis. The final panel included 363 G. hirsutum samples; of which 292 were improved/cultivated or previously cultivated types and 71 were non-cultivated relatives (Additional file 2). The remaining 27 samples were from 10 diploid and tetraploid Gossypium species other than G. hirsutum. The non-G. hirsutum species included six diploid species (G. arboreum, G. amourianum, G. longicalyx, G. raimondii, G. thurberi, and G. trilobum) and four tetraploid species (G. barbadense, G. ekmanianum, G. mustelinum, and G. tomentosum). All plant materials used in this study were obtained from other researchers or national germplasm collections as indicated in Additional file 2.
The improved types of G. hirsutum were selected to span global regions of cotton breeding efforts. Within improved types, 185 cultivars were developed in the United States, and these were further classified into four breeding regions: eastern, mid-south, plains, and western. For a detailed description of criteria used for classification into global and US breeding regions see Hinze et al. [10]. Classification was based on declared passport data at time of submission. To provide continuity with our earlier report [10], genotypes from Africa were separated into two groups, northern Africa and southern Africa, with the equator as the boundary. We theorized that genetic differences observed between African improved germplasm would be based less upon geospatial considerations and more on breeding history. It was thought that genetic groupings might exist based on previous colonization and trade relations. The two regional groups were arbitrary but have been preserved for continuity. Within the improved group of G. hirsutum, 15 examples of genotypes with the same designations were included to examine whether genotypes with the same names were also genetically similar. In 12 examples, the genotypes came from different breeding programs (biological replicates), and in the remaining examples, the genotypes came from the same seed source (technical replicates) (Additional file 3).
Analyses of non-cultivated G. hirsutum germplasm centered around seven geographical landraces: latifolium, marie-galante, morrilli, palmeri, punctatum, richmondi, and yucatanense [5]. These analyses of noncultivated types included a distinct group of "mocó" cotton (G. hirsutum race marie-galante), primarily of Brazilian origin [39]. For consistency with our earlier reports [4,10,11], the accessions from the non-cultivated perennial relatives of G hirsutum, i.e., from the seven geographical landraces, will be further and collectively referred to as "wild" even though it is recognized that the only truly wild landrace is yucatanense, while the remaining six landraces display some traces of domestication and are considered ferals, or relics from ancient pre-Columbian domesticated forms.
SNP genotypes were generated for each panel member according to Hulse-Kemp et al. [38], using the Cot-tonSNP63K array (Illumina, USA), which contains 63,058 SNP Infinium II assays and the cluster file available for tetraploid Gossypium germplasm. The cluster file was developed using 1,156 cotton samples including the subset of plant materials used in this diversity analysis. Genotypes were determined for all samples in the present analysis using the 38,822 SNPs classified as polymorphic by Hulse-Kemp et al. [38]. Approximately 30% of the SNP assays in the CottonSNP63K were designed from 20,000 sequences for detecting polymorphism between G. hirsutum and five other species, namely G. barbadense, G. tomentosum Nuttal x Seemann, G. mustelinum Miers x Watt, G. armourianum, and G. longicalyx, but not among G. hirsutum genotypes.

Data analysis
Standard summary statistics for all SNPs were generated using PLINK v. 1.90b3m (https://www.cog-genomics.org/plink2) [40,41]. SNP allele frequencies were calculated using the '-freq' option. Estimates of expected heterozygosity were calculated using the '-hardy' option. Polymorphic SNPs were defined as those with a minor allele frequency (MAF) greater than 0.01within a defined set of samples. Unique SNPs were defined as those with MAF > 0 in a given group of samples and MAF = 0 in all other groups being compared. The number of homozygous differences between each line was calculated with the 'bcftools gtcheck' command in VCFtools [42].
Summary statistics and diversity analyses were independently evaluated on (1) the full dataset containing multiple Gossypium species, (2) a dataset of only G. hirsutum samples, (3) a dataset of only improved (i.e., cultivated) G. hirsutum, and (4) a dataset of only wild (i.e., landrace) G. hirsutum. The subset of improved G. hirsutum cultivars originating in US breeding programs was also evaluated.
Diversity was analyzed using SNPs with a MAF greater than 0.01 and a genotyping rate greater than 0.90. A genetic similarity matrix for all pair-wise combinations of individuals was calculated using PLINK ('-cluster -matrix' option) on the basis of the genome-wide average proportion of alleles shared which were identical by state (IBS) between any two individuals [41]. Multidimensional scaling (MDS) analysis of the genetic similarity matrix was used to extract the first six dimensions of relationships between cultivars with PLINK using the '-cluster -mds-plot 6' option. Further interpretation of individual ancestry and degree of admixture was estimated using fastSTRUCTURE [43]. The three data sets (2-4), as noted above, were initially run for K = 1 to 10 with the '-prior = simple' option and remaining default parameters. The optimal value of K for these runs was then determined using the 'chooseK' function. If the optimal K was determined to be 1, the process was repeated for K = 1 to 3 with the '-prior = logistic' option. Distruct v. 1.1 as implemented in fastSTRUCTURE was used to generate bar plots to visualize proportions of admixture.
Putatively identical accessions from the NCGC, as determined by the matrix derived from SNP-based IBS values (IBS > 0.98) were compared for phenotypic similarity at the Southern Plains Agricultural Research Center (SPARC) in 2015. Seeds representing each accession were planted in the greenhouse in early April and grown for 3 weeks before transplanting to field plots. Field plots were 10.06 m × 1.02 m with 20 plants per plot to represent each accession. During the growing season, 26 morphological descriptors were scored as a plot average. This group of descriptors included leaf hair, leaf color, leaf shape, stem color, stem glands, stem hair, leaf size, leaf glands, leaf nectaries, bract nectaries, boll nectaries, petal color, pollen color, petal spot, stigma, bract type, bract teeth size, bract teeth number, bract color, boll shape, boll point, boll size, boll color, boll glanding, boll pitting, and fruiting type (for description of these traits, rating scale, and associated digital images, see https:// www.cottongen.org/data/trait/NCGC_rating_scale). These descriptors are primarily inherited as qualitative traits whose expression shows negligible environmental interaction; therefore, characters were scored in a single replicate at a single location.
Among the 395 genotypes of the current investigation, a subset of 195 G. hirsutum accessions (126 improved and 69 wild types) were selected from the GDRS that were previously genotyped with SSR markers [4,10] (Additional file 2). Genotype data of 105 SSR markers were obtained from the Dryad Digital Repository [44] and compared to data from 38,822 SNPs in the current analysis. Three of the improved G. hirsutum accessions were found to be admixtures based on SSR analyses and were removed prior to comparison of SNP and SSR results (Additional file 1). Independent analyses were conducted for the data subsets of a) 192 G. hirsutum accessions and b) 123 improved G. hirsutum accessions. Loci that were monomorphic within the respective subset were removed prior to calculation of genetic similarities. A genetic similarity matrix was calculated for each marker system based on Jaccard's coefficient [45] as implemented in NT-SYS [46]. Genetic similarity values were plotted against each other and were compared with Mantel statistics via the 'MxComp' option in NT-SYS with 5,000 iterations to provide an α = 0.01 level of significance. Principal coordinate analyses (PCoA) were then calculated from Jaccard's similarity values. To analyze SNP data using NT-SYS software, a text file set was generated from the PLINK binary file format using the '-recode 01' option to code the minor alleles as '0' and the major alleles as '1' , and the '-output-missinggenotype 9' option to code missing data as '9'.
Phenotypic data for seed traits including percent oil content, percent protein content and seed index (grams per 100 seeds) were obtained from Hinze et al. [11] for the 195 accessions common to the previous SSR and current SNP study. The trait data was measured on seeds available in the NCGC which were harvested across different locations and years. The SNP genotypes for these lines were categorized and filtered for MAF greater than 0.01 and a genotyping rate greater than 0.90. Binary PED files for genotypes and phenotypes were exported from PLINK and used as input for single-SNP based association testing using the Genome-wide Efficient Mixed Model Association software (GEMMA) [47]. GEMMA was first applied to calculate new relationship matrices that could be used to adjust for population structure for each trait. QTL associations were then analyzed using the univariate linear mixed model equation. Three statistical tests were calculated, including the Wald test, likelihood ratio test, and the score test. Correction for population structure was assessed by plotting the observed versus expected ratios for each phenotype. Obtained p-values were filtered for significance using the Benjamini-Hochberg's False Discovery Rate (FDR) method [48]. The FDR was initially constrained to the level α = 0.05, but values up to 0.15 were used when few results were obtained using the initial α level. Candidate genes located near the determined loci were investigated using alignment information of SNPs to the G. raimondii (D 5 ) reference genome and/or the NBI (Novogene Bioinformatics Institute) G. hirsutum (TM-1, [AD] 1 ) draft genome [49,50].

Minor allele frequencies within and across Gossypium germplasm groups
We used a high-throughput genotyping platform to appraise the diversity of cultivated cotton and its wild relatives (Additional file 2). Of the SNPs identified in this study, 33,507 (86%) were polymorphic across all Gossypium species measured (Table 1). From the 20,000 sequences specifically selected to detect polymorphism between G. hirsutum and five other species, the array enabled detection of 17,954 interspecific polymorphisms (interspecific SNPs), which are useful for introgression research and breeding. Within G. hirsutum, the number of polymorphic SNPs (MAF > 0.01) ranged from 26,299 (68%) within 71 noncultivated landraces to 23,145 (60%) in 292 cultivars; 25,829 (67%) SNPs were polymorphic within this set of G. hirsutum. The distribution of MAF across these groups indicated that improved cultivars, and more specifically, those from the western US breeding region, had the most monomorphic SNPs (40 and 56%, respectively) (Additional file 4). The utility of the SNP array was examined by characterizing SNP allele frequencies across the panel, revealing that nearly 86% of the called SNPs were polymorphic among the 390 cotton samples, with an average minor allele frequency of 0.21 across the entire set. The average MAF when determined for G. hirsutum alone was 0.25, considerably higher than over all Gossypium species tested. These results serve as a reminder that the array was designed for genotyping and discriminating among genotypes of G. hirsutum, their hybrids, and interspecific introgression products [38]. Based on MAF, for an average pairwise combination of any two improved types, ca. 5,650 SNPs would be expected to be detected, and for an average combination of one improved and one wild G. hirsutum, ca. 6,892 SNPs would be detected ( Table 2). The observed number of homozygous SNP differences for the specific cultivars in this study averaged 7,017 SNPs with a maximum of 11,759 SNPs between 'TAMCOT Sphinx' and 'Sealand2' (an interspecific cross between G. barbadense and G. hirsutum) (Additional file 5). For the twelve groups of genotypes with the same names but from different seed sources, we observed a maximum of 4,857 homozygous differences between sources of the 'PD 1' cultivar (Additional file 3).
SNPs were initially characterized by their unique occurrence in several groups of G. hirsutum. In comparisons of non-improved and improved types, a large proportion of SNPs (23,984; 82.6%) were common to both groups, indicating that most genetic variation from wild types was also found in improved G. hirsutum. Whereas 3,135 or 10.8% of the total SNPs found in wild G. hirsutum were unique, only 1,927 or 6.6% of the SNPs observed in improved G. hirsutum were unique (Fig. 1a). Among global breeding regions for improved G. hirsutum, the United States had the highest number of unique SNPs (1,436), followed by the 18 Australian samples (149) and the 13 samples from northern Africa (118) (Additional file 1). Due to unequal sampling sizes among individual global breeding regions, a more appropriate comparison might be between improved accessions of the United States (N = 185) and all other countries (N = 107). A large proportion of SNPs (23,323; 90.0%) (Fig. 1b) were shared between the US and other countries, reflecting the common origins of cultivars and the continuing exchange of germplasm. Within improved G. hirsutum, approximately the same number of unique SNPs was found within the US (1,436) as compared to all other global breeding regions combined (1,152).
Within US improved germplasm, cultivars that were not assigned to a regional breeding group (unclassified, "n/a") possessed almost three times as many unique SNPs (1,393) as cultivars assigned to specific breeding regions (Fig. 1c). The "n/a" group was very diverse and included a wide range of genotypes. Cultivars from the eastern (545) and plains regions (440) had notably higher numbers of unique SNPs when compared to the Calculations were based on the average MAF (including monomorphic SNPs) for each group obtained from Table 1 mid-south (42) and western region (29). The eastern, mid-south, plains, and unclassified regions of the United States were represented by approximately the same number of samples (48,48,43, and 34, respectively) while the western region had the least representation (12). Therefore, the few unique SNPs from the western region could correspond to its low representation within the improved germplasm of the United States. The low number of unique SNPs in relation to the high representation of samples from the mid-south could be indicative of the development and perpetuation of a genotype that is highly adapted and successful to that region and that has adaptive capabilities in other breeding regions. Genetic diversity was greater in wild germplasm of G. hirsutum (H E = 0.232) than in improved germplasm (H E = 0.195) of the species (Table 1). For cotton bred within the US, the greatest genetic diversity (H E = 0.193) was found in the group of improved samples that could not be assigned to one of the four breeding regions. This supports the high number of unique SNPs within this group and suggests the presence of a distinctive combination of SNPs within a very diverse group of germplasm. The lowest diversity was observed in the mid-south (H E = 0.151) which corresponds to the observed low number of unique SNPs.

Genetic similarity and relationships between Gossypium breeding groups
The identical by state (IBS) values between pairs of individuals was used to estimate average genetic similarity within specific breeding groups. For the two groups of G. hirsutum, the highest similarity was observed in cultivars (IBS = 0.709) and the lowest similarity occurred among wild types (IBS = 0.652) ( Table 3). Pairs of cultivars from the United States were only slightly more similar (IBS = 0.683) than pairs from other countries (IBS = 0.671). Within US breeding regions, the most similarity was observed among cultivars from the midsouth (IBS = 0.739) while the most dissimilar cultivars originated within the plains region (IBS = 0.682). Pairs of unclassified US cultivars were also highly dissimilar (IBS  = 0.673), as expected among the assorted genotypes in this group. When comparing across US breeding regions, cultivars from the western and mid-south regions were the most different (IBS = 0.657). These pairwise genetic similarities may also be evaluated by their respective distribution patterns (Fig. 2a, b, c and d). Of particular interest was the relationship between the United States and other countries. The almost complete overlap in the distribution of pairwise similarity (or diversity) between improved cultivars worldwide was readily observed (Fig. 2b). Within the United States, increased genetic distance was more often found between pairs of cultivars from different regions than from the same region (Fig. 2c). This increased distance could be explained by the divergent distributions for IBS between regions (Fig. 2d). Within the wild category, some pairs had high similarity equivalent to or greater than that of improved pairs (Fig. 2a). Approximately 30% of the pairwise comparisons (representing 30 different wild samples) had IBS greater than 0.7. Upon further inspection, the majority of these samples in the pairwise comparisons had no race designation with the remainder being four marie-galante, four mocó, one latifolium, and one morrilli type. Based on geography of origin, several of the samples were from Venezuela, Brazil, Puerto Rico, and the Guadeloupe and Martinique Islands. This increased similarity could be indicative of a founder effect among the wild samples [51] where diversity was lost as the species moved from the Mexico and Guatemala center of diversity outward for domestication in the Caribbean and South America.
To examine genetic population structure and relationships among the major groups of germplasm, we conducted two independent tests of population stratification (MDS and fastSTRUCTURE). Multidimensional scaling (MDS) of pair-wise IBS values was used to visualize the relationships among several groups. First, all Gossypium species were analyzed together, and the SNPs efficiently separated G. hirsutum from all other species along the horizontal axis (Fig. 3). Three other Gossypium samples (not pure G. hirsutum lines) were noted at the edge of the range of wild samples of G. hirsutum. These samples included a synthetic tetraploid of G. hirsutum x G. longicalyx J.B. Hutchinson x G. armourianum Kearney (HLA); Peale1B, an unnamed new species from Wake Atoll, formerly G. hirsutum [52]; and TX-2263, recently confirmed as G. ekmanianum Wittmack [8] rather than a wild type of G. hirsutum. Within G. hirsutum, the wild and improved samples tended to separate along the horizontal axis with greater dispersion noted within the wild group. The improved cultivars generally separated into two clusters along the vertical axis. We were unable to determine a specific cause for this separation; however, it seems that it may be due to background in cultivars coming from the G. hirsutum genotype 'TM-1' [53]. This genotype is widely recognized as the genetic standard for G. hirsutum and thus has been included in a majority of genetic studies, including those for SNP discovery. Independent analysis of cultivars from the United States and other countries revealed a general mixing and overlapping of these two groups (Fig. 4a). Analysis of the 185 samples from United States breeding programs showed a tendency to stratify by breeding region along the horizontal axis; however, no distinct clusters were observed (Fig. 4b).
Further analyses of G. hirsutum diversity structure were run using fastSTRUCTURE software for three data sets: a) all G. hirsutum, b) only improved, and c) only wild or landrace samples. Only one of these analyses was able to determine a value of K that significantly clustered the data into more than one group. This occurred for the wild group where K = 7 was found to be significant, however there did not appear to be a known classification of samples to explain the observed clusters. The broad clustering of the improved samples into two clusters (indicated by MDS) was not detected by fastSTRUCTURE. The groupings suggested at higher K values did not correspond to any prior classifications and was generally consistent with the homogeneity of variation observed between countries in the MDS analyses. While it was not significant in explaining the genetic structure, the  split between wild and improved groups was obvious at K = 2, in an analysis of all G. hirsutum (Fig. 5). Most wild samples did show evidence of varying levels of admixture with cultivated material, as expected since all cultivated material was fundamentally derived from wild germplasm.

Phenotypic evaluation of genetically identical lines
In phenotypic comparisons of accession pairs determined to be highly similar (IBS > 0.98), only 7 of 26 morphological descriptors were found to differ within pairs. The descriptor ratings for leaf and boll nectary distributions differed most frequently among these paired accessions, followed closely by differences in ratings for quantity of stem hair (Table 4). One pair of accessions ('MD51ne' and 'Siokra 104-90') differed for five descriptors even though this pair had perfect genetic similarity (IBS = 1.000). There was no common breeding history to support identical genotypes between these two accessions. The plants used for genotyping and phenotyping grew from seeds obtained from the NCGC, but different plants were used for each evaluation; hence, there was opportunity for a mistake in sampling and/or handling that inadvertently may have led to these observations. Six of the remaining seven pairs differed in one to three descriptors. Only one pair ('Deltapine 66' and 'Paymaster 54') did not have phenotypic differences as indicated by descriptors. Among the highly similar accession pairs, this pair had one of the lowest tested pairwise IBS values (IBS = 0.981).

Comparison between SNP and SSR markers in cotton
The planned commonality of 192 accessions between the current SNP investigation and a previous SSR investigation of diversity allows for defined comparisons to be made. Within this dataset, 26,324 SNPs and 103 SSRs (representing 748 alleles) were polymorphic. A key difference between the two studies is that a DNA bulk of 10 plants was used to obtain SSR data from individual accessions, whereas DNA from a single individual was used in the current study to obtain SNP data. One expected difference due to sampling technique was that SSR data should reveal larger genetic distances among heterogeneous accessions (generally the wild accessions) than one would expect to see in the improved accessions, which are more homogeneous. We compared the genetic relationships among these accessions for the two marker types using NTSYS. The PCoA plots revealed an obvious separation of improved from wild accessions along the horizontal axis for both marker systems ( Fig. 6a  and b). The relationship of improved G. hirsutum accessions from the United States and other countries was non-discriminatory for both marker types (Fig. 6c and  d). When SNP and SSR data were combined (data not shown), the distribution of points in the PCoA plots closely resembled the distribution obtained when using solely SNP data. The Mantel r statistic of 0.798 indicated that there was relatively strong positive correlation between the SNP and SSR Jaccard similarity matrices for the G. hirsutum accessions (Fig. 7a). When comparing the matrices for only the improved accessions, the correlation was slightly lower, with a Mantel r statistic of 0.509 (Fig. 7b). Therefore, similarities among improved accessions calculated using SNPs were positively, but weakly related to similarity based on SSRs.
The array structure affected the calculated relationships among wild and cultivated accessions observed in the SNP marker system in a manner different from the SSRs. In Fig. 7a, the wild accessions (green) were skewed towards higher values based on SNP genetic similarity on the x-axis compared to the y-axis for the SSRs. This effect is likely due to the fact that a large number of markers were included on the array to produce an acceptable number of polymorphisms between any two improved G. hirsutum accessions. While a relatively smaller number of SNPs derived from wild accessions were also included on the array, thus the "true" similarity with these individuals was skewed based on the source of the SNPs chosen for the array. Due to the ability to detect unique novel differences, particularly between the wild accessions, the SSR data set was less skewed and behaved more "realistically" or closer to "true". While the SSRs and SNP array produce different similarity values, the technologies both provide methods for determining relationships between individuals.

Genome-wide association analysis for three cottonseed nutritional traits
Three traits were analyzed for the set of 195 accessions common with Hinze et al. [10] using the Genome-wide Efficient Mixed Model Association (GEMMA) software for detecting QTL. The seed index or weight in grams per 100 seeds was found to have the highest phenotypic variance explained (PVE) with 0.954 (i.e., 95.4% explained), while seed oil content had a moderate PVE of 0.628 and protein content had the lowest PVE of 0.230. Moreover, the estimated standard error for seed index PVE was very low (0.035), whereas the standard errors for both protein and oil content PVE were considerably higher -0.117 and 0.149, respectively. All three statistical tests available for analyses in GEMMA were utilized, including the Wald test, likelihood ratio test (LRT) and the score test for each analysis. Corrected pvalues were analyzed versus expected ratios for each phenotype, and it was observed that the relatedness calculated using genetic similarities was efficient in correcting for population structure (Fig. 8a, b and c). Corrected p-values for the three statistics were analyzed using a false discovery rate of 0.05. Whereas analysis of seed oil content and seed index did not produce any significant loci among the 26,099 SNPs, the analysis of seed protein content identified four significant loci under the Wald test and FDR of 0.05 with p-value ≤ 3.7e-06 (Table 5): i28873Gh, i34975Gh, i20295Gh, and i22490Gh. Two of the loci, i28873Gh and i34975Gh, remained statistically significant based on a more stringent correction of multiple testing with Bonferroni's adjustment and were also significant with the LRT and score test statistics with a more relaxed FDR of 0.15 (p-value ≤ 1.6e-05). These two most highly significant loci were both derived from BAC-associated SNPs [38]. The marker i28873Gh was originally named GH_TBb059I03r236, and i34975Gh was named GH_TBb119J20f639. Marker i28873Gh was mapped to linkage group "Chr 02" (Chromosome 02) in the high-density intraspecific genetic map [38], and the sequences of both markers aligned to a 7.80-7.82 Mb region of the "A02" scaffold (Chromosome 02) on the NBI G. hirsutum draft genome  sequence [50]. There are two candidate genes located near these two markers on the genome sequence: 1) Gh_A02G0521, which is an mRNA for the protein phosphatase 4 regulatory subunit 3 which is a suppressor of MEK (SMEK PPP4R3) and 2) Gh_A02G0522, an uncharacterized mRNA. The other two markers that were identified are i20295Gh which is a gene-associated SNP identified on G. raimondii scaffold_13_30556525 which does not align to any other Gossypium draft sequence [35], and i22490Gh which was identified in a genotyping-by-sequencing analysis by USDA (D.D. Fang) as CFB2569 and aligns to 35.25 Mb region on G. hirsutum D01 (Chromosome 15) [54]. The closest gene nearby this region on D01 is 14Kb away and named Gh_D01G1288. This gene is an mRNA for an unnamed coil family protein that has Myb DNA binding characteristics with a Myb_SANT-line domain, related to Arabidopsis thaliana protein AT2G24960.2.

Discussion
The development of next generation sequencing technologies and the resulting detection of thousands of markers, primarily SNPs, have increased the power available to analyze diversity in germplasm collections [21,55,56], and to use this diversity to make advances in plant breeding programs [57][58][59][60][61]. We have surveyed variation across Gossypium species utilizing the recently developed CottonSNP63K array and a large panel of 390 Gossypium samples comprising 292 improved cultivars of G. hirsutum and small sets of non-improved G. hirsutum and other Gossypium species. This data allowed us to appraise the utility of the array for evaluating genetic diversity and population structure, and to conduct SNPbased characterization and analyses on a significant portion of the USDA National Cotton Germplasm Collection. The observed patterns of diversity among Gossypium species and within G. hirsutum agreed with expectations based on previous studies with molecular markers [4,10]. The markers distinctly separated G. hirsutum from other species and distinguished the wild from improved types of G. hirsutum. Within improved types of G. hirsutum, the observed clustering may be a sign of slight ascertainment bias as has been noted in other SNP studies [62,63] with samples similar to TM-1 being localized towards the bottom of the two clusters of improved G. hirsutum. We found the array to be effective for grouping germplasm using PCoA and STRUCTURE analyses.
We identified a subset of 192 improved and wild cotton accessions that were run in a previous SSR analysis and on the CottonSNP63K array, and a comparison of grouping and discrimination revealed generally similar patterns for the SSR-versus SNP-based distributions. Moreover, combining the SSR data with the SNP dataset resulted in plots without marked changes, indicating that either technology is sufficient for determining relationships between samples. It is nonetheless noteworthy that the CottonSNP63K array and cluster file were primarily designed to discriminate among cultivated types of G. hirsutum and between cultivated G. hirsutum types versus non-cultivated germplasm, including several diploid and tetraploid species [38]. The ability to discriminate among non-cultivated types was not an emphasized criterion in selecting which SNPs to place in the array, but the array seems to function reasonably well for that purpose, too. The findings here indicate those objectives were achieved, in that the array and associated cluster file efficiently discerns differences among G. hirsutum cultivars, and between improved genotypes from wild forms of G. hirsutum and other species. Thus, the SNP array presents a facile means of identifying several thousand or more SNPs between most pairwise combinations of G. hirsutum genotypes. Given that the intraspecific linkage maps average about 3,500 cM [38], the expected Information is provided on p-values associated with three significance tests, as well as known chromosomal locations in three available Gossypium genome sequences density of markers for a given pair of parents would typically exceed one SNP per cM. SNPs have quickly become the preferred marker system for measuring genetic diversity in plants due to the potential to identify a substantial number of SNPs throughout the plant genome and therefore have greater power when associating that information with phenotypic traits of interest. Only recently have SNPs become utilized for genotyping by the cotton research community. To be successful, SNPs need to be dense and polymorphic within the population of interest. The SNPs on the array displayed polymorphism across the range of samples tested. Over 60% of the SNPs in this dataset were found to be polymorphic in the economically important G. hirsutum species. This suggests that this array will likely have high utility for discrimination and association studies within the primarily cultivated cotton species. Specifically, discrimination by the array is best for US and Australian cultivars as they are the source of the SNPs used in development of the array. In the present study, we have shown that the SNP genotype information obtained using CottonSNP63K array has enabled GWAS to identify loci associated with a seed trait. In our analysis of cottonseed nutritional traits, four SNPs were found significantly associated with protein content, with two located in Chromosome 2, a member of the A t subgenome. In a recent association mapping study of seed oil and protein in cotton, 228 SSR markers were used to genotype 180 elite G. hirsutum cultivars [64]. The SSR-based GWAS detected 12 markers across 9 chromosomes associated with seed protein. Interestingly, chromosome 2 that was identified with multiple individual markers in our SNPbased study was not significant in the SSR study. This may be due to much fewer SSRs tested and the corresponding low marker density relative to the number of effective SNP loci, particularly on chromosome 2. Of the eight SSR loci examined on chromosome 2 by Liu et al. [64], none co-localized with our SNPs based on alignment positions to currently available cotton genome sequences.
While multiple markers with significant associations were identified with seed protein content, no significant associations were found for seed oil content or seed index. Detection of significant marker associations is dependent upon statistical power to detect linkage disequilibrium of the marker with the causative allele for the phenotype. Statistical power in GWAS can be influenced by many factors including trait architecture and effect size, number of samples in the population, number of markers, distribution of markers, statistical method utilized, etc. [65,66]. It may be that the effect sizes of these traits are extremely small implying a multi-genic effect which combined with a relatively small number of samples for a standard GWAS analysis would cause the inability to detect significant associations with the seed index and oil content phenotypes. The differences in trait architecture of the phenotypes is observed in the coefficient of variations (CV) and showed that the trait architecture of seed index with CV of 20.9% was very different from that of oil content (8.8%) and protein content (9.1%). This large amount of phenotypic variability in the seed index trait of the sampled population could have possibly caused QTL determination of the loci responsible for the trait to be difficult. The variability in the current study likely resulted because the data were not from a single experiment in which all of the 195 accessions were included. Rather, the data were obtained from seed harvested from a number of different seed increases grown in different locations and years [11], as in common in large germplasm collections. A quantitative trait defined in a range of environments is likely to produce different accession values that are confounded by these non-genetic effects, and it is these accession means that were used in the GWAS experiment. Inheritance studies have shown that protein and oil are both influenced by the environment, with protein being more susceptible to these changes [64,67,68]. Detection of significant associations with GWAS studies has generally been difficult unless a large number of samples are utilized for lowly heritable traits, as those traits that are determined by a large number of very small effect loci are very difficult to identify. Nonetheless, the utilization of standardized genotyping methods, such as possible with the array, allow for easily adding samples to the population and reanalyzing with the additional data to assist in generating additional power for detection in the future.
In addition to the GWAS performed here, an additional analysis using the CottonSNP63K array, followed by a "targeted" approach of a specific set of SNPs rather than genome-wide SNPs was recently published. Zhu et al. [69] identified a gene on chromosome 15 of the D t subgenome that affects leaf shape in cotton; the findings were supported by similarity to genes regulating leaf morphology in other plant species. A separate study targeting SNPs for genetic male sterility (ms) in cotton [70] successfully linked a SNP to the recessive gene ms5 on chromosome 12 (A12) and ms6 on chromosome 26 (D12). Similarly, Ellis et al. [71] used the array to identify markers linked to a viral resistance gene for cotton bunchy top disease. This targeted SNP association analysis identified nearly as good an interval spanning the Cbd resistance locus as was obtained from screening an F 2:3 population. These results demonstrate that genotyping and GWAS can be effectively conducted with SNPs in cotton utilizing the array genotyping platform. As with many quantitative traits, statistical power necessary for identification of all loci responsible for a trait needs to be high to detect the small effect of numerous loci. In instances where a large proportion of the overall phenotype is explained by large numbers of loci, genomic selection utilizing the complete genotypes obtained by the array may provide a promising future avenue of research. An additional option would be to increase power by adding additional samples, as has been done with human and other crop analyses. For example, extensive application of an analogous array to the USDA Soybean Germplasm Collection enabled synthesis of haplotype block maps that are expected to lead to the discovery of many SNP-trait associations for economically important characteristics [72]. The consistency and reliability of genotyping with the array is likely to prove valuable in regard to accumulating high-quality genotypic data sets to analyze these kinds of genetic traits, with data potentially being accumulated across multiple labs, large distances and time periods.
We assessed the utility of the CottonSNP63K array data for distinguishing among G. hirsutum cultivars by comparing representative cultivars from the NCGC and worldwide cotton researchers. Because breeders often use a limited range of material, assessing the relatedness of cultivars can assist with selecting more distantly related lines to maximize variability in breeding programs. Recent SNP-based reports show a lower diversity in cotton relative to other plant species. A diversity study in soybeans (Glycine spp.) showed Chinese landraces of G. max L. (H E = 0.338) with a slightly higher diversity than inbreds (H E = 0.313) [73]. Though still showing higher diversity than cotton, an Italian grape (Vitis spp.) germplasm collection had higher diversity in the primary cultivated species of V. vinifera L. ssp. sativa (H E = 0.345) than the wild form, V. vinifera L. ssp. sylvestris (H E = 0.266) [74]. Our current evaluation with SNPs and previous evaluations with SSRs [10,25] have revealed differences between individual cultivars with limited differences among groups of cultivars from the United States or from other countries. To maximize genetic diversity estimated with the SNP data presented here, a breeder has multiple options. A breeder could choose parents from those analyzed within this manuscript based on the genome-wide genetic similarity values calculated with all markers (Additional file 5) or could choose two cultivars on opposite ends of either the x-or y-axis shown in the MDS plots ( Fig. 4a and b; Additional file 2). If a breeder was interested in genotypes not assayed here but could assign them to a breeding region as defined here, the average expected polymorphisms for a given cross could be estimated based on the category averages. For example, within the US, if germplasm from the eastern region were crossed to germplasm from the plains region, a breeder could expect to observe ca. 5,200 polymorphic SNPs between the two samples (Table 2). Breeders interested in utilizing parents developed within the US, could benefit by using cultivars from different regions in a breeding program, as they are generally less similar than cultivars from within the same region. Of particular interest are the US cultivars that could not be assigned to a specific breeding region based on regional location of breeding programs or pedigree (also includes cultivars with mixed origins). Increased heterozygosity/gene diversity and unique alleles were observed among cultivars in this group compared to cultivars with clear origin because this group was comprised of many unique genotypes. The knowledge provided by the screening of germplasm materials will allow for intelligent design of breeding populations to incorporate desired levels of known diversity particularly for targeting or pyramiding regions of interest that may provide multiple modes of action for desirable traits in the future. With continuing advances in SNP analysis technologies, the diversity data regarding specific pairs or groups of parents could also be used in concert with genetic maps to identify and select ad hoc subsets of markers to deploy for analysis and/or selection of specific families or populations. Many breeders thus might consider running the CottonSNP63K array against their breeding lines to assess their germplasm relative to other germplasm of interest, and select marker subset(s) that will prospectively maintain effectiveness but reduce genotyping costs.
While the current study is one of the first looking primarily at US germplasm, SNPs have been tested to determine their suitability in the examination of distinctness, uniformity and stability (DUS) for cotton varieties in China [75]. DUS examination typically involves a grow-out test with measurement of a set of phenotypic observations; however, the more rapid results provided by molecular markers have been allowed under the International Union for the Protection of New Varieties of Plants (UPOV) system of plant variety protection in specific situations [76]. Twenty-three core SNP markers were identified by Kuang et al. [75] that clustered 30 standard cotton cultivars into groups consistent with known pedigrees. A similar effort was undertaken to verify marker-based distinctness for registering alfalfa cultivars [77]. They compared 2,902 SNP markers with 11 morpho-physiological measures and 41 polymorphic SSRs on a set of 11 alfalfa landrace cultivars. Even though the authors identified inconsistencies for SSR and SNP markers with morpho-physiological traits, they assert that marker-based assays have high potential to substitute for, or complement, morphological distinctness measures in alfalfa cultivars. Many of these are the same issues that arise when officially registering cultivars of any crop and are applicable to identifying accessions in germplasm collections.
The integration of genomic data into germplasm collection documentation systems and its combination with taxonomic and phenotypic data will impact how these genetic resources are conserved as well as how they are used by plant breeders. In the current study, eight pairs of genotypes with high IBS values (putative duplicates based on genotype) were grown in the field and phenotypic descriptors were measured. Results of genotype and phenotype comparisons were mixed. In some cases, cultivars with a highly similar DNA profile for this set of SNP markers also had identical phenotypes, while in other cases, cultivars with a highly similar DNA profile had multiple phenotypic differences. The observed phenotypic differences are too strong to be due to natural environmental effects. It must be noted that unusual observations may be due to the possibility of human errors in any assessments, whether they be based on genotype or phenotype. Therefore, a researcher desiring to use a similar approach to remove highly similar individuals from their lines may wish to verify a small number of markers on independent DNA isolations with simplex SNP marker assays in the lab to ensure that a mistake was not made in sampling prior to more extensive field-based assessment. Based on our data at this time, we assert that any potential duplicates or redundancies within a collection cannot be ascertained solely by genotype or phenotype profiles, but that SNP and/or SSR data can bring greater power for identification of redundancy, solely due to the number of "markers or traits" one can look at. Often with phenotypic characterization, one is only working with 50-75 traits at most. SNP markers may be more appropriate to distinguish among closely related individuals while SSR markers can easily estimate diversity among less related germplasm. When comparing SSR and SNP markers in cotton cultivars, the study of Kuang et al. [75] found that SSR markers tend to correlate cultivars with their geographic origin while SNP markers more consistently correlate cultivars based on kinship or pedigree; however, no rationale was provided to explain this difference. Transcriptome profiles of lines could also potentially be different without causing an easily detectable phenotype in the field and the causal allele in most cases may not be fully represented on the marker profiles generated with the array, so a researcher may also wish to study the transcriptomic profiles of lines in the future prior to concluding lines are redundant. Until more thorough studies are available, a complement of genetic and phenotypic data should be analyzed prior to declaring two accessions as genetically distinct or duplicates.

Conclusions
With increasing SNP discovery projects and the development of the CottonSNP63K array to assay thousands of SNPs, it is anticipated that SNP markers will play an increasingly important role in cotton genetics, germplasm conservation, and breeding applications. In this study, we provide a large genome-wide variation data set for primarily cultivated cotton. Thousands of SNPs in representative cotton genotypes provide an opportunity to finely discriminate among cultivated cotton from around the world. The SNPs will be useful as dense markers of genome variation for association mapping approaches aimed at correlating molecular polymorphisms with variation in phenotypic traits, as well as for molecular breeding approaches in cotton.

Additional files
Additional file 1: Results and discussion in relation to the removal of admixed/misclassified samples. Five samples were removed from overall SNP diversity analysis, and three samples were removed from the comparison of SNP and SSR data. This file presents the original results and discusses why the samples were subsequently removed. For the original diversity analysis, this file includes the MDS figures, Venn diagrams of unique and shared SNPs, and distribution of pairwise IBS values when these samples are included. Likewise, for the comparison of SNP and SSR data, the original principal coordinate analyses and the plots comparing SNP-and SSR-based genetic similarity are shown. (DOCX 325 kb) Additional file 2: List of Gossypium samples genotyped for SNP diversity analyses. The table includes information for seed source, improved/wild classification, assigned categories for improved breeding regions, assigned categories for wild race types, availability of SSR genotypes for SNP-SSR comparison, and values for seed oil, protein, and seed index for samples used in GWAS analysis. In addition, for each sample, the coordinates of the first two MDS dimensions are listed for Figs. 3 and 4a, b. Five samples are included here that were identified as outliers and not used in the SNP analysis. Siokra 104-90 was not a cultivar developed in Australia (I. Wilson, personal communication). It may be a mislabeling of Siokra 1-4/649 which is actually just Siokra 1-4. (XLSX 97 kb) Additional file 3: Individual matrices with numbers of different and identical SNPs for entries with the same name. These values were extracted directly from Additional file 5. Values above the diagonal represent the count of homozygous differences between pairs of samples. Values below the diagonal represent the number of identical SNPs genotyped between pairs. Heterozygous SNPs within samples were not counted in the number of differences between samples. SNP numbers were determined using all SNPs on the CottonSNP63K array.