Assessment of heterosis based on parental genetic distance estimated with SSR and SNP markers in upland cotton (Gossypium hirsutum L.)

Heterosis has been extensively utilized in different crops and made a significant contribution to global food security. Genetic distance (GD) is one of the valuable criteria for selecting parents in hybrid breeding. The objectives of this study were to estimate the GD between parents using both simple sequence repeat (SSR) markers and single nucleotide polymorphism (SNP) markers and to investigate the efficiency of the prediction of hybrid performance based on GD. The experiment comprised of four male parents, 282 female parents and 1128 F1, derived from NCII mating scheme. The hybrids, their parents and two check cultivars were evaluated for two years. Performance of F1, mid-parent heterosis (MPH), and best parent heterosis (BPH) were evaluated for ten agronomic and fiber quality traits, including plant height, boll weight, boll number, lint percentage, fiber length, fiber strength, fiber uniformity, fiber elongation ratio, micronaire, and spinning consistent index. Heterosis was observed in all hybrids and, the traits like plant height, boll number, boll weight and lint percentage exhibited higher heterosis than the fiber quality traits. Correlations were significant between parental and F1 performances. The F1 performances between three hybrid sets (Elite×Elite, Exotic×Elite, and Historic×Elite) showed significant differences in eight traits, including boll number, lint percentage, fiber length, fiber strength, fiber uniformity, fiber elongation ratio, micronaire, and spinning consistent index. The correlation of the GD assessed by both SSR and SNP markers was significantly positive. The cluster analysis based on GD results estimated using SNP showed that all the female parents divided into five groups and the F1 performance between these five groups showed significant differences in four traits, including lint percentage, micronaire, fiber strength, and fiber elongation ratio. The correlation between GD and F1 performance, MPH and BPH were significant for lint percentage and micronaire. Our results suggested that GD between parents could be helpful in heterosis prediction for certain traits. This study reveals that molecular marker analysis can serve as a basis for assigning germplasm into heterotic groups and to provide guidelines for parental selection in hybrid cotton breeding.


Background
Cotton is the most important natural fiber crop in the world and one of the cultivated allotetraploid Upland cotton (Gossypium hirsutum L.) fulfills about 95% of the output of global cotton production [1]. Heterosis or hybrid vigor is used to describe the phenomenon that the F 1 hybrids present superior performance than parents [2]. Utilization of heterosis in cotton has significantly contributed to the yield and fiber quality [3]. The development of hybrid cotton involves the proper selection of parents and the identification of superior heterotic combinations. Screening a large number of parental lines and selecting appropriate parents for crossing and evaluating them in multiple locations is laborious, costly, and timeconsuming. Various methods have been used to predict the hybrid performance depending on the types of hybrids (single cross or three-way cross) and traits which including parental performance, mid-parent value and the general combining ability [4][5][6][7].
With the aim of saving resources, the genetic distance (GD) inferred from molecular markers has been suggested as a promising tool for hybrid performance prediction and recognition of heterotic groups [8][9][10]. Recently, several reports concerning maize, rice, wheat have suggested the possibility of using the molecular markers, such as simple sequence repeat (SSR) and single nucleotide polymorphism (SNP), to select parental materials for heterosis crosses [6,[11][12][13]. According to these literatures, there is a regression of either hybrid performance or heterosis with increasing molecular genetic distance. These studies showed the potential of GD in the prediction of hybrid performance for important traits.
Several studies in cotton have used molecular markers such as restriction fragment length polymorphism (RFLP), randomly amplified polymorphic DNA (RAPD), or SSR to estimate GD among parents and use their values to predict the hybrid performance, heterosis or specific combining ability (SCA) [14][15][16]. But these studies were based on a rather small set of parental lines and the marker density was very low. Because the cotton genome has tremendously redundant sequences, therefore the assessment of cotton GD requires high-density molecular markers.
The present study used 286 Upland cotton accessions to construct 1128 hybrids according to North Carolina (NC II) mating design and investigated ten agronomic and fiber quality traits and heterosis. We used both SSR and SNP markers to estimate the GD between parents. We further analyzed the relationship between GD and heterosis, and assessed the feasibility of the use of SSR and SNP based genetic distances in predicting the hybrid performance and heterosis.

Genetic distance and clustering analysis for the population
In this study, both SSR and SNP markers were used to investigate the genetic distance (GD) between parents. A total of 198 polymorphic SSR markers were distributed on 26 chromosomes. There were 557 polymorphic alleles in 286 parents ranged from one to ten alleles per marker with an average of 2.81. For the SNP markers, with a missing rate greater than 30% and minor allele frequency (MAF) less than 5% were eliminated and a total of 76,654 SNPs were obtained. These SNPs distributed on 26 chromosomes and varied in density at different chromosomes and locations (Fig. S1).
The GD between the parents calculated based on SSR markers showed that the GD between four male parents (Zhong7886, A971, 4133, and SGK9708) and 282 female parents varied from 0.139 to 0.387, with an average of 0.279 (Table 1, Table S1). The F 1 population which crossed from four male parents was named as population A (Zhong7886), C (A971), D (4133), and E (SGK9708) according to their male parents. The mean value of GD assessed by SSR markers in each F 1 populations was E > C > D > A. The GD between parents based on SNP markers showed that the GD varied from 0.137 to 0.375, with an average of 0.242 (Table 1, Table S1). The mean value of GD assessed by SNP markers in each F 1 populations was A > E > D > C. The correlation of the GD assessed by SSR and SNP markers was significantly positive (0.264 ≤ r ≤ 0.375, P < 0.01). Furthermore, 1128 F 1 hybrids clustered into five groups based on GD assessed through SNP markers and named as group I, II, III, IV and V, having 144, 176, 304, 224 and 280 F 1 , respectively ( Fig. 1). From the clustering results by SSR, all the F 1 hybrids could be clustered into three groups, which contained 536, 468, and 124 F 1 hybrids and names as group 1, 2 and 3, respectively (Fig. S2). But the clustering results by SSR was not perfectly match the clustering results by SNP. Although we could find that Group 1 in SSR clustering result included the majority crosses which clustered as Group I and Group III by SNP, Group 2 in SSR clustering result was consisted by crosses which clustered as Group III, Group IV, and Elite×Elite hybrids showed significant lower GD than the other two hybrids sets (Fig. S3). Furthermore, we evaluated the F 1 performance of the Elite×Elite, Exotic×Elite, and Historic×Elite hybrids and made comparisons with parent performances, and the result showed that all the F 1 hybrid performance were significantly higher than parents in all the nine traits except of fiber strength (Fig. 2  Historic×Elite hybrids. However, no significant differences were observed for plant height (PH) and boll weight (BW) between these three hybrid sets. From the above clustering result by SNP, we concluded that all 1128 hybrids could be divided into five groups according to the GD, therefore we compared the F 1 hybrid performance of the each group and parents (Fig. 3). Firstly, seven traits showed significantly higher values in both five F 1 groups than parents except of FL, FE, and FS. Secondly, Group II, IV, and V showed significantly higher LP than group I and III while Group II showed significantly MIC than group I and III. Furthermore, the mean values of group II, III, IV, and V for FL and FE were significantly higher than parent except of group I. For FS, there was no difference between all the F 1 hybrids with parents. Finally, there was no significant differences among each F 1 groups for SCI, BW, BN, FU, and PH. All these results demonstrated that different groups showed varied performances for concerning trait.

Heterosis performance of F 1 hybrids
We compared the mid-parent heterosis (MPH) and bestparent heterosis (BPH) of ten traits in 1128 F 1 hybrids and the results showed that the MPH values ranged from − 18.2 to 75.9%, whereas the BPH values varied from − 31.4 to 47.7%. The mean values of MPH of the ten traits ranged from 0.09 to 14.18%, with an average of 4.36%, and the mean values of BPH ranged from − 4.85 to 3.30%, with an average of − 0.86%. Generally, the mean BPH values were lower than the MPH values for all traits, and approximately 80.9 and 41.6% of the crosses had positive MPH and BPH, respectively (Fig. 4). Among the different F 1 populations, F 1 population derived from the male  Correlation between parent performance, F 1 performance and heterosis The correlation analysis between the performance of parents and the hybrid performance was studied to investigate the effect of the parents on the performance of the hybrids. The result showed that the correlation between parents and F 1 s performance was significantly positive (ranged from 0.459 to 0.843) in the ten traits except BW and BN. Therefore, this result suggested that genetic control of these traits was under additive genes, and the performance of parents can be used to predict the hybrid performance of these eight traits except for BW and BN (Fig. 5).
The performance of parents showed significant negative correlation with MPH of PH, BN, MIC, and FU (ranged from − 0.127 to − 0.670) in all four populations. For BW, FE and SCI, the correlation between parent performance and MPH values showed significant negative association only in population A and E. While, FL showed significant negative correlation between parents and MPH in population D. For LP, significant negative correlation was observed between parents and MPH in population A and D, but showed significant positive correlation in population E. There was no significant correlation observed between parent performance and MPH for FS (Fig. 5).
The correlation statistics between parent performance and BPH showed that only the correlations for MIC (0.184) were significantly positive in all the four populations, but for FS and SCI, the correlations were significantly negative in all the four populations, ranged from − 0.260 to − 0.589. For LP, the correlation between parents and BPH showed significant positive correlation in population C, D and E. For FU, parents and BPH showed significant negative correlation in group A, C and D. For FE, the correlation between parents and BPH showed significant positive correlation only in group A. The correlation for FL between parents and BPH showed significant negative correlation only in group D. While, PH, BW and BN have both Positive and negative correlations in the four populations (Fig. 5).

Correlation between genetic distance and F 1 performance
To understand the effect of genetic distance of the parents on the level of heterosis in hybrids, the correlations between genetic distance and the F 1 performance, MPH, and BPH were calculated.
Based on the correlation between the GD of SSR markers and F 1 performance, the GD SSR was negatively correlated with BW, LP, BN, FL, MIC, and FU in at least one F 1 population, but not significantly correlated with PH, FE and SCI (Table 2). However, GD SSR was positively correlated with FS in the D population. Based on the correlation between the GD of SNP markers and F 1 s performance, GD SNP was negatively correlated with LP, BN, FL, MIC and FE in at least one F 1 population but not significantly correlated with other traits like PH, BW and FS (Table 2). However, GD SNP was only positively correlated with SCI in the C population.
Overall, most of the traits were negatively correlated with GD SSR and GD SNP , and only two traits (FS and SCI) were positively correlated with GD SSR and GD SNP in only one population. Furthermore, GD SNP had more effective power than GD SSR .

Relationship between genetic distance and MPH
The correlation between GD of SSR markers and MPH showed that GD SSR was negatively correlated with FL, FS, MIC, FU, and SCI in population E, but positively correlated with MPH for PH and BW in population E and D, respectively ( Table 3). The correlation results between GD of SNP markers and MPH showed that the GD SNP was positively correlated with the MPH of PH, BN, FS and FU in only one population and positively correlated with BW and SCI in two populations (Table 3). For the MPH of LP, the correlation was positive in the D population but negative in population E.
In summary, the overall analysis results of the correlation between GD SSR and GD SNP in the four populations was inconsistent, and the correlation of group E was stronger than that of other groups.

Relationship between genetic distance and BPH
The correlation results between GD of SSR markers and BPH showed that the GD SSR was negatively correlated with the BPH of LP, FL, FS, MIC, FU, and SCI but positively correlated with the BPH of PH (Table 4). From the correlation results between GD of SNP markers and BPH, we observed that the GD SNP was negatively correlated with the BPH of LP, BN, FL, MIC, and FE, and positively correlated with the BPH of PH and BW (Table 4). In summary, the overall analysis results of the correlation between GD SSR , GD SNP and the BPH of ten traits were consistent. The overall results were consistent with the correlation trends of F 1 s performance, but the correlation was weak.

Genetic distance between parents assessed by SSR and SNP markers
With the rapid development and spread of molecular marker technology, these molecular markers have been used widely in analyses of GD, genetic diversity, population structure, genetic mapping, and linkage mapping. Earlier at the end of the twentieth century, some studies have used RFLP and SSR markers to study the relationship between GD and heterosis, and proposed that the relationship between GD and heterosis could be predicted by genetic differences [17]. Subsequently, a number of studies used RAPD [18], AFLP [19,20], SSR [21][22][23], EST-SSR [24,25], insertion-deletion (InDel) [11] and SNP markers [13,[26][27][28] to study the relationship between GD and heterosis. Previous studies used different molecular marker types and those results were also different, but the GD was not compared. SSR markers amplify products of different lengths according to the different number of tandem repeats in the core sequences of different materials to obtain the different genotypes of the population. The tandem repeats are mainly distributed in the non-coding region. SNP markers represent the whole genomic information of target species. Compared to traditional SSR markers, SNP markers have good genome-wide coverage. In this study, both SSR and SNP markers were used to study the GD between parents. There was a significant positive correlation between these two GDs (r > 0.264, P < 0.05) and we found that the SNP marker was more  Table 3 Correlation coefficients (r) of genetic distance with mid-parent heterosis (MPH) of yield and fiber quality-related traits based on SSR and SNP marker.  accurate and efficient than SSR marker to study the relationship between GD and heterosis.

Plant heterosis prediction based on genetic distance
In recent years, methods have been sought to allow initial selection of parents intended for heterosis crossing.
Previous studies attempted to analyze the relationship between GD and heterosis have resulted in different conclusions in various species, including wheat, sesame, rapeseed, cacao, eggplant, maize, and pearl millet. Few studies have used GD to estimate F 1 performance and heterosis for improving the breeding efficiency on cotton heterosis utilization. In this study, under the condition of grouping according to different male parents, the GD of the two molecular markers were significantly negatively correlated with F 1 s performance and BPH of LP and MIC. The correlation between GD and MPH of each trait was weak. The correlation between GD SNP and F 1 s performance, MPH and BPH was stronger than that of GD SSR . In addition, according to the clustering result by GD based on SNP, we found that all the F 1 s could be divided into five groups, and its average values of GD was 0.295, 0.287, 0.277, 0.275, and 0.261 for Group I, III, IV, V and II, respectively (Fig. 1). Meanwhile, we found that the lint percentage of Group I and III was significantly lower than Group IV, V and II (Fig.  3). Group II, which had the lowest GD, showed more bigger values in lint percentage, micronaire, fiber length, and fiber elongation ratio than Group I and III. All these results indicated that genetic distance between parents can be a valuable indicator for heterosis predication, especially for lint percentage micronaire, fiber length, and fiber elongation ratio and F 1 crosses clustered in Group II had more commercial values in hybrid cotton breeding.
Positive correlations between GD and heterosis were reported in maize, wheat, pearl millet, Brassica napus, Brassica oleracea, cacao, and rapeseed. In maize, the GD between parental components, as determined by the SNP and SilicoDArT markers was significantly correlated with the heterosis effect observed in the majority of the yield structure features, as well as the yield itself [12]. Nie et al. reported a significant correlation between GD and MPH of 1000-grain weight in wheat [13]. In pearl millet, moderate positive significant correlations were found between GD and MPH for grain yield (r = 0.37, p < 0.01) and BPH for grain yield (r = 0.33, p < 0.01), respectively [21]. Nikzad et al. found a positive correlation for the genetic distance of the inbred lines from the common Brassica napus parent with MPH for seed yield (r = 0.31) and hybrid yield (r = 0.26) [23]. Significant correlation was observed between GD and MPH of plant height, gross plant weight, net curd weight, leaf width, curd diameter and total marketable yield in Brassica oleracea [25]. In cacao, a significant positive correlation of 0.39 was found between GD and SCA for yield [26]. Studies in rapeseed showed that GD evaluated by total molecular markers (GD total ) had no correlation with heterosis but GD measured by favoring markers (GD favor ) significantly and positively correlated with the number of seeds per silique, thousand seed weight, seed yield per plant and seed yield per plot for high-check heterosis and sum of parental general combining ability [29].
However, some investigations also showed no or weak correlation between GD and heterosis in wheat, pearl millet, sesame, eggplant, and maize. Nie et al. observed weak associations between the GD based on SNP and MPH or BPH of spikelet number, harvested spikes and yield in Table 4 Correlation coefficients (r) of genetic distance with best-parent heterosis (BPH) of yield and fiber quality-related traits based on SSR and SNP marker. wheat [13]. Chen et al. found GD based on SSR markers poorly correlated with F 1 performance, MPH and SCA in wheat [30]. Gupta et al. found that the GD was not correlated with heterosis of grain yield in pearl millet [21]. Pandey et al. revealed a weak association of GD with F 1 performance in sesame [22]. In eggplant, GD assessed through SNPs showed a diminutive correlation with the hybrid means, heterosis, and SCA values [27]. In a previous study of maize lethal necrosis, a very low and negative correlation was observed between parental lines markerbased genetic distance and heterosis [31]. Betran et al. suggested that heterosis can be better predicted only when GD is smaller than a certain threshold [32]. Moreover, studies have suggested that the correlation is dependent on the investigated germplasm and GD calculation methods [33]. Previous studies showed that the efficiency of predicting heterosis by GD estimates was improved by selecting markers tightly linked to the QTL affecting heterosis of the target trait [34]. This suggested that higher heterosis was not from crosses between parents with largest GD, but mainly from those with intermediate GD. Significance of molecular marker-based GD in prediction of heterosis inevitably depends upon the methods used to calculate GD, molecular marker types, genome coverage of molecular marker, genome region of molecular marker, types of germplasm, breeding system, traits under consideration, and environmental conditions.

PH
In this study, the low and insignificant correlation in certain traits may be due to the inadequate genome coverage, lack of association between markers and traitcontrolling genes and epistasis among the quantitative trait loci.

Conclusions
In this study, we used both SSR and SNP markers to estimate the GD between parents and to investigate the efficiency of the prediction of hybrid performance based on GD. Our study found that all the female parents could be divided into five groups based on GD SNP cluster result and the F 1 performance between these five groups showed significant differences in LP, MIC, FS, and FE. Furthermore, the correlation between GD and F 1 performance, MPH and BPH were significant negative for lint percentage and micronaire. Overall, our results suggested that GD between parents could be helpful in heterosis prediction for LP and MIC and will be beneficial for heterotic group categorization and parental selection in hybrid cotton breeding.

Plant material
A total of 286 Upland cotton cultivars and lines were selected as parents in this study. All the accessions were collected from different ecological regions in China and from 13 different countries, represented a wide range of genetic backgrounds. The accessions used in this study were 136 elite cultivars, 103 historical cultivars (cultivated before 2000) and 47 exotic cultivars from 13 different countries. Among them, four elite cultivars (Zhong7886, A971, 4133, and SGK9708) with excellent comprehensive characters in China were selected as male parents. All the seeds were stored in the National Germplasm Mid-term Bank of the Institute of Cotton Research (ICR), the Chinese Academy of Agricultural Sciences (CAAS). The detailed information of 286 accessions is listed in Table S2.

Field trial
The field experiments were conducted at the Yellow River region and Yangtze River region during 2012-2013 growing season. 1128 F 1 s were divided into four groups (A, C, D, E) according to their male parents. All 286 parents, F 1 s, and three control cultivars (Lumianyan 28, Ruiza 816 and Ezamian 10) were planted at two different experimental sites in two years. The four groups of locations in the same year were all in the same cotton region of China (Yangtze River valley or Yellow River valley). The experiment was conducted in a randomized complete block design, plots consisted of three rows each, 8 m long with a row spacing of 0.25 m. The field management was carried out according to the routine operation of local field production.

Character investigation and data collection
A total of ten yield and fiber quality traits were collected from the middle row of each plot. One week after topping, plant height (PH, cm) was measured from the ground level to the tip in ten randomly selected plants. After attaining 70% of boll opening, ten mature bolls were randomly selected to investigate the boll number per plant (BN, No.) for each plot. After harvesting, boll weight (BW, g) and lint percentage (LP, %) were calculated by 30 bolls. Fiber quality traits including fiber length (FL, mm), fiber strength (FS, cN/tex), fiber length uniformity (FU, %), fiber elongation (FE, %), spinning consistency index (SCI, %) and micronaire (MIC) were determined by Cotton quality Supervision and Inspection Center of China Agriculture and Village Ministry (Anyang).

DNA extraction and genotyping
The fresh leaves of 286 parents were collected in the field, and the genomic DNA was extracted by CTAB method [35]. The concentration and purity of DNA were determined by Nano Drop2000 spectrophotometer, and the quality was determined by 1% agarose gel electrophoresis. A total of 198 polymorphic SSR markers were utilized from previous studies and listed in Table S3 [36].
The DNA concentration of qualified samples were adjusted to 100 ng/μL for restriction-site associated DNA sequencing (RAD-Seq) by Huada Gene Co., Ltd. (Shenzhen, China). The steps were as follows: (1) DNA digestion; (2) add bar-coded adapters; (3) DNA fragmentation; (4) DNA recovery and purification; (5) DNA amplification; (6) DNA recovery and purification; (7) sequenced on Illumina Hiseq 2000 system. The raw reads were aligned with G.hirsutum L. TM-1 reference genome v 1.1 (http://mascotton.njau.edu.cn/info/1 054/1118.htm) by BWA software and the parameters were set to mem-t8. SNP genotypic data were obtained by SNP Calling, with GATK and SAMTools packages [37,38]. The probability of the fragments mapped to the reference genome was 93.4-99.6%, the coverage on the genome was 0.07-7%, and the average sequencing depth was 1.48. The sequencing data had been deposited to NCBI under the accession number: PRJNA353524.

Evaluation of genetic distance and correlation analysis
The genetic distance (GD) of SSR markers between parents was determined according to Nei's et al. by Powermarker 3.25 [39,40]. The formula is GD SSR = 1-2N ab / (N a + N b ), where N ab represent the SSR marker numbers amplified in both sample a and b, and N a and N b represent amplified SSR marker numbers in sample a and b, respectively. The GD of SNP markers between parents were calculated by TASSEL 5.0 based on the identity-bystate (IBS) genetic distance as GD SNP = 1-IBS [41]. The average performance of MPH and BPH of ten yield and fiber quality related traits of 1128 F 1 s were analyzed by Graphpad prism 7.0. The packages ggplot2 and GGally in R software were used to analyze the correlation between GD of SSR and SNP marker and F 1 s performance, MPH, and BPH. Pearson's correlation coefficients (r) were used to analyze the correlation between parent traits and F 1 s performance, MPH, and BPH and tested at P = 0.05 and 0.01.