Genome wide association mapping and candidate gene analysis for hundred seed weight in soybean [Glycine max (L.) Merrill]

Background The hundred seed weight (HSW) is one of the yield components of soybean [Glycine max (L.) Merrill] and is especially critical for various soybean food types. In this study, a representative sample consisting of 185 accessions was selected from Northeast China and analysed in three tested environments to determine the quantitative trait nucleotide (QTN) of HSW through a genome-wide association study (GWAS). Result A total of 24,180 single nucleotide polymorphisms (SNPs) with minor allele frequencies greater than 0.2 and missing data less than 3% were utilized to estimate linkage disequilibrium (LD) levels in the tested association panel. Thirty-four association signals were identified as associated with HSW via GWAS. Among them, nineteen QTNs were novel, and another fifteen QTNs were overlapped or located near the genomic regions of known HSW QTL. A total of 237 genes, derived from 31 QTNs and located near peak SNPs from the three tested environments in 2015 and 2016, were considered candidate genes, were related to plant growth regulation, hormone metabolism, cell, RNA, protein metabolism, development, starch accumulation, secondary metabolism, signalling, and the TCA cycle, some of which have been found to participate in the regulation of HSW. A total of 106 SNPs from 16 candidate genes were significantly associated with HSW in soybean. Conclusions The identified loci with beneficial alleles and candidate genes might be valuable for the molecular network and MAS of HSW. Electronic supplementary material The online version of this article (10.1186/s12864-019-6009-2) contains supplementary material, which is available to authorized users.


Background
The seed weight (SW) of soybean (often denoted by hundred seed weight (HSW)) is an important yield component and positively correlates with seed yield [1,2]. HSW exhibits wider variation ranges [3], and the HSW of the modern elite cultivar (18-20 g) is approximately 6-7-fold greater than that of the wild soybean (3-4 g) [4]. HSW often determines the final utilization of soybean seed. Lower HSW cultivars are desirable for high quality soybean sprouts and natto production, whereas higher HSW cultivars perform well in tofu, edamame and miso production [5]. HSW also affects soybean seed germination viability and seed vigor [6]. As a typical quantitative trait, HSW is controlled by multiple genes with small or large genetic effects, especially additive effects [7,8] and the heritability range is relatively high (44-94%). HSW is significantly affected by environmental conditions including light, temperature, soil moisture and nutrient status [9][10][11] and geographical conditions such as altitude, latitude, longitude and associated climate [9,12,13]. Although HSW has been improved from 3.5 g of wild soybean to 18-20 g of cultivated soybean through traditional methods during long-term breeding [14], breeding soybean cultivars with reasonable and stable HSW through traditional selection methods remains difficult. The traditional selection method requires evaluation in multiple environments over several years and is expensive, time-consuming and labour-intensive.
Marker assisted selection (MAS) can increase the efficiency of the traditional selection method for HSW by improving the allele frequencies of desirable HSW quantitative trait loci (QTL) [15]. Presently, linkage analysis, based on special bi-parental mapping populations, is still extensively applied to dissect the genetic base (or QTL) of HSW. More than 200 QTL have been reported in the Soy-Base databank (www.soybase.org), which are distributed on 20 chromosomes (Chr, or linkage group (LG)) from more than 40 different genetic populations and 50 parental materials. Most of these identified QTL were found in F 2 or recombinant inbred line (RIL) populations [7,[16][17][18][19][20][21][22][23][24][25][26]. Except for these identified QTL in mature seeds, some studies analysed the dynamic QTL of HSW during the different developmental stages, which have increased the understanding of HSW QTL [19,27]. Only a few genes controlling HSW of soybean, have been cloned till date. A phosphatase 2C-1 (PP2C-1) gene was found to contribute to the increase of HSW in transgenic plants through a combination of wholegenome sequencing and an RIL population derived from a cross between a wild soybean ZYD7 and a cultivated soybean HN44 [28]. Among these identified QTL, most spanned fairly large genomic regions due to the relatively low density of molecular markers (http://www.soybase.org), which have a relatively low accuracy, limiting their application in MAS efforts.
Genome-wide association studies (GWAS) have more extensive recombination events and shorter linkage disequilibrium (LD) blocks due to high density of SNP markers used for mapping and wider phenotypic variation available in germplasm. Therefore, GWAS could significantly improve the resolution and accuracy of marker-phenotype associations compared with linkage analysis, based on bi-parental mapping populations. Presently, GWAS have begun to effectively analyze the genetic basis of HSW in soybean. Yan  In the present study, we performed a GWAS of soybean HSW in Northeast China based on 185 tested accessions and 24,180 SNPs. The aim of the present study is to identify QTLs associated with HSW and to screen candidate genes located in peak SNP regions.

Result
Distribution of hundred seed weight in the association panel The phenotypic value was evaluated in three locations for two years to analyse the variation distribution (Additional file 1). The association panel consisted of 185 accessions with HSW varying from 5.64 g to 34.8 g, and an average HSW of 19.60 g was calculated based on the mean observed across the three tested environments in 2015 and 2016 (Additional file 1). Coefficients of variation ranged from 18 to 21% for the different environments, and significant differences among the tested environments were not found for the association panel (Additional file 1). Normal distributions without any significant skewness and kurtosis were observed for the association panel in all tested environments ( Fig. 1 and Additional file 1).

Distribution of SNPs and genetic characteristics of the mapping population
A total of 24,180 SNPs distributed across all 20 chromosomes of the soybean genome with minor allele frequencies > 0.05 and missing data less than 3% were utilized to estimate LD levels in the 185 tested samples. These SNP markers spanned 947.89 Mbp, which accounted for approximately 86.17% of the entire soybean genome. The number of SNPs among the 20 chromosomes was not even, and large variations in the number of SNPs (from 676 in Chr.11 to 1774 in Chr.15) were observed among different chromosomes. The average marker density was approximately 1 SNP per 40.29 kb ( Table 1). The decay distance of the LD between markers was 215.74 kb (Fig. 2a).

Quantitative trait nucleotide (QTN) associated with hundred seed weight by GWAS
A compressed mixed linear model (CMLM) was used to identify association signals using the R package GAPIT. Principal component and kinship analyses of the association panel were evaluated using the whole set of SNPs, which were also considered in the CMLM model. The first three principle components (PCs) accounted for 12.48% of the total genetic variation (Fig. 2b-c). The evaluation of the variation of the first 20 PCs analysis revealed an inflection point at PC3 (Fig. 2b), suggesting that the first three PCs dominated the population structure on the association mapping. A lower genetic relatedness within the population was exhibited from the distribution of the pairwise relative kinship coefficients of the association panel among the 185 tested accessions (Fig. 2d). In the present study, thirty-four association signals were associated with HSW, which were distributed on fifteen chromosomes, including Chr.3, Chr.4, Chr.5, Chr.6, Chr.8, Chr.9, Chr.10, Chr.12, Chr.13, Chr.14, Chr.16, Chr.17, Chr.18, Chr.19, and Chr.20 ( Fig. 3 and Table 2). Among them, five SNP loci (HSW-8-1 on Chr.8, HSW-9-1 on Chr.9, HSW-12-1 on Chr.12, HSW-12-4 on Chr.12, and HSW-16-3 on Chr.16) were identified in more than three environments in this study. Another thirty QTNs were found in less than three environments. Among all 34 QTNs, fifteen signals overlapped or located near the genomic region of the known QTL underlying HSW or soybean yield, and the other seventeen were novel for HSW ( Table 2). The HSW of the tested accessions with different alleles were evaluated, and the results indicated that the HSW of these  accessions with different alleles in these identified QTNs were significantly different (Table 2). Thus, utilization of these appropriate alleles for HSW would be interesting for MAS of soybean cultivars with higher HSW.

Prediction of candidate genes controlling hundred seed weight
Genes located in the 200-kbp flanking regions of each peak SNP were considered candidate genes. A total of 237 genes, derived from 31 QTNs, were located near peak the SNPs from the three tested environments in 2015 and 2016 (Additional file 2). For further clearing the potential functions of these genes, various functional groups were classified based on the Gene Ontology database (http://geneontology.org/). Of these inferred genes, fifty-four had no functional annotations and were derived from protein families with unknown function. Another 183 genes were related to plant growth regulation, hormone metabolism, cell, RNA, protein metabolism, development, starch accumulation, secondary metabolism, signalling, and the TCA cycle (Additional file 2). Among these identified candidate genes, Glyma.03G192300, a starch branching enzyme located near HSW-3-1 of Chr.3, could promote seed weight in crops [31]. Glyma.05G005400 (located near HSW5-1 of Chr.5) and Glyma.08G317300 (located near HSW8-2 of Chr.5), a MYB-domain protein, play key roles in regulating seed development and determining seed weight in plants [32]. Glyma.05G196400 (located near HSW5-2 of Chr.5), one Leucine-rich repeat protein kinase family protein, has been proven a regulator of seed weight in plants [33]. Masatake et al. (2013) reported that HS3, which has the same domain as Glyma.06G201700 (located near HSW6-1 of Chr.6), regulates seed development in plants [34]. Bhatnagar et al. (2017) reported that OsPP2C51 positively regulates rice seed germination and affects seed yield [35]. Both OsPP2C51 and Glyma.12G109500 (located near HSW- To identify the possible roles of candidate genes in the HSW of soybean, gene-based associations were conducted using the GLM method. A total of 3057 SNPs in 237 candidate genes (MAF > 0.10) were obtained among twenty lines (ten higher/lower HSW lines) through genome re-sequencing. A total of 106 SNPs from 16 genes (Glyma.05G196200, 4 SNP; Glyma.12G083500, 4 SNP; Glyma.13G182400, 4 SNP; Glyma.17G178400, 4 SNP; Glyma.20G140200, 4 SNP; Glyma.20G140400, 4 SNP; Glyma.03G192300, 6 SNP; Glyma.05G005200, 6 SNP;  were significantly associated with HSW in soybean (Fig. 4, Additional file 3). Among these 16 identified genes, one gene (Glyma.03G192300) has been reported to be associated with HSW in crops [31]. The other 15 genes were found to be novel for controlling SW in soybean. The two alleles of the peak SNP from each of the 16 genes were analyzed for allelic effects. The SW of soybean accessions with one allele was significantly different from the other for all the 16 SNP peaks (Fig. 4). These beneficial alleles from candidate genes would be helpful for MAS in soybean with high and stable HSW.

Discussion
As the important source for vegetable oil and plant protein, soybean yield is lower in comparison with other major crops. MAS usefulness and effectiveness of a crop depends on two key factors: the accuracy of QTNs and the genetic variation in the selected association panel. The more diverse of the selected association panel in genotype and phenotype, the better universal of QTNs and beneficial alleles obtained by genome-wide association analysis as molecular markers for molecular assisted selection breeding. Soybean was difficult to adapt and fully mature in the unadoptable environments because it has a significant photoperiod response [30]. In this study, a total of 185 landraces or elite cultivars that were derived from six maturity groups (maturity groups 000, 00, 0, I, II and III) [14] were used to evaluate phenotypic variation (Additional file 4). The samples from these six maturity groups were selected since the accessions from different maturity groups have special adaptability to the three tested sites of Northeast China ('Harbin', 'Gongzhuling' and 'Shenyang') with two degrees difference in latitude between each two sites, which could fully reflect the formation of HSW and accurately evaluate the effects of environment on HSW to effectively increase the accuracy of the phenotypic data and QTNs. A previous study showed that environmental factors significantly affected the stability of HSW, especially in different maturity group conditions [9,12,13]. In the preset study, the HSW of most tested samples behaved stable among the three tested environments, which indicated that the variation of HSW was relatively smaller in the appropriate environment (Additional file 1).  proposed a new way to incorporate a superior  [36]. To date, more than 200 HSW QTL have been identified based on different mapping populations through bi-parental hybrid population, and most of these identified QTL were specific to the genetic background. Therefore, the novel QTL/genes still need to be identified for HSW MAS. In this study, 34 QTNs located on 15 chromosomes were found that are associated with HSW in three environments in 2015 and 2016. Among these 34 association signals, fifteen QTNs overlapped with or near the known HSW QTL ( Table 2). Two QTNs (HSW-3-1 of Chr.03 and HSW-13-2 of Chr.13) were significantly associated with HSW, and the association between these two genomic regions and HSW had been repeatedly verified by many previous linkage analysis studies [8,27,37,38]. Similarly, twelve QTNs (HSW-4-1 and HSW-4-2 of Chr.04, HSW-10-1 of Chr.10, HSW-13-3 of Chr.13, HSW-14-1 of Chr.14, HSW-16-1 of Chr.16, HSW-17-1 and HSW-17-3 of Chr.17, HSW-18-1 of Chr.18, HSW-20-1, HSW-20-2 and HSW-20-3 of Chr.20) have also been reported [4,16,20,[39][40][41]. Li et al. (2008) identified one major QTL (Seed weight 23-2 located in Chr.12) that affects HSW using a wild soybeanderived mapping population, which has a similar genomic region to HSW-12-1 in this study [42]. Another 18 QTL that are novel for HSW were identified.

Conclusions
By performing a GWAS of soybean HSW in Northeast China based on 185 tested accessions and 24,180 SNPs, the results showed that 34 QTNs located on 15 chromosomes are associated with HSW in three environments for two years. Among these 34 association signals, fifteen QTNs overlapped with or near the known HSW QTL. Gene based association showed 106 SNPs from sixteen candidate genes were significantly associated with HSW in soybean. Of them, Glyma.03G192300 was previously reported to be important in seed development and the other 15 gene were novel genes for HSW in soybean. The present study provides an invaluable resource and new QTL/genes for further study on the HSW of soybean in the molecular network and molecular assistant selection.

Soybean germplasms and field trials
One hundred and eighty-five samples were selected and collected from the Chinese National Soybean GeneBank to analyse the HSW variation and for subsequent reduced-sequencing. Among these accessions, there were 97 elite varieties, 38 elite lines and 28 landraces from the soybean production areas between 36.23°N and 61. 50°N of China belonged to six maturity groups (maturity groups 000, 00, 0, I, II and III) [14], representing the geographical and ecological diversity of soybean in northern China (Additional file 4). The other 22 accessions were collected from non-Chinese regions. All plant materials tested were planted at three locations including Harbin (45.80°N, 126.53°E, chernozem, active accumulated temperature 2700°C, frost-free period is 135 d, annual precipitation is 500-600 mm), Gongzhuling (43.50°N, 124.82°E, chernozem, active accumulated temperature is around 3010°C, frost-free period is 144 d, annual average precipitation is 500-700 mm), and Shenyang (41.80°N, 123.38°E, chernozem, active accumulated temperature is around 4010°C, frost-free period is 155 d, annual precipitation is 500-800 mm) in 2015 and 2016. For the tested environments, field trials were performed with a single row plot (3 m long and 0.65 m between rows) based on randomized complete block design and three replicates. After reaching full maturity of all plant materials, a total of 10 randomly selected plants from each row in each plot were randomly picked and weighed and the HSW was evaluated.

DNA isolation and SNP genotyping data collection
Genomic DNA of tested samples was isolated via the hexadecyl trimethyl ammonium bromide (CTAB) method and genotyped through a reduced-sequencing method (the specific locus amplified fragment sequencing (SLAF-seq) methodology) [43]. Two restriction digestion enzymes, MseI (EC 3.1.21.4) and HaeIII (EC: 3.1.21.4) (Thermo Fisher Scientific Inc., Waltham, MA, USA), were selected to produce more than 50,000 sequencing tags (approximately 300 bp to 500 bp in length) in each tested sample. In each accession, the sequencing libraries were defined based on the obtained sequencing tags, which spanned unique genomic regions in soybean. The barcode method and Illumina Genome Analyzer II System (Illumina Inc., San Diego, CA, USA) were utilized to obtain the 45-bp sequence reads at both ends of the sequencing tags from each accession library. The alignment between the obtained raw paired-end reads and the reference genome was conducted with BWA software (Version: 0.6.1-r104) [44]. The raw reads in the same genomic position were used to define the SLAF groups using more than 58,000 high-quality SLAF tags from each tested sample. The SNPs were defined based on an MAF ≥ 0.05. The genotype was regarded as heterozygous when the depth of minor allele/the total depth of the sample ≥ 1/3 (Additional file 5).
For twenty lines with extreme phenotypic values of SW, a genome resequencing with 10-fold in depth was conducted on an Illumina HiSeq 2500 sequencer. Paired-end resequencing reads were mapped to the reference genome (Version: Glyma.Wm82.a2) with BWA (Version: 0.6.1-r104) using the default parameters. SAMtools48 (Version: 0.1.18) software was used to convert the mapping results into the BAM format and to filter the unmapped and non-unique reads. Duplicated reads were filtered with the Picard package (picard.sourceforge.net, Version: 1.87). The BEDtools (Version: 2.17.0) coverage Bed program was applied to compute the coverage of sequence alignments. A sequence was defined as absent when coverage was lower than 90% and present when coverage was higher than 90%. SNP detection was performed by the Genome Analysis Toolkit (GATK, version 2.4-7-g5e89f01) and SAMtools. Only the SNPs detected by both methods could be analyzed further. SNPs with allele frequencies lower than 1% in the population were discarded (Additional file 6). SNP annotation was performed based on the reference genome (Version:Glyma. Wm82.a2) using the package ANNOVAR (Version: 2013-08-23).

Population structure evaluation and linkage disequilibrium (LD) analysis
The principal component analysis (PCA) programs were used to analyze the population structure of the association panel through GAPIT software [45]. The squared allele frequency correlations (r 2 ) in TASSEL version 3.0 [46] were used to calculate the LD block across the soybean genome based on SNPs with MAF ≥ 0.05 and missing data ≤10%. In contrast to the GWAS, missing SNP genotypes were not imputed with the major allele before LD analysis. The parameters in the software programs were set with MAF (≥ 0.05) and the integrity of each SNP (≥ 80%).

Genome-wide association analysis
Association signals of HSW were identified based on 24,180 SNPs from 185 tested samples with the compressed mixed linear model (CMLM) with default parameters in GAPIT [45]. The P value was calculated with the Bonferroni method with α ≤ 0.05 (≤2.70 × 10 − 4 ) and was used as the threshold to declare whether a significant association signal existed [47].

Prediction of candidate genes controlling hundred seed weight
Candidate genes, located in the 200-kb flanking genomic region of each peak SNP, were classified and annotated underlying the reference genome. The variation present in exonic regions, splicing sites, 5′UTRs and 3′UTRs, intronic regions, upstream and downstream regions of candidate genes in ten higher HSW lines and ten lower HSW lines was identified from genome re-sequencing data. A gene-based association analysis was conducted using the General Linear Model (GLM) method in TASSEL version 3.0 [46] to identify HSW-related SNPs or haplotypes. Significant SNPs affecting the investigated traits were claimed when the test statistics reached P < 0.01.