Genetic diversity and population structure analysis in cultivated soybean (Glycine max [L.] Merr.) using SSR and EST-SSR markers

Soybean (Glycine max) is an important legume that is used to fulfill the need of protein and oil of large number of population across the world. There are large numbers of soybean germplasm present in the USDA germplasm resources. Finding and understanding genetically diverse germplasm is a top priority for crop improvement programs. The current study used 20 functional EST-SSR and 80 SSR markers to characterize 96 soybean accessions from diverse geographic backgrounds. Ninety-six of the 100 markers were polymorphic, with 262 alleles (average 2.79 per locus). The molecular markers had an average polymorphic information content (PIC) value of 0.44, with 28 markers ≥ 0.50. The average major allele frequency was 0.57. The observed heterozygosity of the population ranged from 0–0.184 (average 0.02), while the expected heterozygosity ranged from 0.20–0.73 (average 0.51). The lower value for observed heterozygosity than expected heterozygosity suggests the likelihood of a population structure among the germplasm. The phylogenetic analysis and principal coordinate analysis (PCoA) divided the total population into two major groups (G1 and G2), with G1 comprising most of the USA lines and the Australian and Brazilian lines. Furthermore, the phylogenetic analysis and PCoA divided the USA lines into three major clusters without any specific differentiation, supported by the model-based STRUCTURE analysis. Analysis of molecular variance (AMOVA) showed 94% variation among individuals in the total population, with 2% among the populations. For the USA lines, 93% of the variation occurred among individuals, with only 2% among lines from different US states. Pairwise population distance indicated more similarity between the lines from continental America and Australia (189.371) than Asia (199.518). Overall, the 96 soybean lines had a high degree of genetic diversity.

Introduction Soybean, is the world's fourth most widely grown crop. Its high-quality protein (40%) and vegetable oil (20%) [1,2], compared to other crops, make it highly desirable for human and animal consumption and as a biofuel [3]. In addition, soybean plays a vital role in nitrogen fixation during crop rotation [4]. At present, Brazil leads all other soybean-growing nations in production and productivity. Indeed, the productivity in other major soybean-growing countries has increased in the last few decades, even though Pakistan remains behind mainly due to stagnant yields. Although there are more than 120.48 million hectares of soybean grown worldwide, but there is a negligible area under soybean cultivation in Pakistan. Agro-ecological conditions of country are favorable for soybean cultivation but still this crop has failed to attain the suitable position in current cropping pattern. The country is spending about two billion US$ on the import of soybean commodities to fulfil local requirements. Apart from the human food products, soybean meal is the main and preferred source of protein for all types of poultry due to good quality of protein and amino acids. Soybean meal is more frequently used in Pakistan's poultry industry's feed items. Although agro-ecological conditions of Pakistan favor soybean production, low genetic diversity has hindered the development of new varieties [5][6][7][8]. Several studies based on molecular markers and inbreeding coefficient analysis have revealed genetic uniformity in Brazilian soybean cultivars [9,10]. This limited genetic diversity in elite soybean germplasm indicates that the genes present in current cultivars evolved from a small number of accessions. A more varied genetic background is desirable to protect against unexpected pest and disease outbreaks [11,12].
For plant breeders, diverse genetic resources increase the chance of developing new and improved cultivars with desired traits [13]. In present, considering the large number of genes predicted to be involved in the control of agronomic traits,.main focus for developing modern cultivars is to locate the best alleles linked to these traits. Presumably, during soybean domestication and introduction in producing regions, a large number of advantageous alleles were lost as a result of genetic bottlenecks. The accessions chosen for a breeding programme must contain and transmit advantageous rare alleles that are lacking in elite germplasm. As a result, understanding the origins of these alleles is crucial. Accessions that are very different from elite genotypes are likely to offer novel alleles for the desired trait. The difficult part is to choose accessions from the available germplasm to use in breeding operations. Therefore, knowledge of the genetic diversity of soybean genotypes would help breeders and geneticists understand the structure of the germplasm to choose parents with greater genetic diversity and accelerate the expansion of the genetic resources [14]. Morphological characterization, biochemical markers, and molecular marker techniques are frequently used to access genetic diversity among and between populations [15]. Morphological and biochemical markers are less reliable than DNA markers due to significant environmental effects [16]. Developing DNA markers is important for understanding the genetic diversity between and within different crop species [17,18] as they draw attention to variations in the nucleotide sequence between different individuals and are indifferent to environmental variables [19]. Molecular markers such as Random amplified polymorphic DNA (RAPD), Simple-sequence repeats (SSR), expresses sequence tags (EST-SSR), Amplified fragment length polymorphism (AFLP), and Single nucleotide polymorphism (SNP) have been used to identify genetic diversity in soybean germplasm [20][21][22][23][24][25][26].SSR markers have been the most used for characterizing genes, analyzing genetic diversity, and mapping genetic linkages. SSR markers are very useful for genotype differentiation, pedigree analysis, assessing genetic distances among genotypes, and variety identification because they are short tandem repeats dispersed uniformly on the entire genome with high polymorphic information content (PIC) and reproducibility [27][28][29]. While use of functional molecular markers, such as those developed from expressed sequence tags (EST), directly access to the population diversity of important genes for agriculture, making it easier to link genotype to phenotype. Although, SNPs are the most important DNA markers as they have low levels of recurrent mutations, making them stable in terms of evolution. Therefore, they are the best markers for dissecting the genetic basis of complex characters for analyzing genomic evolution processes [30]. While SNPs can be used as to assess genetic diversity in agricultural species, they are less preferred than SSRs due to their limited information content, biallelic nature, and high cost [19]. A comparative genetic diversity study on sugar beet cultivars using DArT, SNPs, and SSRs showed that SSR markers had the highest success rate due to their highly polymorphic characteristics [31,32]. Other studies have shown that SSR markers were extremely effective in estimating genetic diversity and association among soybean accessions [16,[28][29][30][31][32].
Since soybeans are a relatively new crop in Pakistan, local breeding programs focus on creating new, competitive cultivars with excellent production and quality with limited attention to understanding the extent of diversity in their working germplasm. Therefore, it is important to assess the genetic diversity of the soybean germplasm from USDA to develop new and improved soybean cultivars for Pakistan. Hence, this study analyzed 96 accessions of cultivated soybean from various geographical regions using 80 genomic SSR and 20 functional EST-SSR markers evaluate the geographic and genetic differences.

Genotyping
Fresh young leaves were used for genomic DNA extraction following the Cetyltrimethyl ammonium bromide (CTAB) method described by Doyle and Doyle [33]. Eighty genomic SSR markers and 20 functional EST-SSR markers distributed uniformly across the soybean genome were selected from the literature (S2 Table) and used to check genetic diversity among the soybean accessions. A PCR reaction mixture (15 μl) was prepared, comprising 1.5 mM of 10× buffer, 3.5 mM MgCl 2 , 600 μM dNTPs, 0.6 μM of each forward and reverse primer, and 1 U Taq polymerase with 50-100 ng DNA. The reaction began with initial denaturation at 94˚C for 5 min, followed by 95˚C for 30 sec, 48-55˚C for 1 min, 72˚C for 1 min, and a final extension at 72˚C for 10 min. A gradient thermal cycler (Kyratec Super Cycler) was used to perform the PCR reaction. The PCR products were fractionated in 2.5% agarose gel electrophoresis containing ethidium bromide for staining bands and visualized using a UV Analyzer based on their migration distance relative to Gene ruler 50 bp DNA ladder (Thermo Scientific, 10416014).

Statistical analysis
Genotypic data obtained from the SSR and EST-SSR markers were scored as 0 or 1 based on the presence or absence of a DNA band in the gel (S1 Fig). The expected heterozygosity (He), observed heterozygosity (Ho), genetic distance between accessions (GD), and Shannon informative index (I) were estimated using POPGENE (v.1.32) software [34]. The PIC, gene diversity, and allele frequency of markers were calculated using Power Marker v.3 [35].

Diversity analysis and population structure
Phylogenetic analysis was conducted using genotypic data to evaluate the dissimilarity among accessions using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method in DarWIN software. The phylip file obtained from DarWIN was used to construct the phylogenetic tree in MEGA6 software [36]. Principal coordinate analysis (PCoA) was conducted using Past 4.0 software to identify the degree of differentiation between accessions [37]. Model-based software STRUCTURE v.2.3.4 with the admixture ancestry model was used to analyze population structure, with correlated allelic frequencies used as parameters for the analysis [38]. The number of iterations for the Burn-in Period and Markov Chain Monte Carlo was set at 10,000. The online platform STRUCTURE HARVESTER was used to obtain Optimum Evanno's K values [39]. Analysis of molecular variance (AMOVA) was used to assess genotypic variation in the population, with accessions from each country considered a single population. Since each population must contain at least two individuals, the accessions from Japan and Iran were considered one population. AMOVA was performed using GenA-LEx 6.5 software [40].

Marker informativeness and heterozygosity
Of the 100 markers used in this study, 17 EST-SSRs and 79 SSRs were polymorphic, with 262 amplified alleles, ranging from 2 to 5 alleles per locus (average 2.79). Of the 96 polymorphic markers, the five most polymorphic markers produced five alleles, followed by seven, 41, and 43 markers that produced four, three, and two alleles, respectively (

Genetic relationship among 96 soybean accessions based on origin
The 96 tested soybean accessions were grouped into eight populations based on origin. As Japan and Iran only had a single accession each, they were grouped in a single population (Pop-8

Genetic distance (Nei's measure) analysis
The genetic distance among the 96 soybean accessions from nine regions ranged from 0.079-1.232 (S3 Table). PI612157 from Georgia, USA, and PI462312 from India had the greatest genetic distance of 1.232, followed by PI269518C from Pakistan and PI462312 from India with 1.17. These four accessions had the highest degree of genetic differentiation based on genetic distance. PI644047 and PI644054, both from Georgia, USA, had the smallest genetic distance of 0.079.

Diversity analysis
The genetic diversity of 96 soybean accessions was assessed through following analysis. Phylogenetic analysis. The phylogenetic analysis identified two major groups and several subgroups (Fig 2). Group-1 (G1) comprised of 53 accessions of which most accessions were from the USA (37) and Brazil (7), while Group-2 (G2) comprised of 43 accessions including accessions from China, Pakistan, India, and Afghanistan. Further G2 contained Pakistani check cv. Faisal while G1 contained cv. Ajmeri. The phylogenetic analysis grouped the 59 USA accessions into nine groups (G1-G9), indicating a high degree of genetic diversity (Fig 3).
Principal coordinate analysis. The PCoA showed that all accessions were distributed across the plot, with 45.97% of the total variation explained in the first six coordinates (Fig 4).

PLOS ONE
Genetic diversity and population structure analysis in cultivated soybean Based on their grouping, many USA accessions were similar to Brazilian and Australian accessions, while the accessions from India, China, Pakistan, and Afghanistan were similar, with six Chinese and two Pakistani accessions clustered together. A second PCoA of the USA accessions revealed that the accessions were scattered across the plot without any significant clustering. The first six PCoAs explained 51.5% of the total variation (Fig 5).
Population STRUCTURE. Population STRUCTURE was used to 1) identify distinct genetic populations, 2) identify migrants and admixed individuals, and 3) assign individuals to populations [41]. The highest peak (K = 2) occurred at ΔK 176.6, indicating that the tested population could be divided into two groups. Two minor peaks, at K = 3 (ΔK = 20.71) and K = 8 (ΔK = 8.38), also occurred ( Fig 6A). The accessions with a membership proportion (Q) of 80% or more were considered pure, with the remaining accessions classified as admixture

PLOS ONE
Genetic diversity and population structure analysis in cultivated soybean increasing from 8 to 11 in G1 and 19 to 23 in G2. Three of the 12 Chinese accessions were in G1, with the rest in G2. Five of the eight Pakistani accessions were in G2, with the rest in G1. G1 and G2 each contained two Indian accessions. A similar structure analysis was undertaken for the USA accessions, with an optimum K value of K = 9 obtained (Fig 7A). The results were consistent with the phylogenetic analysis, indicating a high level of differentiation in the USA soybean lines (Fig 7B). AMOVA. The molecular variance observed among individuals within a population was 94% and among populations was only 2% (Table 3). Wright's F-statistics for the tested markers were F is (0.961) and F it (0.962). The 96 polymorphic markers had a mean fixation index of 0.025, indicating low genetic variation across subpopulations. The rate of gene flow (Nm) was 9.814, indicating a high rate of gene exchange among populations.
The 59 USA accessions were further grouped based on states and analyzed for AMOVA. The percentage of variation among individuals within the population was 93% and among populations was 2%. Wright's F-statistics for the tested SSR markers were F is (0.952) and F it (0.953). The SSR markers had a mean fixation index (F st ) of 0.024, indicating a very low degree of exchange among populations. The rate of gene flow (Nm) was 10.36, indicating a very high rate of gene exchange among populations (Table 4). A pairwise population matrix of accessions was undertaken to check the population distance among populations in three continents: America, Asia, and Australia ( Table 5). The results indicated greater genetic diversity among the lines from Asia than continental America and Australia.

Discussion
Characterizing germplasm and understanding its genetic diversity are prerequisite steps for developing improved crop cultivars [42]. The plant breeders could increase the genetic base of locally adapted cultivars by using their knowledge of genetic diversity. Consequently, genetic diversity estimation has become an important method for locating genetically different parents that possess desirable features is genetic diversity estimate [42]. The identification of genetic diversity and genetic structure of evaluated soybean germplasm using molecular data supported the selection of possible parents based on morpho-biochemical properties, which facilitated long-term breeding and selection operations. In order to reduce the genetic instability of segregating populations, various parents are preferable in soybean crossbreeding [43]. Many studies have been conducted to assess the genetic diversity in legume crops using molecular markers [44,45]. The polymorphism observed by using these molecular analysis was very high that was very likely due to polymorphic nature of SSRs [46,47]. Thus molecular markers are reliable source for identifying the various soybean populations.
In present study, 100 uniformly distributed genomic SSR and functional EST-SSR markers were used to explore the genetic diversity among 96 soybean accessions. The results of the phylogenetic analyses, PCoA, population STRUCTURE, and AMOVA indicated high genetic variation among the accessions (Figs 2, 4, and 6; Table 2), with a slightly higher average allelic number per locus (2.88) than an earlier study by Bisen, Khare [48], who reported 2.21 average alleles per locus for 50 SRR markers in 38 soybean accessions. The difference in allele numbers may be due to the different sample sizes, number of markers, and genotypes used [49]. A marker with a PIC value = 0.5 or more indicates the presence of high informativeness [50]. Here, PIC values ranged from 0.18-0.72 (average 0.44), lower than the average 0.47 reported elsewhere [51]. Markers with high PIC values can be used to distinguish soybean accessions. Information on Ho and He suggests the extent of genetic variability in the population [5]. This study had a much lower average Ho (0.019) than average He (0.51), which may be due to the high self-pollinating nature of soybean [52,53]. The average Shannon's Index per locus was 0.79, slightly higher than Ullah et al. [54], who reported an average of 0.69 per marker in soybean that shows that diversity observed in present study is slightly higher than previous.
The study population was divided into two major groups, G1 and G2 (Fig 2), with most USA lines in G1 and those from other countries in G2. Ž ulj Mihaljević [43] also tested 42 SSR markers on 97 European soybean accessions that were separated into two sub-groups based on geographic origin, which also supported the present findings. Geographic distances and genetic variation are highly correlated, which is more likely the result of long-term selection and ecological diversity [55]. These groupings were further supported by PCoA and population STRUCTURE (Figs 2 and 4), suggesting that the USA lines are genetically distinct from lines from other countries and affirming the assumption that USA lines are somewhat distinct from lines from other continents. Similar results obtained by structure and PCoA indicates that two separate gene pools were the primary source of two sub populations [45]. However, some Brazilian lines were also present in G1, which may be due to their close geographical locations or free movement of the germplasm in this region. In addition, most Pakistani lines clustered with the USA lines in G1, possibly due to their similar origins. In an earlier study, Iqbal, Naeem [56] also reported that accessions from USA and Pakistan clustered together. STRUCTURE analysis showed that the ratio of pure USA lines increased when the threshold decreased from 80% to 70% in both groups, in line with the findings of other studies [51, 54, Table 3) but also observed genetic similarity among some lines. Other studies support the high degree of variation among the USA lines [58,59]. The local commercial varieties Faisal soybean and Ajmeri were present in different clusters, indicating that these genotypes were introduced from various origins and assessed throughout the selection process before being made available for commercial cultivation [60]. Appiah-Kubi et al. [61] assessment of the genetic diversity among the soybean genotypes using 20 SSR markers revealed a close link between genotypes and their geographic origin, which is consistent with these findings. Due to the substantial exchange of genetic resources among farming communities, the poor structure of germplasm may be a reflection of the presence of gene flow between the subpopulations [62]. So, by using molecular markers it will be easy for the farmers to assess the suitable genotype for the crossing that may lead to the development of new varieties.

Conclusion
Frequent use of closely related cultivars reduces the genetic diversity in germplasm and hinders the breeding of new cultivars with improved traits. The soybean accessions investigated in this study are highly diverse, with a medium to high level of genetic variation across geographic regions, Genetic markers can be considered as a useful source to access the genetic diversity which is not easy to achieve through phenotypic diversity thus could be valuable for future breeding programs by increasing the new varieties in already existing gene pool.