An Improved Genome-Wide Association Procedure Explores Gene–Allele Constitutions and Evolutionary Drives of Growth Period Traits in the Global Soybean Germplasm Population

In soybeans (Glycine max (L.) Merr.), their growth periods, DSF (days of sowing-to-flowering), and DFM (days of flowering-to-maturity) are determined by their required accumulative day-length (ADL) and active temperature (AAT). A sample of 354 soybean varieties from five world eco-regions was tested in four seasons in Nanjing, China. The ADL and AAT of DSF and DFM were calculated from daily day-lengths and temperatures provided by the Nanjing Meteorological Bureau. The improved restricted two-stage multi-locus genome-wide association study using gene–allele sequences as markers (coded GASM-RTM-GWAS) was performed. (i) For DSF and its related ADLDSF and AATDSF, 130–141 genes with 384–406 alleles were explored, and for DFM and its related ADLDFM and AATDFM, 124–135 genes with 362–384 alleles were explored, in a total of six gene–allele systems. DSF shared more ADL and AAT contributions than DFM. (ii) Comparisons between the eco-region gene–allele submatrices indicated that the genetic adaptation from the origin to the geographic sub-regions was characterized by allele emergence (mutation), while genetic expansion from primary maturity group (MG)-sets to early/late MG-sets featured allele exclusion (selection) without allele emergence in addition to inheritance (migration). (iii) Optimal crosses with transgressive segregations in both directions were predicted and recommended for breeding purposes, indicating that allele recombination in soybean is an important evolutionary drive. (iv) Genes of the six traits were mostly trait-specific involved in four categories of 10 groups of biological functions. GASM-RTM-GWAS showed potential in detecting directly causal genes with their alleles, identifying differential trait evolutionary drives, predicting recombination breeding potentials, and revealing population gene networks.


Introduction
Soybean (Glycine max (L.) Merr.) has been cultivated for approximately 5000 years in China [1]. Its planting area has expanded from China as the center of origin to 53 • N and 35 • S worldwide [2,3]. Originally, soybeans were not suitable for environments at high or low latitudes. As soybeans spread worldwide, they became adapted to the change in environment. The spread was facilitated through four routes, i.e., northward to Northeast China and Russia; eastward to the Korea Peninsular and Japan islands; southward to Southeast Asia, South Asia, Africa, and Australia; and lately westward to northern North America, southern North America, and Central and South America [4].
Soybean is a typical short-day and thermophilic crop. Its growth and development are affected by day-length and temperature [5,6]. Garner and Allard [7] discovered the phenomenon of photoperiodism that regulates the growth period of soybeans. Temperature is another environmental factor affecting the soybean growth period [8]. Therefore, these two factors determine the adaptability of soybean varieties to geographic regions. DSF (days of sowing-to-flowering) and DFM (days of flowering-to-maturity), measured by number of days, do not reflect their relationship with day-length and the temperature of the geographic and climate environments directly. Therefore, the two growth period traits were separated into degree and duration of day-length and temperature, i.e., ADL (accumulative day-length) and AAT (accumulative active temperature), by Wang et al. [9]. The results showed that ADL and AAT for DSF and DFM were markedly different among and within the subgroups of geographic regions and MG-sets (maturity group-sets; worldwide, 3 MG-sets grouped from 13 MGs) [10,11].
For the growth period traits of soybean, 12 major genes, including E1-E11 and J, have been reported [10]. Among them, the potential genes of E1-E4, E9, E10, and J have been isolated. Meanwhile, FT genes, E1La, E1Lb, PRR3a/3b, DT1, and DT2, have been also identified as being related to DSF [12][13][14][15][16][17]. In Arabidopsis, thermo-morphogenesis was found to be controlled by various light signaling pathways, biological clock rhythms [18], plant hormone levels and activities [19,20], and epigenetic mechanism and chromatin level regulation [21][22][23]. However, very few studies regarding these factors in soybeans have been reported. Previous genetic studies on the response to day-length and temperature were mainly carried out for DSF, with little attention being devoted to DFM. As DSF and DFM are measured according to the number of days, which varies with the geographic location at which the materials are tested, the substantial factors that should be tested are their required ADL and AAT. Therefore, the exploration of the gene-allele systems of ADL and AAT required for DSF and DFM is essential to understand the exact relationship between the growth period traits and key environmental factors of ADL and AAT.
Two strategies have been used in genetic study of a target trait; one is to find the individual gene(s) for their functions in a specific material(s), while the other is to explore the complete gene-allele system of the trait in a natural population. Plant breeders are especially interested in finding and converging all of the superior alleles on different loci that exist in historical germplasm populations for cultivar development. Genome-wide association study (GWAS) has been widely used for the genetic dissection of complex traits in germplasm populations. However, previous GWAS procedures concentrated on finding a few major loci, such as the general linear model (GLM) and mixed linear model (MLM) approaches [24][25][26], based on single-locus models, and even the multilocus mixed-model (MLMM) [27] and multi-locus random-SNP-effect mixed linear model (mrMLM) [28]. Breeders are likely more interested in exploring the whole quantitative trait locus (QTL)-allele constitution in a population for an optimal genetic functioning. He et al. [29,30] proposed the innovative restricted two-stage multi-locus GWAS (RTM-GWAS) procedure.
RTM-GWAS is characterized by two innovations. First, using the genomic single nucleotide polymorphism linkage disequilibrium block (SNPLDB) as markers to meet the requirement of multiple alleles per locus in a germplasm population, and using the genetic similarity coefficient matrix of whole-genome SNPLDBs to correct the population structure bias for reducing false positives and negatives. Second, using the multi-locus model to identify the complete QTL-allele set to avoid false positives inflated by neighboring loci and false negatives caused by stringent significance correction for the single-locus model, as well as using two-stage analysis with trait heritability as the upper limit to control the calculation of noise and false positives based on incorporated precise experimental design. In this way, the significance level of p = 0.05 under the multi-locus model is equivalent to Bonferroni criterion of 0.05/m (m is the number of markers) under the single-locus model. The RTM-GWAS procedure has been applied to identify QTL-allele systems for growth period, seed quality, resistance to insects, and tolerance to environmental stress traits in soybeans for germplasm populations and bi-parental populations. The identified QTL-alleles were further used for gene annotation, optimal cross-design, and evolution mechanism studies [31][32][33][34][35][36].
Liu et al. [10] used RTM-GWAS to identify 52 QTLs for days of sowing-to-flowering and 59 QTLs for days of sowing-to-maturity in the global soybean population. From the QTLs, 44 and 36 candidate genes were annotated, respectively, and grouped into a similar set of 10 functional groups. A previous study (Su et al., 2023, accepted by Theoretical and Applied Genetics) suggested using gene-allele sequence as genomic markers (GASM) to replace SNPLDB in direct detection of the gene-allele system to avoid the inference from QTLs to genes based on whole genome sequencing. The present study follows this research strategy to use GASM-RTRM-GWAS to identify direct gene-allele systems of DSF and DFM and their required ADL and AAT.
Because RTM-GWAS can identify a relatively complete set of QTL-alleles, Liu et al. [37] explored the evolutionary mechanism from later to earlier in the annual wild and cultivated soybean in China through comparisons between the QTL-allele matrices of geographic subpopulations. This also made it possible to explore the gene network of a trait in a population using Gene Ontology (GO) analysis (http://www.soybase.org (accessed on 1 January 2021)). The protein-protein interaction (PPI) network is a procedure for analyzing how proteins work together in cells to perform cellular functions in a coordinated manner [38][39][40]. PPI calculation and prediction have been applied in plants such as Arabidopsis [41] and rice [42].
The present study aimed to identify the whole-genome gene-allele constitution of DSF and DFM and their required ADL and AAT using GASM-RTM-GWAS. This was used to explore the genetic mechanism in geographic adaptation and MG-set expansion, to predict the genetic recombination potential of the traits, and to explore their gene-allele networks and identify key gene-alleles to improve the traits in the global soybean population. This study is characterized by two features: first, directly identifying the gene-allele systems of growth periods and their response to day-length and temperature through the innovative GASM-RTM-GWAS; and second, targeting the recombination potentials and evolutionary motivators for further world-wide extension of soybeans based on the complete gene-allele information of the growth periods and related eco-traits.

Identification of Gene-Allele Systems of DSF and DFM in Global Soybeans
DSF and DFM were related to the degree and duration of two basic environment factors, ADL and AAT, respectively. From the center of origin to extended geographic regions, the required ADL and AAT for DSF and DFM changed greatly to adapt to the local geographic and sowing seasonal conditions [9] (Table S1, Figure 1a).
Under the G × E model of GASM-RTM-GWAS, 141 and 135 genes with 406 (2-8 or 2.88/gene) and 386 (2-8 or 2.84/gene) alleles were identified on the 20 chromosomes for DSF and DFM, explaining 76.85% and 55.03% of phenotypic variance (PV), respectively, ranging from 0.00 to 6.95%/gene and 0.00 to 7.20%/gene, respectively. There were 124 and 127 genes with R 2 < 1.5%, explaining 25.22% and 32.63% PV, respectively, and 17 and 8 genes with R 2 ≥ 1.5%, explaining 51.63% and 22.40% PV, respectively; therefore, 18,85% and 25.67% of the genetic variation (h 2 −PV gene = 95.70−76.85% = 18.85% and 80.70−55.03% = 25.67%, respectively) were due to a collective of minor genes. The G × E variance contributed 4.20% and 18.80% to the PV for DSF and DFM, respectively, which is relatively low (Table 1, Figure 1b,c). Only seven genes were shared, indicating different genetic systems between the two traits. Comparing the identified QTLs reported by  using a similar population as in the present study, 141 genes vs. 52 QTLs (44 annotated genes with 39 consistent in this study) for DSF and 135 genes vs. 59 QTLs (36 annotated genes with 34 consistent in this study) for DFM were observed. These results indicated that more DSF and DFM genes were identified using GASM-RTM-GWAS than using SNPLDB-RTM-GWAS. In addition, compared with SoyBase (http://www.soybase.org (accessed on 1 January 2021)), 63 (44 loci) and 28 (23 loci) QTLs/genes are similar to the DSF and DFM genes discovered in this study (Tables S2 and S3), while the other 97 and 110 genes explaining 52.45% and 38.14% of the present study PV, respectively, were newly detected. Accordingly, GASM-RTM-GWAS is powerful and effective in identifying gene-allele systems of complex quantitative traits and performed better than SNPLDB-RTM-GWAS. This is due to the GASM markers being more relevant than the SNPLDB markers, meaning that the haplotype number of a GASM fits the allele number of the corresponding causal gene better because GASM targets the same segment of the causal gene in the population.

Identification of Gene-Allele Systems of ADL and AAT Required for DSF and DFM in Global Soybeans
The identified gene-allele results from the GASM-RTM-GWAS G × E model for ADL DSF , AAT DSF , ADL DFM , and AAT DFM are listed in Tables S4-S7, with their summarized results in Table 1 and Table S8. A total of 459 genes were detected for the four traits.
For ADL DFM , 124 genes were detected, with 364 alleles (2-6 or 2.94/gene), explaining 50.24% PV, ranging from 0.00 to 7.18% per gene. There were 119 genes with R 2 < 1.5%, explaining 33.95% PV, and 5 genes with R 2 ≥ 1.5%, explaining 16.29% PV. In addition, 28.26% PV (78.50−50.24% = 28.26%) was due to a collective of minor genes. For AAT DFM , 129 genes with 362 alleles (2-6 or 2.81/gene) were detected, explaining 41.54% PV, with each gene ranging from 0.00 to 5.71%. There were 126 genes with R 2 < 1.5%, explaining 31.20% PV, and 3 genes with R 2 ≥ 1.5%, explaining 10.34% PV. In addition, 29.36% PV (70.90−41.54% = 29.36%) was due to a collective of minor genes. The G ×E variance accounted for 20.90% and 29.00% PV for ADL DFM and AAT DFM , respectively, which is larger than those for DSF (Table 1, Figure 1b,c). Comparing the previously reported DFM genes (QTLs) to the present ADL DFM and AAT DFM genes, assuming DFM and ADL DFM and AAT DFM were interrelated, 33 (23 loci) and 28 (22 loci) previously reported genes (QTLs), respectively, were close to the DFM genes discovered in this study. Therefore, at least 101 ADL DFM and 107 AAT DFM other genes were not related to DFM, but newly detected in the present study, which explains 33.58% and 29.49% of the present total PV, respectively (Tables 1, S6 and S7). large-contribution major gene with R 2 ≥ 1.5%. SC-major gene: small-contribution major gene with R 2 < 1.5%. Unmapped gene: unmapped minor genes. In parentheses of gene rows, the first number is the number of identified genes, followed by a range of single gene contributions to phenotypic variance. In parentheses of allele rows is the range of single allele effects with the unit of d for DSF and DFM, d · h for ADL, and d · • C for AAT.
The above results indicated that the gene-allele systems of ADL DSF , AAT DSF , ADL DFM , and AAT DFM were different from each other, whereas those of ADL DSF and AAT DSF were closely related to DSF and those of ADL DFM and AAT DFM were less closely related to DFM. Table 2 shows that there were 13 shared genes between ADL DSF and AAT DSF , which explained 32.82% and 38.19% of their PV, respectively, indicating that both traits are genetically related and may form the genetic basis for ADL × AAT, causing the DSF variation. There were 16 shared genes between ADL DFM and AAT DFM , which explained only 15.59% and 10.44% of their PV, respectively, and were less closely interrelated at the DFM stage.  g-DSF-08-1 6.07 g-ADL DSF -08-1 6.04 g-AAT DSF -08-1 6.14 g-DSF-02-11 2.08 g-AAT DSF -02-9 2.10 Glyma04g36240 g-DSF-04-3    Note: Gene code: for example, g-DSF-06-4, -06 represents chromosome 6 and -4 represents its order on the chromosome according to its physical position. The position corresponds to the Williams 82 reference genome version 1 (Wm82.a1). §: bold gene code means R 2 of gene ≥1.5%.
The identified genes with their alleles for each trait were arranged in a gene-allele matrix, which consisted of all gene-allele information of the 354 soybean varieties in the global population, including alleles with positive and negative effects (Figure 1d-f). Table 2 summarizes the shared genes among DSF, ADL DSF , and AAT DSF , as well as among DFM, ADL DFM , and AAT DFM . Nine genes were common among DSF, ADL DSF , and AAT DSF , which explained 35.09%, 32.60%, and 37.89% PV, respectively, indicating that ADL and AAT are a large part of the common genetic basis for DSF. In addition, DSF and AAT DSF shared 23 other genes (24.16%), while DSF and ADL DSF shared only 4 other genes (0.38%). This indicates that ADL and AAT are both important to DSF, but AAT has a greater impact on DSF variation (59.25%) than ADL (35.47%). In addition, DSF, ADL DSF , and AAT DSF had 94, 96, and 88 trait-unique genes, respectively, with only 14.75%, 31.65%, and 10.49% contributions to their respective PVs. These trait-unique genes were scattered among the four functional categories of 10 groups (Table 3), indicating that different genes with similar functions were trait-unique for different traits. For example, seven genes related to light and circadian rhythm were present in DSF, six of which were unique, while in ADL DSF , eight genes related to light and circadian rhythm were present, of which six were unique genes (Figure 1g,h).

Differential Contributions of ADL and AAT to DSF and DFM in Global Soybeans
In DFM, ADL DFM , and AAT DFM , eight common genes were found, which explained 11.54%, 11.33%, and 8.63% of their respective PVs. DFM and ADL DFM shared 11 genes (12.77%) and DFM and AAT DFM shared four genes (2.39%). This indicates that ADL DFM (24.31%) contributed more genetically than AAT DFM (13.93%) to DFM, but it was less when compared with that in DSF; therefore, DFM was less genetically influenced by ADL and AAT than DSF. In addition, in DFM, ADL DFM , and AAT DFM , 105, 86, and 96 traitunique genes, respectively, explained 26.59%, 18.43%, and 28.33% of their respective PVs. These trait-unique genes were scattered among the four functional categories of 10 groups (Table 4), indicating that different genes with similar functions were unique for the three traits (Figure 1h,i). Note: The numbers in parentheses represent the sum of R 2 for a certain category or group. The four gene ontology categories with their groups are as follows: Category I: Genes related to flowering, seed and stem development, or response to light and temperature stimulation, including Group 1 , genes related to flower development and growth; Group 2 , genes related to light and circadian rhythm; and Group 3 , genes related to temperature response. Category II: Translocation signal transduction; defense response; and genes related to DNA methylation, transcription, RNA processing, and chromosome modification, including Group 4 , genes related to histone variants and chromosome modification; Group 5 , genes related to DNA methylation, transcription, and RNA processing; and Group 6 , genes related to signal transduction and transport. Category III: Primary metabolism genes related to secondary metabolism, including Group 7 , genes related to plant hormones; Group 8 , genes related to protein and lipid metabolism; and Group 9 , genes related to sugar metabolism. Category IV: Genes related to biological processes and unknown functions, including Group 10 , genes related to other processes or unannotated. Table 4. Changes of genes-alleles from the center of origin to the derived geographic regions and from the primary MG-set to expanded MG-sets.  Note: In the Trait column, GS: geographic sub-population. "O" represents the center of origin; "A" represents Northeast China, far-east of Russia, and southern Sweden; "B" represents the Korea Peninsular and Japan Islands; "C" represents Southeast Asia, South Asia, and Africa; "D" represents northern North America, southern North America, and Central and South America. MG: maturity group; "E" represents the early MG-set (MG 000~0); "P" represents the primary MG-set (MG I~VII); "L" represents the late MG-set (MG VIII~X Therefore, DSF shared more ADL and AAT contributions than DFM, or in other words, the three DFM-related gene systems involving the DFM traits are more independent from each other than those of the DSF-related traits. Furthermore, DSF shared more genes and PVs with AAT DSF than ADL DSF , while DFM shared more genes and PVs with ADL DFM than AAT DFM , indicating that AAT was more important in determining DSF and ADL was more important in determining DFM in global soybeans. The relative importance of ADL DFM to DFM length is a new concept that coincided with the report of Han and Gai [49]. In addition, the three DSF-related traits among the nine shared genes, Glyma06g23580, Glyma10g26450, and Glyma16g03320, are close to the confirmed E1, E4, and LHY1a, respectively, while Glyma01g22830, Glyma02g04190, Glyma08g03210, Glyma13g07110, Glyma17g08460, and Glyma17g09500 have not been reported in previous studies. There were 94, 96, and 88 trait-specific genes in DSF, ADL DSF , and AAT DSF , but they only explained 14.75%, 31.65%, and 10.49% of their respective PVs, thus these specific genes provided small contributions to their PV. For the three DFM-related traits, among the eight shared genes, Glyma19g34740 is close to the Dt1/TFL1b gene, while Glyma02g00371, Glyma07g40260, Glyma08g15870, Glyma12g06580, Glyma12g06950, Glyma13g09470, and Glyma13g25480 have not been reported in previous studies. There were 105, 86, and 96 trait-specific genes in DFM, ADL DFM , and AAT DFM , which explained 26.59%, 18.43%, and 28.33% of their respective PVs, that is, a little more than the contributions of DSF-related traits ( Table 2).

Differentiated Evolutionary Motivators in Geographic Adaptation and MG Expansion for Growth Period Traits of Global Soybeans
Based on the relatively thorough identification of genome-wide genes-alleles, the dynamic allele changes due to the different motivators were calculated from the five geographic submatrices (O, A, B, C, and D) and three MG-set submatrices (MG I-VII or P MG-set, MG 0-000 or E MG-set, and MG VIII-X or L MG-set) for the six traits. Here, the designations of the geographic and MG-set submatrices are listed in Table 4.
In adaptation to geographic regions, the allele changes showed a similar tendency between DSF, ADL DSF , and AAT DSF and DFM, ADL DFM , and AAT DFM . In the three DSF- In the MG-set expansion, the allele changes showed a similar tendency for DSF-and DFM-related traits, but were different from those in geographic adaptation. In the three DSF-related traits, there were 383-405 alleles in the primary MG-set. From the P MG-set to the E and L MG-sets, a dominant share of the alleles, 275-352, were passed down; a share of the alleles (47-109, about 12-27%) were excluded; and a few new alleles (0-1, approximately 0%) emerged. Altogether, the emerged alleles in E and L MG-sets were added to 1 allele with a positive effect, and the excluded alleles in E and L MG-sets were added to 21-30 alleles with 10-15 negative ones and 9-20 positive ones. In the three DFM-related traits, there were 362-384 alleles in the P MG-set. From the P MG-set to the E and L MG-sets, a dominant share of the alleles, 274-332, were passed down; a share of the alleles (52-92, about 13-24%) were excluded; and no new alleles emerged. Altogether, the emerged alleles in E and L MG-sets were added to 0, and the excluded alleles in E and L MG-sets were added to 23-41 with 9-15 negative ones and 10-26 positive ones ( Figure S1b,d).
The trends of the evolutionary gene-allele changes in DSF and DFM due to geographic region adaptation and MG-set expansion in the present study were consistent with ADL DSF and AAT DSF and with ADL DFM and AAT DFM . In both the geographic adaptation and MGset expansion, allele inheritance (or migration) was always the dominant part. However, in geographic adaptation, new allele emergence (or mutation) was a joint major dynamic motivator for individual adapted sub-regions and total sub-regions, while allele exclusion (or selection) appeared in individually adapted sub-regions, but not in the total adapted sub-regions, even almost without allele exclusion (only one allele). Meanwhile, in the MGset expansion, allele exclusion was a major dynamic motivator, even without new allele emergence, in all individual and total expanded MG-sets. Therefore, in the two evolutionary processes, new allele emergence was an active motivator for geographic adaptation, while allele exclusion was an active motivator for MG expansion. This is true for all six traits. In addition to allele inheritance, emergence, and exclusion, allele recombination based on the allele changes should also be an important and constant dynamic motivator, which is shown in the next section.
In both allele emergence and exclusion, the changed alleles might have positive or negative effects, which indicate that the actual mutation and selection were not unidirectional. Both negative and positive alleles were emerged or excluded regardless of geographic adaptation and MG-set expansion, but the final comprehensive results were consistent with the target of the breeding effort.

Genetic Recombination Potential of DSF and DFM and Their Required ADL and AAT in Global Soybeans
To evaluate the genetic recombination potential of the six traits, the 25th percentile and 95th percentile progenies in each of the 62,128 possible crosses (parents' pairs) among the 354 varieties were predicted based on linkage and independent assortment models. The linkage model means that the original linkages among the GASMs are reserved, while the independent assortment model means that the original linkages among GASMs are not considered. The predicted results (Table 5) showed a similar outcome for the six traits under the linkage model, in which a wide range of the predicted 25th and 95th percentile progenies, distributed among the 62,128 parental pairs, were much lower and higher than the extremes of the 354 varieties, respectively. These showed lower and higher parent transgressive segregations in 992-6352 crosses for 25th percentile segregation and 1631-27, 142 crosses for 95th percentile segregation in the six traits (Table 5, Figure 1j). Interestingly, for the six traits in the global population, the predicted results under the independent assortment model were similar to those under the linkage model. This indicates that no extra increment can be expected from further elimination of the linkage drags.  Considering the genetic structure of the accessions of the global population (Figure 1f), few accessions had their gene-allele structure composed of all alleles with the lowest (or highest) effects or all accessions in a complementary mode. As there were 124-141 genes identified for each of the six traits, the recombination potential among the large number of gene-alleles should be sufficiently rich, even more than the actual allele emergence or exclusion. This explains why significant progress has been made during the recent couple of centuries while the mutation rate has usually only been at the 10 −6 level.

Gene Functions of DSF and DFM and Their Required ADL and AAT in Global Soybeans
The identified genes are candidate genes before their functions have been demonstrated. A relatively thorough detection of the gene-alleles makes it possible to explore the functional composition of the gene system for a trait. According to SoyBase (https://www.soybase.org (accessed on 1 January 2021)), the identified 124-141 genes, out of a total of 661 genes of the six traits (some shared among traits) ( Figure S2), were annotated based on their functions, which were grouped further into a same set of four categories of 10 groups of biological processes for each trait, although different genes may involve a similar function among the traits. The detailed category-group list for each trait is presented in Tables 3 and S2-S8, Figure 2a. The total 661 genes were annotated as follows: Category I: Group 1 , 61 genes related to flower development and growth; Group 2 , 31 genes related to light and circadian rhythm; and Group 3 , 20 genes related to temperature response. Category II: Group 4 , 30 genes related to histone variants and chromosome modification; Group 5 , 112 genes related to DNA methylation, transcription, and RNA processing; and Group 6 , 88 genes related to signal transduction and transport. Category III: Group 7 , 64 genes related to plant hormones; Group 8 , 141 genes related to protein and lipid metabolism; and Group 9 , 29 genes related to sugar metabolism. Category IV: Group 10 , 240 genes related to other or unannotated processes. Each trait is defined by the above four categories of 10 groups of genes, including some with still unknown functions. This indicates that the six traits involved in the DSF and DFM responses to degree and duration of day-length and temperature were defined by different gene sets, but with similar category-groups of functions. The four gene ontology categories with their groups are as follows: Category I: Genes related to flowering, seed and stem development, or response to light and temperature stimulation, including Group ①, genes related to flower development and growth; Group ②, genes related to light and circadian rhythm; and Group ③, genes related to temperature response. Category II: Translocation signal transduction; defense response; and genes related to DNA methylation, transcription, RNA processing, and chromosome modification, including Group ④, genes related to histone variants and chromosome modification; Group ⑤, genes related to DNA methylation, transcription and RNA processing; and Group ⑥, genes related to signal transduction and transport. Category III: Primary metabolism genes related to secondary metabolism, including Group ⑦, genes related to plant hormones; Group ⑧, genes related to protein and lipid metabolism; and Group ⑨, genes related to sugar metabolism. Category IV: Genes related to biological processes and unknown functions, including Group ⑩, genes related to other processes or unannotated. (b) Part of the protein-protein interaction (PPI) network associated with ADL and AAT genes for DSF in the WSGP. (c) Part of the protein-protein interaction (PPI) network associated with ADL and AAT genes for DFM in the WSGP. The blue ball represents the top 10% of nodes of the betweenness centrality (BC) value. The four gene ontology categories with their groups are as follows: Category I: Genes related to flowering, seed and stem development, or response to light and temperature stimulation, including Group 1 , genes related to flower development and growth; Group 2 , genes related to light and circadian rhythm; and Group 3 , genes related to temperature response. Category II: Translocation signal transduction; defense response; and genes related to DNA methylation, transcription, RNA processing, and chromosome modification, including Group 4 , genes related to histone variants and chromosome modification; Group 5 , genes related to DNA methylation, transcription and RNA processing; and Group 6 , genes related to signal transduction and transport. Category III: Primary metabolism genes related to secondary metabolism, including Group 7 , genes related to plant hormones; Group 8 , genes related to protein and lipid metabolism; and Group 9 , genes related to sugar metabolism. Category IV: Genes related to biological processes and unknown functions, including Group 10 , genes related to other processes or unannotated. (b) Part of the protein-protein interaction (PPI) network associated with ADL and AAT genes for DSF in the WSGP. (c) Part of the protein-protein interaction (PPI) network associated with ADL and AAT genes for DFM in the WSGP. The blue ball represents the top 10% of nodes of the betweenness centrality (BC) value.

Advantages of GASM-RTM-GWAS in Exploring the Gene-Allele System and Gene Network
To explore the complete genetic system of a trait, a large germplasm population has to be used, especially for understanding the multiple allele composition. RTM-GWAS has been demonstrated to be powerful [32][33][34][35][50][51][52][53][54]. In the present study, it was an innovation to use the genomic marker GASM to replace another genomic marker, SNPLDB, to identify the gene-allele system directly to avoid inference of candidate genes from QTLs, while also merging the two-step process into a single step. As indicated above, GASM-RTM-GWAS was more powerful than SNPLDB-RTM-GWAS, in that more genes were identified compared with the results in [10], because the number of GASM-alleles better fit that of the causal genes owing to the extra-large SNPLDB segments and corresponding extra alleles being avoided. An additional advantage is that the GASM marker fits all populations that use the same reference genome, making the gene-allele results comparable among different studies.
The relative completeness and accuracy in identifying genes with their alleles from GASM-RTM-GWAS made possible the identification of further results on the population. These include the establishment of the gene-allele matrix to demonstrate the gene-allele structure of each variety and the whole population [37,55], population evolutionary genetic study through direct comparisons among the matrices of derived subpopulations, optimal cross prediction of the population and subpopulations [56], gene network exploration of the population, and the identification of major genes with their major alleles. Without the relatively thorough identification of the gene-allele system, these extended results are not possible. In addition, more genetic information can be obtained from the identified gene-allele system, such as the allele frequency changes in the evolutionary processes and G × E interactions of each gene-allele as well as the whole gene-allele system.
In addition, because the R 2 and probability (-lg(p) value) of all of the identified genes in global soybeans were explored relatively thoroughly using GASM-RTM-GWAS, the importance of the identified genes and their alleles in the gene-allele system was evaluated [30]. For example, in the present global soybean population, 141 and 135 DSF and DFM genes were identified, respectively, and the individual phenotypic contributions were evaluated. Among these genes, six "E" genes (E1, E2, E3, E4, E9, and E10 [43,44,46,[57][58][59]), DT1 [17], J [16], and FT [13] family genes were found near the present detected genes; especially, Glyma06g23580 for DSF was close to E1. Likewise, Glyma04g07430 and Glyma04g04810 were near the J gene, and Glyma19g34740 was near the Dt1 (TFL1b) gene. However, some of the previously identified genes were not found to be important in the present global soybean germplasm population; therefore, the present results may have provided a check for further evaluating the relative importance of the previously identified DSF and DFM genes/alleles, which was impossible owing to the lack of multiple allele information.

Comparison of Gene-Allele Matrices between Ancestral and Filial Subpopulations as a New but Simple Approach in Exploring Evolutionary Drives
In the present study, the active factors of geographic adaptation and MG-set expansion included new allele emergence and old allele exclusion in addition to allele inheritance or migration. In each geographic region, new alleles emerged, with some being shared among the regions, and the total emerged alleles in the extended regions were less than the sum of the extended complex regions for all six traits, but old alleles were excluded in each region, with the sum of all regions being almost zero or no excluded alleles for the extended complex regions in comparison with those in the O region (ABCD vs. O in Table 4). Likewise, in each new MG-set, no new allele emerged, with the total emerged allele being zero, while old alleles were excluded in each MG-set, with some overlapping in new MG-sets, and the total excluded alleles were less than the sum of the total new MG-sets for all six traits (EL vs. P in Table 4). About 90% of alleles in the extended regions or expanded MG-sets were inherited from the O region or primary MG-set, while 0-11% of alleles emerged or were excluded, which caused the soybean to adapt to all environments around the world and new MG-sets to form. However, the phenotypic changes, such as DSF ranging from 26.4 to 70.6 d (37.0 to 68.5% changes) among the eco-regions and 23.0 to 77.5 d (41.2 to 98.2% changes) among MG-sets (Table S1, Figure S3), were not consistent with the allele changes of 0-11% (including both positive and negative alleles). Therefore, allele recombination based on allele changes is an important genetic drive, especially when accompanying gene interactions. Therefore, in addition to allele inheritance (migration), the population evolution motivator was not due only to new allele emergence, but to new allele emergence plus old allele exclusion and allele recombination. New gene interactions might form, which implies remarkable potential for evolution in a self-pollinated germplasm population (in a cross-pollinated population, the population genotypic structure maintains equilibrium depending on gene frequencies).
The above results were obtained from comparisons among gene-allele matrices between the original subpopulation (O for geographic region and P for MG-set) and the derived subpopulations, which were based on a relatively thorough identification of genes with their alleles through GASM-RTM-GWAS. This analysis would have been impossible without the relatively complete gene-allele identification. In previous studies on evolutionary mechanisms, the comparison of gene frequencies was usually used, but the emerged and excluded genes with their alleles were not detected, and thus were not included in those comparisons. In other words, the important events/information in evo-lution studies were neglected. Therefore, the comparison of gene-allele matrices among related consecutive populations based on GASM-RTM-GWAS is a new and exact, but simple approach to explore evolutionary drives or evolutionary mechanisms [10,37,51]. Furthermore, in addition to new allele emergence and old allele exclusion based on GASM-RTM-GWAS, the changes in allele frequency can be evaluated to explore natural vs. artificial selection functions.

Gene Interaction Network and Important Gene-Alleles of the Growth Period Eco-Traits of Global Soybeans
To understand how genes with different functions work together, the identified genes combined with the information collected from the STRING data center [60] were used for PPI analysis [40]. The 174 DSF, ADL DSF , and AAT DSF genes demonstrated PPI effects, forming four PPI networks; three were completely connected with obvious node genes, while one was partially connected but without an obvious node gene ( Figure 2b). However, the 175 DFM, ADL DFM , and AAT DFM genes formed a larger network containing 145 (82.9%) genes that were completely connected with nodes, while the remained genes were partially connected without a node as another separate network (Figure 2c). This indicates that the regulatory modes were different between DSF and DFM, but both regulation patterns showed similar characteristics.
In the PPIs of DSF and DFM, seven and nine genes were related to light and circadian rhythm, respectively, indicating that light and circadian rhythm played an important role not only in DSF, but for both. While 12 genes related to temperature response were in the DSF PPI, only 5 genes were in the DFM PPI, indicating that temperature was important in the whole growth process, but was more important in DSF. In addition, the other genes in the PPIs of DSF and DFM were the protein-and lipid-metabolism-related genes and DNA methylation, transcription, and RNA-processing-related genes, which accounted for a large portion, followed by plant-hormone-related genes and signal-transduction-and transportation-related genes.
In addition, the genes with high betweenness centrality [61] (BC, the measure of node/hub importance) values were important to PPI. In DSF, the top five genes with high BC (Glyma03g40780, Glyma03g0239, Glyma06g13320, Glyma02g45790, and Glyma04g16180) were involved in genes related to protein and lipid metabolism; light and circadian rhythm; temperature response; DNA methylation, transcription, and RNA processing; signal transduction; and transport. In DFM, the top five high BC genes (Glyma12g35580, Glyma13g22420, Glyma18g05730, Glyma18g01330, and Glyma20g27950) were mainly involved in histone variants and genes related to protein and lipid metabolism. These results suggest that the genetic systems of the six growth-period-related traits are a complex gene system composed of different gene functions or networks, which should be further explored (Figure 2b,c).
For the identification of important genes and their alleles from the 661 genes and 1876 alleles, the alleles that can meet two or more of the following criteria were considered: (i) genes with a large contribution to PV (R 2 ≥ 1.5%), (ii) genes with a top 10% BC score in the PPI network, (iii) genes with new emerged alleles, and (iv) genes shared among traits. For DSF, ADL DSF , and AAT DSF , 62 alleles from 20 genes were nominated, including 35 inherited and 27 emerged alleles. Among DFM, ADL DFM , and AAT DFM , 31 alleles from 12 genes were nominated, including 22 inherited and 9 emerged alleles. It is evident that the nominated important genes and alleles were from a global soybean population, and some of the previously reported genes (but without allele information) were included in the global nomination, while most of the previously reported genes were not included. Therefore, the presently nominated genes and alleles are particularly worth studying further (Table S9).

Plant Materials and Field Experiments
A total of 354 soybean varieties from 27 countries were selected as a representative sample of the world soybean germplasm population (WSGP) from the germplasm storage at the National Center for Soybean Improvement, Nanjing Agricultural University, Nanjing, China. According to a previous study by Liu et al. [4], the source of the materials was grouped into five regions, coded as O, A, B, C, and D (please refer to Table 4 notes for the detailed codes of the sub-regions). "O" represents the materials that came from the center of origin of the soybean in China, including HCHN and SCHN; "A" represents the materials from the northern dissemination route, including NCHN, RUFE, and SSWE; "B" represents the materials from the eastern dissemination route, including KORP and JPAN; "C" represents the materials from the southern dissemination route, including SEAS, SASI, and AFRI; and "D" represents the materials from the Western dissemination route, including NNAM, SNAM, and CSAM. The MG-set system of soybeans was first established in 1944 when MG I-MG VII was defined, and then the earlier MGs (MG 000-MG 0) and later MGs (MG VIII-MG X) were developed and added to the system. The soybean accessions include MG 000-X, with 13 MGs altogether [11], which were grouped into three MG-sets and designated as E (000-0), P (I-VII), and L (VIII-X) MG-sets. The

Measurement of Growth Period Traits and Their Required ADL and AAT
According to Fehr and Caviness [62], the emergence date (Ve), first flowering date (R1), and maturity date (R8) were recorded for all of the tested samples in the experiments, from which DSF and DFM were calculated.
Th day-length and temperature data were obtained from the Public Service of Nanjing Meteorological Bureau. The daily maximum and minimum temperature were used to calculate the daily average temperature, from which the daily active temperature was calculated as the accumulated daily average temperature over all of the days, except for those days whose average temperature was less than 10 • C. Likewise, the daily sunrise and sunset time were obtained, from which the day-length was calculated from their difference.
The required ADL and AAT for DSF and DFM were calculated from the summation of daily day-length and daily active temperature, which were designated as ADL DSF and AAT DSF and ADL DFM and AAT DFM , respectively.

Statistical Analysis
For DSF and DFM and their required ADL and AAT of the four environments, eight blocks were calculated using SAS/STAT 9.4 software package (SAS Institute Inc., Cary, NC, USA). The PROC UNIVARIATE was used to perform descriptive statistical analysis of the traits and the significance between the subpopulations was calculated using the t-criterion. PROC GLM was used to perform variance analysis and regression analysis for the total data. The statistical model for analysis of variance is as follows: y ijk = µ + t i + g k + r j(i) + (gt) ik + ε ijk , where µ is the population mean, g k is the effect of the kth genotype, t i is the effect of the ith environment, r j(i) is the jth block effect in the ith environment, (gt) ik represents the G × E effect between the ith genotype and jth environment, and ε ijk is the residual. The heritability was calculated as follows: h 2 = σ 2 g / σ 2 g + σ 2 ge /n + σ 2 ε /rn , where σ 2 g is the genetic variance, σ 2 ge is the variance of G × E, σ 2 ε is the error variance, n is the number of environments, and r is the number of replications in each environment. The genetic coefficient of variation is calculated as GCV = σ g /µ, where µ is the population mean [63,64].
In testing the significance of allele exclusion, the binomial probability of sampling error is estimated as P 0 = C n 0 p 0 q n , where P 0 is the sampling probability of all individuals without this allele in the subgroup; p = allele probability of the excluded allele, q = 1 − p, n = subgroup size. The p-value here is estimated using the allele frequency of the old subgroup of a specific allele [10].

SNP Genotyping and GASM Assembly
The restriction-site-associated DNA sequencing (RAD-seq) was conducted at BGI Tech, Shenzhen, China for genotyping the WSGP, which has been reported in detail by Liu et al. [4]. A total of 97,706 SNPs were confirmed for the 354 soybean varieties. All SNPs in a gene were assembled directly into GASM based on the reference genome Williams 82.a1.v1.1 (http://www.soybase.org (accessed on 1 January 2021)) [65]. The GASMs were generated using RTM-GWAS software with the haplotype/allele MAF less than 0.01 being superseded with a highest frequency haplotype/allele [30]. In total, 7801 GASMs with 18,111 haplotypes/alleles, each GASM with 2-8 haplotypes/alleles, in an average of 2.32 /GASM were obtained in the WSGP ( Figure S4).

Identification of Gene-Alleles of DSF and DFM and Their Required ADL and AAT Using GASM-RTM-GWAS
Based on the GASMs, the RTM-GWAS procedure [30] was used to identify the geneallele systems of DSF, DFM, ADL DSF , AAT DSF , ADL DFM , and AAT DFM . At both stages, the top 10 principal vectors of the genetic similarity coefficient matrix were constructed based on the GASMs and were used as covariates for the correction of the population structure bias. At the first stage of RTM-GWAS, using the general linear regression in the single-locus model, 6530, 6507, 6574, 6193, 6529, and 6550 GASMs were pre-selected at the significance level of p < 0.05 for the six traits, respectively. At the second stage of RTM-GWAS, the stepwise regression featured the forward selection and backward elimination under the multi-locus model and identified 141, 135, 130, 130, 124, and 129 genes with 406, 390, 384, 384, 364, and 362 alleles, respectively, for a total of 789 genes and 2290 alleles, or 661 genes with 1876 alleles when duplicates were not included for the six traits. As this experiment is used for population genetic research, it is necessary to excavate the complete gene system of relevant traits as much as possible, and the default significance level of p < 0.05 is still used in the second step. If you want to obtain relevant genes for gene-cloning-related experiments, you can use the significance level after Bonferroni correction (a/m) as the significance level of each step in multiple stepwise regression to identify significantly associated genes, where a and m are significance level (0.05) and the number of candidate markers, respectively. The RTM-GWAS software was publicly obtained from https://github.com/njau-sri/rtm-gwas or https://gitee.com/njau-sri/rtm-gwas (accessed on 1 January 2021) [30]. The identified gene system was annotated according to SoyBase (http://www.soybase. org (accessed on 1 January 2021)) based on the reference genome Wm82.a1.v1.1, from the following databases: GO (Gene Ontology, http://geneontology.org/docs/downloadontology/ (accessed on 1 January 2021)) and TAIR (Arabidopsis Information Resource, http://www.arabidopsis.org/ (accessed on 1 January 2021)).

Analysis of Genetic Motivators in Geographic Adaptation and MG-Set Expansion of Global Soybeans
To reveal the gene-allele changes and evolutionary motivators of the six eco-traits from the center of origin to the various geographic regions and from the primary MG-set (I-VII) to the emerged-early MG-set (000-0) and emerged-late MG-set (VIII-X), the whole population gene-allele matrix for each trait was separated into their component submatrices. The inherited, excluded, and emerged alleles were calculated and compared directly to those of the center of origin and of the primary MG-set to evaluate the relative importance of the evolutionary motivators.

Prediction of Genetic Recombination Potentials of DSF and DFM and Their Required ADL and AAT in Global Soybeans
Based on the gene-allele matrix, the possible parent crosses with their progenies among all the accessions were simulated. A total of 2000 homozygous progenies were calculated under the linkage and independent assortment models [30]. The 25th and 95th percentile of a cross were used as the recombination potential indicator for comparisons between the crosses. To determine the transgressive crosses, low-parent heterosis and high-parent heterosis were calculated as LPT = F 25th −LPV and HPT = F 95th −HPV, respectively, where F 25th and F 95th are the 25th and 95th percentile value of a cross population, respectively, while LPV and HPV are the low and high parent values, respectively.

Gene Functional Annotation and Gene Network Analysis
The identified genes were annotated by GO analysis for their functional groups according to SoyBase (http://www.soybase.org (accessed on 1 January 2021)). To explore the gene interactions, protein sequences of the genes were obtained from the phytozome database (http://phytozome.jgi.doe.g.,ov (accessed on 1 January 2021)), then STRING (http://string-db.org (accessed on 1 January 2021)) [66] was used to generate PPI networks [67]. In the PPI network analysis, proteins are described as nodes/hubs and edges. The importance of each node is expressed by its centrality, including betweenness, degree, and closeness, whereas the edge is usually undirected and unweighted [61,[68][69][70].

Conclusions
The DSF and DFM of the global soybean germplasm population were traced to their required accumulative day-length and active temperature to explore their eco properties. (i) Through an improved genome-wide association study with gene-allele sequence as markers (GASM-RTM-GWAS), the six gene-allele systems for DSF, DFM, ADL DSF , AAT DSF , ADL DFM , and AAT DFM (124-141 genes with 362-406 alleles per trait) were explored. DSF shared more ADL and AAT contributions than DFM, while AAT contributed more than ADL to DSF, but vice versa for DFM. (ii) The genetic adaptation from the origin to the geographic sub-regions was characterized by allele emergence (mutation), while genetic expansion from primary maturity group (MG)-sets to early/late MG-sets featured allele exclusion (selection) without allele emergence in addition to inheritance (migration). (iii) Optimal crosses with transgressive segregations in both directions were predicted for breeding purposes, and allele recombination in soybean is an important evolutionary drive. (iv) Genes of the six traits were mostly trait-specific involved in four categories of 10 groups of biological functions. Therefore, GASM-RTM-GWAS showed potential in directly detecting causal genes with their alleles, identifying differential trait evolutionary drives, predicting recombination breeding potentials, and revealing population gene networks.