Analyzing Medicago spp. seed morphology using GWAS and machine learning

Alfalfa is widely recognized as an important forage crop. To understand the morphological characteristics and genetic basis of seed morphology in alfalfa, we screened 318 Medicago spp., including 244 Medicago sativa subsp. sativa (alfalfa) and 23 other Medicago spp., for seed area size, length, width, length-to-width ratio, perimeter, circularity, the distance between the intersection of length & width (IS) and center of gravity (CG), and seed darkness & red–green–blue (RGB) intensities. The results revealed phenotypic diversity and correlations among the tested accessions. Based on the phenotypic data of M. sativa subsp. sativa, a genome-wide association study (GWAS) was conducted using single nucleotide polymorphisms (SNPs) called against the Medicago truncatula genome. Genes in proximity to associated markers were detected, including CPR1, MON1, a PPR protein, and Wun1(threshold of 1E−04). Machine learning models were utilized to validate GWAS, and identify additional marker-trait associations for potentially complex traits. Marker S7_33375673, upstream of Wun1, was the most important predictor variable for red color intensity and highly important for brightness. Fifty-two markers were identified in coding regions. Along with strong correlations observed between seed morphology traits, these genes will facilitate the process of understanding the genetic basis of seed morphology in Medicago spp.

seed characteristics and their genetic determinants 14 .To have a comprehensive idea of alfalfa seed morphology, 244 alfalfa germplasms in addition to 23 other Medicago spp.were evaluated for various seed morphological traits such as seed area, length, width, length-to-width ratio (LWR), perimeter, circularity, the distance between the intersection of length and width (IS) and center of gravity (CG), and seed darkness & red-green-blue (RGB) intensities (86,294 seeds for seed size-related traits and 25,440 seeds for color-related traits).We analyzed correlations within the collection to explore potential relationships between seed morphological traits.Building upon the phenotypic data collected of each trait, we combined it with 8,565 single nucleotide polymorphisms (SNPs) across the Mt5.0 reference genome of M. truncatula to conduct GWAS 15 .Tracing the top candidate SNPs to the reference genome, we identified candidate genes possibly directly connected with seed morphology.Furthermore, validation of the GWAS results and identification of additional marker-trait associations within coding sequences and the entire genome was accomplished using machine learning models.

Materials and methods
Seed phenotypic evaluation in Medicago spp.
The plant Medicago spp.genetic resources used in the described research were provided, free of cost, by the USDA ARS National Plant Germplasm System, Plant Germplasm Introduction and Testing Research Unit following request submission and justification through the publicly accessible GRIN-Global online platform 16 .Overall, we obtained 318 accessions of Medicago spp., including 244 Medicago sativa subsp.sativa (alfalfa) and 23 other species from the germplasm collection (complete list available in Supplementary Data S1).
Seeds were evaluated for seed area size (mm 2 ), length (mm), width (mm), LWR, perimeter (mm), circularity (0-1 range, 0: not circular to 1: complete circle), the distance between the intersection of length & width (IS) (points where the width and length hit the boundary for the seed parameter) and center of gravity (CG) (the point where the seed's mass is concentrated) and seed darkness & brightness/RGB values (0-255 range, 0: absence of light and highest intensity of RGB colors and darkness to 255: full light), following the methodology of previous studies 13,17 .Briefly, seed images (~ 100 seeds per accession) were acquired with Canon imageRUN-NER ADVANCE C7270 (Canon Inc, Tokyo, Japan), followed by analysis through the SmartGrain (version 1.3) high-throughput phenotyping software.The seed image analysis was conducted for seed area size, length, width, LWR, perimeter, circularity, and distance between IS and CG 17 .For every image, any errors produced by the SmartGrain program were fixed manually.Seed darkness and brightness/RGB values were obtained using a multi-point function in ImageJ version 1.54d 18 .Overall, 86,294 seeds for seed size-related traits and 25,440 seeds for color-related traits were analyzed.

Statistical analysis
Statistical analysis followed the methods described by Ahn et al. 13 .To compare all possible accession pairs for each trait, Tukey's HSD test was employed using JMP Pro 15 (SAS Institute, Cary, NC, USA).Additionally, one-way ANOVA was performed for each trait.To explore potential correlations between the traits, Pearson's correlation coefficients were calculated for all possible pairs using JMP Pro 15.We also utilized principal component analysis (PCA) and clustering variables analysis with the phenotypic data from all traits using JMP Pro 15.

SNP identification
For the genotyping of each accession, we obtained genome sequencing reads from 189 phenotyped accessions generated through genotyping-by-sequencing (GBS), sourced from the NCBI SRA database under the Bioproject PRJNA287263 19 .The Mt5.0 reference genome of Medicago truncatula (GCF_003473485.1)was downloaded from NCBI (https:// www.ncbi.nlm.nih.gov/ datas ets/ genome/).The fastq reads were aligned to the reference genome using the Bwa-mem2 program with the 'mem' method.Subsequently, the alignments (BAM files) were merged into a single BAM file using the SamTools v1.17 20 .SNP variants were called as diploid using freebayes v1.3.6 21 , with the following parameters: -min-base-quality 10, -min-supporting-allele-qsum 10, -read-mismatch-limit 3, -min-coverage 5, -min-alternate-count 4, -mismatch-base-quality-threshold 10.The SNPs were further filtered based on the following criteria: SNPs should be biallelic, have a missing genotype rate of 10% or less, exhibit a minor allele frequency of at least 0.05, and have a read supporting the allele of at least 5. Finally, a single variant call format file with 8565 SNPs for 189 accessions was generated for downstream analyses.This population with 189 accessions exhibited an expected heterozygosity of 0.285.For accessions with available genotype data, the population structure and genetic diversity was evaluated by Zhang et al. 19 .

Genome-wide association study
We conducted genome-wide association analysis using the R-package, Genome Association and Prediction Integrated Tool (GAPIT) version 3 22 by employing the method known as settlement of mixed linear models under progressively exclusive relationship (SUPER) 23 .SUPER method uses the framework of a standard MLM approach.An MLM can be described using Henderson's matrix notation as follows: where y is the vector of observed phenotypes; v is the marker effect; β is an unknown vector containing fixed effects, including the population stratification (Q) and the intercept; u is a vector of size n (number of individuals) of random additive genetic effects; W, X, and Z are the known design matrices for marker, fixed and random effects, respectively; and e is the unobserved vector of residuals.The u and e vectors are assumed to be normally distributed with zero mean and G and R covariance matrices, respectively.G = 2Kσ 2 a , where K is a known kinship matrix with element K i,j (i, j = 1, 2, . . ., n) calculated from genetic markers, and R = Iσ 2 e , where www.nature.com/scientificreports/I is the identity matrix and is the unknown residual variance.Population stratification was corrected by principal component analysis (PCA).The optimal number of principal components was determined through a Bayesian information criterion-based analysis within GAPIT.For the association analysis in GAPIT, the following parameters were used: PCA.total = 3 and model = SUPER while leaving all other settings at their default values.Significant associations of SNPs with the traits of interest were identified using a threshold of − log10(p-value) > 4.
Subsequently, the associated regions in the genome were defined as extending 10 kb upstream and downstream of the significant SNPs.Within these regions, candidate genes were identified based on the gene annotation of the Mt5.0 reference genome of Medicago truncatula (GCF_003473485.1).

Machine learning
The VCF file was converted to numeric format using the `gt2num` function from the bwardr R package (v1.0, https:// github.com/ etnite/ bwardr).One hundred eighty-eight monomorphic markers were removed using the `nearZeroVar` function from the R package caret v.6.0-94 24.The genotypic matrix was merged with 11 phenotypic traits to run three machine learning methods: random forest (RF), support vector machine with the linear kernel (SVM), and extreme gradient boosting with the linear kernel (XGB).All models were trained using a resampling with ten-fold cross-validation, and the root mean squared error (RMSE) was used to select the optimal model based on the smallest value.Importance scores for predictor variables (markers) were calculated on trained models using the `varImp` function, and the values were scaled from 0 to 100.

Seed morphologies
Two-tailed ANOVA detected significant differences (p < 0.0001) for all evaluated seed traits among the Medicago spp.accessions (raw data available in Supplementary Data S1).The extensive morphological variations are shown in Fig. 1, presented in scatter plots of seed area, circularity, and brightness across Medicago spp., and highlight both the substantial diversity between species and the wide range within alfalfa itself.Further illustrating this point, Table 1

Correlations among the traits
Pearson's correlation analysis found intercorrelations among most evaluated seed morphology traits, with a few notable exceptions.Blue color intensity stood apart, showing no significant correlations with area, perimeter, and length (Fig. 4 & Table 2).Similarly, IS and CG displayed no significant correlation with brightness, red, and green color intensity.Brightness, red, and green color intensities show moderate correlations with the three seed shape-related parameters: Area, perimeter, and length, indicating a meaningful separation between blue and other colors, as shown in Fig. 5.A PCA plot encompassing all traits unveiled that two major PCs (PC1 & PC2) capture 81% of the overall variation (Fig. 5).Circularity was distinctly separated from the other two major groups.The partial contribution of variables plotted in Fig. 6 indicates that PC1 mainly comprises area size, perimeter, length, and width.Both color intensity-and seed morphology-related traits contributed to PC2.A majority of PC3 consists of LWR and circularity.Likewise, clustering analysis identified three clusters based on seed size, color, and shape (Table 3).

Genome-wide association studies
Because of the mixed ploidy among different species of Medicago included in this study, the SNP variants were called diploid, and genome-wide association studies (GWAS) were conducted on 189 accessions using 8,565 biallelic SNPs.GWAS identified five SNPs exceeding the significance threshold of − log10(p-value) = 4. Five different markers were associated with three different traits (Table 4).Markers S2_9100992 and S2_9101010, separated by 18 bp, were associated with the trait distance between IS and CG (IS to CG).These two markers were located in the gene MtrunA17_Chr2g0289631.Two markers (S4_29996363 and S6_30969602) were associated with seed width.Marker S4_29996363 was in a region coding for a vacuolar fusion protein, MON1, and marker MtrunA17_Chr6g0476151 was in a region coding for a tetratricopeptide-like helical domain protein.Finally, marker S7_33375673, associated with red color intensity (red) was located 99 bp upstream of the coding region for a wound-induced protein, Wun1.Manhattan plots for these three traits are presented in Supplementary Fig. 1.
The PCA plot based on GBS data showed a single main cluster, indicating limited population differentiation (Supplementary Fig. 2).Linkage disequilibrium plots of top SNPs show that linkage was significant (R 2 > 0.3) in a continuous region of approximately 9 kb surrounding marker S4_29996363, the top SNP for seed width (Supplementary Fig. 3. Interestingly, this SNP occurred in the intronic region of MON1, which occupies the majority (~ 7.1 kb) of the linked region.

Machine learning
The importance scores predictor variables (markers) among three machine learning models, and 11 phenotypic traits were compared using Pearson correlation, identifying three clusters of traits highly correlated by their importance scores calculated with SVM (Fig. 7a).Cluster one groups importance scores for brightness, blue, green, and red color intensity; cluster two groups length-to-width ratio and circularity; and cluster three groups area size, length, perimeter length, and width (Fig. 7b).The importance scores for predictor variables (markers) were calculated and compared with markers associated with GWAS.Marker S7_33375673, associated with red color intensity (Red) by GWAS, was the most important predictor variable for the RF and SVM models (Table 5).
The importance scores of the other markers were moderate, averaging around 38.8.However, importance scores for markers associated with GWAS were similar across different traits.For instance, marker S7_33375673 had an importance score of 89.3 and 89.2 with RF and SVM for brightness.
The relative importance of predictor variables was filtered to retain markers with an importance score greater than 70.In total, 301 markers were selected, with the number of markers per trait ranging from 9 to 49.Among these, 117 markers were identified uniquely, while 184 markers were identified multiple times by different models and/or for different traits (Supplementary Table 1).By model, SVM had the highest number of markers retained (182), while XGB had the lowest (54), and by trait, circularity had the highest number of markers retained (43), while width had the lowest (6) (Supplementary Table 2).Twenty-nine markers with relative importance > 70 by more than one model were selected as highly important predictor variables.For example, marker S8_46680823 had a relative importance greater than 70 for area size, length, distance between IS and CG, and perimeter length, identified by different models (Supplementary Table 3).Interestingly, marker S8_46680823 is located in the coding region of an RNA helicase (MtrunA17_Chr8g0389051). Fifty-two unique markers were located in coding regions and were annotated using Mt5.0 reference genome of Medicago truncatula (Supplementary Table 4), identifying six transcription factors, four genes with glycosylation signaling function, and eight related to plant development.

Discussion
Studies emphasize the importance of seed size and shape in breeding for improved yield and quality.In plant breeding, it is crucial to determine the size and shape of seeds, as well as their correlation, to enhance yield or quality 10 .The priority in crop domestication has been on large seeds for a long time, as they have been established as crucial components for high yield 25,26 .Pattern analysis showed that seed weight significantly correlated with multiple traits such as seed perimeter, length, width, and thickness in Sophora moorcroftiana 27 .Along with seed size and weight, seed color plays an important role in crops.Research in wheat indicates that variations in seed color may arise from genetic diversity 28 .Brown seeds' germination parameters and seedling performance in alfalfa were significantly lower than green and yellow seeds under different salt stress 9 .Also, brown seeds were less resistant to salt stress 9 .In addition, seed shape is an important trait often used in plant identification and classification 10 .A recent study in sorghum identified potential correlations between seed shape (circularity) and seed size (area size, length, and width) 13 .Alfalfa [Medicago sativa (L.) subsp.sativa], a member of the Fabaceae, has been cultivated as a forage crop since the beginning of recorded history 29 .Though extensively studied for traits like flowering time 30 , forage quality 31 , abiotic stress response 32 , and leaf size 33 , alfalfa's seed morphology-associated traits and their correlations have received less attention.Given alfalfa's critical role as a forage crop, deciphering the genetic mechanisms shaping seed architecture necessitates a deep understanding of seed morphology and its link to molecular markers.To address this gap, we explored 11 essential seed morphology traits in 318 accessions of Medicago spp., focusing on alfalfa.We identified significant phenotypic variations (Figs. 1, 2, 3 and Table 1) and numerous correlations among traits (Figs. 4, 5, 6 and Tables 2, 3).Cluster analysis of seed morphology-related traits (Table 3) revealed three distinct clusters representing seed size, color, and shape.The partial contribution of these traits to principal components (Fig. 6) further highlighted similar patterns of correlations.Notably, these results closely align with sorghum seed morphology, suggesting the universality of correlations between seed morphology-associated traits across diverse plant species.We detected deviations of blue color intensity from brightness, red, and green, but the underlying cause (genetic diversity or technological limitations) remains unclear.
The GWAS analysis revealed five markers with − log10(p-values) greater than 4 associated with three traits.For seed width, GWAS identified markers S4_29996363 and S6_30969602.Marker S4_29996363 was located in a gene coding for a vacuole fusion protein, MON1.Vacuoles, which can occupy up to 90% of plant cell volume, play crucial roles in plant growth, development, and various functions such as storage, transport, and stress response 34 .MON1 interacts with calcium caffeine zinc sensitivity1 protein in processes such as vacuolar trafficking, vacuole biogenesis, and plant growth 35 .Marker S6_30969602 was located within a gene annotated as a tetratricopeptide-like helical domain protein.While much has been learned about the molecular functions of these proteins over the past decade, including their roles within the cell and their physiological functions during plant growth and development 36 , there are no reports of this gene being associated with seed width.Additionally, the GWAS identified the marker S7_33375673 associated with seed red intensity.This marker was located 99 bp upstream of a gene annotated as Wound-induced protein, Wun1.Wun1 plays an important role in plant responses to diverse stresses, including wounding (biotic) and high salt (abiotic) 37 .
While several advances have been made to increase statistical power and reduce computing time in GWAS, all models are based on mixed linear models (MLMs).In this study, we employed the Settlement of MLM Under Progressively Exclusive Relationship (SUPER) method for GWAS.The SUPER method divides the genome into equal-sized segments to select the most influential SNP, considered as a pseudo quantitative trait nucleotide (QTN).These pseudo QTNs are exempt from the kinship matrix (K).Since the K is derived from a smaller number of markers compared to the K in MLM, which is derived from all markers, the confounding between K and some markers becomes more severe 23 .However, despite this method, we only identified five markers associated with three out of 12 traits.Several possible explanations exist: 1) MLMs in GWAS test each SNP separately to capture the main genetic effect, neglecting the minor effects of markers, 2) GWAS cannot capture SNP-SNP interactions as it only tests for the marginal effects of SNPs, overlooking the contribution of multiple markers to phenotypic response, and 3) GWAS requires control of type I errors through stringent methods such as Bonferroni or false discovery rate corrections, which may lead to underpowered detection of candidate markers.On the other hand, machine learning (ML) models applied to GWAS can enhance the detection of markers associated with traits of interest.ML has the capability to capture the combined minor effects of multiple genetic markers, allowing for the development of multi-locus methodologies that consider all SNPs in the model to estimate the importance scores of predictor variables (SNPs) 38 .In this study, we implemented three ML models, confirming the significance of marker S7_33375673 not only for seed red intensity but also for seed brightness.Additionally, we identified 29 markers with relative importance greater than 70 by more than one model, thereby expanding the repertoire of markers and candidate genes related to seed-related traits.Among the ML models, Support Vector Machines (SVM) exhibited lower shrinkage of estimated importance for predictor variables, representing 60% of the SNPs with an importance score greater than 70.In summary, machine learning can enhance the ability to detect new genetic associations with various traits, addressing the challenges posed by the complex genetic architecture of quantitative traits.The identified candidates are promising leads due to their known functions related to plant development.However, the GWAS analysis conducted in this study was limited by the availability of genotypic data due to the relatively low sequencing data coverage.In M. truncatula, the mean linkage disequilibrium has been shown to extend to less than 5 kilobases 39 .We identified 8,565 SNPs, meaning variants with potentially significant associations with seed morphology traits could have been missed in our analysis.While selfing was used to produce the seeds, their tetraploid nature may still introduce some degree of genetic heterogeneity compared to seeds from diploid plants, which could reduce the power of the GWAS analysis.To address this potential limitation, we employed a larger sample size (~ 100 seeds per accession) to increase the statistical power of the analysis.On the other hand, machine learning models identify highly important SNPs related to their phenotypic traits.This highlights the importance of including machine learning models in association studies.Future studies will investigate the associations between seed morphology traits and molecular markers through machine learning to expand the identification of genetic targets in alfalfa to improve the yield and quality of this vital forage crop.

Conclusion
In this study, we explored seed morphological diversity and correlations between multiple traits among Medicago spp., focusing primarily on alfalfa.We evaluated seed morphology traits in a diverse panel of 318 Medicago spp.accessions and present a comprehensive set of phenotype data.High phenotypic diversity and extensive correlations were identified among the tested accessions across the traits.PCA and clustering variables were generated to uncover commonalities between phenotypes, mainly revealing three groups: Seed size, shape, and color.Among seed colors, blue intensity exhibited minor differences from the others.GWAS analysis based on morphological traits identified multiple SNPs potentially associated with seed morphologies.Candidate genes for seed morphology trait IS and CG, seed width, and red intensity included CPR1, MON1 and a PPR protein, and Wun1, respectively.Some GWAS-associated markers were further validated with ML models, such as marker S7_33375673, which was upstream of wound-induced protein 1 (Wun1).Finally, a number of additional markers were identified using ML models, which shows the value of applying algorithms to assess marker association with complex traits.Further research is therefore warranted to elucidate the specific genes underlying seed morphological diversities in Medicago spp.Table 5.Comparison of the relative importance of markers associated with GWAS.Importance scores for markers associated with GWAS were calculated using three machine learning models: random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB).The importance scores for predictor variables were scaled from 0 to 100, and only values greater than 10 were included.GWAS values correspond to − log10(p-values).Traits tested included distance between IS and CG (IS to CG), length-to-width ratio (L-W ratio), perimeter length (Per length), and blue, green, or red color intensity (Blue, Green, or Red).

Figure 1 .
Figure 1.Scatter plots of seed area, circularity, and brightness across Medicago spp.(a) M. scutellata has the largest seed area, followed by M. ciliaris, M. arborea, and M. bonarotiana.Conversely, M. sativa subsp.caerulea and M. monspeliaca exhibit the smallest seed area across tested Medicago spp.(b) M. orbicularis displays the most circular shape followed by M. cancellata and M. sativa subsp.falcata.In contrast, M. truncatula, M. murex, and M. littoralis have the least circular shapes.(c) PI 619,434 (M.sativa subsp.sativa) is the brightest accession across all Medicago spp., followed by M. truncatula, M. littoralis, M. arabica, and other M. sativa subsp.sativa accessions.M. ciliaris and M. muricoleptis present the darkest seeds.

Figure 2 .
Figure 2. Comparison of seed area sizes for accessions PI 197,356, PI 287,999, PI 386,287, and PI 604,218.(a) PI 197,356 (Medicago scutellata) has the largest seed area size among all evaluated Medicago spp.(b) PI 287,999 (Medicago monspeliaca) exhibits the smallest seed area size of all evaluated Medicago spp.(c) PI 386,287 has the largest seeds in area size in Medicago sativa subsp.sativa.(d) PI 604,218 has the smallest seeds in area size in Medicago sativa subsp.sativa.The scale bars indicate 1 cm.

Figure 3 .Figure 5 .
Figure 3.Comparison of seed brightness for PI 660,361, PI 498,767, PI 619,434, and PI 468,014.(a) PI 660,361 (Medicago truncatula) has the second brightest seeds across all evaluated Medicago spp.accessions (b) PI 498,767 (Medicago ciliaris) possesses the darkest seeds in the population of Medicago spp.(c) PI 619,434 has the brightest seeds in all evaluated Medicago spp.(d) PI 468,014 has the darkest seeds in Medicago sativa subsp.sativa.The scale bars indicate 1 cm.

Figure 6 .
Figure 6.Partial contribution of seed morphology-related traits to principal components in Medicago spp.The chart displays the partial contribution of each seed morphology-related trait to the first three principal components (PC1, PC2, and PC3).

Figure 7 .
Figure 7. Pearson correlation of predictor variable importance for trained machine learning models.(a) The relative importance of 8377 predictor variables (markers) was calculated for 11 phenotypic traits using three machine learning models: random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB).(b) Clusters of highly correlated traits with correlation values.Pearson correlation scores were colored with red indicating 1 and blue indicating − 1.

Table 1 .
The top five accessions for each seed morphological trait in M. sativa subsp.sativa are displayed.

Table 3 .
Cluster variables analysis among the seed morphology-related traits.Three clusters were formed based on seed characteristics: Size, color, and shape.

Table 4 .
15st of the top SNPs and candidate genes identified by GWAS.Candidate genes (Gene ID) were identified based on the gene annotation of the reference genome15.Traits associated were the distance between IS and CG (IS to CG), seed width (Width), and red color intensity (Red).Gray cells correspond to markers at the same gene.*Marker was located at 99 bp upstream of the coding region.