Molecular marker development and genetic diversity exploration in Medicago polymorpha

Medicago polymorpha L. (bur clover), an invasive plant species of the genus Medicago, has been traditionally used in China as an edible vegetable crop because of its high nutritive value. However, few molecular markers for M. polymorpha have been identified. Using the recently published high-quality reference genome of M. polymorpha, we performed a specific-locus amplified fragment sequencing (SLAF-seq) analysis of 10 M. polymorpha accessions to identify molecular markers and explore genetic diversity. A total of 52,237 high-quality single nucleotide polymorphisms (SNPs) were developed. These SNPs were mostly distributed on pseudochromosome 3, least distributed on pseudochromosome 7, and relatively evenly distributed on five other pseudochromosomes of M. polymorpha. Phenotypic analysis showed that there was a great difference in phenotypic traits among different M. polymorpha accessions. Moreover, clustering all M. polymorpha accessions based on their phenotypic traits revealed three groups. Both phylogenetic analysis and principal component analysis (PCA) of all M. polymorpha accessions based on SNP markers consistently indicated that all M. polymorpha accessions could be divided into three distinct groups (I, II, and III). Subsequent genetic diversity analysis for the 10 M. polymorpha accessions validated the effectiveness of the M. polymorpha germplasm molecular markers in China. Additionally, SSR mining analysis was also performed to identify polymorphic SSR motifs, which could provide valuable candidate markers for the further breeding of M. polymorpha. Since M. polymorpha genetics have not been actively studied, the molecular markers generated from our research will be useful for further research on M. polymorpha resource utilization and marker-assisted breeding.


INTRODUCTION
The genus Medicago consists of approximately 87 species including the most important forage crop species M. sativa and the legume model species M. truncatula (Steele et al.,

Sample collection and field experiment
In China, M. polymorpha mainly distributes in five major cultivation areas (Jiangsu province, Zhejiang province, Anhui province, Yunnan province, and Shanghai) with very limited distribution elsewhere. For SLAF-seq analysis, a total of 10 representative M. polymorpha individual accessions with distinct phenotypic characteristics were collected from these five different geographical regions. Among them, six accessions came from Jiangsu province , and the other four accessions from the other four areas (MP-6 from Zhejiang, MP-8 from Shanghai, MP-3 from Anhui, and MP-9 from Yunnan). For plant growth form, MP-3 and MP-4 grown in lawn were prostrate, MP-13 grown in farmland was semi erect, and the other seven M. polymorpha accessions grown in farmland were erect (Table 1).
Each living accession was taken to the experimental fields of Yangzhou University, Jiangsu Province, China (119 40′E, 32 35′N) for ball planting. After the seeds of each accession were harvested in May, field experiments for phenotypic trait measurement were conducted in a randomized complete block design with three biological replications in October at Yangzhou University. There were two rows for each accession. The seeds of each genotype in each replication were sown in drill with a spacing of 0.5 m in rows and 1 m between two accessions. No fertilizer was used, but timely drainage between the plot ditch, artificial weeding, and timely irrigation were utilized.

Phenotype measurement and data statistical analysis
At the beginning of each accession's flowering time in April of the following year, six different agronomic traits (leaf length (cm), leaf width (cm), plant height (cm), branch number (n), stem diameter (cm) and stem-leaf ratio) were evaluated on 10 plants per replication. Branch number is the number of branches at the base (including the main stem), and plant height is the absolute height from the base of the main stem to the growing point of the main stem. Stem diameter refers to the diameter of the base of the main stem, measured with a vernier caliper. Analysis of agronomical data was carried out using analysis of variance (ANOVA) and the general linear model (GLM) of SAS (V.9.4). Cluster analysis was conducted using the Squared Euclidean Distance method with SPSS software (V.19.0).

Extraction of genomic DNA and enzyme solution design
Young healthy leaves of each accession were collected, frozen in liquid nitrogen, and then stored at −80 C before DNA extraction. Total high-quality genomic DNA was extracted using a minor modified CTAB method (Murray & Thompson, 1980). DNA concentration and quality were examined with an ND-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and by electrophoresis in 1% agarose gels with a lambda DNA standard. DNA concentrations in all samples were adjusted to the same level for the construction of SLAF libraries. To obtain adequate SLAF tags, the restriction enzyme combination "RsaI + HaeIII" (NEB, Ipswich, MA, USA) was selected based on (a) lots of SLAF tags, (b) fewer restriction fragments with repeat sequences, (c) an even distribution across chromosomes, and (d) simulated fragments aligning uniquely to the reference genome.

SLAF library construction and sequencing
SLAF-seq was performed with slight modifications as described previously (Sun et al., 2013).
First, for the in-silico prediction of the number of markers produced by different enzymes, marker-discovery experiments were designed by analyzing the M. truncatula A17 (NCBI: taxid 3880; https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=3880) using prediction software independently developed by Biomarker Technologies Corporation (Beijing, China). Based on the result of the preliminary experiment, RsaI and HaeIII restriction enzymes were selected as the best combination. To further evaluate the accuracy of the 'RsaI + HaeIII' combination enzyme digestion experiment, Arabidopsis thaliana ecotype Columbia was used as the control for sequencing. Genomic DNA was then digested with restriction enzyme RsaI (NEB, Ipswich, MA, USA), ligated to RsaI adapter, and then digested by additional enzyme HaeIII (NEB). Polymerase chain reactions (PCR) were carried out using the diluted restriction-ligation samples, dNTP, Taq DNA polymerase (NEB), and Rsa I-primer containing a barcode. According to the position of Arabidopsis reads in the genome and the length distribution of inserted fragments in Arabidopsis thaliana ecotype Columbia, DNA fragments of 314-414 bp for M. polymorpha were selected using the Blue Pippin automated DNA size selection system. We classified these as SLAF tags and then sequenced them using the Illumina HiSeq 2500 system (Illumina, Inc., San Diego, CA, USA) with 125 bp pair-end. The sequencing was performed by Biomarker Technologies Corporation (Beijing, China).

Molecular marker development
SLAF marker identification and genotyping were performed as described by Sun et al. (2013). High-quality SLAF tags were obtained from 10 M. polymorpha genotypes. SLAFs with two to four tags were considered polymorphic SLAFs and potential markers. Polymorphic SLAF tags showed sequence polymorphisms between different accessions. Before mapping, low-quality reads with quality scores less than 30 were removed, and high-quality reads with terminal 5-bp and barcode sequences were discarded.
All high-quality SLAF paired-end sample reads of the 10 accessions were then aligned to the newly assembled reference genomes of M. polymorpha 'Huaiyang Jinhuacai' (Cui et al., 2021; https://ngdc.cncb.ac.cn/gwh/Assembly/17540/show) using the Burrows-Wheeler alignment tool (BWA) software (Li & Durbin, 2009) with default parameters. Variants at each SLAF locus were called independently for each accession by GATK (v.4.0) (McKenna et al., 2010), and then the genotypes of all accessions (n = 10) were joined into a VCF file with the criteria of integrity >90%, minor allele frequency (MAF) >0.05, and max missing 0.5. The filtered high-quality SNPs were further used to reconstruct the phylogenetic tree with the maximum likelihood estimation method in MEGA-X .

Genetic diversity analyses
The assessments of genetic diversity, including minor allele frequency (MAF), observed heterozygosity (Ho), expected heterozygosity (He), Nei diversity index (Nei), Shannon's information index (I), and polymorphic information content (PIC), were calculated through Perl programming.

Phenotypic characterization
There were significant differences in the phenotypic traits across the different M. polymorpha accessions. As shown in Table 2, the leaf length of M. polymorpha materials in this study varied from 1.38 to 2.40 cm, with a variation amplitude of 1.02 cm, and the leaf width varied from 1.45 to 2.23 cm, with an amplitude of 0.78 cm. Among all accessions, the leaf length and width of MP-6 from Wenling, Zhejiang province were the largest, and this material is also known as "Wenling big leaf" in the local area. We found that the germplasm characteristics of the MP-6 large leaves were not changed due to alloplanting. The leaf length and width of MP-4 from Danyang, Jiangsu province were the smallest, followed by MP-3 from Anhui, which retained more wild characteristics. The plant height of M. polymorpha accessions showed obvious differentiation. For example, the plant heights of MP-3 and MP-4 were only about 4 cm, and the lateral branches crept close to the ground because the main stem growth was significantly inhibited. However, other materials grew upright, ranging in height from 20 to 25 cm, except for MP-13. Moreover, the lateral branches of MP-3 and MP-4 were very developed and totaled up to 4, which was significantly more than that of the other materials. Stem diameter ranged from 1.71 to 2.08 cm, with an average of 1.89 cm. MP-9 from Chuxiong, Yunnan province had a stem-to-leaf ratio, of 1.33, indicating that it had the most leaves, while MP-4 had the fewest leaves. The variation coefficient showed that the plant height of the M. polymorpha accessions changed the most (43.15%), followed by branch number (33.42%), and stem-leaf ratio and stem diameter changed the least (9.60% and 4.84%, respectively). Additionally, as the correlation analysis in Table 3 shows, leaf length and leaf width were significantly positively correlated, and leaf length and leaf width were both significantly positively correlated with plant height. Branch number, stem diameter, and stem-leaf ratio were positively correlated with each other, indicating that the M. polymorpha accessions with thicker stem had greater branch numbers.

Cluster analysis of phenotypic traits
Clustering 10 M. polymorpha accessions based on their phenotypic traits (Fig. 1A) revealed three groups and a high value of genetic diversity among various accessions. The first group contained 70% of the studied accessions from all geographical regions, including MP-5, MP-6, MP-7, MP-8, MP-9, and MP-11. The second group was comprised of 10% of the accessions assessed. The third group was made up of 20% of the genotypes, including MP-3 and MP-4. MP-13 was assigned to a separate cluster due to its approximate average value for all phenotypic traits among the evaluated genotypes.

SLAF library validity and SLAF-seq
Since the genome of M. polymorpha was not published at the time of this study, we determined that the "RsaI + HaeIII" double-enzyme digestion scheme was the best enzyme  digestion scheme for constructing the M. polymorpha SLAF tag library based on the prediction results of electronic digestion of the M. truncatula genome by Biomarker Technologies Corporation (Beijing, China). The selected size for this method was 314-414 bp using the Blue Pippin system. We predicted that there were 117,528 SLAF tags distributed in the genome with 6.19% located in repetitive regions. To further evaluate the efficiency of the restriction enzyme digestion scheme, Arabidopsis thaliana was selected as an example. Through the detection of the remaining restriction enzyme restriction sites in Arabidopsis read inserts, we found that the ratio of Arabidopsis paired-ended reads mapped to their reference genome was 83.41%, and the enzyme digestion efficiency was 97.79%. This indicates that the double-enzyme digestion scheme should be effective for the SLAF library construction of M. polymorpha. SLAF sequencing of the 10 M. polymorpha accessions generated 6.26 Gb clean reads, with a Q30 percentage of 84.97% and GC content of 38.29% (Table S1). Finally, a total of 87,207 SLAF tags were identified, and a single accession contained 79,008 to 85,727 tags (mean 82,336 tags). The read numbers for SLAF tags ranged from 598,481 to 1,869,872 (mean 944,305), and the sequencing coverage ranged from 7.40-fold to 21.88-fold (mean 11.37-fold). Of the identified 87,207 SLAF markers, 25,616 (29.37%) were polymorphic and 70.53% were non-polymorphic, and only 0.1% were repetitive (Table S2). The integrity values of SLAFs from M. polymorpha accessions were all above 90% (Table S3). The high quality of these data promised the reliability of the subsequent analyses.

Development of SNP and SSR markers
Only polymorphic SLAF markers were further selected as high quality SNPs. PCR duplicates were not removed after alignment since the removal would result in many sites in the results not covered by reads, or very low coverage. Moreover, SLAF sequencing does not sequence the entire genome, but sequences the digested fragments according to the restriction sites. Therefore, data filtering strategies used the following parameters: QD >2, MQ >40.0, MQRankSum >−12.5, ReadPosRankSum >−8. After aligning the fragments to the M. polymorpha reference genome, a total of 52,237 SNPs with integrity > 90% and minor allele frequency (MAF) >0.05 were identified in this study (Table S4). The MAF distribution spectrum is listed in Table S5. These SNP variants were mostly distributed on pseudochromosome 3 (22.21%), least on pseudochromosome 7 (7%), and nearly evenly distributed on the five other pseudochromosomes (~14.16% for each pseudochromosome).
In order to provide a theoretical basis for the development of SSR molecular markers in M. polymorpha, the distribution characteristics of the SSR sequence were analyzed using MISA software. Results showed that 195,753 complete SSR sequences were screened, with relative densities of 428/Mb. The total length of SSR sequences was 3,611,698 bp and accounted for 0.79% of the total genome sequence length. Among the one to six different nucleotide repeats units, the single nucleotide repeat units were the most, followed by dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeat units. SSR repeat unit types in the M. polymorpha genome were abundant and had great potential for developing polymorphic markers and in the genetic diversity and molecular marker-assisted breeding of M. polymorpha.

Genetic diversity of M. polymorpha based on SNP markers and agronomic traits
Phylogenetic analysis based on high-quality SNPs also showed that all M. polymorpha accessions were divided into three clades (Fig. 1B), consistent with morphological clustering (Fig. 1A). The first clade contained MP-2, MP-5, MP-6, MP-7, MP-8, MP-9, and MP-11. The second clade included only MP-13, and the third clade had two accessions, MP-3 and MP-4. When M. truncatula was used as the outgroup, the rooted evolutionary trees ( Fig. 2A) suggested that MP-3 and MP-4 were the most related to the putative ancestral species of the M. polymorpha population. Furthermore, PCA was performed and the first, second, and third principal components explained 14.88%, 12.75%, and 11.92% of the genetic diversity, respectively. The first principal component roughly separated the three groups (Fig. 2B). For the three clusters (Fig. 1A), the He, Nei, I, and PIC values of group I were the highest, followed by group III, and then group II (Table 4). These results indicated that group I possessed the highest genetic diversity, while group II had the lowest genetic diversity.

DISCUSSION
M. polymorpha is native to the areas surrounding the Mediterranean Sea and has been naturalized in most parts of the world. Molecular marker development and diversity evaluation of M. polymorpha accessions collected from different regions in China can provide useful information concerning sustainable conservation and the utilization of genetic diversity. However, previous genomic studies on M. polymorpha have been limited compared to applicable molecular markers in other major crops. Genomic data published by Cui et al. (2021) provided researchers novel insight into M. polymorpha molecular marker development. In this study, we used 10 M. polymorpha accessions collected from different regions of China to sequence genome-wide distributed SLAFs for polymorphism detection and genotyping for the first time.
In recent years, SLAF-seq technology has been widely used for high-throughput SNP marker development, QTL mapping, high-density genetic map construction, the genetic divergence analysis, and genome-wide association analyses of important agronomic traits in major crops. Wang et al. (2019) used SLAF-seq to construct a genetic map including 7,033 SNP loci that covered 3,353.15 cM with an average distance between consecutive markers of 0.48 cM, and total 13 stable QTL associated with six cottonseed quality traits were detected, Wei et al. (2020) constructed a SNP-based genetic map of Eggplant based on large-scale SNP markers discovered by SLAF-seq technology and QTL Analysis. Huang et al. (2021) identified 60,495 polymorphic SNPs in a subset upland cotton MAGIC population containing 372 lines using SLAF-seq technology, and subsequently successfully detected eight genes with known functions associated with important agronomic traits.   Li et al. (2021) investigated the genetic diversity and population structure of weedy and cultivated broomcorn millets by using the SLAF-seq technology. Wang et al. (2022) identified a candidate region of pmHYM in the chromosome 7BL of wheat landrace cultivar Hongyoumai using SLAF and BSR-seq methods. Likewise, Xue et al. (2022) constructed a high-density genetic map containing 9,980 SLAF markers using SLAF-seq, and identified two QTLs related to palmitic acid content in soybean. In summary, previous research has indicated that SLAF-seq technology is a highly efficient method used in molecular marker development and diversity evaluation analysis. In our study, we identified a total of 25,616 polymorphic SLAF tags containing 52,237 high-consistency SNPs with MAF >0.05 and integrity >0.9. The identified SNPs were mostly distributed on pseudochromosome 3 (22.21%), least on pseudochromosome 7 (7%), and nearly evenly distributed on the other pseudochromosomes of M. polymorpha. Moreover, 195,753 complete SSR sequences were screened, with relative densities of 428/Mb. SSR repeat unit types in the M. polymorpha genome were abundant and showed great potential for developing polymorphic markers, as well as important application value in the genetic diversity and molecular marker-assisted breeding of M. polymorpha.
In order to evaluate the genetic diversity of M. polymorpha, a variety of molecular markers have been developed, including isozyme, random amplified polymorphic DNA (RAPD), and SSR (Paredes et al., 2002;Chu et al., 2010;Emami-Tabatabaei et al., 2021). However, these studies had the shortcomings of having a limited number of markers and being time-consuming and laborious. In this study, we developed more than 50,000 high-quality SNPs of M. polymorpha using SLAF-seq for the first time. Our results showed that M. polymorpha exhibited a considerable level of nucleotide diversity. However, M. polymorpha is a niche species, and there has not been a standardized collection of germplasm resources in most areas of China. Due to this limitation, we collected only 10 M. polymorpha accessions in China under the existing conditions. Nevertheless, previous studies have shown that small samples tend to produce an excess number of intermediate frequency alleles than large samples (Ramírez-Soriano & Nielsen, 2009;Tenaillon et al., 2001), and analyses of genetic diversity based on small populations can still be important (Nayanakantha, Singh & Gupta, 2010;Zhang et al., 2019), especially for M. polymorpha.
The evaluation of genetic diversity with both agronomic traits and molecular markers can be useful to analyze genetic diversity within species with more precision, as well as in phylogenetic studies (Ambreen et al., 2018). Co-analysis of agronomic traits and molecular data have extensively been used for genetic diversity analysis in various crops such as Triticum urartur (Wang et al., 2017) and safflower (Golkar, Arzani & Rezaei, 2011), but it has not been studied in M. polymorpha. In this study, we analyzed the phenotypic characteristics of M. polymorpha materials and found that there were great differences in the phenotypic traits across different M. polymorpha accessions. Cluster analysis of all 10 M. polymorpha accessions based on their phenotypic traits indicated that these accessions could be grouped into three groups, which was basically consistent with their phenotypic traits. Moreover, phylogenetic and PCA analyses based on SNP markers also showed that all M. polymorpha accessions were divided into three clades, which was consistent with morphological clustering results.
Large numbers of SNP markers discovered by SLAF-seq technology in M. polymorpha germplasm collected from different regions of China can provide an important breeding resource and lay the foundation for future genetic analysis. Previous studies have shown that using a combination of SNP markers and agronomic traits is an effective method for evaluating genetic diversity (Golkar & Nourbakhsh, 2019), which was also confirmed in our study.

CONCLUSION
In this study, we applied SLAF sequencing technologies to develop molecular markers and explore the genetic diversity of M. polymorpha germplasm collected from different regions in China. More than 50,000 high-confidence SNP markers in a panel of 10 M. polymorpha accessions were first developed. These SNP markers were evenly distributed in each chromosome, with the most on pseudochromosome 3 (22.2%) and the least on pseudochromosome 7 (7.0%). Cluster analysis of all M. polymorpha accessions based on SNP markers (phylogenetic analysis and PCA analysis) and phenotypic traits both suggested that they could be divided into three groups. Subsequent genetic diversity analysis for these 10 accessions also validated the effectiveness of the molecular markers of M. polymorpha germplasm in China. Additionally, we performed SSR mining analysis to identify polymorphic SSR motifs, which helped to provide valuable candidate markers for the subsequent breeding of M. polymorpha. Our study will provide important information about the molecular markers of M. polymorpha and will be useful for further research on M. polymorpha resource utilization and marker-assisted breeding.