RNA-Seq and genetic diversity analysis of faba bean (Vicia faba L.) varieties in China

Background Faba bean (Vicia faba L) is one of the most important legumes in the world. However, there is relatively little genomic information available for this species owing to its large genome. The lack of data impedes the discovery of molecular markers and subsequent genetic research in faba bean. The objective of this study was to analyze the faba bean transcriptome, and to develop simple sequence repeat (SSR) markers to determine the genetic diversity of 226 faba bean varieties derived from different regions in China. Methods Faba bean varieties with different phenotype were used in transcriptome analysis. The functions of the unigenes were analyzed using various database. SSR markers were developed and the polymorphic markers were selected to conduct genetic diversity analysis. Results A total of 92.43 Gb of sequencing data was obtained in this study, and 133,487 unigene sequences with a total length of 178,152,541 bp were assembled. A total of 5,200 SSR markers were developed on the basis of RNA-Seq analysis. Then, 200 SSR markers were used to evaluate polymorphisms. In total, 103 (51.5%) SSR markers showed significant and repeatable bands between different faba bean varieties. Clustering analysis revealed that 226 faba bean materials were divided into five groups. Genetic diversity analysis revealed that the relationship between different faba beans in China was related, especially in the same region. These results provided a valuable data resource for annotating genes to different categories and developing SSR markers.


INTRODUCTION
Faba bean (Vicia faba L.) is one of the most important crops worldwide. It is grown in China, Ethiopia, Egypt, the Andean States of South America, Europe, and Australia (Duc et al., 2010). Approximately 4.8 million tons of faba beans were grown in 2017, which ranks fourth in production among cool-season crops (FAOSTAT, 2018). Faba beans are a good source of protein for humans, especially those in West Asia and North Africa, and is an affordable livestock feed in developed regions, such as Europe and Australia (Alghamdi et al., 2012;Johnston et al., 1999a;Johnston et al., 1999b). Understanding the soil nitrogen fixation of faba bean crop rotation is important for improving green agriculture practices for environmental preservation (Stoddard et al., 2009).
Faba bean is a partially allogamous diploid (2n = 12) species (Raina & Ogihara, 1995), with a genome size of approximately 13 Gb, which is the largest genome among legume crops (Bennett & Smith, 1982;Johnston et al., 1999a;Johnston et al., 1999b). Although this is an important crop, there is less attention focused on faba bean than other legumes owing to its genome size and six pairs of remarkably large chromosomes.
Research on the faba bean genome is challenging in comparison with other legumes. However, considerable progress has been made in the mapping of genes using genetic diversity analysis for crops with important agronomy characteristics. Previous genetic linkage maps were constructed with different populations and molecular markers. The initial molecular markers of faba bean focused on fragment length polymorphisms (RFLPs) ( Van de Ven et al., 1991;Torres, Weeden & Martín, 1993), and randomly amplified polymorphic DNA (RAPDs) (Link et al., 1995;Avila et al., 2003). The first linkage map of faba beans was constructed with RFLP, RAPD and isozyme markers by Van de Ven et al. (1991). Second generation molecular markers, such as simple sequence repeats (SSRs) (Pozarkova et al., 2002;Roman et al., 2004), amplified fragment length polymorphisms (AFLP) (Zeid, Schon & Lin, 2003;Zong et al., 2009;Zong et al., 2010), sequence-characterized amplified regions (SCAR) Gutierrez et al. (2007), and inter-simple sequence repeats (ISSR) (Wang et al., 2012;Hou et al., 2018) made high density linkage maps and genetic diversity analysis possible. For example, an integrated approach was first used to develop SSR markers from selected regions of the faba bean (Pozarkova et al., 2002), which were then mapped by Roman et al. (2004). Different spring and winter germplasms derived from China and abroad were compared with AFLP markers (Zong et al., 2009;Zong et al., 2010).
With the rapid development of next-generation sequencing technology (NGS), numerous expressed sequence tags (ESTs) have become publicly available and provide useful information for EST-based markers (Ma et al., 2011;Kaur et al., 2012a;Kaur et al., 2012b;Yang et al. , 2012;Beier et al., 2017;Mokhtar & Atia, 2019). This has provided a sound foundation for comparative genome analysis with other legume species (Chapman et al., 2009). The second-generation sequence technology has been foundation for sequencing faba bean genomes and researchers have made some initial sequencing efforts. Additional high density physical map and map-based cloning is possible for faba bean. Ma et al. (2011) developed 21 EST-derived SSRs markers; these polymorphic markers were used for gene tagging and MAS in faba bean. Yang et al. (2012) released high-throughput microsatellite markers using next generation sequencing. Kaur et al. (2014) first reported high-density genetic mapping with SNP-based markers in faba bean ascochyta blight resistance. The first consensus map for the faba bean contained 687 SNP markers on six linkage groups, each presumed to correspond to one of the faba bean chromosomes (Webb et al., 2016). Density enhancement genetic linkage map of the faba bean was conducted by screening 5,325 SSR genomic primers and 2,033 EST-SSR primers (Yang et al., 2019). These are useful for marker-assisted selection and breeding in this important legume crop. Carrillo-Perdomo et al. (2020) reported the most saturated consensus genetic map with 1,728 SNP markers distributed in six linkage groups using three mapping populations. Recently, Mokhtar et al. (2020) released the Vicia faba omics database (VfODB), which included ESTs, EST-SSRs, mtSSRs, microRNA-target markers and genetic maps in the faba bean. This powerful database will be a helpful platform for the molecular breeding of the faba bean.
However, there are not many identified faba bean markers that can be used in genetic studies and there is an urgent demand to develop various kinds of molecular markers to enhance marker development and breeding strategies in the future. To date, RNA-seq technology has been an effective tool to generate genome-wide transcriptome profiles and obtain substantial SSRs, especially for species whose genome has not been sequenced (Garg & Jain, 2013;Zhang et al., 2011;Xiao et al., 2013).
RNA-seq has been successfully used in various species, including rice (Xu, Gao & Wang, 2012), maize (Davidson et al., 2011), chickpea (Garg et al., 2011), field pea (Sudheesh et al., 2015, and faba bean (Yang et al., 2020). RNA-Seq will provide a feasible approach to develop SSR markers for both genetic diversity and mapping analysis of the huge, non-sequenced faba bean genome.
Previous studies have reported EST-SSR markers in faba bean, however, the numbers that can be used in genetic analysis and linkage mapping are still limited. In order to conduct genetic analysis, we first used four faba bean varieties for RNA sequencing; this data was used (1) to obtain faba bean transcriptome information and get a better understanding of the annotated genes, (2) to develop SSR markers and validate their polymorphisms, (3) to verify the relationship between different faba bean germplasms in China.

Plant materials
Four different varieties of faba bean, Qinghai11 (indeterminate growth), Qingcan16 (determinate growth), Qingcan 18 (zero tannin content) and YL01 (high tannin content) with extreme phenotype, were selected for transcriptome sequencing. The four varieties were grown in the experimental station of the Qinghai Academy of Agricultural and Forestry Sciences. The young leaves of different varieties were collected for three biological repeats and the RNA was extracted.
A total of 226 samples (Table S1) derived from different regions of China were used for the development of SSR markers and genetic diversity research. The 226 accessions were the main varieties or special germplasm resources derived from different provinces of China. These accessions will be used for the construction of the core germplasm for the China faba bean in the future. Among these, 63 varieties were collected from the northwest, 79 varieties were sampled from the southwest, 15 varieties were found in northern China, 54 varieties came from eastern China, three varieties were found in southern China, and the remaining 12 varieties were distributed in central China.
The 226 samples were provided by the Qinghai Academy of Agriculture and Forestry Sciences and were planted in the experiment field in Qinghai in March 2020.

RNA extraction and cDNA library construction
RNA extraction and cDNA library construction samples were isolated from the young leaves of four different varieties: Qinghai11, Qingcan16, Qingcan 18, and YL01. The total RNA was extracted using Trizol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions and were stored at −20 • C for later analysis. DNase I was used to treat the RNA samples to remove DNA contamination. RNA samples were mixed according to the equivalent concentration of each variety. The concentration and integrity of the total RNA was examined and degraded using a 2100 BioAnalyzer (Agilent, Santa Clara, CA, USA) and agarose gel electrophoresis. The total RNA was processed using oligo (dT) magnetic beads. The selected RNA was used to create a full-length cDNA library. An Agilent 2100 BioAnalyzer and ABI Steponeplus real-time PCR system were used to test the constructed libraries. The resulting products were used for sequencing.

Data filtering and de novo assembly
The Illumina NovaSeq 6000 platform was used for paired-end sequencing. Raw reads were filtered via SOAPnuke (v1.4.0) (Chen et al., 2017a;Chen et al., 2017b). The reads were first preprocessed to eliminate joint contamination. The low-quality reads were discared (reads with ambiguous N bases >5% and more than 15% of bases Q <20 bases). Clean data were obtained after removing adapter-containing, poly-N-containing, and low-quality reads from the raw data. The Q20, Q30, and the GC content of the clean data were calculated.
The raw data was also calculated by RNASeqPower (https://bioconductor.org/packages/ release/bioc/html/RNASeqPower.html) (Hart et al., 2013). The sequencing depth of RNA Seq-Power package in R was 32 and the sample number was 3 (Table S2).
The genomic DNA from 226 faba bean varieties was extracted according to the DS (sodium lauroylsarcosine) protocol (Song & Langridge, 1991;Song & Henry, 1994). DNA quality and quantity were tested by 1.5% agarose gel electrophoresis. The developed SSR markers were first verified using 50 randomly-selected faba bean accessions, including 16 varieties from the Qinghai province, 20 varieties from the Yunnan province, 10 varieties from the Sichuan province, and four varieties from the Jiangsu province.
PCR was carried out in a S1000 Thermal Cycler (BIO-RAD, Hercules, CA, USA) in volumes of 10 µL. The PCR system contained 4.0 µL 2×PCR Master Mix (TaKaRa, Shiga, Japan), 2.0 µL DNA (10 ng/ µL), 1.0 µL of each primer pair (2.0 pmol/µL), and 2.0 µL ddH 2 O. The PCR conditions were as follows: denaturation at 94 • C for 5 min, followed by 30 cycles of 94 • C for 30 S, 55 • C for 30 s, 72 • C for 30 S, and a final extension for 10 min at 72 • C. PCR products were heated and denatured at 95 • C for 5 min. A total of 5 µL of each sample was loaded with a 6% denaturing polyacrylamide gel using the silver staining method (Bassam, Anolles & Gresshoff, 1991).

Genetic diversity analysis
SSR data were scored according to the migration rate of the PCR amplified fragments at each SSR locus and the data format was converted according to the requirements of the analysis software. Popgene V1.32 software was used to calculate the allele number (NA) and effective number of alleles (NE). The polymorphism information content (PIC) and Nei's genetic diversity (H) were calculated using Power Marker V3.25 software. The genetic diversity analysis of faba bean was tested using Nei's genetic distance and the unweighted pair group method using averages (UPGMA) (non-weighted averages) method.

RNA sequencing and sequence annotation
A total of 92.43 Gb of paired-end raw data were obtained from the 12 faba bean samples (four accessions with three replicates) using the Illumina NovaSeq 6000 platform. A total of 89.31 Gb of clean reads were obtained after filtering the reads for quality. The Q20 percentage of the reads was over 98% and the Q30 percentage was over 94% (Table S3). All clean reads were assembled with Trinity and then Tgicl was used to cluster the transcripts to eliminate redundancy and obtained the initial unigene. Tgicl was used for multiple samples to cluster the unigene of each sample to eliminate redundancy and obtain the final unigene for subsequent analysis (Table S4). After clustering, the quality indexes and the length distribution of the unigene are shown in Table S4. A total of 133,487 unigenes were finally obtained with a GC content of 38.77%, and the mean length of the assembled unigenes was 1,334 bp (N50 = 1,858 bp).
TransDecoder software was used to identify candidate coding regions (CDS) in the unigenes. The longest open reading frame was extracted, and then the Pfam protein homologous sequences were searched by Blast comparison between the SwissProt database and Hmmscan to predict the CDS. The predicted total number of CDS was 71,282 with a GC content of 41.86%, and CDS length from 12,783 bp to 297 bp (N50 = 1,359 bp).

Gene function classification
Gene function annotation resulted in 133,487 unigenes. The numbers and percentage of annotation results for the seven databases, including NR, NT, PFAM, Swissprot, GO, KEEG and KOG, are shown in Table 1 (Fig. 2A). The largest number of genes (88,629; 66.40%) were annotated by NR, and the fewest were annotated by Pfam, (62,986; 47.19%) ( Fig. 2A). Venn diagram of gene annotation with five databases, including NR, NT, PFAM, GO, and KOG, showed that a total of 44,188 genes could be annotated (Fig. 2B).
After GO annotation, the unigenes were classified into three categories: biological process, cellular component, and molecular function (Fig. 3A). Among these, 83,391 (46.9%) unigenes for molecular function accounted for the majority of the unigenes, followed by 52,210 (29.3%) unigenes for cellular components, and 42,359 (23.8%) unigenes for participating in biological processes, covering a comprehensive range of GO categories.
In the molecular function category, catalytic activity (39,008) and binding (35,451) were the two main categories. Moreover, 4,478 unigenes were assigned to transporter activities. However, few unigenes were involved in transcription regulator (1,482), structural molecule activity (1,404), molecular function regulator (859), and antioxidant activity (419). The rest of the unigenes were related with cargo receptor activity, molecular carrier activity, nutrient reservoir activity, and protein tag and translation regulator activity. Among the cellular components, unigenes of membrane part (21,296), cell (18,105), and organelle part (9,888) were the dominant unigenes. However, only a few unigenes were involved in the membrane (860), extracellular region (776), protein-containing complex (523), virion part (393), and symplast (283). The remaining unigenes participated in the supramolecular complex, organelle, and nucleoid. The biological processes unigenes were as follows: unigenes of the cellular (18,246), biological regulation (7,015), cellular component organization (4,664) processes, metabolic (4,025) and localization (3,985) accounted for the majority of the unigenes. However, only a few unigenes were determined to be a to response to stimulus (2,196), multicellular organismal process (1,091), and multi-organism process (717). The remaining unigenes were associated with the other biological process, including rhythmic process, nitrogen utilization, and growth. A total of 68,068 KOG-annotated unigenes were classified according to KOG group and divided into 25 groups after KOG annotation (Fig. 3B). Among these genes, the number associated with only the general function prediction was the highest (15,721); the number in the signal transduction mechanisms ranked second (7,707); and the number associated with function unknown ranked third (5,771).
BlastX was used to match a single gene to KEGG to assess the inclusiveness of the transcriptome library and the efficiency of the annotation program. The results showed that a total of 69,648 unigenes were classified according to their participation in KEGG  metabolic pathways, including cellular processes (3,374), environmental information processing (4,361), genetic information processing (15,477), metabolism (42,034), and organismal systems (2,405) (Fig. 3C). The largest number of genes were the global and overview maps, which involved in the metabolism branch, with 15,875 genes. According to the E-value distribution in the NR database, 88,574 genes were found to be related to genes annotated in other species (Fig. 4), of which 34,014 (38.4%) genes were related to Medicago truncatula, and 15,617 (17.6%) genes were similar to Cicer Arietinum. A total of 14,010 (15.8%) genes were related to Trifolium pratense, 12,516 (14.1%) genes were related to Trifolium subterraneum, and 2,009 (2.2%) genes were related to Pisum sativum. A total of 2,623 transcription factor-related genes were detected by searching the NR datasets (Fig. 5), which were divided into 58 categories. The similarity genes included 318 MYB-related genes, 219 BHLH-related genes, 172 C3H-related genes, 169 AP2-EREBP-related genes, 128 WRKY-related genes, and 117 NAC-related genes.

SSR polymorphism analysis
A total of 5,200 EST-SSR markers were developed from the 133,487 unigenes after filtering (Table S5). To validate their polymorphisms for faba bean accessions, one set of 200 markers was chosen from the above list of loci. Among the selected 200 EST-SSR markers, 103 pairs of SSR markers produced clear amplicons with the expected sizes in nearly all of the 226 faba bean varieties. Among the successful markers, 103 (51.5%) polymorphic genic SSR markers were identified, containing 39 di-, 47 tri-, two tetra-, and one penta-, while the other 14 markers were monomorphic. A total of 103 polymorphic EST-SSR markers were then used to test the polymorphism of 226 faba bean varieties ( Fig. 6; Table S6). A total of 498 polymorphic loci were scored with an average number of 4.84 alleles per locus. The most abundant polymorphism locus was T_DN30600 with 10 alleles. Only two alleles were detected for T_DN20824, T_DN32959, T_DN27103, T_DN31063, T_DN24924, T_DN15279, T_DN34099, and T_DN19478.
The polymorphism information content (PIC) varied from 0.1273 (T_DN34099) to 0.7913 (T_DN28326), with an average value of 0.4959 (Table 3). The proportion of the PIC value was greater than the average proportion of 47.57%. We observed that the effective number of alleles (Ne), Shannon's index (I), observed heterozygosity (Ho) and expected heterozygosity (He) were 2.4581, 1.0278, 0.0122 and 0.5512, respectively. The number of effective alleles (Ne) was distributed from 1.158 to 5.372 and the average number was 2.4581. The Shannon's index (I) was arranged from 0.263 to 1.902 with an average of 1.0278. The mean values of observed heterozygosity (Ho) and expected heterozygosity (He) were 0.0122 and 0.5512, respectively. The results indicated that the selected markers showed high diversity between faba bean varieties and could be used for further genetic analysis. Based on the genotype data obtained from 103 pairs of SSR markers, the population genetic structure of 226 domestic faba bean materials was analyzed using Structure2.3.4 software. It has been shown that when K = 2, the value of K is the maximum, and 226 samples were divided into two subgroups. A total of 194 (85.84%) samples, whose Q value >0.6, were divided into two groups. Group 1 contained 96 samples (42.48%) and Group 2 contained 98 samples (43.36%), respectively (Fig. 7). The Q value of the remaining 32 materials was ≤ 0.6 (14.16%). The result of Q value analysis revealed that there were no clear group attributive characteristics and a mixed population was formed.

Clustering analysis
A 0/1 matrix was established based on 498 loci amplified from 103 pairs of SSR primers with NTSYSpc Version 2.1 software. UPGMA clustering analysis of 226 faba bean germplasms was obtained (Fig. 8). The clustering results showed that 226 faba bean samples could be divided into five categories. Among these, Tongcan no.5 (H0005091; Jiangsu, China) and the beiyuanga faba bean (H0005277; Gansu, China) belonged to separate categories, respectively. Only two faba bean germplasms, including the erqing faba bean (H0001645) from the Sichuan province and big faba bean (H0001724) derived from the Guizhou province, were classified into the third category. The fourth class included 113 samples, which came from the majority of the faba bean planting provinces of China, including Sichuan, Gansu, and Qinghai. The remaining 109 materials were clustered into the fifth class.

DISCUSSION
Our transcriptome analysis revealed 92.43 Gb of the overall sequencing data and provided a set of 133,487 faba bean unigenes using Illumina sequencing technology. A high-quality transcriptome assembly and annotation for faba bean was obtained in this study, which will provide a powerful tool for further gene mining of faba beans. Our analysis also showed the possibility of using the RNA-Seq technique for SSR development in faba bean. A total of 16,202 SSR loci and 5,200 developed SSR markers in this study will provide valuable information for genetic diversity, linkage mapping of genes with important agronomy traits, and marker-assisted selection in faba beans.
Faba beans are an important legume for human consumption and livestock feed, therefore, the planting area and production of faba bean have increased year by year. The   planting range and yield of faba bean has ranked the fourth after pea (Pisum sativum L.), chickpea (Cicer arietinum L.), and lentil (Lens culinaris Medik.) according to FAO statistics (FAO, 2018).
Recently, more attention has been paid to the development of genomic and genetic resources of faba bean, owing to its increasing demands. The large genome size and its repetitive DNA (>85%) (Novák et al., 2020) have hampered the development of the faba bean in research areas including the sequencing of the whole genome, fine mapping, and map-based cloning.
Previous studies have reported transcriptome analysis and provided genetic resources for faba bean (Arun-Chinnappa & McCurdy, 2015;Ray, Bock & Georges, 2015;Braich et al., 2017;Cooper et al., 2017;Khan et al., 2019). Considerable progress has been made in some areas and previous studies have led to the DNA sequence data being released in public databases from previous reports (Mokhtar et al., 2020).
A genome-wide transcriptome map reported by Arun-Chinnappa & McCurdy (2015) generated 69.5 M reads and 65.8 M were used for assembly following trimming and quality control. A total of 17,160 unigenes were generated, of which 80.6% were successfully annotated against GO terms. Ray, Bock & Georges (2015) sequenced approximately 1.2 million expressed transcripts and assembled these into contigs by transcriptome analysis.  The de novo assembly of 606.35 M high-quality pair-end clean reads yielded 164,679 unigenes of leaf tissues (Khan et al., 2019). Gene annotation regulated energy metabolism, transmembrane transporter activity, and secondary metabolites according to the GO and KEGG databases.
The development of different sequencing technologies has led to an increasing number of databases and expressed sequence tags (ESTs), EST-SSRs, mtSSRs, microRNA-target markers, and single nucleotide polymorphism (SNP) markers reported in faba bean (Mokhtar et al., 2020). Thus, close genetic mapping, quantitative trait loci (QTL) mapping, and identification of candidate genes in faba bean are possible. EST-SSRs are a useful tool for faba bean genetic analysis compared with traditional SSRs because they are part of the genes. Additionally, EST-SSRs represent high conservation of expressed sequences and can transfer between related species.
In this study, we obtained 16,202 SSR loci from transcriptome analysis and developed a total of 5,200 EST-SSR markers from the 133,487 unigenes. These EST-SSR markers provide not only important genomic resources for basic research but also chances to construct closely linked mapping of agronomic traits for faba bean. The obtained SSR loci number in this study was lower than 18,327 loci reported by Alghamdi et al. (2018). However, the numbers of developed SSR markers was much more than previous reports, such as 2,932 SSRs revealed by Gnanasambandam et al. (2012) and 802 SSRs obtained by Kaur et al. (2012a); Kaur et al. (2012b). The SSR frequency and distribution analysis revealed that single base repeats were the dominant type of SSR sequences, accounting for 60.1% of the total number of SSR sequences. A/T had the largest number of repeats (9,261), accounting for 98.83% of single nucleotide repeats. The (AG/CT) n di-nucleotide repeat occurred 74.3%, which was the highest frequency in our study, and was similar to results reported by Kareem et al. (2021) and El-Rodeny et al. (2014). Simultaneously, SSR frequency and distribution in this study were consistent with the high (AG/CT) n repeats in other beans, such as the adzuki bean (Chankaew et al., 2014) and common bean (Blair et al., 2014). The results in this study also verified that SSR loci existed differently in various species and the screening criteria may also influence the distribution frequency of SSR loci.
In our study, the population genetic structure analyzed by Structure2.3.4 software showed that 226 faba bean samples were divided into two groups. However, the distancebased method clustered the samples into five groups. Our results revealed that the categories of germplasm from different geographical sources was not specially obvious. Faba bean materials from the same region were not able to be divided into the same group, and materials from different sources were clustered together, which was consistent with the results of Abid et al. (2015). This may be a result of gene exchange among germplasm resources from different regions and the mutual introduction of different regions in China. These results may also relate to the screening of SSR primers.

CONCLUSION
In conclusion, this transcriptomic analysis of four faba bean varieties with different phenotypes provided a valuable set of genomic data for characterizing genes. In addition, the developed SSR markers in this study will lay a basis for further genetic studies and molecular marker-assisted breeding of faba bean.