The draft genomes of five agriculturally important African orphan crops

Abstract Background The expanding world population is expected to double the worldwide demand for food by 2050. Eighty-eight percent of countries currently face a serious burden of malnutrition, especially in Africa and south and southeast Asia. About 95% of the food energy needs of humans are fulfilled by just 30 species, of which wheat, maize, and rice provide the majority of calories. Therefore, to diversify and stabilize the global food supply, enhance agricultural productivity, and tackle malnutrition, greater use of neglected or underutilized local plants (so-called orphan crops, but also including a few plants of special significance to agriculture, agroforestry, and nutrition) could be a partial solution. Results Here, we present draft genome information for five agriculturally, biologically, medicinally, and economically important underutilized plants native to Africa: Vigna subterranea, Lablab purpureus, Faidherbia albida, Sclerocarya birrea, and Moringa oleifera. Assembled genomes range in size from 217 to 654 Mb. In V. subterranea, L. purpureus, F. albida, S. birrea, and M. oleifera, we have predicted 31,707, 20,946, 28,979, 18,937, and 18,451 protein-coding genes, respectively. By further analyzing the expansion and contraction of selected gene families, we have characterized root nodule symbiosis genes, transcription factors, and starch biosynthesis-related genes in these genomes. Conclusions These genome data will be useful to identify and characterize agronomically important genes and understand their modes of action, enabling genomics-based, evolutionary studies, and breeding strategies to design faster, more focused, and predictable crop improvement programs.

plants of special significance to agriculture, agroforestry, and nutrition) could be a partial solution. Results: Here, we present draft genome information for five agriculturally, biologically, medicinally, and economically important underutilized plants native to Africa: Vigna subterranea, Lablab purpureus, Faidherbia albida, Sclerocarya birrea, and Moringa oleifera. Assembled genomes range in size from 217 to 654 Mb. In V. subterranea, L. purpureus, F. albida, S. birrea, and M. oleifera, we have predicted 31,707,20,946,28,979,18,937, and 18,451 protein-coding genes, respectively. By further analyzing the expansion and contraction of selected gene families, we have characterized root nodule symbiosis genes, transcription factors, and starch biosynthesis-related genes in these genomes. Conclusions: These genome data will be useful to identify and characterize agronomically important genes and understand their modes of action, enabling genomics-based, evolutionary studies, and breeding strategies to design faster, more focused, and predictable crop improvement programs.
Keywords: orphan crops; food security; whole-genome sequencing; transcriptome; root nodule symbiosis; transcription factor Background The world's population is expected to reach 9.8 billion by 2050. Ensuring a sustainable food supply to meet the energy and nutritional needs of the expanding population is one of the greatest global challenges [1]. Approximately 88% of countries currently face a serious burden of malnutrition [2]. To overcome this burgeoning food and nutritional challenge, the use of potential crop plants (both model and non-model) appears to be a better choice. Throughout history, humans have relied on an astonishing variety of plants for energy and nutrition; from 390,000 known plant species, around 5,000-7,000 have been cultivated or collected for food [1,2]. However, in the present century, fewer than 150 species are commercially cultivated for food purposes, and just 30 species provide 95% of human food energy needs. More than half of the protein and calories we obtain from plants are acquired from just three "megacrops": rice, wheat, and maize [3]. This narrow range of dietary diversity is partly a result of decades of intensive research focused on just a few species, which has successfully led to the production of high-yielding varieties of these major crops, usually cultivated under high-input agricultural systems. However, in some regions, we are now witnessing a drastic decrease in their yields, and the question has been raised as to whether rice and wheat (in particular) are currently making enough breeding progress to meet the challenge. All three megacrops are high-energy carbohydrate sources but are limited in protein content. Even if these crops can meet the energy requirement of the increasing world population, they cannot meet the nutritional requirement for active health by themselves [2].
To diversify the global food supply, enhance agricultural productivity, and tackle malnutrition, it is necessary to diversify and focus more on crop plants that are utilized in rural societies as a local source of nutrition and sustenance but have so far received little attention for crop improvement. These landraces (Traditional plant varieties) tend to be locally adapted and can often provide a rich source of nutrition, yet they have largely been ignored by modern interventions. The goal of the African Orphan Crops Consortium [4], an international public-private partnership, is to sequence, assemble, and annotate the genomes of 101 plants that contribute to traditional African food supplies by 2020. These neglected or orphan plants have been seldom studied by scientists but are of major importance in many African countries. They are usually grown by smallholder farmers, either for consumption or local sale, and are a major food source for 600 million rural Africans [5,6]. In this study, we sequenced and assembled draft genomes of five African orphan plant species (Fig. 1), which are highly important to augment food and nutritional security in Africa.
Vigna subterranea (Bambara groundnut; National Center for Biotechnology Information [NCBI]: txid115715), belonging to the Fabaceae family, is a leguminous plant species that originated in West Africa and is cultivated in sub-Saharan areas, particularly Nigeria [7,8]. With good nitrogen-fixing ability and drought tolerance, on average the seeds contain 63% carbohydrate, 19% protein, and 6.5% fat, thereby making bambara groundnut a complete food. Approximately 165,000 tons of this species are produced in Africa each year, but yields are low because efforts to improve Bambara have been neglected for many years [9]. The genomes of mung bean and adzuki bean, which also belong to the Vigna genus, have been published [10,11].
Moringa oleifera (Moringa; NCBI: txid3735) is a highly nutritious, fast-growing, and drought-tolerant tree that is indigenous to northern India, Pakistan, and Nepal [12]. Presently, this species is ubiquitously distributed throughout tropical and subtropical countries, and in particular, covers the major agroecological region in Nigeria. The leaves are rich in protein, minerals, beta-carotene, and antioxidant compounds, which are generally used as nutritional supplements and in traditional medicine. The seeds are used to extract oil, and seed powder can be used for water purification [13,14]. There are varying reports of Moringa production. India is the largest producer of Moringa with an annual production of 1.1-1.3 million tons of tender fruits from an area of 38,000 ha. In Limpompo province, Moringa is cultivated in relatively small areas (0.25-1 hectares), with seed yields of 50-100 kg/ha [15]. Prior to this study, a draft genome of M. oleifera from Yunnan (China) was reported [16], which estimated a similar genome assembly size and gene numbers to our version.
Lablab purpureus (Dolichos bean or hyacinth bean; NCBI: txid35936), a member of the Fabaceae family, is one of the most ancient (>3,500 years) domesticated and multipurpose legume species, which is used as an intercrop in livestock systems. Although it has large agromorphological diversity in south Asia, its origin appears to be African [17]. It is rich in protein, has good nitrogen-fixing ability, and is highly adaptable to diverse environmental conditions [18]. Limited production data are available, suggesting that yields are low. In southwestern parts of Bangladesh, Lablab is reported to have a total production area of approximately 48,000 hectares [17]. In other areas, it has a similarly relatively low production area; e.g., Kenya, approximately 10,000 hectares [19] and Karnataka, India, 79,000 hectares [20].
Faidherbia albida (apple-ring acacia; NCBI: txid138055) is the only tree species in the Faidherbia genus (Fabaceae). Its distinctive key features, such as reverse phenology (leaves grow in the long, dry season and shed during the rainy season) and nitrogenfixing ability, mean that F. albida has been planted as a key agroforestry species in traditional African farming systems for hun- dreds of years [21]. It originated in the Sahara or eastern and southern Africa, then spread across semi-arid tropical Africa, and later to the Middle East and Arabia. Estimates suggest that during the last decade, the tree was cultivated over an area of 300,000 hectares [22]. Average pod production ranges from 6-135 kg per tree per year in the Sudanian zone. In Mana Pools, Zimbabwe, two trees averaged 161 kg per tree in one year [23]. This yield per unit area is about 2,000-3,000 kg/ha, assuming a density of ∼20 mature trees per hectare [24]. Sclerocarya birrea (Marula; NCBI: txid289766) belongs to the Anacardiaceae family and is a traditional fruit tree found in southern Africa, mostly south of the Zambezi river [25]. Fruits are eaten fresh or are used to produce juices and wine, which has substantial socioeconomic and commercialization importance. The seeds of the fruits are rich in nutrition and oil content (56%) and are often consumed raw. It is estimated that the total value of the commercial marula trade is worth USD $160,000 per year to rural communities [26], with values per tree ranging from 315 kg (17,500 fruits) to 1,643 kg (91,300 fruits) [26,27]. A survey in north-central Namibia showed that, on average, there are 5.33 farms per household, with a total of 13,278 fruiting trees.
Considering the limited systematic efforts to improve the breeding of these understudied tropical crops to date, making their genomic data available will provide much-needed impetus to conduct basic and applied translational research to improve and develop them as important, sustainably cultivated food crops. These efforts will be vital for directly or indirectly improving nutrition for the increasing urban populations in the regions where these crops are grown.

Sample collection, library construction, and sequencing
Genomic DNA was extracted either from a tree (F. albida, M. oleifera) or from nursery plantlets (V. subtarranea, L. purpureus, S. birrea) grown at the World AgroForestry Center campus in Kenya using a modified Cetyl TrimethylAmmonium Bromide (CTAB) method [28].
Extracted DNA was used to construct paired-end libraries (insert size ranging from 170 to 800 bp) and mate-pair libraries (insert size >2 kb) following Illumina (San Diego, CA) protocols. Subsequently, sequencing was performed on a HiSeq 2000 platform (Illumina) using a shotgun sequencing strategy to generate more than 100 Gb raw data for each species (see Additional file 1: Table S1). Data were filtered using SOAPfilter (v2.2) [29] as follows: (1) small insert size reads were discarded; (2) Polymerase Chain Reaction (PCR) duplicates and adapter contamination were discarded; (3) reads with ≥30% low-quality bases (quality score ≤15) were removed; (4) bases of low quality were trimmed from each end of the reads; and (5) reads with ≥10% uncalled ("N") bases were removed. At the end, more than 100 × high-quality reads were obtained for each species according to their estimated genome size (see Additional file 1: Table S1).
RNA for transcriptome sequencing was extracted from different tissues of V. subterranea, L. purpureus, F. albida, and M. oleifera. The RNA was extracted using the PureLink RNA Mini Kit (Thermo Fisher Scientific, Carlsbad, CA) according to the manufacturer's instructions. For each sample, RNA libraries were constructed by following the TruSeq RNA Sample Preparation Kit (Il-  lumina) manual and were then sequenced on the Illumina HiSeq 2500 platform (paired-end, 100-bp reads), generating ∼36 Gb of sequence data for each species. Data were then filtered using a similar method to that used in DNA filtration, with a slight modification: (1) reads with ≥10% low-quality bases (quality score ≤15) were removed and (2) reads with ≥5% uncalled ("N") bases were removed (see Additional file 1: Table S2). All the transcriptome data from different tissues were compiled, and the combined version was used to check the completeness of the wholegenome sequence assembly.

Evaluation of genome size
Clean reads of the paired-end libraries were used to estimate genome sizes (insert size 250 bp and 500 bp). k-mer frequency distribution analysis was performed using the following formula: where Num represents the read number of reads used, Len represents the read length, K represents the k-mer length, and K Dep refers to where the main peak is located in the distribution curve [30].
k-mer distributions of F. albida, S. birrea, and M. oleifera showed two distinct peaks (see Additional file 1: Fig. S1), where the second peak was confirmed as the main one for each of the species. The genome sizes of V. subterranea, L. purpureus, F. albida, S. birrea, and M. oleifera were predicted as 550, 423, 661, 356, and 278 Mb, respectively (see Additional file 1: Table S3).

De novo genome assembly
For de novo genome assembly, SOAPdenovo2 (SOAPdenovo2, RR ID:SCR 014986) [29] was used for constructing contigs, followed by scaffolding, and finally gap filling. To build contigs, libraries ranging from 170 to 800 bp were used to construct de Bruijn graphs with the parameters "pregraph -d 2 -K 55," and contigs were subsequently formed with the parameters "contig -g -D 1" to delete links with low coverage. In the scaffolding step, paired-end and mate-pair information was used to order the contigs with parameters "scaff -g -F" and "map -g -k 55." Finally, to fill the gaps within scaffolds, GapCloser version 1.12 (GapCloser, RRID:SCR 015026) [29] was used with the parameters "-l 150 -t 32" using the pair-end libraries. Finally, total assembled lengths of 535.05, 395. 47
To evaluate the completeness of genes in the assemblies, unigenes were generated from the transcript data of each species using Bridger software with the parameters "-kmer length 25min kmer coverage 2" [32] and then aligned to the corresponding assembly using Basic Local Alignment Search Tool (BLAST)like alignment tool (BLAT, RRID:SCR 011919) [33]. The results indicated that each of the assemblies covered about 90% of the expressed unigenes, suggesting that the assembled genomes contained a high percentage of expressed genes (Table 3).
To confirm the accuracy of the assemblies, some of the paired-end libraries were mapped to the genome assemblies, and the sequencing coverage was calculated using SOAPaligner, version 2.21 (SOAPaligner/soap2, RRID:SCR 005503) [34]. Sequencing coverage showed that >99% of the bases had a sequencing depth of more than 10× and confirmed the accuracy at the base level (see Additional file 1: Fig. S2). GC content and average depth were also calculated with 10 kb non-overlapping windows. The distribution of GC content indicated a relatively pure single genome without contamination or GC bias (see Additional file 1: Fig. S3). The GC content of each sequenced genome was also compared with that of a related species. As expected, close peak positions showed that the related species were similar in GC content (see Additional file 1: Fig. S4).

Repeat annotation
Repetitive sequences were identified using RepeatMasker (version 4.0.5) [35], with a combined Repbase and a custom library obtained through careful self-training. The custom library comprised three parts: MITEs (miniature inverted repeat transposable elements), LTRs (long terminal repeats), and an extensive library that was constructed as follows. First, the annotated MITE library was created using MITE-hunter [36] with default parameters. Then, a library of LTR elements with lengths of 1.5-25 kb and two libraries of terminal repeats ranging from 100 to 6,000 bp with ≥85% similarity were constructed using LTRharvest [37] integrated in Genometools (version 1.5.8) [38] with parameters "-minlenltr 100, -maxlenltr 6000, -mindistltr 1500, -maxdistltr 25000, -mintsd 5, -maxtsd 5, -similar 90, -vic 10." Subsequently, For de novo prediction evidence, a series of training sets was made to optimize different ab initio gene predictors. Initially, a set of transcripts was generated by a genome-guided approach using Trinity with the parameters "-full cleanup, -jaccard clip,genome guided max intron 10000, -min contig length 200." The transcripts were then mapped back to the genome using PASA (version 2.0.2) [44], and a set of gene models with real gene characteristics (e.g., size and number of exons/introns per gene, features of splicing sites) was generated. Complete gene models were picked for training Augustus [45]. Genemark-ES (version 4.21) [46] was self-trained with default parameters. The first round of MAKER-P was run based on the evidence as above, with default parameters except "est2genome" and "protein2genome" being set to "1," yielding only RNA and protein-supported gene models. SNAP [47] was then trained with these gene models. Default parameters were used to run the second and final rounds of MAKER-P, producing the final gene models.
The number of protein-coding genes identified in each species was 31,707 in V. subterranea, 20,946 in L. purpureus, 28,979 in F. albida, 18,937 in S. birrea, and 18,451 in M. oleifera. Compared to the other sequenced species in the same genus [10,11], V. subterranea has more genes than mung bean (22,427) but fewer than adzuki bean (34,183). Various gene structure parameters were compared to the related species of each sequenced genome, as summarized in Table 5 and Additional file 1: Fig. S5. BUSCO evaluation showed that at least 85% of 1,440 core genes could be identified across all the species, suggesting an acceptable quality of gene annotation for the five sequenced genomes (see Additional file 1: Table S4).
Non-coding RNA genes in the sequenced genomes were also annotated. Using BLAST, ribosomal RNA (rRNA) genes were searched against the A. thaliana rRNA database or by searching for microRNAs (miRNA) and small nuclear RNA (snRNA) against the Rfam database (Rfam, RRID:SCR 004276; release 12.0) [48]. tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) was also used to scan for tRNAs [49]. The results are summarized in Table 6.
Moreover, 8,184 gene families of S. birrea, M. oleifera, C. papaya, C. sinensis, and T. cacao were clustered (Fig. 2B), of which 365 gene families containing 798 genes were specific to M. oleifera and 362 gene families containing 796 genes were specific to S. birrea. KEGG pathway enrichment analysis of paralog genes was also conducted (Additional file 1: Tables S6, S7). Functional annotation revealed that in V. subterranea, these paralogs corresponded mainly with carbon fixation, zeatin biosynthesis, and glyoxylate and dicarboxylate metabolism. However, for L. purpureus, the fatty acid elongation pathway was enriched, while in F. albida, pathways corresponding to plant-pathogen interactions and cyanoamino acid metabolism were enriched. In S. birrea, enrichment occurred in plant-pathogen interaction, starch and sucrose metabolism, and fatty acid biosynthesis pathways. In M. oleifera, pathways related to fatty acid and diterpenoid biosynthesis and with cyanoamino acid metabolism were enriched. Using Gene Ontology (GO) analysis, paralog genes in V. subterranea, L. purpureus, F. albida, M. oleifera, and S. birrea were enriched in ion binding, metabolic processes, disease resistance, cell components, and biological processes, respectively.

Phylogenetic analysis and estimation of divergence time
We identified 141 single-copy genes in the 14 species used for the above analysis and subsequently used them to build a phylogenetic tree. Coding DNA sequence alignments of each single-copy family were generated following protein sequence alignment with MUSCLE (MUSCLE, RRID:SCR 011812) [41]. The aligned coding DNA sequences of each species were then concatenated to a supergene sequence. The phylogenetic tree was constructed with PhyML-3.0 (PhyML, RRID:SCR 014629) [ [11,63]. Based on the tree constructed using single-copy-family genes, the divergence time between F. albida and Papilionoideae was predicted to be 79.1 (70.0-87.0) Mya. This is a little different from a previous prediction of the origin of legumes based on two gene markers (matk and rbcL) [64]. The divergence time between M. oleifera and C. papaya was predicted to be 65.4 (59.2-71.1) Mya and 67.9 (53.6-77.3) Mya between S. birrea and C. sinensis (Fig. 1).
Subsequently, to evaluate gene gain and loss, CAFE (CAFE, RR ID:SCR 005983) [65] was employed to estimate the universal gene birth and death rate, λ, under a random birth and death model using the maximum likelihood method. Results for each branch of the phylogenetic tree were estimated and are represented in Fig. 1.
GO enrichment analysis was also conducted on gene pathways in expanded families in the lineage of each sequenced species (Additional file 1: Tables S8, S9). Terms related to energy and nutrient metabolism were commonly distributed in the enrichment output of V. subterranea, L. purpureus, M. oleifera, and S. birrea, e.g., proton-transporting two-sector ATPase complex, cyclase activity, nutrient reservoir activity, and carbohydrate derivative binding.
In F. albida, expanded gene families were related to signal transfer or regulation, e.g., signaling receptor activity, phosphatase regulator activity, and regulation of response to stimulus. Furthermore, the regulatory factors GLABRA3, ENHANCER OF GLABRA 3, AUX1, LAX2, and LAX3 [66][67][68], which are related to the formation of root hairs and lateral roots, were identified in these families. As a traditional agroforestry tree in Africa, F. albida was previously reported to have a root system architecture that displays wide variation under different environmental factors (soil depth, nutrient amount, or water reservoirs) [69]. This suggests its adaptability to the complex environment, which requires signal transferring and regulation. The results obtained from the GO enrichment analysis were consistent with the biological characteristics of F. albida.

Identification of protein, starch, and fatty acid biosynthesis-related genes
Using the amino acid, starch, and fatty acid synthesis genes in soybean [11,71] as bait, we performed an ortholog search in V. subterranea, L. purpureus, F. albida, S. birrea, M. oleifera, G. max, Triticum aestivum, Zea mays, and O. sativa (Additional file 1: Tables S10-S13). Vigna subterranea is a good source of resistant starch (RS) [72], which has the potential to protect against diabetes and reduce the incidence of diarrhea and other inflammatory bowel diseases [73]. High amylose levels can contribute to RS. Previously, studies have shown that deficiency in SSIIIa (soluble starch synthase gene) decreases amylopectin biosynthesis and increases amylose biosynthesis by a granule-bound starch synthase (GBSS) encoded by the Wx gene in O. sativa indica [74]. Downregulation of the soluble starch synthase SSII and of SBE leads to higher levels of RS in barley [75]. Interestingly, in V. subterranea, two out of four GBSSs underwent expansion, suggesting their vital role in controlling starch synthesis (Fig. 4) at the transcriptional and post-transcriptional level. No expansion in GBSS was observed in the genomes of L. purpureus, F. albida, S. birrea, or M. oleifera; in V. subterranea, soluble starch synthase was not expanded. Therefore, we speculate that the expansion of GBSS might be why V. subterranea is rich in RS. Similarly, differences in the copy numbers of choline kinase, a key factor in fatty acid synthesis and storage, were found be-tween the four legumes (V. subterranea, 7; F. albida, 4; L. purpureus, 2; and G. max, 5) and between two orphan species (S. birrea, 1, and M. oleifera, 3). Choline kinase is the first enzyme in the cytidine diphosphate-choline pathway that is involved  in lecithin biosynthesis [76,77]. Based on these observations, we inferred that all the factors required to synthesize lecithin are present in V. subterranea. However, gene expression data remain lacking in terms of the GBSS and choline kinase genes in these the five species. More transcriptomic analysis and chemical tests are required to uncover the mechanisms of their nutrition metabolism.

Identification of the root nodule symbiosis pathway
Legumes (Fabaceae) are well known for their ability to fix nitrogen; an important trait to replenish nitrogen supplies in soil and agricultural systems. Being part of the human food production chain, legumes have a major impact on the global nitrogen cycle. Nitrogen-fixing plants can fix nitrogen through root nodule symbiosis (RNS) using symbiotic nitrogen-fixing bacte-ria. In a previous report, RNS was revealed to be restricted to Fabales, Fagales, Cucurbitales, and Rosales, which together form the monophyletic nitrogen-fixing clade. This suggests a predispositional event in their common ancestor, which enabled their subsequent evolution [78]. Despite this genetic predisposition, many leguminous members of the nitrogen-fixing clade are nonfixers [79]. This has raised the question as to whether the nodulation trait evolved independently in a convergent manner or originated from a single evolutionary event followed by multiple losses. The answer to this question cannot be explained with current genomic approaches because available genomic information of nodulating species is, at present, limited to a single subfamily, the Papilionoideae, in the Fabaceae. Although the Mimosoideae subfamily within the Fabaceae also contains nitrogen-fixing species, none of its members have been genome sequenced.    In this analysis, we identified 16 RNS signal (Sym) pathway genes in three legumes (V. subterranea, L. purpureus, and F. albida) and two non-legumes (S. birrea and M. oleifera). First, we collected the protein sequences of previously reported genes in the Sym pathways of L. japonicus and M. truncatula [80] (Fig. 5). Using these sequences as bait, we predicted the Sym genes in V. subterranea, L. purpureus, F. albida, S. birrea, and M. oleifera through reciprocal best hits generated by a BLASTP search with an E-value of 1e-5 (Table 8). To verify this prediction with syntenic analysis, all-versus-all BLASTP results were subjected to MCSCANX [81] with default parameters to generate syntenic blocks. The result showed that among the legumes, all of the components in the pathway were conserved except for MtNFP/LjNFR5, Lj-CASTOR, CCaMK, MtCRE1/LjLHK1, and NF-YA2, while many components were missing in the non-legumes. Among the three legumes, the orthologous genes MtNFP/LjNFR5, LjCASTOR, and MtIPD3/LjCYCLOPS were absent in F. albida. As previously reported, the expression of NIN is lower in the ipd3-mutant line [82]; analysis of the M. truncatula mutant C31 showed that the Nod Factor Perception gene is essential in Nod factor perception at early stages of the symbiotic interaction [83]. Meanwhile, the function of IPD3 was proved to be partly redundant, which means it is likely that other proteins phosphorylated by CCaMK can partially fulfill this role when IPD3 is absent [82]. Differences in the components of the RNS pathway (Table 8), together with the relatively weak nitrogen-fixing ability [84] of F. albida, is thus a good reference for RNS diversification research.

Conclusion
This comprehensive study reports the sequencing, assembly, and annotation of five genomes of underutilized plants in Africa, along with details of their key evolutionary features. The draft genomes of these species will serve as an important complementary resource for non-model food crops, especially the leguminous plants, and will be valuable for both agroforestry and evolutionary research. Improving these underutilized plants using genomics-assisted tools and methods could help to bring food security to millions of people.

Availability of supporting data
The raw data from our genome project was deposited in the NCBI Sequence Read Archive database with Bioproject IDs PR-JNA453822 and PRJNA474418. Assembly and annotation of the five genomes and other supporting data, including BUSCO results, are available in the GigaDB repository [85], and the data reported in this study are also available in the CNGB Nucleotide Sequence Archive (CNSA: https://db.cngb.org/cnsa; accession number CNP0000096). All genome annotations described here are also available at http://bioinformatics.psb.ugent.be/orcae/ AOCC. Figure S1: K-mer (K = 17) analysis of five genomes. Figure S2: Distribution of sequencing depths of the assembly data. Figure S3: The GC content. Figure S4: Comparison of GC content across closely related species. Figure S5: Statistics of gene models in Vigna subterranea, Lablab purpureus, Faidherbia albida, Moringa oleifera and Sclerocarya birrea. Figure S6: Expansion and contraction of gene families. Table S1: Statistics of the raw and clean data of DNA sequencing. Table S2: Summary statistics of the transcriptome data in four species. Table S3: Estimation of genome size based on k-mer statistics in five species. Table S4: BUSCO evaluation of the annotated protein-coding genes in five species. Table S5: Analysis of gene families of different species. Table S6: Enriched pathways of unique paralogs genes in families.  Table S9: Enriched pathways of genes in families with expansion.

Additional files
Table S10: The copy numbers of protein biosynthesis-related genes in each species.   The copy numbers of fatty acid synthesis and storage-related genes in each species.
Table S13: The copy numbers of fatty acid degradationrelated genes in each species.
Table S14: Numbers of transcription factors in the studied species.