A Chromosome-level Assembly of a Wild Castor Genome Provides New Insights into Adaptive Evolution in a Tropical Desert

Wild castor grows in the high-altitude tropical desert of the African Plateau, a region that has high ultraviolet radiation, strong light and extremely dry conditions. Some genes or alleles persist in the wild castor genome to prevent the species being vanished at extreme condition, but usually be decayed under domestication due to the relaxation of natural selection. To investigate the potential genetic basis of adaptation to both highland and tropical deserts, we generated a chromosome-level genome sequence of the wild castor accession WT05 with scaffold and contig N50 sizes of 31.93Mb and 8.96Mb, respectively. Compared with cultivated castor and other Euphorbiaceae plants, wild castor exhibits positive selection and gene family expansion for genes involved in DNA repair, photosynthesis and abiotic stress responses. Genetic variations associated with positive selection were identified in several key genes, such as LIG1, DDB2, and RECG1, involved in nucleotide excision repair. Moreover, a study of genomic diversity among wild and cultivated accessions revealed genomic regions containing selection signatures associated with the adaptation to extreme environments. The identification of the genes and alleles with selection signatures provide insights into the genetic mechanisms underlying the adaptation of wild castor to the high-altitude tropical desert, and would serve as targets for directing improvement of modern castor varieties. , some encode We also identified four genes (Chr01m0050, Chr03m2352, Chr05m2 118, Chr07m3025) encoding homologs of a tetratricopeptide repeat (TPR)-like superfamily protein 53 , a light-sensing related gene 54 (AT5G10620), the photosystem related gene LHB1B3 55 and the photoacclimation gene MBS1. The function of the gene MBS1, which was identified as a key player in singlet oxygen ( 1 O 2 ) signaling, has been well studied. Knockout in Arabidopsis results in hypersensitivity to photooxidative stress, whereas overexpression leads to tolerance to intense light 56 . Evidence also proved that MBS1 couples with β-cyclocitral to induce transport of singlet oxygen ( 1 O 2 ) to the nucleus, ultimately leading to photoacclimation 57 . These results reflect a microcosm of the adaptive evolution of castor in arid, high-light tropical deserts.


Introduction
Considering the extreme variations in genome size in Euphorbiaceae species, we investigated dynamic changes in LTR-RTs during the evolution processes and tried to explain the large variations in the genome size of species in the Euphorbiaceae family. Wild or cultivated castor (350 Mb), compared with the other three important economic species of Euphorbiaceae, has a relatively small genome. LTR proliferation occurred~1.0 Mya, and the most recent amplification was estimated to have occurred 0.2~0.5 Mya, according to the corresponding values of the highest sharp peak and foremost relatively minor fluctuating peak, respectively. More specifically, physic nut (genome size =416 Mb, 59.35%) 12 experienced another two short LTR proliferations at 2.4 and 3.6 Mya; cassava (genome size = 582 Mb, 50.34%) 13 has a broader peak at~1.0 Mya than castor.
Additionally, for tung tree, with a G-scale genome size of 1.2 Gb and repeat sequence nearly 58.74%, we found that LTRs remained active from 1.0 to 2.0 Mya ( Supplementary Fig. 6). The trend of repeat amplification was consistent with the trend of genome size amplification while the length of gene region did not change much. Specially, the ratio of Gypsy-type LTRs of G-scale genome (tung 14 and rubber tree) is nearly two-fold amplification than castor. (Supplementary   Table 8). Similarly, in the genome study of desert poplar (Populus Euphratica), which also found that it is the extensive expansion of Gypsy elements led to its rapid increase in genome size 15 . Therefore, we infer that LTR amplification led to genome-size variations in Euphorbiaceae species.

Comparation with castor reference genome
Both wild and the reference cultivar have a similar genome size but the quality of assembling was quite distinct. First, the number of scaffolds assembled in the WT05 and NSL4733 genomes was 135 and 25,828, respectively. The contig N50 and scaffold N50 lengths of the WT05 genome were 425 (8963.1 vs 21.1) and 64 (31927.7 vs 496.5) times longer than those of the NSL4733 genome, respectively (Table 1) Table 9). These results show that the newly assembled genome had high sequence homology and chromosome integrity, which greatly improved the quality of the castor genome ( Fig. 2a and Supplementary Fig. 7-8). Additionally, the genome sequence similarity between the two versions was estimated to be 99.16%, suggesting that the two genomes have not diverged much yet (Supplementary Table 10).
A better genome assembly would allow us to annotate the structure and function of genes more accurately. By comparing gene annotation between the two genomes, we first found that the number of genes annotated in the NSL4733 genome was greater than that in the WT05 genome; however, the minimum, maximum and average lengths of CDSs in the NSL4733 genome were shorter than those in the WT05 genome (Supplementary Table 11). This result reflected incomplete gene annotation in the NSL4733 genome, likely caused by the fragmented sequence assembly. For instance, the genes Chr03m1425 and Chr01m0783 in the WT05 genome were annotated as containing 9 and 14 exons, respectively, which was validated by RNA-seq data from five castor tissues, whereas in the NSL4733 genome, only 3 and 6 exons were annotated in these two genes, respectively. Detailed examination showed that these two genes were located at the ends of the shorter scaffolds, and thus, the missing exons were the result of an incomplete assembly ( Supplementary Fig. 9). Furthermore, in the process of gene annotation of the WT05 genome, large RNA-seq datasets from 17 castor samples were collected for correcting the gene annotation. Some truncated genes in the previous NSL4733 version were reannotated as complete genes in the new annotation. For example, the Chr09m1125 sequence contains two short genes (30064.t000012 and 30064.t000013) in the NSL4733 version; a similar result was obtained for the Chr10m1108 gene ( Supplementary Fig. 10). These results indicate that the gene annotation has vastly improved in the newly obtained WT05 genome, providing accurate genetic information for evolutionary and functional genomic studies on castor.
Taking advantage of the newly obtained WT05 genome, we reannotated two families of important genes in castor, namely, ricin-coding genes and genes involved in ricinoleic acid synthesis. First, we identified 25 ricin-related genes that are distributed in 5 scaffolds, including 8 ribosome-inactivating proteins (RIPs) with both ricin A and B chains, 9 ricin A chain proteins and 8 ricin B chain proteins. Specifically, 22 of the 25 genes were concentrated in three segments of chromosome 8 ( Supplementary Fig. 11a). In contrast, 28 ricin genes scattered along 17 scaffolds in NSL4733 assembling. Moreover, two set of truncated adjacent gene pairs are supposed to derive from two pseudogenes (Supplementary Table 12 and Supplementary Fig. 11b, c).
Based on the annotation, we attempted to uncover the mechanism underlying the high toxicity of castor. Ricin has been identified as a type II ribosome-inactivating proteins (RIPs) containing an active domain (Ricin toxin A chain) and a binding lectin domain (Ricin toxin B chain) connected by a disulfide bond, which depurinates adenine in a specific residue of the ribosomal RNA and allows ricin bind to cell surface then enter cell through endocytosis, respectively. Notably, there are 8 copies with intact RIPs in the WT05 genome, whereas there were 2 and 1 homologous genes that were found in rubber tree and tung tree, respectively (Fig. 2b, c). Further sequence alignment among these homologs revealed that one motif with 43 amino acids located in the middle of the RTA chain was highly divergent between castor and other plants without ricin, including rubber tree, tung tree, oil palm and tea tree (Supplementary Table 13). Previous research show that one of key active sites(Tyr123) were located in the sequence of RTA that is involved in depurinating adenine in a specific residue of the 28S ribosomal RNA and the mutation is able to result in sevenfold decrease in enzyme activity 16 ( Supplementary Fig. 12). These results suggest that this 43-residue motif in the RTA region plays an important role in the mechanism of action of ricin.
Furthermore, we investigated the expression profiles of the ricin genes by integrating RNA-seq data from different castor tissues, which showed a tissue-specific expression pattern. Eight RIPs with both ricin A and B chains are widely expressed across different tissues. The RIPs genes Chr08m1661, Chr08m1663, Chr08m1667, and Chr08m1657 are specifically and highly expressed in seeds at different developmental stages, and the RIP gene Chr08m1621 is mainly expressed in leaves and stems. The RIP genes showed clearly higher transcriptional activity in seeds than in other tissues, consistent with the observation that castor seeds have higher toxicity than other tissues. In contrast, the genes encoding only ricin A or B chain prefer to have relatively low or no expression across the tissues ( Fig. 2d and Supplementary Fig 13). Due to the lack of some conserved motifs in these genes, it is not clear whether they still have RIP function. These comprehensive expression profiles provide a good reference for ricin gene functional research.
On the other hand, we annotated 301 genes related to fatty acid synthesis and reconstructed the ricinoleic acid synthesis pathway (Supplementary Data 2 and Supplementary Fig. 14). We diagrammed the fatty acid metabolic pathway with the corresponding genes involved in ricinoleic acid synthesis and integrated transcriptome help to identified several key genes (Fig. 2e) shows differential transcript abundance (log10 TPM scale) across different castor tissues and seed developmental stages. In detail, several genes including ACC, MCMT, EAR, KAS, SAD, LACS, PDAT, GPAT, DGAT, Oleosins and FAH12, were highly expressed in the seeds but not in the roots, stems and leaves, which is consistent with the enrichment of ricinoleic acid in castor seeds (Fig.   2e). Specifically, in the pathway of ricinoleic acid synthesis, we found that five key genes, namely, ACC, EAR, KASII, SAD and FAH12, showed low expression in the early seed developmental stage (EDS) and the highest expression levels in the middle seed developmental stage (MDS), then decreasing to the initial expression level (LDS), followed by no or weak expression in the stage of dormancy (DS). This expression trend was consistent with the accumulation of fatty acids in castor seeds 17 .
The genome assembly and gene annotation in WT05 is greatly improved the quality of reference genome of castor, which allows us to identify genetic variations and perform GWAS analysis more accurately. Taking advantageous of the newly-obtained WT05 genome, we reanalyzed the resequencing data from 385 Chinese castor lines that have been published in 2019. We identified genetic variations, and performed GWAS with 9 agricultural traits. We totally identified 2218 SNPs that significantly correlated to 9 agricultural traits (P<1.0×10 −6 ), of which 602 SNPs were not able to be identified in previous analysis 10 . This GWAS analysis not only validated many of the known controlling loci, but also detected a large number of new candidate markers associated with agricultural traits that were unable to be detected in previous analysis ( Fig. 3f and Supplementary Fig. 15). For examples, we detected one novel signal in chromosome 3, in which 44 SNPs are significantly associated with hundred-grain weight. These SNPs are located in upstream 3.25~17.6kb region (scattered in Chr03:2564~2565.6kb) of the gene LOC107262598 that was annotated as the homolog of the gene ICESLEEPER 2 in rice. ICESLEEPER 2 was reported to be associated with amount of seeds and the mutant trend to produce empty panicles, resulting very few seeds in rice 18 . Another new signal with 3 SNPs that are significantly associated with seed volume is located in the upstream 1.6~1.8 kb of the gene LOC8281893 in chromosome 8. This gene encodes UDP-glucuronic acid decarboxylase (OsUXS) and plays important role in rice seed development at a certain stage 19 . Besides, there are more novel SNPs associated with some agricultural traits that was listed in the supplementary Data 3. Therefore, the WT05 genome provides a high quality reference for studies of population genetics and molecular breeding of castor.

Gene family expansion associated with photosynthesis
To investigate the phylogenetic position of castor bean in Euphorbiaceae species, especially the differentiation time between wild and cultivated castor bean, we constructed a phylogenetic tree for 7 species, namely, wild castor (WT05), cultivated castor (NSL4733), cassava, physic nut, rubber tree, tung tree, and flax, with Arabidopsis thaliana as an outgroup, using 622 single-copy gene families. As expected, the wild castor is most closely related to cultivated castor and the tree topology is consistent with previous research 20 . The divergence time between the wild castor and each species was estimated and is shown on the tree (Fig. 3a). For the divergence of wild and cultivated castor, we used the 10906 collinear genes from a total of 722 syntenic blocks between two genomes to calculate the synonymous substitution rate (Ks) distribution, and the results showed peaks at 0.002 to 0.004. According to the substitution rate of 1.3×10 -8 mutations per site per year, we estimated the divergence time to be from 0.077~0.154 Mya (Fig. 3b). Since this divergence time is much earlier than the cultivation time (~1000 years) of castor, we speculated that the wild castor WT05 is not a direct ancestor of the cultivar NSL4773.  Table 15). As an example, one significantly amplified gene family, photosystem II rection center protein B(PSBB) 21 , which is involved in photosynthesis, light reaction, and photosynthetic electron transport in photosystem I, contained four copies in the wild genome, while the cultivated castor had only two copies of PSBB ( Supplementary Fig. 16a). A similar result was found when we compared the gene family between castor and the other five Euphorbiaceae species. Sixteen gene families containing 95 genes were significantly expanded (P < 0.05) in the wild castor genome, one of which contained 4 genes involved in photosynthesis (Supplementary Table 16-17). We verified the accuracy of the copy number amplification events by transcriptome alignment for different tissues from castor ( Supplementary Fig. 16b). These results suggested that the expansion of photosynthesis-related genes in wild castor could be associated with the adaptation to intense light in the desert region.

Positive selection associated with DNA repair
Sunlight is essential for plant growth and constantly replenishes energy through photosynthesis; thus, plants cannot survive without light. However, wild castor plants grow in the desert region on the African plateau, so they must tolerate ultrastrong ultraviolet radiation, which inevitably causes DNA lesions to varying degrees. Considering the impact of intense UV radiation or high light intensity in tropical desert areas on DNA damage. It is hypothesized that wild castor has developed strong DNA repair systems to adapt to intense UV radiation during long-term  Table 18).
GO enrichment analysis also revealed that these positively selected genes (PSGs) were enriched in several categories associated with DNA repair, cellular response to DNA damage stimulus, response to stress, cellular response to stress and cellular response to stimulus (Supplementary   Table 19 and Fig. 4b). This result suggested that there are indeed many genes related to DNA repair undergoing positive selection during long-term adaptive evolution of the wild castor genome. Similarly, according to previous genome studies of Crucihimalaya himalaica 24 , Tibetan antelope 25 , Tibetan chickens 26 and ectothermic snakes 27 , some genes involved in DNA repair were also identified as being under positive selection pressure in order to the adaptation to high altitude environment. These results consistently suggested that the evolution of DNA repair system is an important common mechanism for organisms to the adaptation to extreme environments.
Here, we identified 21 PSGs associated with DNA damage and repair, 9 of which play key functions in DNA repair pathways, including NER, BER, and MR (Supplementary Table 20). For example, the gene LIG1 (DNA LIGASE 1) plays a key role in both DNA replication and excision repair pathways, which could repair both single-and double-strand break lesions 28 (Fig. 4a). Three amino acid replacements were identified in the exons of the gene (L115Y; D138S; P293V), which was also confirmed by transcriptome data from roots, stems, leaves, seeds and flowers (Fig. 4c).
To explore whether these replacements were located in the protein domain, we further simulated the protein three-dimensional (3D) structure to examine the possible effects of the mutations on the enzyme structure using Phyre2 29 . As a result, 616 residues (77% of the protein sequence) were modeled with 100.0% confidence by the single highest scoring template, and the crystal structure was similar to the structure of human DNA ligase I 30 . We found that all three amino acid replacements were located in the DNA-binding domain (DBD) (Fig. 4d). Previous studies have Abiotic stresses such as high temperature, drought and high salinity are also typical features in tropical desert areas. Here, we identified a group of PSGs that are potentially involved in stress responses (Supplementary Table 22). First, we identified one gene, RcuChr03m1916, encoding a homolog of heat shock transcription factor A2 (HSFA2). Upregulation of HSFA2 tends to improve heat tolerance in Arabidopsis thaliana 38 . Its homolog HSFA1 (heat shock transcription factor A1) has been reported to confer resistance to heat stress 39 . Another gene, RcuChr02m0839, encodes a homolog of the Arabidopsis chaperone protein AtDJB1, which belongs to the DNAJ heat shock protein family and participates in osmotic stress tolerance through its effects on ABA signaling pathways 40 . Additionally, we identified a gene encoding a zinc finger protein that is reported to have a functional role in salt tolerance in rice 41 , Arabidopsis thaliana 42 , cotton 43 and poplar 44 ; the genes RcuChr10m1484 and RcuChr05m0192 encode DI19 (DROUGHT-INDUCED PROTEIN19) 45 and ERD (early-responsive to dehydration stress) 46 , respectively, which play roles in drought resistance. These results suggest that these genetic variations in these PGSs could be closely associated with environmental adaptability.

Selection signatures in the wild castor population
The wild castor population is growing under the strong pressure of natural selection in the tropical desert area. Consequently, some genomic regions or genes associated with environmental adaptation in the wild castor population are expected to evolve with high conservation under natural selection pressure. Based on this principle, we calculated the ratio of the genetic diversity In the tropical desert region, wild castor is exposed to not only UV radiation but also drought stress. Here, we identified four genes associated with abiotic stress responses, including RcuChr05m3234, RcuChr09m2155, RcuChr09m2100 and RcuChr05m2127. Gene RcuChr05m3234 encodes the homolog of SnRK2, which is a member of sucrose nonfermenting 1-related protein kinases (SnRK2) in Arabidopsis. This gene can be activated by ionic (salt) and nonionic (mannitol) osmotic stress 47 . The SnRK2 gene family, as a protein kinase, plays important roles in the activation of stress response signals, such as signals associated with the response to salt, drought and osmotic stress 48 . Our comparative genomics analyses identified one nonsynonymous SNP mutation located at position 29292173 of chromosome 5, within the gene SnRK2, which resulted in an amino acid change from phenylalanine to serine (P223S) (Fig. 5c-d).
Based on the simulated 3D structural model, we found that P223S is located within the kinase domain of SnRK2, an adjacent structural motif on the C-lobe of SnRK2.6, surrounding the potential position for the activation loop (Fig 5e). Previous research suggested that the two adjacent site mutations F226D and I230D can result in complete dissociation between SnRK2.6 and ABI1 49 (Fig. 5e). Furthermore, the statistics revealed that the number of allelic mutations in the corresponding locations was obviously different between wild and cultivated populations. (1) (Fig. 5f). Moreover, examination of the expression levels in castor leaves showed that the expression in wild castor was higher than that in cultivated castor (Fig. 5b).
Another gene, RcuChr09m2155, encodes the homolog of TPS1, which is involved in trehalose biosynthesis. This gene has been reported to be involved in drought resistance in drought-tolerant cassava, physic nut and castor crops. In cassava, previous research found that higher amounts of trehalose contribute to higher drought stress tolerance 50 Fig. 18). Expression profiles revealed that TPS1 in wild castor leaves had a higher expression level than that in cultivated castor leaves (Fig. 5b).
Both of these genes have been suggested to play potential roles in drought and salt stress responses. Genetic variations are found in UTRs or intronic regions. We found that the expression levels of the wild alleles were higher than those of cultivated alleles (Fig. 5b and Supplementary Data 5).
We also identified four genes (Chr01m0050, Chr03m2352, Chr05m2 118, Chr07m3025) encoding homologs of a tetratricopeptide repeat (TPR)-like superfamily protein 53 , a light-sensing related gene 54 (AT5G10620), the photosystem related gene LHB1B3 55 and the photoacclimation gene MBS1. The function of the gene MBS1, which was identified as a key player in singlet oxygen ( 1 O2) signaling, has been well studied. Knockout in Arabidopsis results in hypersensitivity to photooxidative stress, whereas overexpression leads to tolerance to intense light 56 . Evidence also proved that MBS1 couples with β-cyclocitral to induce transport of singlet oxygen ( 1 O2) to the nucleus, ultimately leading to photoacclimation 57 . These results reflect a microcosm of the adaptive evolution of castor in arid, high-light tropical deserts.

Discussion
Wild castor plants growing in tropical desert regions of the African Plateau are exposed to a variety of abiotic or biotic stresses under harsh environmental conditions, such as drought, salinity, and, especially, UV damage. To adapt to unique conditions, wild castor has developed a series of self-defense systems that provide valuable germplasm resources and advantageous characteristics, such as resistance to UV damage and drought. These traits are highly valued in castor breeding. In this work, we assembled one chromosome-level genome of wild castor, providing a high-quality reference for genomic studies of castor. Furthermore, through comparative genomic analyses with cultivated castor and other Euphorbiaceae plants, we revealed that a great number of genes associated with stress responses, especially in responses to UV-induced DNA damage and repair, have undergone positive selection and harbor many advantageous alleles and variations for castor improvement.
The wild castor WT05 genome was assembled at the chromosome level, with high consistency and integrity, greatly improving the quality of the reference genome for castor. Moreover, based on the completeness of the WT05 genome, the gene structure in this castor genome is annotated more precisely than that in the previous version of the castor genome. Additionally, we performed careful gene functional annotation and characterized two classes of important genes in castor, including genes associated with ricin and fatty acid biosynthesis. Taking advantageous of the WT05 genome, we identified genetic variations based on the resequencing data of 385 Chinese castor lines that have been published in 2019, and performed GWAS analysis with 9 agricultural traits. We detected more novel SNPs that are significantly associated with agricultural traits, which was not able to be found in the previous study. All of these results confirmed that the WT05 genome version is a marked improvement over the previous version, thus providing a better reference for studies on castor.
The intense UV radiation environment frequently leads DNA damage by inducing nucleotide structure lesions such as intra-or inter-strand cross-links, cleavage of phosphodiester bonds and single/double-strand DNA breaks (SSBs/DSBs), which inevitably cause errors in transcription or translation, probably resulting in highly cytotoxic lesions and even potentially lethal lesions 23, 58 .
Via consistent efforts made in previous studies, DNA repair mechanisms have been well Additionally, a number of genes related to drought also underwent gene family expansion or positive selection in the wild castor genome, including some key genes involved in stress responses, such as SnRK2.6, GST8 and TPS1. These results provide novel insights into the molecular mechanisms underlying the adaptation of wild castor to abiotic stresses and provide a set of genes and alleles as potential targets for castor improvement.
In summary, we assembled a chromosome-level genome of wild castor, providing a high-quality

Gene Family Expansion and Contraction.
We Each significantly expanded and contracted gene family was determined by comparing the cluster size differences between the ancestor and each species, and a P value of or below 0.05 was considered significant. GO enrichment analyses of genes were conducted using the BiNGO application installed in CytoScape software (3.7.2) 91

Genes under Positive Selection
To increase the power of tests for positive selection, we applied the branch site model implemented in the codeml program with the following parameters to estimate the dN/dS substitution rates (ω value): positive model: model=2, NS sites=2, fix_omega=0, omega=1; null model: model=2, NSsites=2, fix_omega=1, omega=1. We also deleted all gaps (clean data = 1) from the alignments to lower the effect of ambiguous bases on the inference of positive selection.
A foreground branch was specified as the clade of WT05. A likelihood ratio test (LRT) was conducted to determine whether positive selection occurs in the foreground branch. LRT was calculated according to the following formula: LRT=2 x |Pos_lnL-Null_lnL|. The significance value (P value) was calculated by the chi-square test, which was conducted by chi2 in the PAML (v4.9) 98 package, and the degree of freedom was set to 2. In this study, PSGs were inferred when the P value was less than 0.05 and contained at least one positively selected site with a high probability (P value ≥0.95) according to the Bayes empirical Bayes (BEB) test. The functional annotation of PSGs in WT05 was also conducted using the same approach and the same P value cutoff as that used for the gene family expansion and contraction analysis.

Sequence alignment and variant detection
Twenty-six wild-type and 159 cultivated sample reads were aligned to the newly assembled genome WT05. The same pipeline and the parameters were used to call variants 10

GWAS analysis and identification of the candidate genes.
The association analysis was performed with the Efficient Mixed-Model Association expedited (EMMAx) program 101 . The significance threshold for the associated SNPs was chosen as − log10P > 6. Functional annotation of the candidate genes were base on the homologs function from closely related species or the model species Arabidopsis thaliana via sequence blast.

Data availability
The assembled genome sequences have been deposited at the NCBI under BioProject PRJNA589181. The transcriptome sequencing data were deposited in the Sequence Read Archive database under the accession number SAMN15783672-SAMN15783680.