Chromosome-level genome assembly of a xerophytic plant, Haloxylon ammodendron

Abstract Haloxylon ammodendron is a xerophytic perennial shrub or small tree that has a high ecological value in anti-desertification due to its high tolerance to drought and salt stress. Here, we report a high-quality, chromosome-level genome assembly of H. ammodendron by integrating PacBio’s high-fidelity sequencing and Hi-C technology. The assembled genome size was 685.4 Mb, of which 99.6% was assigned to nine pseudochromosomes with a contig N50 value of 23.6 Mb. Evolutionary analysis showed that both the recent substantial amplification of long terminal repeat retrotransposons and tandem gene duplication may have contributed to its genome size expansion and arid adaptation. An ample amount of low-GC genes was closely related to functions that may contribute to the desert adaptation of H. ammodendron. Gene family clustering together with gene expression analysis identified differentially expressed genes that may play important roles in the direct response of H. ammodendron to water-deficit stress. We also identified several genes possibly related to the degraded scaly leaves and well-developed root system of H. ammodendron. The reference-level genome assembly presented here will provide a valuable genomic resource for studying the genome evolution of xerophytic plants, as well as for further genetic breeding studies of H. ammodendron.


Introduction
Desertification is a serious threat to arid and semi-arid regions, which cover more than 40% of the global land area and are home to about 1 billion people. 1,2 This has become a major global crisis. Desertification has affected 10-20% of these drylands, and the desertification rate is likely to increase rapidly in the face of dramatic climate change and human activities. 3,4 Deterioration of plant cover is the prime indicator of desertification. However, some plant species have developed their own adaptation strategies to overcome water and nutrient shortages in harsh desert environments. [5][6][7][8] These extremophytes have played important roles in anti-desertification projects worldwide. Therefore, studying genetic mechanisms of plant adaptation to extreme desert environments will help us to develop more drought-resistant crops, which is beneficial for future desertification control. In the last few decades, molecular basis of drought and salt tolerance has been extensively studied in the model plant Arabidopsis thaliana. 9 With the rapid development of sequencing technologies, our understanding of drought and salt stress response in plants has advanced by transcriptome analysis of a large number of xerophytic plants. 10 However, genome sequencing of xerophytic plants, which is crucial for comprehensive understanding of their genetic mechanisms of adaptive evolution, has been limited to very few species, including one gymnosperm (Welwitschia mirabilis), one dicot species (Ammopiptanthus nanus), and three monocots (Cleistogenes songorica, Oropetium thomaeum, and Aloe vera). [11][12][13][14][15] Haloxylon ammodendron (Amaranthaceae), a xerophytic perennial shrub or small tree with a high tolerance to drought and salt stress, dominates many sandy and saline areas of Asian deserts. 16 It plays an important role in the maintenance of the structure and function of the desert ecosystem via sand fixation, wind control, and microclimate amelioration. 17 In addition, this species has a high economic value because it has traditionally been used as livestock feed and firewood, 18 and it is the host plant of a rare traditional Chinese medicinal plant, 'desert ginseng', Cistanche deserticola. 19 A few xeromorphic and halomorphic characteristics have been developed in this species, including highly degraded scaly leaves, C 4 photosynthesis in the succulent stem, a deep root system, and high salt content in the main stem and branch. To date, molecular-level studies of H. ammodendron's response to abiotic stressors have been limited to the transcriptome 20-23 and a limited number of genes. [24][25][26] A comprehensive understanding of the adaptive evolution of H. ammodendron has been hampered by the lack of genome information.
In this study, we assembled a chromosome-level genome assembly of H. ammodendron (2n ¼ 2x ¼ 18) 27 with PacBio's high-fidelity (HiFi) reads and Hi-C reads. Using this high-quality genome, we investigated the evolutionary history and molecular mechanisms underlying the desert adaptation of H. ammodendron. The genome assembly presented here will provide a valuable genomic resource for studying the genome evolution of xerophytic plants. This genome is also beneficial for further genetic breeding studies of this ecologically and economically important species.

Plant materials and sequencing
All plant materials used in this study were harvested from an adult plant of H. ammodendron growing in Yinchuan, Ningxia Province, northwestern China (Fig. 1A). The plant samples were authenticated by Dr Lei Zhang from North Minzu University, China. For wholegenome DNA sequencing, total genomic DNA was extracted from fresh leaves using the cetyl trimethylammonium bromide method. 28 First, paired-end libraries with an insert size of 350 bp were constructed and sequenced on an Illumina HiSeq 2500 platform following the manufacturer's instructions (Illumina, San Diego, CA, USA). Second, SMRTbell libraries were constructed using the PacBio 15-kb protocol (Pacific Biosciences, CA, USA), including DNA fragmentation, DNA damage repair and end repair, hairpin adapter ligation, and DNA purification. Long-read HiFi sequencing was then performed using circular consensus sequencing (CCS) mode on a PacBio Sequel II platform. Third, more than 2 g of young leaves were collected for the Hi-C experiment. Hi-C libraries were constructed as described previously, 29 including chromatin extraction and digestion, DNA ligation, and purification. The DNA was sheared to a mean fragment size of $350 bp and sequenced on an Illumina HiSeq X Ten platform.
To aid in genome quality assessment and gene prediction, fresh leaves and stems from the same individual of H. ammodendron were collected for RNA sequencing (RNA-seq). Total RNA was extracted from these tissues, and a DNA-free DNA removal kit (Thermo Fisher Scientific, USA) was used to remove residual DNA. The RNA-seq libraries were prepared and sequenced on an Illumina HiSeq 2500 platform. RNA-seq reads were filtered using Trimmomatic v0.36 30 with default parameters.

Genome assembly and assessment
The genome size of H. ammodendron was estimated based on a 17mer frequency distribution of Illumina short reads generated by Jellyfish v2.2.9 31 with parameters '-m 17 -s 3G -C'. Low-frequency K-mers (<4Â) were removed and the genome size was calculated based on the formula Genome_size ¼ (Total number of k-mersÀNumber of low-frequency k-mers)/Homozygous peak depth. Raw HiFi-sequencing data were processed using CCS v4.2.0 (https:// ccs.how), with the main parameters of min-passes ¼ 3 and minrq ¼ 0.99. The high-quality HiFi reads were then assembled into contigs using hifiasm v0.14, 32 with default parameters. Further, ALLHiC 33 was used to anchor these contigs into pseudochromosomes based on Hi-C data. Placement and orientation errors exhibiting obvious discrete chromatin interaction patterns were manually adjusted according to the contact maps plotted by Juicebox v1.8.8. 34 Genome quality and completeness were assessed by a combination of short reads mapping, transcript alignment, benchmarking universal single-copy orthologs (BUSCO) analysis, 35 and LTR assembly index (LAI) scores. 36 First, the Illumina reads used in genome size estimation were mapped to the assembly using BWA v0.7.17, 37 and the read depth distribution was calculated by SAMTOOLS v1.5. 38 Second, RNA-seq reads were assembled into unigenes by Trinity v2.8.4 39 under the mode without reference, and all unigenes were mapped to the genome using BLAT v36. 40 Third, a BUSCO completeness score was calculated using BUSCO v3.02 in genome mode with the Embryophyta odb10 dataset. Finally, LAI scores were calculated in a sliding window of 3 Mb with a 300 kb step size across the entire genome using LTR_retriever v2.8. 41

Genome annotation
The genome of H. ammodendron was annotated using procedures similar to those used in two previous studies. 42,43 Homology searches of repeats were performed against the assembly by RepeatMasker v4.0.7 44 based on a comprehensive database, including a de novo repeat library generated by RepeatModeler v1.0.11 45 and the 'Viridiplantae' repeat library from the Repbase database v22.11. 46 We also performed de novo detection of intact long terminal repeat retrotransposons (LTR-RTs) by LTR_Finder v1.06 47 and LTRharvest v1.5.10, 48 with the main parameters of min LTR length ¼ 100 bp, max LTR length ¼ 7,000bp, and min LTR similarity ¼ 90%. Finally, LTR_retriever was used to integrate the LTR predictions, remove false positives, and estimate the insertion times of intact LTR-RTs. We then predicted protein-coding genes based on the repeat-masked genome. First, a homology-based prediction was conducted by aligning the protein sequences of Suaeda aralocaspica v1.0, 49 57 was used to detect likely proteincoding regions based on a comprehensive transcriptome database comprising, de novo and genome-guided RNA-seq assembly. Third, a de novo prediction was performed using Augustus v3.2.3 58 with parameters trained with high-quality gene models selected from the PASA predictions with an exon number !3 and CDS length !1,500bp. Finally, all gene predictions were integrated into a final gene set using EvidenceModeler (EVM) v1.1.1. 59 The final gene set was searched against PlantTFDB v5.0 60 to identify transcription factor (TF) genes.
Functional annotation was performed by aligning the proteincoding genes against the SwissProt and TrEMBL databases 61 using DIAMOND v0.9.22. 62 Protein motifs and domains were annotated using InterProScan v5.31 63 by searching multiple publicly available databases. Gene ontology (GO) IDs for each gene were retrieved with Blast2GO v2.5. 64 Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway mapping was achieved using the KEGG Automatic Annotation Server (https://www.genome.jp/ kaas-bin/kaas_main).

Comparative genomic analysis
The protein sequences of H. ammodendron and seven other sequenced plant species, namely six Amaranthaceae species (S. aralocaspica, B. vulgaris, Beta patula, 65 C. quinoa, S. oleracea, and A. hypochondriacus) and one outgroup species (A. thaliana), were used for the phylogenetic analysis. Single-copy orthogroups among the eight species were distinguished using OrthoFinder v2.3.11 66 with the DIAMOND aligner and the Markov cluster algorithm. For each single-copy orthogroup, protein sequences were aligned by MAFFT-LINSI v7.313. 67 Conserved sites were extracted from the concatenated alignments of all single-copy genes using Gblocks v0.91b. 68 A maximum likelihood phylogeny was constructed based on the conserved sites using RAxML v8.0.17 69 under the PROTGAMMAILGX model with 100 bootstrap replicates.
Divergence times among the eight species were estimated using MCMCTREE implemented in PAML v4.9e 70 with the options 'independent rates' and 'F84' models. The calibration points for the A. thaliana-A. hypochondriacus divergence (111-131 million years ago, Mya) and S. oleracea-C. quinoa divergence (18-57 Mya) were obtained from the TimeTree database (http://www.timetree.org). Gene family expansions and contractions were further determined using CAFE v3.1. 71 To investigate gene families that are possibly involved in desert adaptation in H. ammodendron and xerophytes, gene family clustering was performed using OrthoFinder based on the protein sequences of H. ammodendron, S. oleracea, A. thaliana, C. songorica, and O. thomaeum. Gene families specific to xerophytes (H. ammodendron, C. songorica, and O. thomaeum) and specific to H. ammodendron were extracted using an in-house python script.

Whole-genome and tandem gene duplication analysis
The protein sequences of H. ammodendron and S. aralocaspica were self-aligned and aligned with each other using DIAMOND. Syntenic blocks within and between genomes were identified using MCScanX v1.1, 72 with default parameters. For each syntenic gene pair, the non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) were calculated using the 'add_ka_and_ks_to_collinearity.pl' script from MCScanX. The prevalence of whole genome duplication (WGD) events was examined by checking the Ks distributions of paralogs and orthologs within and between the two species. A Ka/Ks ratio >1 indicates a positive selection.
Tandem gene duplications were identified as homologous gene pairs that (i) are located within 10 consecutive genes on the same pseudochromosome; (ii) are located within 100 kb of each other; and (iii) have an identity !50% and a coverage !70% obtained by BLASTP.

Gene expression analysis
Previously published RNA-seq data of 5% PEG-6000 (Osmotin mimics drought stress) treated H. ammodendron seedlings and controls 22 were downloaded from the NCBI database under the BioProject PRJNA312463. Both the drought-treated seedlings and the controls had three biological replicates. RNA-seq reads were filtered by Trimmomatic and mapped to the assembly using HISAT2 v2.1.0. 73 Fragments per kilobase per million (FPKM) was calculated to evaluate the gene expression level using Cufflinks v2.2.1. 74 Genes that exhibited >2-fold change in FPKM values between droughttreated seedlings and controls were identified as up-or down-regulated genes.

High-quality genome assembly of H. ammodendron
Before assembly, a total of 49.4 Gb Illumina data were generated and subjected to genome size estimation (Supplementary Table S1). In total, 44,091,563,459 k-mers (k ¼ 17) were identified, and a major peak was observed around a k-depth of 61 ( Supplementary Fig.  S1). After removing low-frequency k-mers, the genome size of H. ammodendron was estimated to be 708.7 Mb, with a high degree of heterozygosity (1.04%).
To generate a high-accuracy and high-continuity genome assembly, we obtained a total of 22.6 Gb (31.9Â genome coverage) HiFi long reads, with a maximum and average read length of 64.7 and 15.8 kb, respectively (Supplementary Table S1 and Fig. S2). De novo assembly of these HiFi reads resulted in a preliminary assembly comprising 132 contigs for a total length of 685.3 Mb (96.7% of the estimated genome size). Contigs longer than 1 Mb totalled 675.0 Mb (98.5% of the total assembly length), indicating the high continuity of the assembly. Coverage depth distribution of Illumina reads revealed that there were very few duplicate haplotypes in the assembly ( Supplementary Fig. S3).
To reconstruct a chromosome-level assembly, we generated 84.7 Gb (119.6Â genome coverage) Hi-C clean reads (Supplementary Table S1) and anchored the contigs into pseudochromosomes using the Hi-C scaffolding approach. Ultimately, 682.3 Mb sequences (99.6% of the entire assembly) were assigned to nine pseudochromosomes ( Fig. 1B and C), corresponding to the haploid chromosome number of H. ammodendron. The lengths of pseudochromosomes ranged from 65.5 to 88.7 Mb (Supplementary  Table S2), and all pseudochromosomes showed a well-organized diagonal pattern for intra-chromosomal interactions ( Supplementary  Fig. S4). The final genome assembly contained very few gaps, with an average gap number of 7.2 per pseudochromosome. The contig and scaffold N50 values of this assembly reached 23.6 and 76.2 Mb, respectively (Table 1 and Supplementary Table S3).
To validate the quality of the genome assembly, we first mapped the Illumina reads back to the assembly and obtained an overall mapping rate of 98.6% and a 10-fold minimum genome coverage of 90.4% (Supplementary Table S4). We also assembled the RNA-seq reads into 306,064 unigenes, 82.7% of which had a length coverage of >90% within a single scaffold (Supplementary Table S5). Furthermore, we observed a BUSCO completeness score of 97.8% and an average LAI score of 19.8 for the entire genome (Supplementary Table S6 and Fig. S5). This evidence together indicated that the H. ammodendron assembly attained a high level of quality and completeness.

Annotation of the H. ammodendron genome
We predicted 318.6 Mb of repetitive sequences representing 46.5% of the H. ammodendron genome (Supplementary Table S7). The putative repetitive elements were more likely to be enriched in the central regions of chromosomes ( Supplementary Fig. S6). The percentage of repetitive sequences in H. ammodendron was comparable to that of three other Amaranthaceae species, B. vulgaris  Fig. S7). In addition, we identified 125.1 Mb (18.3% of the genome) of unclassified repetitive sequences, which might be H. ammodendron-specific.
After masking all repetitive sequences, we identified a total of 41,647 protein-coding genes, 97.2% (40,467) of which were located on the nine pseudochromosomes, with an overall gene density of 59.3 gene per Mb (Supplementary Table S2). A BUSCO completeness score of 94.6% was obtained at the gene model level, indicating the high completeness of gene annotation (Supplementary Table S4). Compared to other Amaranthaceae species, H. ammodendron had the shortest average transcript length, which was mainly caused by intron length reduction (Supplementary Table S9). Of the protein-coding genes, 39,032 (93.7%) were functionally annotated in at least one publicly available database, and 21,946 (52.7%) genes were assigned to GO terms (Supplementary Table  S10). Among the protein-coding genes, 1,340 TF genes were predicted and classified into 56 distinct gene families (Supplementary  Table S11).

Evolutionary history of H. ammodendron
We identified 467 strictly single-copy orthogroups among H. ammodendron and seven other plant species (Supplementary Table S12). A wellsupported species tree was recovered based on the multiple sequence alignments of these single-copy genes ( Fig. 2A). The results showed that H. ammodendron clustered together with another desert plant S. aralocaspica with 100% bootstrap support ( Supplementary Fig. S8). This topology remained robust when we excluded the allotetraploid species C. quinoa from the phylogenetic analysis ( Supplementary Fig. S8). The divergence time between H. ammodendron and S. aralocaspica was estimated to be around 42 Mya (Fig. 2A). However, the genome size of H. ammodendron (709 Mb) was significantly larger than that of S. aralocaspica (467 Mb). To gain further insights into the evolutionary history underlying genome size differences between these two species, we examined the prevalence of whole-genome/tandem duplication events and LTR amplifications, which were considered to be the major factors that contributing to genome size expansion. [75][76][77] We identified a total of 31,220 collinear blocks within and between the genomes of H. ammodendron and S. aralocaspica. The Ks distribution of orthologs between these two species showed a major peak around 0.47, which is younger than the peak identified from the analysis of paralogs within H. ammodendron ($0.69), indicating that no independent WGD event had occurred in the H. ammodendron genome after its split from S. aralocaspica (Fig. 2B). However, we found that the H. ammodendron genome displayed a high level of tandem duplication (2,343 arrays containing   Table S12). The Ks distribution of these duplicated genes suggests a recent burst of tandem duplication events in the H. ammodendron genome (Supplementary Fig. S9). Furthermore, we identified 6,357 intact LTR-RTs with a total length of 48.7Mb (7.1% of the genome) in the H. ammodendron genome. In contrast, only 1,682 intact LTR-RTs with a total length of 13.0 Mb (2.9% of the genome) were identified in the S. aralocaspica genome. Approximately 52.9% of the intact LTR-RTs inserted into H. ammodendron were younger than 2 million years, with median insertion times of 1.02 and 0.56 Mya for Copia and Gypsy elements, respectively (Fig. 2C). Interestingly, the recent burst of LTR-RTs and gene duplications was coincided with the aridification process of Asian inland in late Cenozoic, which was possibly driven by the rapid uplift of the Tibetan Plateau or global cooling. 78,79 These results indicate that both recent substantial amplification of LTR-RTs and tandem gene duplication in the H. ammodendron genome may have contributed to its genome size expansion and arid adaptation. In addition, we identified 344 significantly (P < 0.01) expanded gene families (containing 4,762 genes) in the H. ammodendron genome, which were functionally related to 'defense response', 'response to water deprivation', 'meristem maintenance', and 'root development' (Supplementary Fig. S10).

Molecular mechanisms underlying desert adaptation of H. ammodendron
A recent study speculated that a low genomic GC content was possibly related to plant adaptation to harsh nutrient-and water-limited conditions. 11 We observed a total GC content of 35.4% for the H. ammodendron genome, which was lower than that of nonexrophytic species in Amaranthaceae, such as S. oleracea (37.9%), B. vulgaris (35.9%), and C. quinoa (36.9%). The overall GC content of protein-coding genes in the H. ammodendron genome was 36.7%, with 1,669 genes exhibiting a GC content of <32.0% (Fig. 3A). These low-GC genes were found to be highly enriched in 'seed growth', 'response to wounding', 'cellular response to osmotic stress', 'floral meristem determinacy', 'regulation of meristem development', and 'hyperosmotic salinity response' (Supplementary Fig.  S11). Reduction in the GC content of these genes may have contributed to the desert adaptation of H. ammodendron.  C. songorica, and O. thomaeum) (Fig. 3B). The xerophyte-specific gene families were highly enriched in 'response to heat', 'negative regulation of response to salt stress', 'positive regulation of abscisic acid biosynthetic process', 'stomatal closure', 'lateral root morphogenesis', and 'negative regulation of leaf development' (Supplementary Fig. S12). We also identified 2,003 gene families (containing 13,961 genes) that are specific to H. ammodendron (Fig. 3B). These H. ammodendron-specific genes were markedly enriched in 'response to water deprivation', 'embryo development ending in seed dormancy', 'response to wounding', 'circadian rhythm', 'response to brassinosteroid', and 'regulation of stomatal closure' (Supplementary Fig. S13). Gene expression analysis identified 875 up-regulated and 3,616 down-regulated genes in 5% PEG-6000 treated H. ammodendron seedlings compared with the controls. Among these differentially expressed genes, 22 are specific to the three xerophytes, 1,796 are specific to H. ammodendron (Fig. 3C), 257 exhibit a GC content of <32.0%, and 839 are located in the significantly expanded gene families. These genes may play important roles in the direct response of H. ammodendron to waterdeficit stress.
Desert plants usually develop smaller and narrower leaves than plants living in temperate habitat to limit water loss by transpiration, and obtain nutrients from soil through their well-developed roots for optimal growth. 8 Brassinosteroids (BRs) are plant steroid hormones known mainly for promoting organ growth through their combined effect on cell expansion and division. 80,81 Arabidopsis mutant plants with known defects in the biosynthesis or perception of BRs develop small leaves. 82 We identified 71 genes that were involved in metabolism (GO: 0016131) or biosynthesis (GO: 0016132) of BRs in the H. ammodendron genome (Fig. 3D and Supplementary Table S13). Among these genes, 31 were tandemly duplicated, and six were under positive selection (Fig. 3D). We also identified eight genes that were involved in the biosynthesis of strigolactones (SLs; GO: 1901601; Supplementary Table S13), which are a novel class of plant hormones controlling shoot branching in seed plants. 83,84 In rice, treatment with SL biosynthesis inhibitor stopped elongation of seminal roots in wild-type plants under stress conditions. 85 Based on Interproscan annotations, we identified 1 SHORT-ROOT (SHR), 1 SHORT-ROOT-like (SHR-like), 1 SCARECROW (SCR), and 19 SCARECROW-like (SCR-like) genes in the H. ammodendron genome (Supplementary Table S13), which may play important roles in the root development of H. ammodendron. Both the SHR and SCR Arabidopsis mutant plants showed dramatically shorter roots than wild-type. 86,87 SHR-like and SCR-like genes are also found to be closely linked to root formation in several plants. 88,89 Among the genes identified above, nine BR metabolism/biosynthesis genes, one SHR gene, and two SCR-like genes were differentially expressed after drought treatment (Supplementary Table S13). These genes may contribute to the degraded scaly leaves and well-developed root system of H. ammodendron, which further promotes its adaptation to drought stress. These candidate genes provide essential information for future functional studies of the organ development of H. ammodendron.

Conclusion
In this study, we presented a chromosome-level genome assembly of H. ammodendron with high-level quality and completeness. We predicted 41,647 high-confidence protein-coding genes in the genome assembly. No clear evidence of a recent WGD event was found in the H. ammodendron genome. However, both the recent substantial amplification of LTR-RTs and tandem gene duplication may have contributed to its genome size expansion and arid adaptation. An ample amount of low-GC genes was closely related to functions that may contribute to the desert adaptation of H. ammodendron. Gene family clustering together with gene expression analysis identified differentially expressed genes that may play important roles in the direct response of H. ammodendron to water-deficit stress. We also identified several genes possibly related to the degraded scaly leaves and welldeveloped root system of H. ammodendron. The present highquality reference genome sequence will accelerate desert adaptation studies of xerophytic plants as well as further genetic breeding studies of H. ammodendron.

Supplementary data
Supplementary data are available at DNARES online.