A Chromosome-Level Genome Assembly of Toona ciliata (Meliaceae)

Abstract Toona ciliata Roem is an important timber species in the Toona genus of the Meliaceae family and an endangered species due to over-cutting and a low rate of natural regeneration in China. Although molecular markers have been applied to studying population genetic diversity, the absence of a reliable reference genome limits in-depth genetic conservation and evolutionary studies of this species. Here, we reported a high-quality assembly of the whole genome sequence of T. ciliata. The total assembled genome has 520.64 Mb in length anchored on 28 chromosomes (contig N50 = 4.48 Mb). A total of 42,159 genes were predicted after the ab initio, homology-based, and transcriptome analyses. A total of 41,284 protein-encoding genes (97.92%) were functionally annotated and 1,246 non-coding RNAs were identified in the T. ciliata genome. Phylogenomic analysis showed that T. ciliata was divergent at 15.06 (6–25) Ma from T. sinensis of the same genus Toona. This whole genome sequence provides a valuable resource to study the genetic conservation and molecular evolution of T. ciliata in the future.


Introduction
Toona ciliata belongs to the monophyletic genus Toona of the Meliaceae family (Sun et al. 2014). The species, aka Chinese mahogany, is an important tropical and subtropical species and has great socio-economic values, such as the high-quality wood for furniture and the leaves for medicinal material (Edmonds 1993;Liao et al. 2009). It is naturally distributed in Pakistan and western India, Southeast Asia, southern China, Malaysia, and eastern Australia, and considered as an endangered species due to over-cutting and a low rate of natural regeneration (inbreeding depression) in China (Liang et al. 2011).
Previous studies on genetic diversity and molecular evolution of T. ciliata were based on molecular markers. Muellner et al. (2010) used the sequences of nuclear ITS and cpDNA segments (trnS-trnG, psbB, psbT, and psbN genes) to infer the evolutionary relationship of T. ciliata with other species of the Meliaceae family. Other molecular

Significance
Toona ciliata is an important timber species in the genus Toona of the Meliaceae family and an endangered species at Grade II in China. Previous molecular and evolutionary studies of this species were restricted due to the absence of reference genome sequence. In this study, we sequenced the whole genome of T. ciliata, which would provide a valuable resource to study the evolution of T. ciliata and develop molecular markers for studying genetic conservation of this endangered species in the future. marker studies included the use of the sequence-related amplified polymorphisms and simple sequence repeats to analyze population genetic structure and mating systems. These studies indicated that T. ciliata had a high level of population genetic differentiation and significant effects of isolation by distance in its natural distribution in China (Li et al. 2015;Zhan et al. 2019), and a predominant outcrossing system, with selfing and inbreeding (Zhou et al. 2020). However, exploration of molecular markers is limited for our in-depth understanding of the molecular evolution of this species. Here, we reported the high-quality chromosome-level sequences of T. ciliata genome assembled by combining nanopore and Hi-C sequencing analyses. This is alternative to T. sinensis in the genus Toona whose genome sequence was recently assembled (Ji et al. 2021).
For a phylogenetic comparison, we selected four species used by Ji et al. (2021), including Arabidopsis thaliana, Eucalyptus grandis, Salix purpurea, and Prunus persica, and six different angiosperm plant species (Citrus maxima, Citrus reticulata, Populus tremula, Glycine max, Amborella trichopoda, and T. sinensis) for providing the further context of analysis. Although two species in the Meliaceae family, Azadirachta indica (Krishnan et al. 2012) and Xylocarpus granatum (GenBank accession: GCA_019650275.1), were sequenced, the downstream genomic analysis was limited because gene annotations (gff files) were not provided. These two species were not included for phylogenetic analysis. Our phylogenetic analysis helps to view the evolutionary divergence of T. ciliata from T. sinensis and other land plant species.

Genome Assembly
Genome survey was performed with K-mer analysis (K = 21) using three 350 bp-library datasets. The haploid genome size was estimated to be 253.36 Mb in length, and repetitive sequences accounted for 35.89% of the genome size. Genomic heterozygosity was estimated to be 11.90% and the GC content was 34.15%. The karyotype study confirmed that the sample tree has 56 chromosomes 2n = 56 (supplementary fig. S1, Supplementary Material online), consistent with the previous findings (Singh 1951;Styles and Vosa 1971;Mehra et al. 1972).
With the Nanopore sequencing platform (Biomarker Biotechnology Company, Beijing), we obtained 66.02 Gb raw sequence data. After filtering, we obtained 62.85 Gb clean data, with the sequencing depth of about 120.72×, the length of reads N50 of 26.95 kb, and the average read length of 20.25 kb. Distribution of the read sizes was summarized in supplementary table S1, Supplementary Material online.
After corrections of the clean data with Canu and the assembly contigs with Racon and Pilon, we obtained the genome of T. ciliata, which contained 324 contigs, with the N50 length of 4,484,018 bp and a total length of 520,643,266 bp. The GC content was 32.73% (Table 1).
Hi-C assembly with LACHESIS and manual adjustment and inspection were showed in supplementary table S2, Supplementary Material online. A total of 518,944,513 bp (99.67% of the total assembled genome) was anchored on 28 chromosome groups, with the scaffold N50 of 17,615,381 bp in length. Among the sequences located on chromosomes, the sequence length with the order and direction determined was 497,824,081 bp, accounting for 95.93% of the total length of the mapped chromosomes. Figure 1a shows the distribution of gene density, repeat sequence density, GC content, and collinearity within and among chromosomes of T. ciliata. Figure 1b shows the heat map analysis where 28 chromosomes were clearly distinguished. The completeness of genome assembly was assessed using CEGMA v2.5 and BUSCO v4.0 software with the eukaryotic core gene database and embryophyta_odb data set 9. A total of 450 CEGs were assembled with 98.25% completeness, and a total of 1,360 BUSCOs were assembled with 94.44% completeness, 1.11% fragmentation, and 4.44% missing. Material online). The gene annotations were also evaluated by BUSCO analysis with embryophyta data set 9. Overall, 1,572 complete BUSCOs (97.40%) were identified in gene annotation, including 1,045 single-copy (64.75%), 527 duplicated BUSCOs (32.65%), and 22 fragmented BUSCOs (3.6%). In total, 20 genes (1.24%) were recognized as missing BUSCOs in our genome. The transcriptome data were evaluated by Hisat2, and 90.74% of RNA-seq clean data were mapped to our predicted exons.
A total of 41,284 protein-encoding genes (97.92%) were functionally annotated in the T. ciliata genome from  (Table 1). In addition, 1,246 non-coding RNAs were identified, including 676 tRNAs, 218 rRNAs, 149 miRNA, 72 snRNA and 131 snoRNA. Finally, a total of 148 pseudogenes were predicted, with 242,503 bp in total. Toona ciliata had more genes that had four or more copy numbers than T. sinensis.

Gene Family and Phylogenetic Analyses
A phylogenetic tree was constructed using 1,276 singlecopy genes from the whole genome where divergent times among species were estimated using MCMCTree, calibrated with the known fossil records of three pairs of plant species divergent times ( fig. 1c). The divergent time was 15.06 (6-25) Ma between T. ciliata and T. sinensis, and 76.29(75-92) Ma between T. ciliata and A. thaliana. Our estimates of the divergent times between T. ciliata and T. sinensis were overlapped with the previous results (7-49 Ma) derived from genetic markers (Muellner et al. 2010;Cavers et al. 2013;Koecke et al. 2013;Koenen et al. 2015). The divergent time was generally longer between T. ciliata and T. sinensis in genus Toona than between C. maxima and C. reticulata in genus Citrus, 6.66 (2-13) Ma. The divergent times between T. sinensis and E. grandis (101-111 Ma) were comparable with the results (107.7-111.9 Ma) of Ji et al. (2021). However, the divergent times between T. sinensis and A. thaliana were less than those obtained by Ji et al. (2021). As expected, A. trichopoda, the earliest divergent species in angiosperms, had the largest divergent times from all other species investigated. Toona ciliata was divergent at more than 80 Ma but <100 Ma from P. tremula, S. purpurea, G. max, and P. persica.

Sample Collection, DNA Extraction, and Genome Sequencing
The sample for genome sequencing was collected from one clone of the individual growing in Pupiao, Baoshan City, Yunnan Province, China (25.04N, 99.06E) and identified as T. ciliata var. ciliata. Figure 1d shows the individual cultivated in a pot in South China Agricultural University. Young and healthy leaves of the 1-year-old plant were collected for DNA isolation. Genomic DNA (gDNA) was extracted using the cetyltrimethylammonium bromide method (Doyle 1987). The concentration and purity of the gDNA were determined using a Nanodrop 2000 spectrophotometer and a Qubit fluorometer. DNA integrity was evaluated on a 0.5% agarose gel.
Genome sequencing and assembly were carried out by Biomarker Biotechnology Company in Beijing. The Nanopore reads were filtered and corrected using Canu (Koren et al. 2017). Nanopore sequencing library was constructed using a total of 9 μg gDNA to select larger fragment sizes (>10 kb) using a Blue Pippin Automatic Nucleic Acid Recovery System. The standard ONT library prep protocol was applied with a Ligation Sequencing Kit (SQK-LSK109) (Deamer et al. 2016). The raw reads were filtered with the thresholds of Q-value >7 and the minimum length of read fragments >500 bp. The high-quality reads were used to assemble the genome.

Genome Size and Assembly
Preliminary genome survey was performed with K-mer analysis using three 350 bp-library datasets, including estimation of haploid genome size, proportion of repetitive sequences, genomic heterozygosity, and GC content. Our karyotype study was done to determine chromosomes of the sample.
We constructed Hi-C fragment libraries from 300 to 700 bp insert sizes (Vaser et al. 2017), and sequenced through Illumina platform. Adapter sequences of raw reads were trimmed, and low-quality pair-end reads were removed for clean data. The final valid reads were selected after the removal of the invalid read pairs, including dangling-end and self-cycle, re-ligation, and dumped products using Hic-Pro v2.10.0 (Servant et al. 2015).
The assembly results were assessed from three aspects: (1) the mapped rate (%) of clean reads on the reference genome sequence with bwa-mem software (Li 2013) for double-ended sequencing and bwa software for shorter sequences (Li and Durbin 2009); (2) the CEGMA(Core Eukaryotic Genes Mapping Approach) v2.5 (default parameters) database that contained 458 conserved key genes in eukaryotes (Parra et al. 2007) was used to evaluate the integrity of the final genome assembly; (3) BUSCOv4.0 software (Simão et al. 2015) was used to evaluate the integrity of the genome assembly by using OrthoDB V9 embryophyta database containing 1,440 conserved core genes. Parameters used with BUSCO were: -evalue 1e−03 (E-value cutoff for BLAST searches), -sp arabidopsis (reference species).
GenBlastA v1.0.4 program (She et al. 2009) was used to scan the whole genome after masking the predicted functional genes. Putative candidates were then analyzed by searching for non-mature mutations and frame-shift mutations using GeneWise v2.4.1 (Birney et al. 2004).

Gene Family and Phylogenetic Analyses
We compared genomic sequences of 11 species (A. thaliana, A. trichopoda, C. maxima, C. reticulata, E. grandis, G. max, P. persica, P. tremula, S. purpurea, T. sinensis, and T. ciliata), with an emphasis on the evolutionary divergence between T. ciliata and T. sinensis. Genome sequences of these species except T. ciliata were downloaded from different databases (supplementary table S6, Supplementary Material online).
The protein sequences of 11 species were classified by gene family with Orthofinder V2.4 software (Emms and Kelly 2019), and the comparison method was diamond while E-value was 0.001. PANTHER V15 database (Mi et al. 2019) was used to annotate the obtained gene families. GO and KEGG enrichment analyses were carried out for the gene family unique to T. ciliata by clusterProfile v3.14.0 (Yu et al. 2012).
The predicted protein sets were condensed to include a single peptide sequence for each gene by filtering out redundant alternative splicing events with Gblocks V0.91 (Talavera and Castresana 2007). The single-copy genes were used to construct phylogenetic tree by the ModelFinder (Kalyaanamoorthy et al. 2017), and the optimal model was JTT + F+I + G4 with the maximum likelihood (ML) method. The number of bootstraps was set to 1,000. By combining the known divergent times of multiple species derived from TIMETREE (http://www.timetree.org/), the divergent times among species were calculated using the MCMCTREE module in PAML v4.9 (Yang 1997;Puttick 2019).

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.