The chromosome-level draft genome of Dalbergia odorifera

Abstract Background Dalbergia odorifera T. Chen (Fabaceae) is an International Union for Conservation of Nature red-listed tree. This tree is of high medicinal and commercial value owing to its officinal, insect-proof, durable heartwood. However, there is a lack of genome reference, which has hindered development of studies on the heartwood formation. Findings We presented the first chromosome-scale genome assembly of D. odorifera obtained on the basis of Illumina paired-end sequencing, Pacific Biosciences single-molecule real-time sequencing, 10x Genomics linked reads, and Hi-C technology. We assembled 97.68% of the 653.45 Mb D. odorifera genome with scaffold N50 and contig sizes of 56.16 and 5.92 Mb, respectively. Ten super-scaffolds corresponding to the 10 chromosomes were assembled, with the longest scaffold reaching 79.61 Mb. Repetitive elements account for 54.17% of the genome, and 30,310 protein-coding genes were predicted from the genome, of which ∼92.6% were functionally annotated. The phylogenetic tree showed that D. odorifera diverged from the ancestor of Arabidopsis thaliana and Populus trichocarpa and then separated from Glycine max and Cajanus cajan. Conclusions We sequence and reveal the first chromosome-level de novo genome of D. odorifera. These studies provide valuable genomic resources for the research of heartwood formation in D. odorifera and other timber trees. The high-quality assembled genome can also be used as reference for comparative genomics analysis and future population genetic studies of D. odorifera.

the preparations ? Answers: Thanks for your reminding. The library detail information has been added.
Can you detail the preparation of the 10X sequencing library ? What is the quantity of DNA engaged in the library preparation. What is the size distribution of the DAN fragments used ? Answers: Thanks for your reminding. The 10X library detail information has been added.
Line 293. Please add some informations about the HiC library preparation or add a reference. What is the enzyme used ? Answers: The enzyme is DpnII and it used cuts at "GATC". The reference has been added.
Lines 298-299. You used the HiC data to organize the scaffolds into pseudomolecules. Did you detect mis-assemblies after 10X scaffolding ? Answers: We use the Hi-C data to organize the scaffolds into pseudo-molecules after 10X scaffolding, so we can't detect mis-assemblies by 10X Genomics link-reads. But according to the results of Hi-C, we could correct the mis-assemblies that may be introduced by 10x scaffolding. The Hi-C interaction heatmap indicate that the clusting result is good.
Line 327. Sequences produced from the RNA sequencing libraries are assembled. Can you detail the RNA extraction method ? Did you extract RNA from the fresh leaves too ? Answers:We have added detail information about the RNA sequencing libraries, thanks for your comments.
Line 407. You perform analysis on RNA sequencing. But you spoke about replicates (line 193). Did you use the three replicates of each samples in your analysis ? Have you check the reproducibility of the results thanks to the replicates? Answers: According to the editor' request, for Data Note in-depth analysis is not required. We have made some adjustments to the manuscript, thanks for your comments.
Typo : Line 398 : repeated gene pairs Line 198 : co-expressed instead of co-expression Line 224 : has instead of was Answers:Revised, thanks for your comments.
Reviewer #2: The manuscript describes genome assembly, annotation, Transcript sequence of D. odorifera and very preliminary analysis of genome and some of the medicinally important genes. The quality of genome assembly and annotation is very high. However, the other analysis are routine. A big problem for me to assess this manuscript is I could not find the legends of the figures in the main text. The genome assembly statistics table must represent the step-by-step improvement for each sequencing technique. The approximate time of WGD is not mentioned. I did not find any comparative structure genomic analysis between D. odorifera and nearest species (MCScanX?). At the same time, what is the speciality in the sequence of medicinally important genes of D. odorifera vis-a-vis the nearest species that allowed it to produce novel compounds. To publish in Gigascience, I believe some useful biological analysis and information are required. A more informative and biologically oriented manuscript would be suitable for this journal.
Answers: Thanks for your comment. According to the editor' advice, this manuscript was transferred to Data Note format. We have made some adjustments to the manuscript according to your suggestions. Figure legends has been added to the text, and the statistics of assembly process has been added to the assembly statistics table. The approximate time of WGD was described.
The collinearity analyses between Dalbergia odorifera and Arachis duranensis have been done using jcvi method (https://github.com/tanghaibao/jcvi). Firstly, we find out all the published genome of legumes on the website: https://www.plabipd.de/plant_genomes_pa.ep. According to the divergence time from the website: http://www.timetree.org/, Arachis duranensis was the closest legume specie to Dalbergia odorifera at present. So we make a collinearity analysis of Arachis duranensis between Dalbergia odorifera. I can't upload pictures in the column of response to reviewers, so I sent the collinearity figure to the eidtor by e-mail. While due to the long genetic distance (~40 MYA) and low genome collinearity between Arachis duranensis and Dalbergia odorifera, it is not suitable for verification of Dalbergia odorifera assembly by comparative genomics so far. According to the editor' request, for Data Note in-depth analysis is not required. Comparative genomics is usually recommended, but we do not necessarily need to do any biological experimentation to meet this requirement. Thanks again for your valuable comments. respectively. Ten super-scaffolds corresponding to the 10 chromosomes were odorifera is an ideal biological model to study the mechanism underlying high-quality 51 heartwood (HW) formation due to its insect-proof, durable, fragrant, beautiful HW [1]. 52 HW is defined as the central wood layers of a tree (Additional file 2: Fig. S1). This 53 tissue, containing nonliving cells and nonfunctioning xylem tissue, can affect tree 54 health, with broader implications for forest health [2]. The natural durability of wood 55 as well as the biological, technological, and aesthetic parameters of wood and wood 56 products depend on the presence, quality, and quantity of HW, which is strongly 57 affected by external stimuli [3]. Flavonoids, which are the major compounds found in 58 D. odorifera, are a main class of secondary metabolites that strongly affect various 59 properties of HW, including durability and the color of wood products [2]. Besides, 60 flavonoids are crucial for plant resistance against pathogenic bacteria and fungi, and 61 flavonoid production can be induced by fungal invasion [4]. It is worth noting that 62 carbohydrates can also affect flavonoid accumulation and the formation of phenolic 63 extractives, which contribute to the natural durability of wood during HW formation 64 [5]. Apart from its excellence as a wood product, the HW of D. odorifera, which is 65 known as "JiangXiang" in traditional Chinese medicine, has been included in the 66 Chinese Pharmacopoeia for decades and is widely used to dissipate stasis, stop bleeding, 67 and relieve pain. D. odorifera HW is also used to treat blood stagnation syndrome, 68 ischemia, swelling, necrosis, and rheumatic pain in Korea [6]. Due to its great 69 medicinal and commercial value, D. odorifera is becoming more and more rare: Only Despite the commercial interest and increasing demand for D. odorifera, the lack of 76 a genome sequence for this species has limited analysis of the mechanism underlying 77 HW formation in D. odorifera, which has seriously hampered conservation and 78 breeding efforts. Advances in sequencing and assembly technology have made it 79 possible to obtain chromosome-level reference genome sequences for organisms once 80 thought to be intractable, including forest trees, which always have high 81

heterozygosity. 82
Herein, we used Illumina short reads, Pacific Bioscience's single-molecule 83 real-time sequencing long reads, Hi-C data, and 10× Genomics linked-reads data to 84 assemble the first chromosome-level genome of D. odorifera. We revealed the genomic 85 features of D. odorifera, including repeat sequences, gene annotation, and evolution. 86 This high-quality genome would provide the fundamental genetic information to study 87 the durable HW formation of D. odorifera and related species. leaves were snap frozen in liquid nitrogen, followed by preservation at -80 °C in the 94 laboratory prior to DNA extraction. To obtain the whole-genome sequences, genomic 95 DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method [9]. 96 The quality and quantity of the isolated DNA were checked by electrophoresis on a 1% 97 agarose gel and a NanoPhotometer® spectrophotometer (IMPLEN, CA, USA), and 98 the DNA was then accurately quantified using Flurometer (Life Technologies, CA, 99

USA). 100
In order to generate a chromosome-scale assembly, four different technologies 101 were applied: Illumina's paired-end sequencing, Pacific Bioscience's single-molecule 102 real-time sequencing, 10× Genomics link-reads, and Hi-C technology. NEB Next® 103 Ultra DNA Library Prep Kit (NEB, USA)was used to construct Illumina's paired-end 104 library. 0.5 ug genomic DNA molecules were fragmented, end-paired and ligated to 105 adaptor. The ligated fragments were fractionated on agarose gels and purified by PCR  Table S2). For 10× Genomics library 117 preparation, purified high-molecular-weight genomic DNA of high quality was 118 incubated with Proteinase K and RNaseA for 30 minutes at 25°C. DNA was further 119 purified, indexed, and partitioned into bar coded libraries that were prepared using the 120 GemCode kit (10× Genomics, Pleasanton, CA). Following the GemCode procedure, 121 1.0 ng of DNA was used for gel beads in emulsion (GEM) reactions in which DNA 122 fragments were partitioned into molecular reactors to extend the DNA and to 123 introduce specific 14-bp partition bar codes. Subsequently, GEM reactions were 124 polymerase chain reaction (PCR)-amplified. The PCR cycling protocol was as follows: 125 × coverage) data were generated. The enzyme used in Hi-C library was DpnII and it 137 used cuts at "GATC". All the raw sequence data generated by the Illumina platform 138 were filtered by the following criteria: filtered reads with adapters, filtered reads with 139 N bases more than 10%, and filtered reads with low-quality bases (≤ 5) more than 50%. 140 All sequence data were summarized in Table 1. 141 To fully assist genome annotation, five tissues (flower, leaf, root, seed and stem) 142 Genomics linked-read data using fragScaff v140324 [16] software, and the resulting 172 scaffolds were further connected to super-scaffolds by Hi-C technology using the 173 methods described by Bickhart et al.[17]. According to the Hi-C clustering results, the 174 genome sequences were divided into 10 chromosome clusters ( Fig.1; Additional file 1: 175 Table S7; Additional file 2: Fig. S6). Each technology greatly improved the assembly 176 quality (Additional file 1: Table S9). All these processes yielded a final draft D. 177 odorifera genome assembly with a total length of 638.26 Mb, scaffold N50 of 56.16 178 Mb, and longest scaffold of 79.61 Mb ( Table 2). The N50 of this assembly is almost 179 the best in the Fabaceae family that have been recently completed (Additional file 2: 180 Fig. S7). 181 To assess the quality of the genome assembly, we mapped paired-end reads with 182 short insert sizes onto the assembly using BWA (BWA, RRID:SCR_010910) v 183 were the most abundant (comprising 37.7% of the genome), followed by DNA 206 transposons (9.16% of the genome; Table 3). 207 208

Protein coding gene prediction and ncRNA prediction 209
Three approaches were employed to predict the protein-coding genes in the D. 210 odorifera genome, including homologous comparison, ab initio prediction, and 211 RNA-seq-based annotation. For homologous comparison, the reference protein 212 sequences from the Ensembl database and NCBI database for seven species, including 213

Arabidopsis thaliana, Populus trichocarpa, Eucalyptus grandis, Medicago truncatula, 214
Arachis duranensis, Malus domestica, and Glycine max (Additional file 1: Table S1) 215 were aligned against the D. odorifera genome using TBLASTN (TBLASTN, 216 RRID:SCR_011822) v2.2.15 [27] search with E-value 1e-5 in the "-F F" option. All 217 TBLAST hits were concatenated after filtering low-quality records. The sequence of 218 each candidate gene was further extended upstream and downstream by 1,000 bp to 219 represent the entire gene region. Gene structures were predicted using GeneWise 220 (GeneWise, RRID:SCR_015054) v2.4.1 [28]. Genes predicted in a homology-based 221 manner were viewed as the "Homology-set". RNA-sequencing (RNA-seq) data derived 222 from 5 tissues (Additional file 1: Table S11)  were constructed, 9,108 of which were common among species (Additional file 2: Fig.  279   S11). In addition, we identified 12,092 gene families shared between five Fabaceae 280 species while 577 gene families that were unique to D. odorifera ( Fig. 2A; Fig. 2B). 281 The 1,211 species-specific genes in the unique families were significantly 282 overrepresented in the categories of regulation of replication and repair, such as 283 mismatch repair, DNA replication, nucleotide-excision repair, and homologous 284 recombination (Additional file 2: Fig. S12).  . thaliana, D. odorifera and C. cajan, and D. odorifera and G. max. 320 The 4DTV values of orthologous gene pairs in the collinear segment were calculated 321 and used to construct a frequency distribution map. The 4DTV plot indicated that after 322 the ancient so-called γ WGD event shared by core eudicots [52], D. odorifera had 323 undergone a new round of WGD (Fig. 2D) In this study, we presented the genome of D. odorifera firstly and described its genetic

Type
Classifications of Transposable elements (TEs) predicted by each method Click here to access/download; Table;Table 3.xls Circos plot shows the characterization of the D.odorifera genome Evolution of the D. odorifera genome.
C 0 1 C 0 2 C 0 3 C 0 4 C 0 5 C 0 6 C 0 7 C 0 8 C 0 9 C 1 0 been assessed by our reviewers. Although it is of interest, we are unable to consider it for publication in its current form. The reviewers have raised a number of points which we believe would improve the manuscript and may allow a revised version to be published in GigaScience.
For a Data Note in-depth analysis is not required, but we do require some validation. Comparative genomics is a good way to do this and is usually recommended, but you do not necessarily need to do any biological experimentation to meet this requirement.
Answers: Thanks for your reminding. The collinearity analyses have been done using jcvi method (https://github.com/tanghaibao/jcvi). Firstly, we find out all the published genome of legumes on the website: https://www.plabipd.de/plant_genomes_pa.ep. According to the divergence time from the website: http://www.timetree.org/, Arachis duranensis was the closest legume specie to Dalbergia odorifera. So we make a collinearity analysis of Arachis duranensis between Dalbergia odorifera. While due to the long genetic distance (~40 MYA) and low genome collinearity between Arachis duranensis and Dalbergia odorifera, it is not suitable for verification of Dalbergia odorifera assembly by comparative genomics so far.