Draft genome of the famous ornamental plant Paeonia suffruticosa

Abstract Tree peony (Paeonia Sect. Moutan) is a famous ornamental plant, with huge historical, cultural, and economic significance worldwide. In this study, we reported the ~13.79 Gb draft genome of a wide‐grown Paeonia suffruticosa cultivar “Luo shen xiao chun,” representing the largest sequenced genome in dicots to date. Phylogenetic analyses based on genome sequences demonstrated that P. suffruticosa was placed as sister to Vitales, and they together formed a clade that was sister to Rosids, weakly supporting a relationship of ((Saxifragales and Vitales) and Rosids). The identification and expression analysis of MADS‐box genes based on the genome assembly and de novo transcriptome assembly of P. suffruticosa revealed that the function of C class genes was restricted in flower development, which might be responsible for the stamen petalody in tree peony cultivars. Overall, the first sequenced genome in the family Paeoniaceae provides an important resource for the origin, domestication, and evolutionary study as well as cultivar breeding in tree peony.

tivars worldwide, and China alone has more than 1,000 cultivars . The origin of these cultivars and relationships among them has attracted much attention, but remains unclear due to lack of detailed records of the complicated crossbreeding between wild species and cultivars during the long domestication history (Haw, 2001;Zhou et al., 2014). The numerous different flower colors and shapes of tree peony cultivars represent high genetic diversity and have been used for cultivars' classification in early studies (Zhou, Zhang, & Zhao, 2007).
But based on molecular markers, the genetic groups of cultivars were not necessarily related to flower colors (Guo, Hou, & Zhang, 2009), while different provenances might be the more important factor contributing to the genetic differences (Liu & Lu, 2009;Yuan, Cheng, & Zhou, 2011). The high genetic diversity, along with wide geographic distribution, has made tree peony a fascinating model for studying the mechanisms of diversification and adaptation in plants.
The first high-density genetic map of tree peony was constructed using genotyping by specific-locus amplified fragment (SLAF) sequencing (Cai, Cheng, Wu, Zhong, & Liu, 2015). It contained 1,189 SLAF markers, spanning 920.699 cM with an average distance of 0.774 cM between adjacent markers. The genetic information of P. suffruticosa available to date includes three linkage maps (Cai et al., 2015;Guo et al., 2017;Zhang et al., 2019), 2,415 expressed sequence tags (ESTs) deposited to the NCBI database, and six RNA-seq datasets. Although analysis of the expressed sequences had contributed a lot to our understanding of the mechanisms of flower bud development (Shu et al., 2009), reblooming (Zhou, Cheng, Wang, Zhong, & He, 2013), prolonging vase life of cut flowers (Zhang et al., 2014), and different color formation in petals (Zhang, Cheng, Ya, Xu, & Han, 2015a) or leaves (Luo, Shi, Niu, & Zhang, 2017), our comprehensive and in-depth understanding of the genetic basis underlying the numerous flower morphological differences, oil production, and medicinal use is still limited. The availability of a reference genome sequence of P. suffruticosa would be helpful for the integration of multi-omics data across studies to enable more in-depth research into the biology and genetics of tree peony. Furthermore, a fully annotated genome of P. suffruticosa would serve as a foundation for cloning of important horticultural traits-related genes, identification of new varieties, and conservation of endangered varieties, as well as to promote more efficient breeding of tree peony.
In the present study, we report a de novo assembly and annotation of the P. suffruticosa genome, with an estimated genome size of ~13.66-15.76 Gb using PacBio's Single Molecule, Real-Time Technology (SMRT). Furthermore, we analyzed its phylogenetic relationship with closely related plants based on the genome sequences and reported a comprehensive analyses of the MADS-box gene family in this tree peony cultivar.

| Plant materials and sequencing
An individual of P. suffruticosa "Luo shen xiao chun" (Figure 1) grown in the peony resource spectrum of Luoyang Academy of Agriculture F I G U R E 1 A flowering plant of P. suffruticosa "Luo shen xiao chun." and Forestry Sciences (N34°39' latitude, E112°27' longitude, Luoyang, China) was selected and the voucher specimen was deposited in the Herbarium of China National GeneBank with a code number "HCNGB_00009295". The genomic DNA was isolated from the leaves with a standard CTAB extraction method (Murray & Thompson, 1980). A 20kb library was constructed as described previously (Pendleton et al., 2015). Approximately 20 µg of high-quality genomic DNA was sheared to ~20 kb targeted size and assessed with an Agilent 2,100 Bioanalyzer. Shearing of genomic DNA was followed by damage repair and end repair, blunt-end adaptor ligation, and size selection with a Blue Pippin system (Sage Science). A total of 114 SMRT cells and 177 SMRT cells were sequenced on PacBio RS II system and PacBio Sequel system, respectively. In total, 96.1 million subreads (894 Gb) were generated with an N50 of 14.5 kb and a mean length of 9.3kb. Furthermore, one paired-end library was constructed according to the standard protocol provided by BGI (BGI-Shenzhen) and sequenced on the BGISEQ-500 platform (Goodwin, McPherson, & McCombie, 2016), with a read length of 100 bp, generating a total of 673 Gb clean data.
Total RNA was extracted from the root, stem, shoot, leaf, flower, and flower bud tissues collected from the same individual using a rapid CTAB-based method as described previously (Gambino, Perrone, & Gribaudo, 2008). Paired-end libraries were constructed using standard protocol provided by BGI (BGI-Shenzhen) and then sequenced on the BGISEQ-500 platform, with a read length of 100 bp. In total, 45.71 Gb raw data were obtained and after filtering by SOAPnuke (Version 1.5.6) (https ://github.com/BGI-flexl ab/ SOAPnuke), there were 6.85 ~ 8.46 Gb clean data for each sample (Supplementary Table S1).

| Estimation of P. suffruticosa genome size
A total of 520 Gb high-quality clean reads obtained from the BGISEQ-500 platform were subjected to 17, 19, 21, and 23 kmer frequency distribution analyses using Jellyfish (Marcais & Kingsford, 2011). The frequency graph (Supplementary Figure S1) was drawn and the P. suffruticosa genome size was calculated using the formula: genome size = kmer_Number/Peak_Depth. For 17 kmer, the total number of kmers was 519,130,610, and 870, and the peak depth was 38. The P. suffruticosa genome size was estimated to be 13.66 Gb, and the data used in 17 kmer analysis was about 46.53× coverage of the genome. In 19, 21, and 23 kmer analyses, the genome size was estimated to be 14. 35,14.77,and 15.76 Gb,respectively (Supplementary Table S2).
Since the assembling of highly repetitive genome is sensitive to program parameters in FALCON pipeline, we tuned the parameter values of length_cutoff_pr and overlap_filtering_setting to explore the alternative assemblies (Supplementary Table S3). Finally, we obtained seven different assembly versions (Supplementary Table   S4). Among them, the N50 length ranged from 48.4 kb in version 7 to 76.7kb in version 1, and the assembly size ranged from 11.5 Gb in version 2 to 13.8 Gb in version 6. In an overall view, the assembling result showed that no assembly was undoubtedly better than another one. To choose the most suitable genome assembly for functional genomic studies, we further evaluated the completeness of the seven assemblies by comparing them against a set of 1,440 conserved plant genes in BUSCO embryophyta_odb9 dataset using BUSCO v2.0 (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) pipeline. The completeness score ranged from 57.5% in version 3 to 61.2% in version 6 (Supplementary Table S5).
We observed that the completeness score was not necessarily accordant with the contiguity (contig N50) among the seven assembly versions. Although version 1 has the highest N50 value (76.7 Kb), it captured less genome sequence than other versions and has a relatively smaller completeness score (58.3%). In fact, the assembly version 6 and 7 contained more total sequence than version 1 to 5, which might be due to the set of a smaller value of parameter "length_cutoff_pr = 6,000." We chose the assembly version 6 for further improvement because it had the highest completeness score and contained the most total sequences. High-quality BGISEQ-500 reads were mapped to this assembly with BWA-MEM (Li, 2013) with default parameters, and high-quality mapped reads (MAQ >20) were further used to polish the assembly with Pilon (https ://github. com/broad insti tute/pilon/ wiki) with default parameters. Finally, the obtained assembly was named V_final, which had a total length of 13.79 Gb with an N50 length of 49.94 Kb.
In addition, four publicly available RNA-seq datasets for tree peonies (NCBI Short Read Archive, accession number: SRX336125, SRX314813, SRX698348, and SRX2439581) were aligned to the final assembly version using BLAT (Kent, 2002) for further validation.
Gene structure was predicted using GeneWise (Birney, Clamp, & Durbin, 2004). For transcriptome-based prediction, clean RNA-seq reads generated in this study were mapped to the assembly using TopHat v2.1.0 (Trapnell, Pachter, & Salzberg, 2009) and assembled into transcripts using Cufflinks (Trapnell et al., 2012), then open reading frames were predicted with hidden Markov model (HMM)based training parameters. Results derived from the above methods were integrated by EVM (Haas et al., 2008) to produce a consensus gene set.

| Gene family and phylogenetic analysis
For gene family analysis, OrthoMCL (Li, Stoeckert, & Roos, 2003) was used to construct orthologous gene families on all the proteincoding genes of P. suffruticosa and 7 sequenced plant species (O. sativa (Ouyang et al., 2007), S. lycopersicum, C. roseus (Kellner et al., 2015), G. max, P. persica, and V. vinifera, Kalanchoë fedtschenkoi (Yang et al., 2017)). Before OrthoMCL, BLASTP was performed to find similar matches from different species with an E-value cutoff of 1.0e-05. The number of gene families in each species was calculated based on the composition of the OrthoMCL clusters. Genes that were single copy in an OrthoMCL cluster for all species analyzed were selected to construct phylogenetic trees using two methods.
For the concatenation-based method, the protein sequences were aligned using PRANK software version 170,427 (http://wasab iapp. org/softw are/prank/ ) and was then trimmed using Phyutility version 2.2.6 (Smith & Dunn, 2008) to remove poorly aligned regions with more than 30% missing data. The alignments were concatenated into one supermatrix file, which was used to reconstruct maximum-likelihood (ML) phylogenetic tree using IQ-TREE version 1.5.5 (Nguyen, Schmidt, Haeseler, & Minh, 2015) with automatic model selection and 1,000 bootstrap replicates. For coalescence-based method, individual ML gene trees were reconstructed from the CDS alignments using RAxML version 8.2.11 (Stamatakis, 2014) with GTRGAMMA model and 500 bootstrap replicates. The gene trees were used to reconstruct a species tree using ASTRAL v5.5.9 (Mirarab et al., 2014), with 1,000 bootstraps.
Species divergence times were estimated using MCMCTREE in PAML version 4.9 (Yang, 2007), based on the coalescent phylogenetic tree reconstructed from 511 single copy genes using ASTRAL.
The ML estimates of branch lengths were obtained using CODEML programs in PAML under the JONES + gamma substitution models with the gamma priors set at 0.5. Two priors, the overall substitution rate (rgene gamma) and rate-drift parameter (sigma2 gamma), were set at G (1, 4.3) and G (1, 4.5). The correlated rates were used to specify the prior of rates among internal nodes (clock = 3 in MCMCTREE). The parameters of the birth-death process for tree generation with species sampling were fixed at BDparas = 1 1 0. A loose maximum bound for the root was set at <10.0 (=1,000 Ma).
Markov chain Monte Carlo (MCMC) approximation with a burn-in period of 5,000,000 cycles was obtained, and every 5,000 cycles was taken to create a total of 10,000 samples. To diagnose possible failure of the Markov chains to converge to their stationary distribution, two replicate MCMC runs were performed with two different random seeds for each analysis. The stationarity of the chains and convergence of two runs were monitored by Tracer version 1.7 ( Rambaut, Drummond, Xie, Baele, & Suchard, 2018) (https ://github. com/beast-dev/trace r/relea ses/tag/v1.7.1). Divergence time estimates in TimeTree (Hedges, Marin, Suleski, Paymer, & Kumar, 2015) database were used for selecting the calibration priors.

| Identification of MADS-box genes from P. suffruticosa genome assembly and de novo transcriptome assembly
We identified putative MADS-box genes in P. suffruticosa genome using Then, the clean reads were assembled using Trinity v2.4.09 (Grabherr et al., 2011) with a minimum contig length setting to 150 bp for each sample. We identified putative coding sequences (CDSs) within each transcript with TransDecoder (https ://trans decod er.github.io/) using default parameter settings. We merged the CDSs of all samples and removed redundant sequences using CD-HIT-EST v4.6 (Li, 2006) with parameters "-c 0.8 -r 0." The putative MADS-box genes were identified from these CDSs using the same methods described above.
To further classify these genes into subfamilies, two individual phylogenetic trees for type I and type II genes were constructed using MADS-box protein sequences from P. suffruticosa and Arabidopsis. Multiple sequence alignment was performed using the Clustal X program (Larkin et al., 2007), and phylogenetic trees were then constructed using MEGA5 software (Tamura et al., 2011) with the neighbor-joining (NJ) method. Bootstrap values were calculated with 1,000 replicates to evaluate the support of the nodes.

| Expression analysis of P. suffruticosa MADSbox genes based on genome assembly and de novo transcriptome assembly
The expression profiles of the identified MADS-box genes in different tissues of P. suffruticosa were analyzed using the transcriptome data generated in this study. Because the two gene sets obtained from the genome assembly and from de novo transcriptome assembly were largely different, we calculated the gene expression level for both separately. The filtered clean reads were mapped to the two gene sets using BOWTIE2 v2.2 (Langmead & Salzberg, 2012). The gene expression level was first quantified using RSEM program (Li & Dewey, 2011) and then was normalized by calculating the FPKM value for comparison between different samples.

| Genome assembly and completeness assessment
Based on k-mer analyses, the P. suffruticosa genome was estimated to be ~13.66-15.76 Gb in size (Supplementary Figure S1). It is the largest genome in the sequenced dicots to date and presents a big challenge for genome sequencing and assembly. To assemble the P. suffruticosa genome, a total of 894 Gb third-generation long reads were generated using PacBio RS II system and PacBio Sequel system, representing ~67× coverage of the genome. Although PacBio reads have a relatively high error rate of ∼15%, de novo assembly using these data was proved to be accurate enough with a deep coverage, typically >50× (Berlin et al., 2015). Falcon pipeline (Chin et al., 2016) was used to assemble the genome with significant parameter tuning, and seven different assembly versions were obtained. Completeness assessment of all the assemblies was performed using BUSCO (Simao et al., 2015) to choose the best assembly, followed by polishing with high-quality short reads. The final assembly version spanned 13.79 Gb in 499, 810 contigs (N50 = 49.94 kb) (Table 1). Completeness assessment showed that 66.1% of the expected 1,440 plant conserved genes were detected as complete (Supplementary Table S5), which was comparable to that of recently sequenced mega-genomes (>10 Gb), such as 53% of sugar pine  and 74% of Ginkgo biloba genome (Guan et al., 2016). Additionally, RNA sequence reads generated from six different tissues were mapped to our genome assembly by TopHat v2.1.0 (Trapnell et al., 2009) and the average mapping ratio was 73.2% (Supplementary Table S6). We also mapped four publicly available RNA-seq datasets to this assembly and found that the mapping ratio   Table S10). In addition to protein-coding genes, we predicted 1,215 tRNA, 1,510 rRNA, 960 microRNA (miRNA), and 1,055 small nuclear RNA (snRNA) genes in our assembly (Supplementary Table S11).

| Gene families and phylogenic reconstruction
Using OrthoMCL (Li et al., 2003), the predicted 35,687 protein-coding genes in P. suffruticosa were assigned into 10,882 gene families consisting of 22,279 genes, while 13,408 genes were not organized into groups which might be mis-annotated or derived from lineage-specific expansion. Among these gene families, 1,794 were unique in P. suffruticosa compared with other seven plant species (Supplementary Table S12).
To infer phylogenetic relationships, a total of 511 single copy orthologs corresponding to the eight species were extracted from the clusters and were used to reconstruct phylogenetic trees with coalescence-based and concatenation-based method, respectively (Supplementary Figure S4a,b). The resulting species trees demonstrated a conflict on the placement of K. fedtschenkoi. In the species tree reconstructed from concatenated protein sequences, K. fedtschenkoi was placed as sister to the eudicots. In the coalescent species tree, K. fedtschenkoi was placed as sister to a clade of Vitales + Rosids, which was consistent with the APG IV tree (Chase et al., 2016). However, in both species trees reconstructed from the two methods, P. suffruticosa was placed as sister to Vitales, and they together formed a clade that was sister to Rosids. As the conflict between the concatenation-based tree and coalescence-based tree might indicate complicated evolutionary histories of genes in K. fedtschenkoi, we performed phylogenetic analyses excluding K. fedtschenkoi using the same two methods above.
The resulting tree topologies (Supplementary Figure S4c,d) were congruent, supporting P. suffruticosa as sister to Vitales.

| Identification of MADS-box genes from P. suffruticosa genome assembly and de novo transcriptome assembly
Using two methods of homology search of Arabidopsis MADS-box proteins and HMMER search of MADS-box domain profile, we identified 52 putative MADS-box genes in P. suffruticosa genome assembly, including 36 type I and 16 type II genes (Table S13).
Based on the phylogenetic analysis, 36 type I genes were divided into Mα and Mγ subgroups, which contained 19 and 17 members, respectively (Supplementary Figure S5). The Mβ MADS-box genes are important in endosperm development (Masiero, Colombo, Grini, Schnittger, & Kater, 2011;Zhang et al., 2017) Table S14) were identified using the same methods described above. According to the phylogenetic tree, the missing MADS-box genes of Mβ, SVP, ANR1, and AGL13 subfamilies in genome-based prediction were recovered from transcriptome-based prediction. No FLC orthologous gene was identified, indicating that this subfamily may have been lost in P. suffruticosa. In consideration of the possibility that the two MADS-box gene sets may have overlaps, we used the 32 transcriptome-derived MADS-box genes as queries to BLASTP against the 52 genome-derived MADS-box genes. As listed in Supplementary   Table S15, there were only seven pairs of genes showed high protein identity (>90%). If considering each of the seven pairs of genes to be identical, we obtained a total of 77 MADS-box genes in P. suffruticosa in this study, including 44 type I and 33 type II genes. Of these type II genes, 18 were assigned to be ABCDE genes, including 5 A class genes, 5 B class genes, 2 C/D class genes, and 6 E class genes.

| Expression of MADS-box genes in different tissues
Transcriptome data showed that the expression of the type I MADSbox genes, except PsuMADS5 and TRINITYpsu2, were all too weak to detect in the six different organs (Figure 3b). Since the six organs did not include the seed samples, we supposed that these type I genes might function in the seeds development as they do in Arabidopsis (Bouyer et al., 2011;Kang, Steffen, Portereiko, Lloyd, & Drews, 2008).
In contrast to the weak expression of type I genes, most of the type II genes had a moderate or high expression in certain tissues (Figure 3a,b). The SOC1 gene (TRINITYpsu18) was highly expressed in the vegetative tissues, in accordance with previous studies in tree peony (Zhang, Li, et al., 2015b). The SVP gene (TRINITYpsu16) was widely expressed in vegetative tissues and in flower bud, indicating that this gene may play multiple roles in tree peony development.
Most of the ABCDE genes exhibited the highest expressions in the flower bud and flower of P. suffruticosa, which conformed their important roles in flower development (Figure 3a,b). The A class genes (PsuMADS11, TRINITYpsu29, TRINITYpsu31, and TRINITYpsu33) were highly expressed in almost all six organs, implying they not only play the normal A class gene function in flower but also have multiple functions in other organs. Of the B class genes, AP3 gene (PsuMADS32 or TRINITYpsu14, they were nearly identical) and B-PI gene (TRINITYpsu13) had similar expression profile in flower and flower bud, implying they act as heterodimers on the formation of petals and stamens as they do in Arabidopsis and in other core eudicots (Riechmann, Krizek, & Meyerowitz, 1996;Wuest et al., 2012). The expression of another PI gene (PsuMADS6) could not be detected in neither flower organ, implying this gene had lost the B class gene function. The two B-TM6 genes (PsuMADS17 and PsuMADS24, and TRINITYpsu15 was nearly identical to PsuMADS17) had moderate to high expression in flower organs and stem, and weak expression in other three vegetative organs, which indicates that they may play multiple roles in tree peony.
There were two C class genes (PsuMADS1 and TRINITYpsu22, and TRINITYpsu21 was nearly identical to PsuMADS1), which were homologous to Arabidopsis AGAMOUS (AG) gene (Supplementary Figure S6). Transcriptome data showed that the two C class genes were expressed at low level in flower organs. In the six E class genes, PsuMADS18 (TRINITYpsu24 and TRINITYpsu25 were nearly identical to PsuMADS18), TRINITYpsu30, TRINITYpsu32, and TRINITYpsu34 had moderate to high expression in flower organs, while PsuMADS28 and PsuMADS38 were barely expressed in any of the six organs.

| D ISCUSS I ON
The genome assembly of Paeonia suffruticosa presented in this study represents the largest genome sequenced in the dicots, to date. It is a big challenge to assemble so large genome with high ratio of heterozygosity and repetition. The draft genome was fragmentary and was far from completed. The most possible reason for the poor assembly is that most of the PacBio subreads, with an N50 of 14.5 kb and a mean length of 9.3kb, cannot span the repetitive regions across the large chromosome. Another reason for the poor assembly may be the PacBio data were insufficient. Although ~67× PacBio subreads were used, the error-corrected data used for next assembling step in FALCON pipeline was about 20×. So, the key to improve the genome assembly of Paeonia suffruticosa is to get more and longer sequencing reads, which should be taken into account in improvement of this genome assembly in future.
The draft genome of P. suffruticosa is the first genome sequenced for the Paeoniaceae, and the second sequenced species in Saxifragales. The coalescent phylogenetic tree (Supplementary Figure S4 (a)) placed the two sequenced species of Saxifragales into two different clades, implying paraphyly of the order of Saxifragales.
According to this coalescent tree, the position of P. suffruticosa supports a relationship of ((Saxifragales and Vitales) and Rosids) and the position of K. fedtschenkoi supports a relationship of ((Rosids and Vitales) and Saxifragales). But the monophyly of Saxifragales has been strongly supported by molecular data in previous studies (Soltis et al., 2013;Soltis, Soltis, Endress, & Chase, 2005). Moreover, the proportion of gene trees that supported the tree topologies at each node showed that the positions of P. suffruticosa and K.
fedtschenkoi in the coalescent tree were both weakly supported. ing sepals, A + B + E specifying petals, B + C + E specifying stamens, and C + E specifying carpels (Litt & Kramer, 2010;Theissen & Saedler, 2001;Wellmer, Graciet, & Riechmann, 2014). According to the ABCDE model of flower development, B and C class MADS-box genes determined the identity of petal and stamen. In this study, we identified five nonredundant B class genes and two C class genes from the combined dataset of genome assembly and transcriptome assembly. Except one B-PI gene, the other four B class genes were all highly expressed in flower. It was worth noting that the sequences of the two B-TM6 genes had been proved to be different among wild species and cultivars of tree peonies, which was explained to be related to stamen petalody and different flower shape formation in tree peonies (Shu et al., 2012 (Yanofsky et al., 1990). The loss-of-function or restricted expression of the C function genes has shown to play a central role in the production of excessive numbers of petals in many different species, such as in Thalictrum thalictroides (Galimba et al., 2012), Prunus lannesiana (Liu, Zhang, Liu, Li, & Lu, 2013), Camellia japonica (Sun et al., 2014), and in rose (Dubois et al., 2010). During plant domestication, the causal mutations for convergent changes in key traits are likely to be located in particular genes (Lenser & Theissen, 2013). So, we suppose that the restriction of C class gene function may be responsible for the stamen petalody in tree peony cultivars. In addition, when the function of C class genes is restricted, the expression of TM6 genes can assure the normal development of carpels. It is reasonable to infer that the combined activity of AP3/PI and TM6 genes determines the formation of petal and stamen as well as the conversion between them when the C class gene function is restricted.

| CON CLUS ION
This study presents the first genome in the family Paeoniaceae and the largest eudicots genome sequenced to date. This genome is also an important addition to Saxifragales genomic resources, which will facilitate the research of the phylogeny of this highly diverse clade.
By integrating this genome with transcriptome data, we have demonstrated the use of this genome to explore the molecular mechanism underlying the flower development specified in this ornamental plant and suggested a modified BC model in the formation of petal and stamen. It can be expected that this genome will aid in deciphering the formation of specific and important traits in tree peony, such as various flower colors, oil accumulation in seeds, biosynthesis pathways of pharmacologically active metabolites, and adaption under domestication.

ACK N OWLED G M ENTS
We are grateful to all participants of the Agriculture Department at BGI. We also thank the experts at the CNGB (China National