Chromosome-level reference genome and alternative splicing atlas of moso bamboo (Phyllostachys edulis)

Abstract Background Bamboo is one of the most important nontimber forestry products worldwide. However, a chromosome-level reference genome is lacking, and an evolutionary view of alternative splicing (AS) in bamboo remains unclear despite emerging omics data and improved technologies. Results Here, we provide a chromosome-level de novo genome assembly of moso bamboo (Phyllostachys edulis) using additional abundance sequencing data and a Hi-C scaffolding strategy. The significantly improved genome is a scaffold N50 of 79.90 Mb, approximately 243 times longer than the previous version. A total of 51,074 high-quality protein-coding loci with intact structures were identified using single-molecule real-time sequencing and manual verification. Moreover, we provide a comprehensive AS profile based on the identification of 266,711 unique AS events in 25,225 AS genes by large-scale transcriptomic sequencing of 26 representative bamboo tissues using both the Illumina and Pacific Biosciences sequencing platforms. Through comparisons with orthologous genes in related plant species, we observed that the AS genes are concentrated among more conserved genes that tend to accumulate higher transcript levels and share less tissue specificity. Furthermore, gene family expansion, abundant AS, and positive selection were identified in crucial genes involved in the lignin biosynthetic pathway of moso bamboo. Conclusions These fundamental studies provide useful information for future in-depth analyses of comparative genome and AS features. Additionally, our results highlight a global perspective of AS during evolution and diversification in bamboo.

Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "Then we aligned the moso chromosomes to the rice genome and we obtained a mean coverage of ~59.77%" 6. p. 5 L13: About bamboo BAC sequences, I suggest mentioning that these sequences are derived from other bamboo specie (Ph. heterocycla) in this section or in the additional table S6.
Response: We appreciate this observation. In fact, the old Latin name, Ph. heterocycla, is a synonym of Ph. edulis and the both Latin names indicate the same bamboo (moso bamboo). Therefore, the BAC sequences mentioned in our manuscript are also derived from moso bamboo. 7. p. 5 L24: Provide correct reference -"…we predicted 51,074 high-quality protein-coding loci… (Additional Table S10)". Instead of Table S10, it should be Table S11.
Response: Thank you very much for pointing out this error. We have re-organized and renumbered the Additional Table, as follows: "we predicted 51,074 high-quality protein-coding loci with intact structures in moso bamboo (Additional Table S17)" 8. p. 5 L30: Provide correct reference -"… ~17% of the gene models were precisely refined (Additional Table S11)". Instead of Table S11, it should be Table S12.
Response: Thank you very much for pointing out this error. We have revised the table and re-organized and re-numbered the Additional Table, as follows: "According to our results, ~17% of the gene models were precisely refined by the UTR addition and internal structural adjustment (Additional Table S19)." 9. p. 5 L36: Provide correct reference -Regarding the annotation using BUSCO, the reference should be Additional Table S13 instead of S12 in " (Fig 1d and Additional Table  S12)".
Response: Thank you very much for pointing out this error. We have added the reference of BUSCO, and re-organized and re-numbered the Additional Table, as follows: "According to the completeness assessment of the annotation using BUSCO [1], moso bamboo (95.2%) was more complete than Z. mays (92.2%) but close to O. sativa (95.6%) ( Fig. 1d and Additional Table S20)" 10. p. 7 L27: The additional table reference for the enrichment analysis should be S25 instead of S26.
Response: Thank you very much for pointing out this error. We have re-organized and re-numbered the Additional Table, as follows:  "As the functional implication of AS genes, the enrichment analysis result showed 885 genes, which alternatively spliced in all samples, significantly enriched in RNA metabolic processing, mRNA processing, RNA processing and RNA splicing in the processes (Additional Table S25)." 11. p. 7 L32: "…which account for one-third of the AS events (termed as among-tissue)." According to additional Fig S18, the AS events classified as among-tissue correspond to two-third of the AS events.
Response: Thank you very much for pointing out this error. We are sorry for the typo error and we have revised the sentence, as follows: "Since AS possess strong specificity to different tissues or developmental stages, we identified 181,105 tissue-specific AS events (67.57%), which account for two-thirds of the AS events (termed as among-tissue)." 12. p. 8 L43-45: The sentence "the distribution of the TE genes in the 8 datasets was examined. A substantially negative correlation" could be '…was examined and a substantially negative…' Response: Thank you for this excellent suggestion. We have revised this sentence, as follows: "Moreover, the distribution of the TE genes in the 8 datasets was examined and a substantially negative correlation was observed, indicating that the more conserved genes had more TE insertions." 13. p. 10 L28: "…a higher percentage of IR (38.22%) and other AS types (total 28.18%) were observed in bamboo." In order to avoid misunderstanding, I suggest clarifying that 'other AS types' represent a set of AS events except the main AS types already mentioned in the manuscript.
Response: Thank you for this excellent suggestion. We have added the related descriptions in the Analysis part, as follows: "In subsequent analyses, we defined the four main AS types represented intron retention (IR), alternative 3' splice site donor (A3SS), alternative 5' splice site acceptor (A5SS), and exon skipping (ES), and we also defined the other AS types represented some AS types except the above four main AS types." 14. p. 16 L50: Please, provide release number of the pfam-A.hmm database.
Response: Thank you for this excellent suggestion. We have added the related information, as follows: "The filtered sequences were subsequently analyzed by hmmsearch using the Pfam-A.hmm database (released 2017/03/31)." Response: We appreciate this observation. We have added the related description in the figure legend of Figure 2, as follows: "IR, A3SS, A5SS, and ES represents intron retention, alternative 3' splice site donor, alternative 5' splice site acceptor, and exon skipping, respectively." 16. Figure S3: In the figure legend "…The while boxes…" should be '…The white boxes…' Response: Thank you very much for pointing out this error. We are sorry for the typo error and we have revised the figure legend of Figure S3, as follows: "The white boxes in the BAC represent ambiguous bases (Ns) and the yellow line represent well aligned sequences between the BAC and the sequences." 17. Fig. S17: Is the x-axis data label named 'AS' correct?
Response: Thank you very much for pointing out this error. We are sorry for the typo error. The second pillar in the X-axis should be 'IR' instead of 'AS'. Therefore, we have revised the figure.
18. Table S1: Please provide the correct number of libraries in the 'Total' description.
Response: Thank you very much for pointing out this error. The total number should be 61 and we have revised the table 19. Table S3: Asterisk with description in the legend is not shown in the table.
Response: Thank you very much for pointing out this error. We have added asterisks in the Table S3 20. Table S28: I recommend excluding the words 'totally' and 'were' in the table legend.
Response: Thank you for this excellent suggestion. We have revised the table legend of  Table S28, as follows: "Additional Table S28. One hundred and forty genes of lignin biosynthesis pathway experimentally validated collected from public studies" 21. Some figures and tables citation are missing in the manuscript, such as Fig. 1a, 1b, and 3d; and additional table S23.
Response: Thank you for this excellent suggestion. We have added the figures and tables citation in the manuscript and reorganized tables citation in the Additional Files, as follows: "Then, the Hi-C assembly was generated with total length reached 1.91 Gb as well as contig and scaffold N50 length with 53.29 Kb and 79.90 Mb based on the Hi-C data and the improved WGS assembly ( Fig. 1a and 1b)." "Additionally, compared with the AS events among the genes expressed in samples with different specificities (maxTs) (for details, see Methods), the maxTs obviously increased from D8 to D1, representing an enhancement in the sample specificity from a highly conserved gene dataset to a poorly conserved dataset (Fig. 3d)." "Additionally, for the transcriptomic analysis, approximately 379 Gb and 5 Gb of raw data were produced from the Illumina and PacBio platforms, respectively (Additional Tables S2-7)" Dataset 22. The PacBio reads (IsoSeq) must also be submitted to GiGADB or SRA and their accession number provided in the manuscript.
Response: We appreciate this observation. We have provided the SRA accession number (SRR7032261-69) for Iso-Seq data in the manuscript, as follows: "RNA-Seq raw sequence data for the 26 samples and Iso-Seq raw sequence data for a mixture sample were deposited in NCBI Short Read Archive database under the accession numbers: SRX2408703-28 and SRR7032261-69, respectively." Responses to the comments of Reviewer #2 Reviewer #2: Zhao et al reported a much improved genome assembly of moso bamboo, and characterized its alternative splicing (AS) atlas. While I find the assembly result very impressive, I have several questions on the methodology. 1. According to the method text in the "Additional File", the RNA reads were mapped onto the genome by "BLAT" and refined by HISAT. This is a rather unconventional way to map RNA-seq data to a reference genome. Why not just use HISAT2? BLAT was designed to align transcripts, not individual RNA-seq reads. Also, I'm not aware of the adjustment function in HISAT. Please make sure the read mapping was done correctly because this is fundamental to the AS analysis.
Response: Thank you very much for pointing out this error. We are sorry for the unclear and confusion description of the RNA-Seq analyses in our Additional File. Indeed, as your mentioned, correctly mapping is fundamental for AS analyses, we double-checked our shell scripts and found the aligning RNA-Seq reads only used HISAT2 (release 2.0.4) rather than HISAT and BLAT. We have revised the sentence in the Additional File, as follows: "Similarly, RNA-Seq data, a kind of high-throughput expressed data, were mapped to the genome to identify exon-intron splicing junctions and refine the alignment of RNA-Seq reads to the genome, using HISAT2 (version 2.0.4) [2]." 2. One important conclusion the authors made is that the conserved genes tend to have more AS events. It is however unclear to me how the authors measured the degrees of conservation. The authors did a gene family classification, and "obtained 8 datasets of orthologous genes representing different levels of conservation (Fig. 3a) designated dataset8 (more conserved genes) to dataset1 (bamboo-specific genes) based on a phylogenetic relationship of 8 selected species." But there was no further explanation. What does the "most conserved genes" in dataset8 entail? Based on presence or absence? And what is the difference between, say dataset8 and dataset7? How gene families were clustered was also unexplained (at least I couldn't find it). These are critical details that are missing.
Response: Thank you for this excellent suggestion. Based on the genome-wide identification of orthologous genes in the selected 8 plants (Amborella trichopoda, A. thaliana, Elaeis guineensis, B. distachyon, O. sativa, Spirodela polyrhiza, S. bicolor and Ph. edulis) and the species divergence time in a phylogeny tree (Fig. 3a), we identified eight orthologous gene datasets. For instance, dataset8 (D8) represents common orthologous genes in the selected 8 plants, which were located in an early divergence time in the phylogeny tree. D7 represents common orthologous genes in the selected 7 plants except A. trichopoda (the specie with the earliest divergence time) and D7 doesn't contain orthologues genes in D8, and so on. Thus, D1 represents bamboo-specific orthologous genes, which were located in later divergence time. According to a previous study [3], we obtained the divergence times of genes based on the presence and absence of orthologs in the phylogeny. In our subsequent study, thus, we considered the bamboo-specific genes (D1) as a poorly conserved gene dataset and the common genes in all selected plants (D8) as a highly conserved gene dataset, and the degree of conservation decreased monotonically from D8 to D1. Lastly, we have revised the related descriptions and Fig. 3a, as follows: "Evolutionary analysis of AS in moso bamboo Based on the genome-wide identification of orthologous genes in the selected 8 plants (Amborella trichopoda, A. thaliana, Elaeis guineensis, B. distachyon, O. sativa, Spirodela polyrhiza, S. bicolor and Ph. edulis) and the species divergence time in a phylogeny tree (Fig. 3a), we identified eight orthologous gene datasets. For instance, dataset8 (D8) represented common orthologous genes in the selected 8 plants, which were located in an early divergence time in our constructed phylogeny. D7 represented common orthologous genes in the selected 7 plants except A. trichopoda (the specie with the earliest divergence time) and D7 doesn't contain orthologues genes in D8. And so on. Thus, D1 represented bamboo-specific orthologous genes, which were located in later divergence time. According to a previous study [3], we obtained the divergence times of genes based on the presence and absence of orthologs in the phylogeny. In our subsequent study, thus, we considered the bamboo-specific genes (D1) as a poorly conserved gene dataset and the common genes in all selected plants (D8) as a highly conserved gene dataset, and the degree of conservation decreased monotonically from D8 to D1." 3. A species phylogeny was reconstructed from 8 genomes, but no information is provided about how this was done. What are the "single-copy orthologous genes"? What methods and programs you used for phylogenetic reconstruction?
Response: Thank you for this excellent suggestion. We have revised the related methods in the Additional File, as follows: "S3.1 Orthologous Gene and Phylogenetic The identification of orthologous gene clusters was considered as a fundamental aspect of genome evolution. Single-copy gene families and multi-gene families were identified by orthMCL (version 2.0.9) [4] among Ph. edulis and other 7 plant species, including Amborella trichopoda (version 1.0) from Amborella Genome Database (amborella.huck.psu.edu), Elaeis guineensis (GCF_000442705.1) from NCBI database, Arabidopsis thaliana (TAIR10), Brachypodium distachyon (version 3.1), Oryza sativa (version 7.0), Spirodela polyrhiza (version 2) and Sorghum bicolor (version 3.1) from the ENSEMBL database. The statistic of the gene family clustering in the 8 species was showed in Additional Table S24. The comparison of gene family clustering was provided in Additional Fig. S7. Afterwards, all single-copy genes were used to construct the phylogenetic tree by PhyML (version 3.0) [5] specifying a HKY85 substitution model with a gamma distribution across sites (Additional Fig. S8)." 4. Species divergence time was estimated, but again, the authors provided no methodological detail. How was the molecular clock estimated? Did you test the validity of assuming a molecular clock (e.g. relative rate test)? Further, the time calibrations listed in the Additional File need citations; they also to me look like secondary calibrations rather than "fossil time".
Response: Thank you for this excellent suggestion. We are sorry for the unclear and confusion description in the analysis of the species divergence time. Indeed, we estimated the species divergence time using calibration time rather than fossil time and we have revised the related Method in the Additional File, as follows: "S3.3 Estimation of Divergence Time In order to estimate the divergence time between Ph. edulis and the other 7 sequenced plant genomes, a Bayesian relaxed molecular clock approach was used to estimate the divergence time using MCMCTREE in PAML (version 4) [6]. Calibration times were gained from a previous study [7] Fig. 3A, it is scientifically incorrect (or at least very confusing). The x-axis is apparently in unit of substitution/site, which is a branch length measurement. It does not make sense to have a terminal tip linked (by vertical dashed line) to a branch length value. There was also no "divergence times" information in this figure and the legend should be revised.
Response: Thank you for this excellent suggestion. we have re-made the Fig.3a. 6. The expansion of lignin biosynthesis genes could be due to whole genome duplication (WGD), but WGD was not discussed. Are the two decoupled?
Response: Thank you for this excellent suggestion. According to the additional analysis of the divergence time of lignin biosynthesis genes, we have added an explanation about the expansion of lignin biosynthesis genes in the aspect of WGD in the Analysis and Discussion, respectively, as follows: In the section of Analysis "Additionally, we calculated the synonymous substitution rate analysis for 13 gene families evolved in the lignin biosynthesis using the yn00, which was a package in PAML to estimate synonymous and nonsynonymous substitution rates. Then, the Ks rate was translated to the divergence time by the formula T=Ks/2r (r=6.5×10-9). As shown in Additional Fig. S22, the result indicated that the divergence time of the lignin biosynthesis genes occurred at the 5~16 million year ago (Mya), which correspond to the whole genome duplication (WGD) time 7~12 Mya in the moso bamboo genome [8]." In the section of Discussion "Combined with the results of the divergence time of the lignin biosynthesis genes and our previous study [8], we estimated the occurrence of a putative WGD event at 7~12 Mya in the moso bamboo genome, suggesting that there might have been a tetraploidization event during bamboo history [8]. Then, the ancient tetraploid moso bamboo evolved into a current diploid moso bamboo. Additionally, WGD could provide more gene copies, which facilitated evolving the genes with new functions [9]. Therefore, the expansion of the lignin biosynthesis genes in moso bamboo could be due to the occurrence of WGD event." Some other comments are listed below. The manuscript has no page number, and the line numbers does not match the actual lines and restart in each page, which make the review difficult. Anyway, I tried my best to point out where in the text I was referring to. Abstract 7. Line 18 -what is "additional abundance data"? You meant sequencing data?
Response: We appreciate this observation. Indeed, the data means the sequencing data and we have revised the sentence in the Abstract, as follows: "Here, we provide a chromosome-level de novo genome assembly of the moso bamboo (Phyllostachys edulis) using additional abundance sequencing data and hybrid-combined de novo assembly strategies." 8. Line 31 -"dramatic evolutionary characteristics" is too dramatic and unclear. Please be specific or take out this sentence.
Response: Thank you for this excellent suggestion. We have removed the sentence in the Abstract, as follows: "Via comparison with orthologous genes in related plants, we observed that the AS genes are concentrated in more conserved genes that tend to accumulate higher expressed transcripts and share less specificity. 9. Line 39 -what does "bamboo's specificity in being a woody plant" mean? Please clarify "specificity".
Response: Thank you for this excellent suggestion. Our result indicated moso bamboo has the features of woody bamboo in the grass family based on the analysis of the lignin biosynthesis pathway. To properly express the meaning, we have revised the sentence in the Abstract, as follows: "Furthermore, gene family expansion, abundant AS and positive selection were identified in crucial genes involved in lignin biosynthesis, indicating that moso bamboo is a woody plant in the grass family. Background 10. Line 20 -change "investigated" to "been carried out".
Response: Thank you very much for pointing out this error. We have revised the sentence, as follows: "Only a limited number of genome-wide studies have been investigated in bamboo." 11. Line 46 -change "is responsible" to "is partly responsible".
Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "Species-specific AS is partly responsible for a wide variety of biodiversity with limited repertoires of protein coding genes" 12. Line 46 -take out "our colorful dynamic world full of".
Response: Thank you for this excellent suggestion. We have removed the part, as follows: "Species-specific AS is partly responsible for a wide variety of biodiversity with limited repertoires of protein coding genes." 13. Line 6 "between conservation and AS" -What conservation? Sequence conservation? Gene functional conservation? Amino acid conservation? Protein structural conservation?
Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "We performed a genome-wide investigation to determine the relationship between amino acid conservation and AS and examine the evolution of AS status of genes that are involved in the lignin biosynthesis." 14. Line 6 -change "between evolution and the AS status of genes …" to "examine the evolution of AS status of genes …" Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "We performed a genome-wide investigation to determine the relationship between amino acid conservation and AS and examine the evolution of AS status of genes that are involved in the lignin biosynthesis." Data description 15. Line 21 -change "different strategies" to "different sequencing strategies".
Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "For the assembly of the moso bamboo genome, approximately 603.3 Gb genome data with different sequencing strategies were generated." Analyses 16. Line 53 -"assembly" statistics.
Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "Compared with those of our previous version [8], the assembly statistics and quality of the new WGS assembly were obviously improved (Additional Tables S9-10)." 17. Line 34 -change "was higher than" to "more complete than" Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "According to the completeness assessment of the annotation using BUSCO, moso bamboo (95.2%) was more complete than Z. mays (92.2%) but close to O. sativa (95.6%)." 18. Line 58 -what is "post-regulation level"? you meant post-translational level?
Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "To facilitate the genome-wide investigation of the AS landscape in moso bamboo and comprehensively identify the factors that influence AS at the post-translational level, we performed high-throughput RNA sequencing (RNA-Seq) using the Illumina HiSeq-4000 platform." 19. Line 25 -"RNA from a mixture of … " this sentence is unclear.
Response: Thank you for this excellent suggestion. We have revised the sentence, as follows: "The full-length cDNA sequencing of alternatively spliced isoforms (Iso-Seq) used RNA from a mixture of 26 samples." 20. Line 48 -"…uniform AS events…" You meant "unique AS events"?
Response: We appreciate this observation. The number of the total AS events identified in our study were counted after removing repeated AS events in all 26 samples. Therefore, the word "unique" properly expressed the meaning and we have revised the sentence, as follows: "In total, 266,711 unique AS events were identified in 25,225 AS genes, accounting for ca. 49.39% of all annotated genes."