Targeted enrichment of novel chloroplast-based probes reveals a large-scale phylogeny of 412 bamboos

The subfamily Bambusoideae belongs to the grass family Poaceae and has significant roles in culture, economy, and ecology. However, the phylogenetic relationships based on large-scale chloroplast genomes (CpGenomes) were elusive. Moreover, most of the chloroplast DNA sequencing methods cannot meet the requirements of large-scale CpGenome sequencing, which greatly limits and impedes the in-depth research of plant genetics and evolution. To develop a set of bamboo probes, we used 99 high-quality CpGenomes with 6 bamboo CpGenomes as representative species for the probe design, and assembled 15 M unique sequences as the final pan-chloroplast genome. A total of 180,519 probes for chloroplast DNA fragments were designed and synthesized by a novel hybridization-based targeted enrichment approach. Another 468 CpGenomes were selected as test data to verify the quality of the newly synthesized probes and the efficiency of the probes for chloroplast capture. We then successfully applied the probes to synthesize, enrich, and assemble 358 non-redundant CpGenomes of woody bamboo in China. Evaluation analysis showed the probes may be applicable to chloroplasts in Magnoliales, Pinales, Poales et al. Moreover, we reconstructed a phylogenetic tree of 412 bamboos (358 in-house and 54 published), supporting a non-monophyletic lineage of the genus Phyllostachys. Additionally, we shared our data by uploading a dataset of bamboo CpGenome into CNGB (https://db.cngb.org/search/project/CNP0000502/) to enrich resources and promote the development of bamboo phylogenetics. The development of the CpGenome enrichment pipeline and its performance on bamboos recommended an inexpensive, high-throughput, time-saving and efficient CpGenome sequencing strategy, which can be applied to facilitate the phylogenetics analysis of most green plants.


Background
The subfamily Bambusoideae belongs to the grass family Poaceae and exhibits substantial phenotypic diversity, with 1642 species in 125 genera, three tribes, and 15 subtribes, which have been classified into~75 clades [1]. The Bambuseae consists of tropical woody bamboos (Bambuseae), temperate woody bamboos (Arundinarieae) and herbaceous bamboo tribe (Olyreae). Bambusoideae predominantly distributed in the Old World, such as China, Japan, Thailand, Indonesia, and the countries of Southeast Asian. As one of the most ecologically and industrially valuable tribes of Bambusoideae, woody bamboos were used for furniture, paper, fiber textiles, and fuel [2]. In total, about 500 bamboos are distributed in Asia, spanning a wide geographic and temperature range. However, infrequent, incongruent, and unpredictable flowering events as well as unstable vegetative characteristics, severely restricted the identification and classification of woody bamboos. The phylogenetic relationships based on more massive amounts of woody bamboos remain elusive due to the lack of extensive and high-quality genomic resources.
The chloroplast genome (CpGenome) is an essential resource for the study of plant evolution [3]. This organelle is one of the most technically accessible regions of the genome. The chloroplast genomic DNA of green plants commonly exhibits a conserved genome structure that contains two copies of inverted repeat (IR) separating the small single-copy region (SSC) and the large single-copy region (LSC) [2,4,5]. The CpGenome has been a popular source of reconstructing the phylogeny of green plants, and many chloroplast DNA loci are contributing to the development of plant taxonomy. To obtain chloroplast DNA suitable for whole chloroplast genome sequencing, it can be traditionally enriched by using the sucrose gradient centrifugation method [6], the high salt method [7], long PCR technology by using primers [8]. The characters of the strategies above are the use of physical methods to extract chloroplast DNA or the need for high quality, sufficiently extracted cellar DNA and the appropriate primers. With the development of sequencing technology, next-generation sequencing (NGS) has the advantageous characteristics of high-throughput and efficient, resulting in a rapid increase in the amount of sequencing data. Chloroplast DNA generally accounts for only about 0.5-13% of the whole genome [9]. But, the chloroplast DNA sequencing data from the whole genome sequencing (WGS) data produced a lot of "useless" data except for "useful" ones, consuming much of the sequencing capacity and reducing the efficiency of parallelly chloroplast sequencing. The above methods for obtaining chloroplast DNA sequencing data cannot meet the needs of large-scale CpGenome sequencing, which significantly restricts and hinders the in-depth research of plant genetics and evolution.
In this study, the main goals were: (1) To develop and evaluate a pipeline to target-enrich and assembly the chloroplast data of bamboos. (2) To obtain high-quality and high coverage of bamboo CpGenomes by the pipeline, to reconstruct a phylogenetic tree, and to promote phylogenetic knowledge of bamboo. (3) To share the new sequenced bamboo CpGenomes, allowing researchers to quickly compare suspect chloroplast data and explore the bamboo CpGenomes.

Species selection for probe design and evaluation
To improve the variability and versatility of the probes, we selected 567 representative species from the 3654 published CpGenomes species (collected from NCBI, Released Dec 2018) to design and evaluate probes for a targeted enrichment strategy of CpGenomes (Supplementary Table S1 and S2). Among the 567 species, 22 are bamboo species. For data preprocessing, we elucidated our approach in a flow chart ( Supplementary Figure S1). A phylogenetic tree (Supplementary Figure S2) was constructed based on the 567 complete CpGenomes, which spanned the phylogenetic diversity of 7 major clades, including 40 orders and 57 families. The model species in each clade were selected as core candidates. Thus, a total of 99 CpGenomes, including 6 bamboo CpGenomes, were chosen as the representative species for the probe design (Table 1), and the remaining (468 CpGenomes) were chosen as test data further to assess the efficiency of the probes for chloroplast capture. The species for probe design and the species for probe evaluation were different genera but belong to the same family (e.g., Danthonia and Chionochloa, both are Poaceae).

Construction of non-redundant chloroplast reference
Using the CpGenome of Arabidopsis thaliana as the initial reference sequence (as a database sequence), other selected CpGenomes (as query sequences) were aligned to the database sequence by BLAST+ v2.2.25 software with default parameters. The sequences with more than 90% identity were masked from the query sequences. Then, the resulting sequences were subjected to a secondary round masking of redundant sequences, which were identified by an all-against-all BLAST+. Finally, a non-redundant chloroplast reference, as a pan-chloroplast genome (pan-CpGenome), was obtained by iterative analysis. Sequences with high similarity (> = 90%) were masked with "Ns", and others were highly divergent sequences in the pan-CpGenome (Supplementary File F1). The visualization of the alignment of 98 CpGenomes to Arabidopsis thaliana

Universal probes designed for bamboo CpGenomes
The regions of the pan-CpGenome sequences which have not been masked to "Ns" were extended by 40 bp on both sides for the design of the probes. Each region was divided into K-mers of 90 bp in length and the melting temperatures of the K-mers were calculated [11]. A comprehensive score of uniqueness, frequency, melting temperature, and GC content was calculated for each probe by Primer3 v2.4.0 [12]. The probes with the highest comprehensiveness scores were selected in 20 bp window and slid along the target region at the fixed interval. For ensuring high coverages of the probe sequences in the target region, the target region was covered at least 2 times by these selected probes. Finally, a total of 180,519 DNA oligonucleotides were synthesized by a CustomArray B3 Synthesizer (CustomArray, Washington, DC, USA) according to the manufacturer's instructions and dissolved in 10× TE buffer (pH = 8.0).

Taxa sampling
All sampled species covering more than 30 genera (Supplementary Table S4)  Totally, 358 bamboo samples, mainly from young leaves, were collected. All samples were frozen in liquid nitrogen immediately and were preserved in ultra-low temperature refrigerator at − 80°C, followed by DNA extraction.

DNA extraction and target enrichment sequencing for bamboos
A total of 358 woody bamboo samples were sampled and sequenced in this study (Supplementary Table S4), as a practical application of target enrichment sequencing and an evaluation of the capture efficiency. Genomic DNA from each sample was extracted using the CTAB method and fragmented to a peak size of 200 bp using a Covaris E220 sonicator (Covaris, Woburn, Massachusetts, USA), followed by the end-repair, addition of base "A", and adapter ligation. DNA fragments of the desired size (200 bp) were selected on an agarose gel and hybridized to the probes for 72 h. The probes captured DNA fragments were recycled by magnetic beads coated with streptavidin, which interacted with the biotin on the probes to wash away the uncaptured DNA fragments. The captured DNA fragments were sequenced on the BGISEQ-500 platform at Beijing Genomics Institute, Shenzhen, China. High-quality reads ranging from 1 Gb to 9 Gb with 100 bp paired-end were acquired for each sample. For data preprocessing, we illuminated our method in a flow chart (Supplementary Figure S1). SOAPfilter (v2.2) [13] was applied to remove low-quality reads and adaptors in the following criteria (1) reads with > 10% base of N; (2) reads with > 40% of lowquality reads (value <=10); (3) reads contaminated with adaptors and produced by PCR duplication. A CpGenome of Phyllostachys edulis (downloaded from NCBI, accession number: HQ337796.1) was used as a reference for assembly using MITObim (V1.8) [14]. In this way, we finally recovered the complete CpGenomes of all 358 samples. Additionally, the plastid genomes were annotated in the current standard web-based program DOGMA [15] (http://dogma.ccbb.utexas.edu/,).

Development of universal chloroplast probes for bamboos
From the 3654 CpGenomes collected from NCBI, 567 high-quality CpGenomes were selected for probe development and divided into two datasets, with 99 CpGenomes for probe design and 468 CpGenomes for probe evaluation. Considering the applicability and robustness of the probes designing for bamboos, and the diversity of CpGenomes, the 99 CpGenomes were selected from different families. Details of the related methods were provided in Supplementary Figure S1. A 15 Mb pan-CpGenome was assembled based on the alignment to Arabidopsis thaliana (Supplementary File F1). The comparison analysis showed the CpGenomes had great variations across species (Fig. 1) Table S3). The probes sequences were available in Supplementary File F2. All the designed probes had excellent uniqueness, with an average 1 time while being aligned with the pangenome. The probes were mostly distributed in the range of 70-80% melting temperatures and 30-40% GC content (Supplementary Figure S3). To assess the broad spectrum of the probes, the BLAST+ program was employed to align the probes to the 468 complete CpGenomes for evaluating the probes. The average coverage ratio in the 468 complete CpGenomes was 90.54% (Supplementary Table S8). In bamboos, the coverage ratio was all over 93.00%, with an average coverage of 94.78% ( Fig. 2b and Supplementary Table S8). Moreover, some orders such as Magnoliales, Pinales, Poales also had high coverage.

Probe-based targeted enrichment and assembly of bamboo CpGenomes
A total of 358 fresh woody bamboo samples collected from China were included (Supplementary Table S4) and used to evaluate capture efficiency. A total of 1G-9G raw reads were obtained, and low-quality reads and adaptors were filtered in data preprocessing ( Fig. 2c and Supplementary Table S9). Clean and high-quality reads were used for reference-guided assemblies by MITObim and recovered nearly complete CpGenomes for the 358 bamboo species. The assembled CpGenomes ranged from 139,664 to 140,064 base pairs (bp), and the LSC regions varied from 83,496 bp to 83,845 bp in length (Supplementary Table S9). The CpGenomes were annotated with approximately 121 genes, including around 113 unique genes encoding 80 proteins, 4 ribosomal RNAs, and 29 transfer RNAs, exhibiting a higher degree of conservation.
We detected 15 overlapped bamboo CpGenomes that were present in both the in-house and published data (Fig. 2d). To assess the target enrichment, we mapped the raw reads to the corresponding CpGenome released previously and compared assembled bamboo CpGenome to corresponding released ones. The results showed more than 45.77% in average of the raw reads from inhouse bamboo CpGenomes can be mapped to the corresponding published CpGenomes, and the mapping depth was higher than 1200×. Alignment with the published CpGenomes, the coverage of assembled CpGenomes was greater than 98.59% (Fig. 2d and Supplementary Table S10).

A phylogenomic relationship based on 412 bamboo CpGenomes
For comprehensively collecting bamboo CpGenomes, 69 bamboo CpGenomes from NCBI were acquired, resulting in a total of 412 non-redundant bamboo CpGenomes after removing redundancy (Supplementary Table S6). We reconstructed a phylogenetic tree of bamboos based on the concatenated sequences of 76 protein-coding genes in the 412 bamboo CpGenomes. Phylogenetic analyses supported the relationship of (Arthrostylidiinae (Bambusinae, Olyreae)). We classified different clades in the phylogenetic tree based on previous studies [18,19]. The pattern of (XI((VIII, IV)VI)((IX, III) (VII, V))) was provided in Arthrostylidiinae (Supplementary Figure S4). Most of the newly sequenced species distributed in Clade V, Phyllostachys was a representative genus in bamboo, with the clade embed into Clade V, which was the sister clade of Bashania fargesii. There are some non-Phyllostachys species were found in Phyllostachys genus clade. The Phyllostachys genus clade was divided into two groups based on the phylogenetic tree. Phyllostachys edulis, the most planted bamboo in China, distributed in Phy-II (Fig. 3). The sequences from NCBI clustered with corresponding in-house sequences. For example, Phyllostachys edulis sequence from NCBI clustered with in-house sequences of Phyllostachys edulis f epruinosa, Phyllostachys edulis f exaurita, Phyllostachys edulis f flexuosa, et al.

China Bamboo database in CNGB
The data that support the findings of this study have been deposited into CNGB Sequence Archive (CNSA) [20] of China National GeneBank DataBase (CNGBdb) [21] to facilitate the accumulation of knowledge on bamboo phylogeny. Researchers can download the raw data and assembled CpGenome sequences from CNGB through Project ID: CNP0000502 (https://db. cngb.org/search/project/CNP0000502/). Moreover, researchers can search for all assembled bamboo plastid genomes in this study through web-based BLAST+ service (https://db.cngb.org/blast/). The available plastid genome sequences of bamboos and the corresponding BLAST+ server can promote researchers to explore the complex and elusive history of bamboo evolution. Fig. 2 Evaluation of the pipeline performance in woody bamboos. a A dot plot provides the average depth (×) and coverage ratio of the 99 plant CpGenomes used to design the probes. The red and blue dots represent bamboos and other plant species, respectively. The black lines represent the average depth (×) and a coverage ratio of the bamboo species. b A dot plot provides log10(cover length) and the coverage ratio of the 468 plant CpGenomes used to evaluate the probes. The red and blue dots represent bamboos and other plant species, respectively. The black lines represent log10(cover length) and the coverage ratio of the bamboo species, respectively. c A box plot of gene number, genome size, and raw bases (bp) of the sequenced bamboos CpGenomes in this study. d Evaluation of mapping and coverage of the probes compared to the in-house and released bamboo CpGenomes. The mapping ratio represents the proportion of reads obtained by the probes aligned with the released bamboo CpGenomes. Mapping coverage represents the proportion of the assembled CpGenomes based on the probes aligned with the released bamboo CpGenomes

Discussion
CpGenome provides an essential resource for plant evolution As an essential component of plant organelles and photosynthesis organs, chloroplasts have a simple structure, the small genome size (~110-165 kb) containing 90-110 protein-coding genes [22] and highly conserved gene region across species, due to their nonrecombinant, haploid and uniparentally [23]. The genomic characterization of various aspects of chloroplasts has led to an important role in the research of plant origin, evolution and phylogenetic analysis relationship between different plant species [24,25]. Many studies had been reported using chloroplast genes to construct phylogenetic trees of plants. For example, Jansen et al [26] used 81 chloroplast genes to estimate relationships among the major angiosperm clades; Saarela et al [27] found weak support for Amborella as the Fig. 3 A species tree of Phyllostachys clade based on 76 chloroplast genes. The species tree divided into 2 parts, which labeled with different background colors. Numbers at the node indicated the bootstrap values and bootstrap values lower than 80 were concealed. The red, purple, grey, and blue blocks in the tree represented P. sect. Heteroclada species, P. sect Phyllostachys species, unlabeled and non-Phyllostachys species, respectively. Name with 'LOC' stands represented newly sequenced sequences in this study basal-most angiosperm lineage using 17 plastid genes and the nuclear gene phytochrome C (PHYC). With the deepening of chloroplast research, more and more researchers are focusing on the complete chloroplast sequence [28][29][30]. Kane et al [31] suggested that the whole CpGenome could serve as an ultra-barcode for identifying plant varieties.
Hybridization-based probes for target enrichment in large-scale CpGenome sequencing Chloroplast DNA can be traditionally acquired by the sucrose gradient centrifugation method [6] or the high salt method [7]. Another method was to amplify the entire chloroplast DNA from the whole cellular DNA base on a long PCR technology by primers, which were designed on conserved sequences [8]. These methods were not suitable for large-scale samples due to the large amount of labor and material resources required to obtain chloroplast DNA, and the labor-intensive method used to prepare chloroplast DNA. Chloroplast reads also can be identified from WGS reads by aligning the WGS data with the reference CpGenome. It is a demanding bioinformatics technique and requires a closely related reference CpGenome. The method was not suitable for the species that are not closely related or have poor quality reference genome sequences. Moreover, to assemble only CpGenome based on this method, a great deal of useless sequencing data was thus generated, consuming much of the sequencing capacity and reducing the efficiency of parallelly chloroplast sequencing, since the chloroplast DNA sequencing data represents only a small fraction of WGS. Therefore, most existing methods for obtaining DNA and sequencing data suitable for whole CpGenomes cannot meet the needs of large-scale CpGenome sequencing, greatly limiting and hindering the in-depth research of plant genetics and evolution.
Target enrichment before sequencing is a useful method that allow for in-depth analysis of specific portions of the genome. Moreover, a group of universal probes covering whole CpGenome in a tribe species can make target enrichment strategy exert it's advantages. Large scale CpGenomes target enrichment by universal probes can provide cost-effective, high density, and high coverage.

Efficiency target enrichment and comparative analysis of CpGenomes for different clades
More than 3000 chloroplast genomes have been released recently [32], since the first reported sequencing of the complete CpGenome of Nicotiana tabacum [33]. We chose the 99 representative CpGenomes, including 6 bamboo CpGenomes from 3654 CpGenomes published to design probes. These vascular plants included 7 clades (Lycopodiophyta, Moiliformopses, Gymnosperms, Basal angiosperms, Monocots, Eudicot, and Magnoliidae), belonging to 57 families and 40 orders. The alignment of the CpGenomes of 7 clades to Arabidopsis thaliana CpGenome may show the CpGenome structure variation during evolution and indicating differences among different clades (Fig. 1). Structure variation indicated the pan-CpGenome derived from CpGenomes of distinct clades was essential for constructing greater applicability of pan-CpGenome with more divergent sequences. In 146-150 kb, 124-129 kb, and 88-92 kb, Poaceae had alignment gaps compared to the rest of Monocots, ANA grade, Magnoliids, and Eudicots. Moreover, Ferns, Horsetails, Gymnosperm, and Lycophytes indicated fragment sequences at the corresponding positions. It may suggest the corresponding CpGenome regions completed in angiosperm during evolution and uniquely lost in Poaceae after Angiosperm. However, the phenomenon should be further tested on the basis of broad-spectrum reference and amplification samplings.
In pan-CpGenome construction, unique sequences were selected, and the final pan-CpGenome size was~15 Mb. A total of 180,519 probes were designed and synthesized using a new hybridization-based approach to enrich chloroplast DNA fragments. Evaluation of the quality of the probes and pan-CpGenome showed a high mapping ratio, which was stable and efficient in bamboo CpGenomes. Besides bamboos, the amplified plant CpGenomes expanded variational sequences and universality of the probes in the pan-genome construction step. Thus, the probes also had high mapping rates in some orders, such as Malvales, Rosales, Pinales and Poales, et al, and indicated the applicability of the probes in these clades. Conversely, lower mapping rates were found in Nymphaeales, Solanales, Schizaeales, Lamiales, et al, which may due to inadequate and poor corresponding CpGenomes materials in pan-Genome constructing. It can be solved by amplifying corresponding CpGenomes to expand divergent sequences in pan-CpGenome or decreasing parameter restriction. Comparing of the assembled CpGenome with its published counterparts demonstrated a mapping coverage of over 98%, further confirming the efficiency of the probes in enriching chloroplast DNA fragments. In general, this pipeline of pan-CpGenome construction, pan-CpGenome-based probes design, and CpGenome enrichment showed its performance in bamboo CpGenomes and recommended a strategy of large-scale CpGenomes acquiring to green plants.

Bamboo CpGenomes could provide additional information on large-scale phylogenetic relationships
There are more than 500 bamboo species in China, which play significant roles in economy, ecology, culture, aesthetics, and technology [34,35]. Bambusoideae is one of three subfamilies in Poaceae known as the BEP clade [36]. Bamboo remains one of the most challenging groups for plant taxonomists and field botanists [37], due to infrequent, incongruent, unpredictable flowering events, and diversity vegetative characters, which may result from frequent hybridization occurred in bamboos [37,38]. As a useful strategy in phylogenetics and classification of species, phylogenetic analysis based on sequences has been performed in bamboos over the past decades. Extensive sampling and sequencing of the plastid genome has been a remarkable effort in genetic, phylogenetic, and classification analysis of bamboo. We have constructed a phylogenetic tree of 412 samples, covering more than 300 species, 40 genera, which is the largest sampling project of bamboo in China and provides a large-scale phylogenetic tree of bamboos. According to the phylogenetic tree, XI (Ampelocalamus calcareus) is the earliest diverging Arthrostylidiinae species, consistent with previous studies [18,19,39]. The phylogenetic tree supports (Arundinarieae (Bambuseae, Olyreae)) pattern, and the pattern is consistent with previous studies based on smaller-scale plastid sequences, suggesting a nonmonophyletic lineage of woody bamboos [36,[40][41][42]. The results also showed the stability of the pattern, which may no change under amplified sampling. Differently, phylogenetic trees using nuclear sequences suggested the basal position of Olyreae in Bambusoideae and showed a monophyletic origin of the woody characteristic of bamboo [37,43]. For clarifying the confliction, the analysis should focus on changes in gene duplications and genome structure caused mainly by multiple hybridizations in bamboo, by performing largely amplified sampling and genome-wide sequences. Additionally, there is a fundamental demand for bamboo life trees, especially in China, which has the world's largest areas of bamboo plantation [34].
The Phyllostachys genus, with 59 species, is the most economically important among bamboos [44][45][46]. Phyllostachys edulis is the most significant Phyllostachys species, accounting for ∼73.8% bamboo-growing regions in China (4.43 million ha), and is the most abundant non-wood resource [34]. This study included 102 Phyllostachys CpGenome sequences, covering more than 90% Phyllostachys species, and provides an unprecedented opportunity to expand taxonomic knowledge of Phyllostachys genus. Traditionally, Phyllostachys genus can be divided into two groups, P. sect. Phyllostachys and P. sect. Heteroclada, based on morphological features such as inflorescences and rhizomes et al. [47,48] But there is a controversy in this classification due to some in-between morphological features of two groups [44,47]. Compared to the traditional taxonomy, the species tree we constructed exhibited different phylogenetic relationships in P. sect. Phyllostachys and P. sect. Heteroclada, specifically the two groups of species intermixed in the species tree. Incongruence between morphological taxonomy and the phylogenetic tree may be due to complex evolutionary processes or taxonomic treatments. Totally, 13 non-Phyllostachys species, such as Indocalamus pedalis, Oligostachyum oedogonatum, Pleioblastus solidus, et al were found in Phyllostachys genus Clade. They are all scattered in Phy-II. The existence of numerous non-Phyllostachys species may indicate non-monophyly of the Phyllostachys genus. It is supporting the non-monophyly thesis of Phyllostachys genus based on previous studies of plastid sequences [38,49,50] and conflicting with previous results based on non-genome wide nuclear sequences or morphological features [44,47,48]. The classification should be treated carefully because of the evolutionary complexity of bamboos. Moreover, The incongruence between plastid and nuclear gene phylogenies in Arundinarieae was found in the previous study [18]. Though the species tree we constructed supports more than 90% species coverage of Phyllostachys, the taxonomy of Phyllostachys clade should be further tested within the phylogenies based on genome-wide nuclear genes.

Conclusions
A practical and large-scale approach to CpGenome acquisition will promote plant genetics and phylogenetics. We recommend a universal probe-based CpGenome enrichment pipeline, which successfully applied to bamboo CpGenomes, and 358 woody bamboo CpGenomes were acquired. Moreover, the universal probes we designed for bamboo exhibited a broad spectrum, which may also be applicable in Magnoliales, Pinales, Poales et al. We also reconstructed a phylogenetic tree of bamboos in China based on CpGenomes which supported the non-monophyly of the genus Phyllostachys. For promoting evolution, phylogenetic and population studies, we uploaded the sequences to CNGB to provide a BLAST+ server. For further research, we will explore many divergent hotspot regions associated with repeat sequences of LSC regions, such as tRNA clusters, which can be used as genetic markers for phylogenetic studies.
Additional file 1: Figure S1. A flow chart provided for data analysis in this study.