Chloroplast genome data of five Amygdalus species: Clarifying genome structure and phylogenetic relationships

Amygdalus species have considerable ecological and economic value, however, the phylogenetic relationships among Amygdalus remain controversy. In this study, we sequenced and assembled the chloroplast (cp) genomes of five Amygdalus species: Prunus communis, P. mongolica, P. pedunculata, P. triloba, and P. mira. We then conducted comparative genomic analyses and constructed their phylogenetic relationships. The genome length ranged from 157,870 to 158,451 bp, and 131 genes were annotated (86 protein-coding genes, 37 tRNAs, and 8 rRNAs). Additionally, 49–57 simple sequence repeats were detected, with most in the large single-copy region and with AT base preferences. Comparative genomic analyses revealed high similarities in structure, order, and gene content. However, we identified four highly divergent sequences: trnR-UCU-atpA, nbdhC-trnV-UAC, ycf4-cemA, and rpl32-trnL-UAG. The phylogenomic relationship analysis suggested that the Amygdalus species were grouped together, in which P. pedunculata, P. triloba, and Prunus tangutica were categorized into a branch, P. mongolica and Prunus davidiana were clustered a branch. This study provides an improved understanding of the genetic relationships among the Amygdalus and provides a basis for the development and utilization of Amygdalus resources.

a b s t r a c t Amygdalus species have considerable ecological and economic value, however, the phylogenetic relationships among Amygdalus remain controversy.In this study, we sequenced and assembled the chloroplast (cp) genomes of five Amygdalus species: Prunus communis, P. mongolica, P. pedunculata, P. triloba , and P. mira .We then conducted comparative genomic analyses and constructed their phylogenetic relationships.The genome length ranged from 157,870 to 158,451 bp, and 131 genes were annotated (86 protein-coding genes, 37 tRNAs, and 8 rRNAs).Additionally, 49-57 simple sequence repeats were detected, with most in the large single-copy region and with AT base preferences.Comparative genomic analyses revealed high similarities in structure, order, and gene content.However, we identified four highly divergent sequences: trnR-UCU -atpA , nbdhC -trnV-UAC , ycf4 -cemA , and rpl32 -trnL-UAG .The phylogenomic relationship analysis suggested that the Amygdalus species were grouped together, in which P. pedunculata, P. triloba , and Prunus tangutica were categorized into a branch, P. mongolica and Prunus davidiana were clustered a branch.This study provides an improved understanding of the genetic relationships among the Amyg-dalus and provides a basis for the development and utilization of Amygdalus resources. ©

Value of the Data
• Chloroplast genome sequences of five Amygdalus species provides an improved understanding of the genetic relationships among the Amygdalus.• Identification of these SSR loci and variations provides candidate molecular markers for research on population diversity and evolutionary research.• The chloroplast genome data provides a basis for the development and utilization of Amygdalus resources.

Data Description
The genus Amygdalus was classified into subgenus Amygdalus and subgenus Persica [1] .The subgenus Amygdalus is mainly distributed in the Mediterranean region and central-eastern Asia, with the exception of P. triloba , which widely distribution in northwest China [2] .The P. triloba, P. pedunculata, P. mongolica , and P. communis belonged to subgenus Amygdalus, and the kernels were rich in oil and protein, which can be use as high-quality oil and protein resource [ 3 , 4 ].The shell can also be used as fuel and adsorbent for heavy metals and pigments [5] .P. mira , belonging to subgenus Persica , is mainly distributed in the Yarlung Zangbo Grand Canyon and the tributary basins of Tibetan Plateau [6] .The kernels of P. mira are rich in oleic acids, linoleic acids, and fat-soluble components, which was used for Chinese traditional medicine to treatment and improvement of diseases [7] .
The classification of genus Amygdalus has always been controversial [8] .Base on the morphological classification, the P. communis, P. mongolica, P. tangutica, P. triloba , and P. pedunculata were classified into subgenus Amygdalus , whereas, the P. mira, P. davidiana, P. ferganensis, P. kansuensis , and P. persica were classified into subgenus Persica [1] .However, Wang et al. [9] revealed that Prunus tenella, P. pedunculata , and P. triloba should be classified into the genus Prunus L ., while, P. communis, P. mongolica , and P. tangutica be divided into the subgenus Amygdalus L .by using cp genomes, and Yazbek et al. [10] investigated the phylogeny of Prunus subg.Amygdalus by plastid and nuclear genes found that the P. triloba , and P. pedunculata should be excluded from subgenus Amygdalus .Chloroplasts (cp) are semi-autonomous organelles with a conserved, maternally inherited genome separate from the rest of the plant.The genome size of terrestrial plants cp varies from 10 0 to 20 0 kb and generally contains 110-130 genes, which are mainly composed of genes involved in photosynthesis, transcription, and translation [11] .Most higher-plant cp genomes form a quadripartite structure, containing one pair of inverted repeats (IR), a small single-copy (SSC) region, and a large single-copy (LSC) region separated by IRs [12] .Therefore, the study of the cp genome plays an important role in species identification, phylogenetic analysis, and molecular marker development [ 13 , 14 ].The rapid development of next-generation sequencing technologies and phylogenetic genomics has led to the sequencing of cp genomes from multiple Prunoideae species, which are widely used in molecular evolution and phylogenetic research [15][16][17] .However, the cp genomes of Amygdalus species were remain insufficient.Here, we sequenced, assembled, and performed phylogenetic analysis on the complete cp genomes of five Amygdalus species ( P. communis, P. mongolica, P. pedunculata, P. triloba , and P. mira ) aimed to provide an improved understanding of the genetic relationships among the Amygdalus and provide a basis for the development and utilization of Amygdalus resources.
Total cp genome lengths for P. mira, P. communis, P. mongolica, P. pedunculata , and P. triloba were 158,153, 157,870, 158,451, 157,948, and 158,388 bp, respectively ( Fig. 1 ).Each cp genome exhibited a typical quadripartite structure, including two IRs (26,373-26,931 bp), one LSC (86,144-86,525 bp), and one SSC (18,966-19,211 bp).The GC content in the cp genome ranged from 36.72 % in P. mongolica to 36.78 % in P. pedunculata and was higher in the IR regions (ranging from 42.55 % in P. mira to 42.60 % in P. communis ) than that in the LSC regions (ranging from 34.57 % in P. communis to 34.62 % in P. pedunculata ) and the SSC regions(ranging from 30.27 % in P. mongolica to 30.46 % in P. communis ), suggesting that the two IR regions were relatively stable.
We annotated 131 genes in each of the five Amygdalus cp genomes, including 86 proteincoding genes, 37 transfer RNAs (tRNA), and eight (rRNAs) ( Table 1 ).These genes were in the same order across all five cp genomes.
Functional analysis classified 131 genes into four categories, including photosynthesis-related, self-replication-related, other, and unknown function( Table 2 ).A total of 18 genes were duplicated in the IR region.Furthermore, 18 contained one intron each, while two genes ( clpP and ycf3 ) contained two introns (Table S1).In addition, the GC content of rRNA (55.5 %) and tRNA (ranging from 53.3 % in P. communis to 53.41 % in P. mira ) was higher than that in protein-coding genes (ranging from 37.61 % in P. triloba to 37.65 % in P. communis ) (Table S2).

Table 2
Genes identified in the chloroplast genomes of the five Amygdalus species.
The analysis of cp genome sequence alignment of five Amygdalus species were carried out by using P. persica as a reference, the result showed that the variations in IGS were higher than that in CDS ( Fig. 4 ).We identified four highly divergent sequences in the IGS: trnR-UCU -atpA , nbdhC -trnV-UAC , ycf4 -cemA , and rpl32 -trnL-UAG .These sequences have potential candidate molecular markers for Amygdalus species.
We compared IR regions among the cp genomes of the five Amygdalus species and two related species ( P. persica and P. pyrifolia ) ( Fig. 5 ).Although the cp genome structure and gene organization were highly conserved, IR expansions and contractions resulted in slight variations in the LSC/IRb and SSC/IRa borders.The genes of rps19, ycf1 , and ndhF were distributed near the boundaries of IR/LSC and IR/SSC.Among them, the ycf1 was detected at the LSC/IRa boundary and the size of ycf1 in the IRa region was 1041-1051 bp, the rps19 was located at the LSC/IRb boundary, with a fragment size of 182-265 bp in the IRb region.The ndhF was located at the IRb/SSC border with a fragment size of 1, 2, 5, and 8 bp in P. triloba, P. pedunculata, P. communis , and Prunus Mira , respectively, however, the ndhF in P. mongolica did not overlap at IRb/SSC boundary.
Our phylogenetic trees based on complete cp genome data had a higher bootstrap value, and 23 out of 24 nodes had 100 % bootstrap values ( Fig. 6 ).
To further ascertain the phylogenetic position of Amygdalus , a phylogenetic analysis was carried out among 26 species.The result showed that five groups were divided, including eurosids I, eurosids II, gymnosperms, basal angiosperms, and euasterids I ( Fig. 6 , Table S4).The phylogenomic relationship analysis suggested the Amygdalus species were grouped together.Furthermore, four Persica species , including P. mira, Prunus kansuensis, Prunus ferganensis , and P. persica were clustered into a separate branch, and P. pedunculata, P. triloba , and Prunus tangutica were categorized into a branch, while P. mongolica and Prunus davidiana were clustered a branch, these results are consistent with previous reports [ 2 , 17 ].

Experimental Design, Materials and Methods
Young and fresh leaves were obtained from all five species.P. mira samples were collected from Bengga, Tibet; P. pedunculata and P. triloba were collected from Inner Mongolia; P. mongolica and P. communis were collected from the Gansu and Henan Provinces, respectively.Samples were stored at −80 °C until analysis.
Total genomic DNA was extracted using a Plant Genomic DNA Kit (Tiangen, Beijing, China).An Illumina Hiseq X high-throughput platform (Illumina, San Diego, CA, USA) was used for DNA sequencing.Library preparation and sequencing were completed by BGI Genomics (Shenzhen, China).
The LSC, SSC, and IR border sequences in Amygdalus cp genomes were compared against those in the cp genomes of P. persica (NC_014697) and Pyrus pyrifolia (NC_015996).The IR-SC boundaries of the cp genomes were visualized in IRscope.

Limitations
None.

Fig. 1 .
Fig. 1.Assembly, size, and features of the chloroplast genomes from five Amygdalus species.The dark gray area in the inner circle represents genomic GC content, whereas the light gray area indicates AT content.

Fig. 2 .
Fig. 2. (A.left) Number of dispersed repeats in the five Amygdalus species.(B.right) Number of long repeat sequences, clustered by length, in the five Amygdalus species.

Fig. 4 .
Fig. 4. Sequence alignment of chloroplast genomes from the five Amygdalus species.The y-axis represents percent identity from 50 to 100 %.

Fig. 5 .
Fig. 5. Comparison of LSC, SSC, and IR border regions across the five Amygdalus species.

Fig. 6 .
Fig. 6.Maximum-likelihood (ML) phylogenetic tree of 26 species based on complete chloroplast sequences.The bootstrap values are marked at the tree node.

Table 1
Chloroplast genomes of the five Amygdalus species.