Comprehensive Analysis of Rhodomyrtus tomentosa Chloroplast Genome

In the last decade, several studies have relied on a small number of plastid genomes to deduce deep phylogenetic relationships in the species-rich Myrtaceae. Nevertheless, the plastome of Rhodomyrtus tomentosa, an important representative plant of the Rhodomyrtus (DC.) genera, has not yet been reported yet. Here, we sequenced and analyzed the complete chloroplast (CP) genome of R. tomentosa, which is a 156,129-bp-long circular molecule with 37.1% GC content. This CP genome displays a typical quadripartite structure with two inverted repeats (IRa and IRb), of 25,824 bp each, that are separated by a small single copy region (SSC, 18,183 bp) and one large single copy region (LSC, 86,298 bp). The CP genome encodes 129 genes, including 84 protein-coding genes, 37 tRNA genes, eight rRNA genes and three pseudogenes (ycf1, rps19, ndhF). A considerable number of protein-coding genes have a universal ATG start codon, except for psbL and ndhD. Premature termination codons (PTCs) were found in one protein-coding gene, namely atpE, which is rarely reported in the CP genome of plants. Phylogenetic analysis revealed that R. tomentosa has a sister relationship with Eugenia uniflora and Psidium guajava. In conclusion, this study identified unique characteristics of the R. tomentosa CP genome providing valuable information for further investigations on species identification and the phylogenetic evolution between R. tomentosa and related species.


Introduction
The family Myrtaceae has over 3000 species distributed predominantly in tropical and subtropical regions of Australia and America [1]. Within this family, Rhodomyrtus tomentosa, an evergreen shrub of genera Rhodomyrtus, is commonly found in east and southeast Asia, including southern China, Japan, and Thailand [2]. R. tomentosa is an important plant used in traditional Chinese medicine and has a long history of clinical application. Its leaves, fruits and roots have all been used as alternative medicines, with different medicinal efficacies [3]. In addition, its fruits are one of the most popular foods in the wild. The vital medicinal and nutritional properties of R. tomentosa have drawn the attention of researchers in recent years [4,5]. The major chemical components of R. tomentosa include hydrolytic tannins, phloroglucin, flavonoids, and triterpenes [6], which possess antioxidant, anti-inflammatory, anti-tumor, antibacterial and other biological activities [7,8].
Eukaryotic cells possess large amounts of nuclear DNA, among which there are two organelles that carry independent genetic material, namely mitochondria and chloroplasts (CPs). CPs contain the   The CP genome of R. tomentosa encodes a total of 114 different genes (Table 2), of which 15 genes are duplicated in the IR regions. These 129 genes are comprised of 84 protein-coding genes, 37 tRNA genes and eight rRNA genes. Three pseudogenes (ycf1, rps19 and ndhF) are located around the IR-SSC, IR-LSC and SSC-IR boundaries, respectively. Four protein-coding genes, seven tRNA genes and four rRNA genes are duplicated in the IR regions. The coding regions constitute 56.7% of the genome, while the rest of the genome contains non-coding regions including introns, pseudogenes, and intergenic spacers. Other genes accD, ccsA, cemA, clpP, matK 5 Total 129 * Indicates gene contains one intron; ** indicates two introns; (×2) indicates the number of the repeat unit is 2.
In the R.tomentosa CP genome, there are 18 genes containing introns ( Table 3) that may participate in regulating gene expression and enhancing the expression of exogenous genes at specific sites and specific times in the plant [15]. Among those, six are tRNA genes and 12 are protein-coding genes. Most genes contain only one intron, while ycf3 and clpP contain two introns. The rps12 gene is unusual, containing one 5' exon and two 3' exons. The 5' exon is located in the LSC region, while the 3' exon is located in the IR regions, which is consistent with the CP genomes of Psidium guajava, Eugenia uniflora and Eucalyptus grandis [21]. The three pseudogenes which contain ycf1, rps19 and ndhF are located between IRB/SSC, IRA /LSC and SSC/IRA, respectively. Due to the inverse repeating nature of the IR regions, these three genes cannot be fully duplicated and lose the ability to encode complete proteins, which leads to their classification as pseudogene. One protein-coding gene (atpE) with a premature termination codon (PTC) was identified during annotation. In order to validate this finding, raw (data) reads were used to conduct mapping on the spliced R. tomentosa sequences, followed by Integrative Genomics Viewer (IGV) visual processing to examine variable loci. The mapping rate of the aptE locus was found to be higher than 99%, suggesting that this locus was indeed variable and resulted in a PTC. PTCs lead to changes in protein coding. Because CP genomes are relatively conserved, especially within the same family, these plants from the Myrtaceae family were selected as control groups: Psidium guajava, Eugenia uniflora and Eucalyptus grandis. The atpE genes from these species were extracted using CLC Sequence Viewer (version 8) and then compared with that of R. tomentosa. The comparison results are shown in Figure 2. It can be seen that the premature termination of the atpE gene in R. tomentosa resulted in the absence of an amino acid compared to the three closely control species.  One protein-coding gene (atpE) with a premature termination codon (PTC) was identified during annotation. In order to validate this finding, raw (data) reads were used to conduct mapping on the spliced R. tomentosa sequences, followed by Integrative Genomics Viewer (IGV) visual processing to examine variable loci. The mapping rate of the aptE locus was found to be higher than 99%, suggesting that this locus was indeed variable and resulted in a PTC. PTCs lead to changes in protein coding. Because CP genomes are relatively conserved, especially within the same family, these plants from the Myrtaceae family were selected as control groups: Psidium guajava, Eugenia uniflora and Eucalyptus grandis. The atpE genes from these species were extracted using CLC Sequence Viewer (version 8) and then compared with that of R. tomentosa. The comparison results are shown in Figure 2. It can be seen that the premature termination of the atpE gene in R. tomentosa resulted in the absence of an amino acid compared to the three closely control species.  The atpE gene encodes a subunit of the chloroplast ATP synthase complex, which participates in photosynthetic phosphorylation necessary for plant growth [22]. As such, this gene is a critical component of the CP genome. Although literatures reports on PTC in genetics are common, few studies have identified PTCs in plant CP genomes. In genetics, nonsense point mutations often result in the production of nonfunctional proteins, assuming these proteins are properly transcribed and translated [23]. To be more exact, the effect of a nonsense mutation point relies on the proximity of the mutation to the original stop codon, and the degree to which functional subdomains of the protein are affected. Some genetic disorders such as thalassemia result from point-nonsense mutations [24][25][26].
With this in mind, the discovery of PTC in the R. tomentosa atpE gene may establish a foundation for further studies at the protein level through cloning and expression. Future work on CP transcription and translation are needed to verify the presence and functions of PTCs in atpE and potentially other genes.

Identification of Long Repeats (LRs) and Simple Sequence Repeats (SSRs)
Repetitive sequences in CP genome have been a major focus of research. There are an abundance of repeated sequences in the CP genome, which are distributed in intergenetic spacer and intron sequences [27]. Long repeats with length greater than 30 bp, might have functions in promoting chloroplast genome rearrangement and increasing population genetic diversity [28]. In order to verify the above-mentioned functions and obtain a comprehensive understanding of the long repeats within the R. tomentosa CP genome, the long repeats in CP genomes from four other species, Psidium guajava, Eugenia uniflora, Eucalyptus grandis and Melastoma candidum were selected for comparison according to the ties of consanguinity between species. These three species were used to compare and analyze the conserved and unique characteristics of chloroplast CP genomes between different genera of the same family. M. candidum, which belongs to another family within the Myrtiflorae order, is the most closely related among other species whose CP genome sequences are available from NCBI except for the three species of Myrtaceae family. Similarly, M. candidum was used to analyze differences between species in different families. The resulting data revealed the repeat structure of these four species, demonstrating that there are 38 (14 forward, 24 palindromic), 31 (14 forward, 15 palindromic, 2 reverse), 33 (16 forward, 16 palindromic, 1 reserve), 30 (16 forward, 14 palindromic) and 49 (22 forward, 19 palindromic, 4 reverse, 4 complement) large repeats (LRs) in R. tomentosa, Psidium guajava, Eugenia uniflora, Eucalyptus grandis and Melastoma candidum, respectively. In detail, there is no reverse or complement repeats in R. tomentosa, similar to Eucalyptus grandis. At the same time, complement repeats exists in M. candidum, which is not a member of the Myrtaceae family unlike the other four species. Thus, population genetic diversity is revealed by LRs differences, which is consistent with LRs functional analysis ( Figure 3).
Simple sequence repeats (SSRs) are composed of small repeated sequences of 1 to 6 bp, which are extensively distributed in intergenic regions, intron regions, and even protein-coding regions. High mutation rates in these regions also reflect the genetic diversity [29]. CP SSRs, which are widely used in phylogenetic and population genetic analyses [30], are important sources for developing molecular markers. A total of 282 SSRs were identified in the R. tomentosa CP genome and were summarized in Table 4, including 173 mononucleotide, 37 dinucleotide, 63 trinucleotide and nine tetranucleotide repeat units. In addition, 98.8% of the mononucleotide SSRs belongs to the A/T type, which is consistent with previous studies where proportions of polyadenine (polyA) and polythymine (polyT) were higher than those of polycytosine (polyC) and polyguanine (polyG) within CP SSRs in many plants [31]. Simple sequence repeats (SSRs) are composed of small repeated sequences of 1 to 6 bp, which are extensively distributed in intergenic regions, intron regions, and even protein-coding regions. High mutation rates in these regions also reflect the genetic diversity [29]. CP SSRs, which are widely used in phylogenetic and population genetic analyses [30], are important sources for developing molecular markers. A total of 282 SSRs were identified in the R. tomentosa CP genome and were summarized in Table 4, including 173 mononucleotide, 37 dinucleotide, 63 trinucleotide and nine tetranucleotide repeat units. In addition, 98.8% of the mononucleotide SSRs belongs to the A/T type, which is consistent with previous studies where proportions of polyadenine (polyA) and polythymine (polyT) were higher than those of polycytosine (polyC) and polyguanine (polyG) within CP SSRs in many plants [31].   In different organisms, synonymous codons occur at different frequencies-this is called preference [30,32]. As for highly expressed genes, the preference of codons is closely related to the abundance of tRNA. An improved understanding of preference of codons will facilitate further studies on the preference of base composition of DNA sequences, finding optimal codons, and designing expression vectors accordingly to improve the efficiency of protein synthesis [33].
In the R. tomentosa CP Genome, all the protein-coding genes were composed of 23,939 codons in sum, among which 2,724 codons (accounting for 11.38%) encode leucine and 286 (1.19%) encode cysteine, respectively. These represent the most and least universal amino acids, respectively, out of the 20 amino acids that can be used for protein biosynthesis by tRNA found in the R. tomentosa CP  Table S1) increases with the quantity of codons that encode for a specific amino acid. As illustrated, most of the amino acid codons, except for methionine and tryptophan, have preferences. This phenomenon was also found in the CP genomes of other species [10,34].
In addition, RNA editing is a very common phenomenon that occurs in plant CP genomes. The core functions of RNA editing include modifying mutations, correcting and regulating translation [35]. RNA editing sites in the R. tomentosa CP genome were predicted based on 35 genes by the predictive RNA editor for plants (PREP) program, among which, a total of 20 genes were analyzed and summarized in Table S2. In sum, 64 RNA editing sites were identified in the R. tomentosa CP genome, in which amino acid conversion from serine to leucine occurred most frequently, while threonine to methionine occurred least often. abundance of tRNA. An improved understanding of preference of codons will facilitate further studies on the preference of base composition of DNA sequences, finding optimal codons, and designing expression vectors accordingly to improve the efficiency of protein synthesis [33].
In the R. tomentosa CP Genome, all the protein-coding genes were composed of 23,939 codons in sum, among which 2,724 codons (accounting for 11.38%) encode leucine and 286 (1.19%) encode cysteine, respectively. These represent the most and least universal amino acids, respectively, out of the 20 amino acids that can be used for protein biosynthesis by tRNA found in the R. tomentosa CP genome. The relative synonymous codon usage (RSCU) value ( Figure 4 and Table S1) increases with the quantity of codons that encode for a specific amino acid. As illustrated, most of the amino acid codons, except for methionine and tryptophan, have preferences. This phenomenon was also found in the CP genomes of other species [10,34].
In addition, RNA editing is a very common phenomenon that occurs in plant CP genomes. The core functions of RNA editing include modifying mutations, correcting and regulating translation [35]. RNA editing sites in the R. tomentosa CP genome were predicted based on 35 genes by the predictive RNA editor for plants (PREP) program, among which, a total of 20 genes were analyzed and summarized in Table S2. In sum, 64 RNA editing sites were identified in the R. tomentosa CP genome, in which amino acid conversion from serine to leucine occurred most frequently, while threonine to methionine occurred least often.

Contraction and Expansion of IRs in the R. tomentosa CP Genome
As mentioned above, the typical quadripartite structure of the CP genome includes two different single-copy regions and two IR regions [36]. Although the inverted repeat regions (IRa and IRb) are the most conserved regions of the CP genome, contraction and expansion at the borders of the IR regions are hypothesized to explain size differences between CP genomes [37,38]. A comparison between R. tomentosa and four other closely related species may explain size differences between their respective CP genomes.
As presented in Figure 5, the IR/SSC and IR/LSC boundaries of R. tomentosa (MK_044696) were compared to those in Psidium guajava (NC_033355), Eugenia uniflora, (NC_027744), Eucalyptus grandis (HM_347959) and Melastoma candidum (NC_034716). The length of the IR regions in the five CP

Contraction and Expansion of IRs in the R. tomentosa CP Genome
As mentioned above, the typical quadripartite structure of the CP genome includes two different single-copy regions and two IR regions [36]. Although the inverted repeat regions (IRa and IRb) are the most conserved regions of the CP genome, contraction and expansion at the borders of the IR regions are hypothesized to explain size differences between CP genomes [37,38]. A comparison between R. tomentosa and four other closely related species may explain size differences between their respective CP genomes.
As presented in Figure 5, the IR/SSC and IR/LSC boundaries of R. tomentosa (MK_044696) were compared to those in Psidium guajava (NC_033355), Eugenia uniflora, (NC_027744), Eucalyptus grandis (HM_347959) and Melastoma candidum (NC_034716). The length of the IR regions in the five CP genomes showed a modest expansion, ranging from 25,824 to 26,390 bp. The IR regions expanded to partially include rps19, ycf1 and ndhF, correspondingly creating truncated ψrps19, ψycf1 and ψndhF copies at the junction of IRa/LSC and IRb/SSC and IRa/LSC, respectively. Long ycf1 pseudogene exists in all species, which has been used to analyze CP genome variation in plants [28,38]. Moreover, it has been reported that the rps19 gene is one of the most abundant transcripts in the CP genome, which exists in most species except for Eugenia uniflora and Eucalyptus grandis. The ndhF gene, related to photosynthesis, was found to be 67 bp, 112 bp, 209 bp away from the IRb/SSC border in R. tomentosa, P. guajava, Eugenia. uniflora, and Eucalyptus grandis, respectively. The trnH gene is present at the longest distance (32 bp) from the LSC edge in the R. tomentosa CP genome. copies at the junction of IRa/LSC and IRb/SSC and IRa/LSC, respectively. Long ycf1 pseudogene exists in all species, which has been used to analyze CP genome variation in plants [28,38]. Moreover, it has been reported that the rps19 gene is one of the most abundant transcripts in the CP genome, which exists in most species except for Eugenia uniflora and Eucalyptus grandis. The ndhF gene, related to photosynthesis, was found to be 67 bp, 112 bp, 209 bp away from the IRb/SSC border in R. tomentosa, P. guajava, Eugenia. uniflora, and Eucalyptus grandis, respectively. The trnH gene is present at the longest distance (32 bp) from the LSC edge in the R. tomentosa CP genome.

Comparative CP Genomic Analysis
The whole CP genome sequence of R. tomentosa (MK_044696) was compared to those of Psidium guajava (NC_033355), Eugenia uniflora, (NC_027744), Eucalyptus grandis (HM_347959), and Melastoma candidum (NC_034716) using the mVISTA program ( Figure 6). By comparison, the two IR regions were less divergent than the LSC and SSC regions, which also occurred in most plants [6,39]. Moreover, it was found that the non-coding region was more variable than the coding region, and the different regions may provide candidate DNA barcodes for future studies. In the coding region, most genes were relatively conserved except for matK, accD, ndhF, ycf1 and ycf2. These divergence hotspot regions of the four plant CP genome sequences provided abundant information for developing molecular markers for phylogenetic analyses and plant identification of Myrtaceae species.

Comparative CP Genomic Analysis
The whole CP genome sequence of R. tomentosa (MK_044696) was compared to those of Psidium guajava (NC_033355), Eugenia uniflora, (NC_027744), Eucalyptus grandis (HM_347959), and Melastoma candidum (NC_034716) using the mVISTA program ( Figure 6). By comparison, the two IR regions were less divergent than the LSC and SSC regions, which also occurred in most plants [6,39]. Moreover, it was found that the non-coding region was more variable than the coding region, and the different regions may provide candidate DNA barcodes for future studies. In the coding region, most genes were relatively conserved except for matK, accD, ndhF, ycf1 and ycf2. These divergence hotspot regions of the four plant CP genome sequences provided abundant information for developing molecular markers for phylogenetic analyses and plant identification of Myrtaceae species.

Phylogenetic Analysis of the R. tomentosa CP Genome
The availability of a complete CP genome provides us with abundant sequence information that can be used to study the molecular evolution and phylogeny of plants [8,40]. To identify the evolutionary position of R. tomentosa, the whole CP genomes of 17 species were used to reconstruct a phylogenetic tree using the maximum likelihood (ML) method, in which four species from Myrtaceae along with 13 species from other families were chosen. Figure 7 shows that most nodes were strongly supported by 100 % bootstrap values (BP). Furthermore, R. tomentosa exhibited a sister relationship with two species of Eugenia uniflora and Psidium guajava and then grouped with Eucalyptus grandis. These four species all belong to the Myrtaceae family and were clustered distinctly from other families, which could help reveal the relationship between different families and orders. Nevertheless, node branching of this phylogenetic tree showed high consistency with the angiosperm phylogeny group (APG) IV classification system, which is a modern classification system of angiosperms based on the research of molecular system development. This classification situation differs from that of Flora of China, a series of books that summarize the systematic classification of vascular plants (ferns and seed plants) in China.
Due to the limited availability of CP genome sequences from Myrtaceae deposited in databases, phylogenetic relationships among Myrtaceae plants based on CP genome sequence can be difficult to determine. Therefore, more data is needed to evaluate phylogenetic relationships of Myrtaceae plants in the future. tomentosa with three others using mVISTA. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percentage identity ranging from 50 to 100%.

Phylogenetic Analysis of the R. tomentosa CP Genome
The availability of a complete CP genome provides us with abundant sequence information that can be used to study the molecular evolution and phylogeny of plants [8,40]. To identify the evolutionary position of R. tomentosa, the whole CP genomes of 17 species were used to reconstruct a phylogenetic tree using the maximum likelihood (ML) method, in which four species from Myrtaceae along with 13 species from other families were chosen. Figure 7 shows that most nodes were strongly supported by 100 % bootstrap values (BP). Furthermore, R. tomentosa exhibited a sister relationship with two species of Eugenia uniflora and Psidium guajava and then grouped with Eucalyptus grandis. These four species all belong to the Myrtaceae family and were clustered distinctly from other families, which could help reveal the relationship between different families and orders. Nevertheless, node branching of this phylogenetic tree showed high consistency with the angiosperm phylogeny group (APG) IV classification system, which is a modern classification system of angiosperms based on the research of molecular system development. This classification situation differs from that of Flora of China, a series of books that summarize the systematic classification of vascular plants (ferns and seed plants) in China.
Due to the limited availability of CP genome sequences from Myrtaceae deposited in databases, phylogenetic relationships among Myrtaceae plants based on CP genome sequence can be difficult to determine. Therefore, more data is needed to evaluate phylogenetic relationships of Myrtaceae plants in the future. Figure 6. Sequence identity plot comparison of the chloroplast genome of Rhodomyrtus. tomentosa with three others using mVISTA. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percentage identity ranging from 50 to 100%.

Plant Material, DNA Extraction and Sequencing
Fresh leaves of Rhodomyrtus tomentosa were collected from the Medicinal Botanical Garden of Guangzhou University of Chinese Medicine. Total genomic DNA was extracted from clean leaves

Plant Material, DNA Extraction and Sequencing
Fresh leaves of Rhodomyrtus tomentosa were collected from the Medicinal Botanical Garden of Guangzhou University of Chinese Medicine. Total genomic DNA was extracted from clean leaves using a DNeasy Plant Mini Kit (Qiagen, Hilden, Germany). The extracted genomic DNA was measured in terms of purity and integrity by ultraviolet spectrophotometry and gel electrophoresis. DNA samples with good integrity and purity were submitted for library construction and sequencing using an Illumina Hiseq 2000 Sequencing platform (Illumina Inc., San Diego, CA, USA).

Chloroplast Genome Assembly and Annotation
Trimmomatic (v0.36, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany) was used to filter and trim low-quality reads. The complete sequence of Psidium guajava chloroplast genome was downloaded from NCBI and served as a reference. Based on their coverage and similarity, CP-like reads were extracted and assembled using the Abyss2.0 program to form a complete chloroplast genome sequence. BLASTn was used to conduct self-alignment for locating the precise position of the quadripartite structure. In order to verify the assembly, four regions between the IR regions and the LSC/SSC region were confirmed through PCR amplification.
The preliminarily gene annotation of the R. tomentosa CP genome was performed using the GeSeq online tool (https://chlorobox.mpimp-golm.mpg.de/geseq.Html) with default parameters [41]. The annotation information was further examined and revised manually using the CLC Sequence Viewer (version 8), which was used to compare the CP genome of R. tomentosa and the related species, Psidium guajava. Since sequences at both ends of the exon are relatively conserved if genes contain introns, the chloroplast introns can be predicted according to the revised annotation file. The Organellar Genome DRAW (OGDRAW) (v1.2, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany) [42] was used to construct a detailed map of the CP genome. Finally, the whole CP genome of R. tomentosa was deposited into GenBank, with an accession number of MK_044696.2.1.

Genome Structure and Genome Comparison
Distribution of codon usage and GC content were analyzed using the Molecular Evolutionary Genetics Analysis (MEGA 6.06, Tokyo Metropolitan University, Tokyo, Japan) [43]. Thirty-five protein-coding genes of the chloroplast genome of R. tomentosa were used to predict potential RNA editing sites using the online program Predictive RNA Editor for Plants (PREP) suite [44], with a cutoff value of 0.8. MISA and REPuter (https://bibiserv.cebitec.uni-Bielefeld.de/session) were used to identify SSRs and LRs in the R. tomentosa CP genome [45]. For the purpose of comparison among genomes, the mVISTA program (http://genome.Lbl.gov/vista/index.shtml) was used to align the CP genome of R. tomentosa with the CP genomes of Psidium guajava, Eugenia uniflora and Eucalyptus grandis [46].

Phylogenetic Analysis
A total of 17 complete CP genome sequences were downloaded from the GenBank (NCBI) database. Nucleotide alignments were subjected to phylogenetic analyses with maximum likelihood (ML) using the GTR + G substitution model, which was selected based on model screening. Bootstrap analysis was conducted with 1000 replicates and TBR branch swapping. In addition, Cinnamomum camphora was set as the out-group.

Conclusions
In conclusion, the complete CP genome of Rhodomyrtus Tomentosa was obtained using high throughput sequencing, which is 156,129 bp in length and encodes 129 genes. Further analysis on genome structure and genome characteristics revealed that gene structure and gene content of the R. tomentosa CP genome are conserved. The phylogenetic analysis indicated that R. tomentosa has a sister relationship with Eugenia uniflora and Psidium guajava. These results provide valuable information for further investigations on species identification and the evolution of R. tomentosa and its related species.