Gene Losses and Variations in Chloroplast Genome of Parasitic Plant Macrosolen and Phylogenetic Relationships within Santalales

Macrosolen plants are parasitic shrubs, several of which are important medicinal plants, that are used as folk medicine in some provinces of China. However, reports on Macrosolen are limited. In this study, the complete chloroplast genome sequences of Macrosolen cochinchinensis, Macrosolen tricolor and Macrosolen bibracteolatus are reported. The chloroplast genomes were sequenced by Illumina HiSeq X. The length of the chloroplast genomes ranged from 129,570 bp (M. cochinchinensis) to 126,621 bp (M. tricolor), with a total of 113 genes, including 35 tRNA, eight rRNA, 68 protein-coding genes, and two pseudogenes (ycf1 and rpl2). The simple sequence repeats are mainly comprised of A/T mononucleotide repeats. Comparative genome analyses of the three species detected the most divergent regions in the non-coding spacers. Phylogenetic analyses using maximum parsimony and maximum likelihood strongly supported the idea that Loranthaceae and Viscaceae are monophyletic clades. The data obtained in this study are beneficial for further investigations of Macrosolen in respect to evolution and molecular identification.


Introduction
The traits of trophic specialization in all parasitic plants are described as "parasitic reduction syndrome". At the genetic level, parasitic reduction syndrome includes the functional and physical reduction of heterotrophs' plastid genomes, where rampant gene loss and an acceleration of molecular evolutionary rates occur [1,2]. Considering the partial or complete absence of their photosynthetic capacity, parasitic plants have to absorb organic nutrients, inorganic nutrients, and water from their hosts [3]. Most parasitic plants are included in the order Santalales and the families Orobanchaceae and Orchidaceae [2]. The first complete chloroplast genome of a parasitic plant was obtained from Epifagus virginiana, and all of its photosynthesis and energy producing genes have been lost [4]. Petersen

Complete Chloroplast Genomes of Three Macrosolen Species
The length of the three studied chloroplast genomes ranged from 129,570 bp (M. cochinchinensis) to 126,621 bp (M. tricolor) with a typical quadripartite structure consisting of a pair of IRs (24,445 bp) separated by the LSC (70,692-73,052 bp) and the SSC (5320-5724 bp) regions ( Figure 2). The three chloroplast genomes were found to highly conserved in GC content, gene content and gene order (Table 1 and Table S1). All three species comprised 113 genes, including 68 protein-coding genes, 35 tRNAs, eight rRNAs and two pseudogenes (rps12 and ycf2). A total of 17 genes were found to be repeated genes, and 79 were found to be unique genes in the chloroplast genomes. Three genes (clpP, ycf3 and rps12) contained two introns, whereas 10 genes (atpF, rpoC1, rpl2, rpl16, petB, petD, trnA-UGC, trnI-GAU, trnK-UUU and trnL-UAA) had only one intron (Table 2  and Table S2).

Complete Chloroplast Genomes of Three Macrosolen Species
The length of the three studied chloroplast genomes ranged from 129,570 bp (M. cochinchinensis) to 126,621 bp (M. tricolor) with a typical quadripartite structure consisting of a pair of IRs (24,445 bp) separated by the LSC (70,692-73,052 bp) and the SSC (5320-5724 bp) regions ( Figure 2). The three chloroplast genomes were found to highly conserved in GC content, gene content and gene order (Table 1 and Table S1). All three species comprised 113 genes, including 68 protein-coding genes, 35 tRNAs, eight rRNAs and two pseudogenes (rps12 and ycf2). A total of 17 genes were found to be repeated genes, and 79 were found to be unique genes in the chloroplast genomes. Three genes (clpP, ycf3 and rps12) contained two introns, whereas 10 genes (atpF, rpoC1, rpl2, rpl16, petB, petD, trnA-UGC, trnI-GAU, trnK-UUU and trnL-UAA) had only one intron ( Table 2 and Table S2).  Figure 2. Gene map of the complete chloroplast genome of three Macrosolen species. Genes outside the large ring circle are transcribed in a counter-clockwise direction, and genes inside the circle are transcribed clockwise. The same color represents the same category of genes. Deep grey in the inner circle represents GC content, and lighter grey represents A/T content.   Other genes accD, clpP **, matK, ccsA, cemA 5 * One or two asterisks following genes indicate one or two contained introns, respectively. (×2) indicates that the number of the repeat unit is two. The numbers in parenthesis at the line of 'Number' indicate the total number of repeated genes.

Codon Usage Analyses and RNA Editing Sites
Relative synonymous codon usage (RSCU) is the ratio between the use and expected frequencies for a particular codon and a measure of nonuniform synonymous codon usage in coding sequences [32]. On the basis of the sequences of protein-coding genes, the codon usage frequency was estimated for the chloroplast genome of the three Macrosolen species (Figure 3). All the protein-coding genes were found to consist of 21,581, 21,598 and 21,520 codons in the chloroplast genomes of M. cochinchinensis, M. tricolor and M. bibracteolatus, respectively (Table S3). Figure 3 shows that the RSCU value increased with the increase in the quantity of codons which coded for a specific amino acid. Most of the amino acid codons show preferences except for methionine and tryptophan. Potential RNA editing sites were also predicted for 29 genes in the chloroplast genomes of the three species. A total of 39 RNA editing sites were identified (Table S4). The amino acid conversion from serine (S) to leucine (L) occurred most frequently, whereas that from proline (P) to serine (S) and from threonine (T) to methionine (M) occurred the least.

13
Other genes accD, clpP **, matK, ccsA, cemA 5 * One or two asterisks following genes indicate one or two contained introns, respectively. (×2) indicates that the number of the repeat unit is two. The numbers in parenthesis at the line of 'Number' indicate the total number of repeated genes.

Codon Usage Analyses and RNA Editing Sites
Relative synonymous codon usage (RSCU) is the ratio between the use and expected frequencies for a particular codon and a measure of nonuniform synonymous codon usage in coding sequences [32]. On the basis of the sequences of protein-coding genes, the codon usage frequency was estimated for the chloroplast genome of the three Macrosolen species (Figure 3). All the protein-coding genes were found to consist of 21,581, 21,598 and 21,520 codons in the chloroplast genomes of M. cochinchinensis, M. tricolor and M. bibracteolatus, respectively (Table S3). Figure 3 shows that the RSCU value increased with the increase in the quantity of codons which coded for a specific amino acid. Most of the amino acid codons show preferences except for methionine and tryptophan. Potential RNA editing sites were also predicted for 29 genes in the chloroplast genomes of the three species. A total of 39 RNA editing sites were identified (Table S4). The amino acid conversion from serine (S) to leucine (L) occurred most frequently, whereas that from proline (P) to serine (S) and from threonine (T) to methionine (M) occurred the least.  Figure 4 shows the comparison of the boundaries of the LSC/IR/SSC regions of three Macrosolen species. The LSC/IR/SSC boundaries and gene contents in the chloroplast genomes of the three species were found to be highly conserved, featuring the same sequence structure and differences in length. In the three species, the rpl2 gene, which is a normal functional gene, crossed the LSC/IRa boundary, but the rpl2 pseudogene with a length of 1268 bp formed in the IRb region. The SSC/IRb boundaries of M. cochinchinensis, M. tricolor and M. bibracteolatus were found to be located in the complete ycf1 gene, and their ycf1 pseudogenes with lengths of 2457, 2455 and 2448 bp, respectively, were found to be produced in IRa.  Figure 4 shows the comparison of the boundaries of the LSC/IR/SSC regions of three Macrosolen species. The LSC/IR/SSC boundaries and gene contents in the chloroplast genomes of the three species were found to be highly conserved, featuring the same sequence structure and differences in length. In the three species, the rpl2 gene, which is a normal functional gene, crossed the LSC/IRa boundary, but the rpl2 pseudogene with a length of 1268 bp formed in the IRb region. The SSC/IRb boundaries of M. cochinchinensis, M. tricolor and M. bibracteolatus were found to be located in the complete ycf1 gene, and their ycf1 pseudogenes with lengths of 2457, 2455 and 2448 bp, respectively, were found to be produced in IRa.

Simple Sequence Repeats (SSRs) and Repeat Structure Analyses
A simple sequence repeat (SSR), which is also known as microsatellite DNA, is a tandem repeat sequence consisting of one to six nucleotide repeat units [22]. SSRs are widely used as molecular markers in species identification, population genetics, and phylogenetic investigations due to their high polymorphism level [33,34]. A total of 238, 226 and 217 SSRs were identified in the chloroplast genomes of M. cochinchinensis, M. tricolor and M. bibracteolatus, respectively (Table 3). Amongst all SSRs, the numbers of mononucleotide repeats were the highest, with values detected at 169, 166 and 162 times in M. cochinchinensis, M. tricolor and M. bibracteolatus, respectively. Amongst these mononucleotide repeats, A/T was found to be the most frequent SSR. In accordance with the number of repeats, mononucleotide and dinucleotide SSRs exhibited a certain base preference that mainly contained A/T units. Long repeat sequences should be >30 bp, and these repeats are mainly distributed in the gene spacer and intron sequences. The result shows that M. cochinchinensis presented the highest number, comprising six forward, seven palindromic, four reverse and one complement repeats ( Figure 5). Two types of M. tricolor, comprising six forward and nine palindromic repeats, were present. M. bibracteolatus presented seven forward, six palindromic and two reverse repeats.

Simple Sequence Repeats (SSRs) and Repeat Structure Analyses
A simple sequence repeat (SSR), which is also known as microsatellite DNA, is a tandem repeat sequence consisting of one to six nucleotide repeat units [22]. SSRs are widely used as molecular markers in species identification, population genetics, and phylogenetic investigations due to their high polymorphism level [33,34]. A total of 238, 226 and 217 SSRs were identified in the chloroplast genomes of M. cochinchinensis, M. tricolor and M. bibracteolatus, respectively (Table 3). Amongst all SSRs, the numbers of mononucleotide repeats were the highest, with values detected at 169, 166 and 162 times in M. cochinchinensis, M. tricolor and M. bibracteolatus, respectively. Amongst these mononucleotide repeats, A/T was found to be the most frequent SSR. In accordance with the number of repeats, mononucleotide and dinucleotide SSRs exhibited a certain base preference that mainly contained A/T units. Long repeat sequences should be >30 bp, and these repeats are mainly distributed in the gene spacer and intron sequences. The result shows that M. cochinchinensis presented the highest number, comprising six forward, seven palindromic, four reverse and one complement repeats ( Figure 5). Two types of M. tricolor, comprising six forward and nine palindromic repeats, were present. M. bibracteolatus presented seven forward, six palindromic and two reverse repeats.

Comparative Genomic Analyses
The complete chloroplast of the three chloroplast genomes were compared with that of M. cochinchinensis as a reference using the mVISTA program. As shown in Figure 6, the ycf1 and ccsA genes were found to be the most mutant genes. Except for these genes, the other genes were found to be highly conserved, and most of them showed similarities of >90%. The variations in the coding regions were smaller than those in the noncoding regions. Amongst the three chloroplast genomes, the most divergent regions were found to be localized in the intergenic spacers such as trnF-trnM. The rRNA genes of the three species were highly conservative, and almost no variations were observed. The K values (sequence divergence between species) were calculated, and the sliding windows of the K values were constructed by the DnaSP [35] (Figure 7). Figure 7 shows that the sequence divergence between M. tricolor and M. cochinchinensis was much higher than the other two K values. M. bibracteolatus and M. tricolor showed a small divergence (K < 0.05). The LSC and SSC regions were more divergent than IRs. Two mutational hotspots were found with high K values, and they were located at the LSC and SSC regions. Combined with genes location and the mVISTA result, the two hotspots were found to be trnF-trnM and ycf1.

Comparative Genomic Analyses
The complete chloroplast of the three chloroplast genomes were compared with that of M. cochinchinensis as a reference using the mVISTA program. As shown in Figure 6, the ycf1 and ccsA genes were found to be the most mutant genes. Except for these genes, the other genes were found to be highly conserved, and most of them showed similarities of >90%. The variations in the coding regions were smaller than those in the noncoding regions. Amongst the three chloroplast genomes, the most divergent regions were found to be localized in the intergenic spacers such as trnF-trnM. The rRNA genes of the three species were highly conservative, and almost no variations were observed. The K values (sequence divergence between species) were calculated, and the sliding windows of the K values were constructed by the DnaSP [35] (Figure 7). Figure 7 shows that the sequence divergence between M. tricolor and M. cochinchinensis was much higher than the other two K values. M. bibracteolatus and M. tricolor showed a small divergence (K < 0.05). The LSC and SSC regions were more divergent than IRs. Two mutational hotspots were found with high K values, and they were located at the LSC and SSC regions. Combined with genes location and the mVISTA result, the two hotspots were found to be trnF-trnM and ycf1. Int. J. Mol. Sci. 2019, 20, x FOR PEER REVIEW 8 of 14 Figure 6. Sequence identity plot comparing the three chloroplast genomes with M. cochinchinensis as a reference by using mVISTA. Grey arrows and thick black lines above the alignment indicate genes with their orientation and the position of their IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100%.

Phylogenetic Analyses
To analyze the phylogenetic relationships of Macrosolen in Santalales, we constructed phylogenetic trees using 58 common protein-coding genes of 16 species and matK genes of 15 species by the MP and ML methods with a bootstrap of 1000 repetitions. The MP and ML trees were the same whether they were constructed by either common protein-coding genes or matK genes ( Figure  8). All nodes in all the phylogenetic trees received a >50% bootstrap value. All four phylogenetic trees showed that the three Macrosolen species are sister taxa with respect to S. jasminodora (Olacaceae). M. cochinchinensis, M. tricolor and M. bibracteolatus were gathered into one branch with a well-supported bootstrap value (100%). The three species within the genus Viscum grouped with Figure 6. Sequence identity plot comparing the three chloroplast genomes with M. cochinchinensis as a reference by using mVISTA. Grey arrows and thick black lines above the alignment indicate genes with their orientation and the position of their IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100%. 8 Figure 6. Sequence identity plot comparing the three chloroplast genomes with M. cochinchinensis as a reference by using mVISTA. Grey arrows and thick black lines above the alignment indicate genes with their orientation and the position of their IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100%.

Phylogenetic Analyses
To analyze the phylogenetic relationships of Macrosolen in Santalales, we constructed phylogenetic trees using 58 common protein-coding genes of 16 species and matK genes of 15 species by the MP and ML methods with a bootstrap of 1000 repetitions. The MP and ML trees were the same whether they were constructed by either common protein-coding genes or matK genes ( Figure  8). All nodes in all the phylogenetic trees received a >50% bootstrap value. All four phylogenetic trees showed that the three Macrosolen species are sister taxa with respect to S. jasminodora (Olacaceae). M. cochinchinensis, M. tricolor and M. bibracteolatus were gathered into one branch with a well-supported bootstrap value (100%). The three species within the genus Viscum grouped with

Phylogenetic Analyses
To analyze the phylogenetic relationships of Macrosolen in Santalales, we constructed phylogenetic trees using 58 common protein-coding genes of 16 species and matK genes of 15 species by the MP and ML methods with a bootstrap of 1000 repetitions. The MP and ML trees were the same whether they were constructed by either common protein-coding genes or matK genes ( Figure 8). All nodes in all the phylogenetic trees received a >50% bootstrap value. All four phylogenetic trees showed that the three Macrosolen species are sister taxa with respect to S. jasminodora (Olacaceae). M. cochinchinensis, M. tricolor and M. bibracteolatus were gathered into one branch with a well-supported bootstrap value (100%). The three species within the genus Viscum grouped with Osyris alba (Santalaceae) and all Santalales species were clustered within a lineage distinct from the outgroup. As shown in Figure 8, the trees constructed by common protein-coding genes also received a higher bootstrap value than the trees constructed by the matK genes. Osyris alba (Santalaceae) and all Santalales species were clustered within a lineage distinct from the outgroup. As shown in Figure 8, the trees constructed by common protein-coding genes also received a higher bootstrap value than the trees constructed by the matK genes.

Discussion
Numerous variations occur in the chloroplast genomes of parasitic plants. However, only a small number of plants within Santalales have been studied. In this study, the complete chloroplast genomes of M. cochinchinensis, M. tricolor and M. bibracteolatus from Santalales were assembled, annotated and analyzed. Compared with the chloroplast genomes of the model plant Nicotiana tabacum, all the ndh genes of the chloroplast genomes were lost amongst the three species, and the infA gene, which codes for a translation initiation factor, was also missing in these species. These cases were similar to those of T. chinensis and T. sutchuenensis [7]. The rpl16 and ycf15 genes were lost in the three species, but they were still present in T. chinensis as pseudogenes (Figure 9). However,

Discussion
Numerous variations occur in the chloroplast genomes of parasitic plants. However, only a small number of plants within Santalales have been studied. In this study, the complete chloroplast genomes of M. cochinchinensis, M. tricolor and M. bibracteolatus from Santalales were assembled, annotated and analyzed. Compared with the chloroplast genomes of the model plant Nicotiana tabacum, all the ndh genes of the chloroplast genomes were lost amongst the three species, and the infA gene, which codes for a translation initiation factor, was also missing in these species. These cases were similar to those of T. chinensis and T. sutchuenensis [7]. The rpl16 and ycf15 genes were lost in the three species, but they were still present in T. chinensis as pseudogenes ( Figure 9). However, compared with the results reported by Shin et al. [36], different gene contents of the chloroplast genome were observed in M. cochinchinensis. These studies have shown that M. cochinchinensis contains the exon 1 fragment of the ndhB gene and a fragment of the infA gene, whereas the rpl36 gene is completely lost. However, the rpl36 gene is still present in the chloroplast genome according to our results. M. cochinchinensis has lost the infA gene and all ndh genes. The number of tRNA genes also differed between the two studies. We annotated 35 tRNA genes, but previous studies only obtained 30 tRNA genes. The evolution of the chloroplast genome in parasitic plants, particularly nonphotosynthetic holoparasites, can lead to significantly reconfigured plastomes [21]. The losses of ndh genes are associated with nutritional status or extensive rearrangements of chloroplast structures [37], and they have occurred in the reported chloroplast genomes of parasitic plants [7]. Our study also showed that ndh genes were lost in the transformation from autotrophy to heterotrophy [38]. 10 compared with the results reported by Shin et al. [36], different gene contents of the chloroplast genome were observed in M. cochinchinensis. These studies have shown that M. cochinchinensis contains the exon 1 fragment of the ndhB gene and a fragment of the infA gene, whereas the rpl36 gene is completely lost. However, the rpl36 gene is still present in the chloroplast genome according to our results. M. cochinchinensis has lost the infA gene and all ndh genes. The number of tRNA genes also differed between the two studies. We annotated 35 tRNA genes, but previous studies only obtained 30 tRNA genes. The evolution of the chloroplast genome in parasitic plants, particularly nonphotosynthetic holoparasites, can lead to significantly reconfigured plastomes [21]. The losses of ndh genes are associated with nutritional status or extensive rearrangements of chloroplast structures [37], and they have occurred in the reported chloroplast genomes of parasitic plants [7].
Our study also showed that ndh genes were lost in the transformation from autotrophy to heterotrophy [38]. The Santalales order consists of a small number of autotrophic species and a large number of parasitic species which are root or aerial (stem) parasites [39]. According to the Engler system, Santalales consists of seven families. We downloaded five families belonging to Santalales, which were available in the National Center for Biotechnology Information (NCBI) at that time, and two species as outgroups to analyze the phylogenetic relationships of Macrosolen in Santalales. The The Santalales order consists of a small number of autotrophic species and a large number of parasitic species which are root or aerial (stem) parasites [39]. According to the Engler system, Santalales consists of seven families. We downloaded five families belonging to Santalales, which were available in the National Center for Biotechnology Information (NCBI) at that time, and two species as outgroups to analyze the phylogenetic relationships of Macrosolen in Santalales. The present study showed that Loranthaceae is closely related to Olacaceae, whereas Viscaceae is closely related to Santalaceae and Opiliaceae. These results are similar to those of previous studies [13,14]. All the phylogenetic results strongly support that Loranthaceae and Viscaceae diverged independently from each other.
As folk medicine in China, M. cochinchinensis, M. tricolor and M. bibracteolatus have been used to treat diseases for a long time, and their dried stems and branches with leaves are used as medicinal parts. However, Macrosolen species are similar in appearance, especially when they are processed into medicinal slices, thereby causing difficulty in their identification. The identification of parasitic medicinal materials has rarely been reported. Though phytochemical approaches have played an important role in species identification [26], they are inadequate because they are limited to the environment and harvest period. Molecular characterization has shown an improved specificity for plants [23,26]. In our study, mutational hotspots such as the ycf1 gene, the ccsA gene and the trnF-trnM intergenic region are potential sites for identification of Macrosolen species.

Plant Materials
All the samples in this study were collected from the Guangxi Province of China. Fresh leaves of M. cochinchinensis and M. tricolor were collected from Qinzhou city, and fresh leaves of M. bibracteolatus were collected from Chongzuo city. The three samples were identified by Yonghua Li, who is from the College of Pharmacy, Guangxi University of Traditional Chinese Medicine. The collected fresh leaves were stored in a −80 • C refrigerator until use.

DNA Extraction, Sequencing and Assembly
All the methods in this article were based on the methods of Zhou et al. [40]. Total genomic DNA was extracted from samples using the DNeasy Plant Mini Kit with a standard protocol (Qiagen Co., Hilden, Germany). The DNA was sequenced according to the manufacturer's manual for the Illumina Hiseq X. Approximately 6.2 Gb of raw data from M. cochinchinensis, 6.5 Gb of raw data from M. tricolor, and 6.3 Gb of raw data from M. bibracteolatus were generated with 150 bp paired-end read lengths. The software Trimmomatic (version 0.39, Institute for Biology, Aachen, German) [41] was used to filter the low-quality reads of the raw data, and the Q value was defined as Sanger. Then, all the clean reads were mapped to the database on the basis of their coverage and similarity. Burrows-Wheeler Aligner (BWA-MEM, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, UK) was used in chloroplast genome assembly to generate the bam files. The depth was calculated using Samtools (Medical Population Genetics Program, Broad Institute, Cambridge, MA, USA) and plotted using Rscript (with the smoothScatter function). The accuracy of the assembly of the four boundaries (SSC, LSC and IR regions) of the chloroplast sequences was confirmed through PCR and Sanger sequencing using the validated primers listed in Table S5. The assembled complete chloroplast genome sequence of M. cochinchinensis, M. tricolor and M. bibracteolatus were submitted to the NCBI, and the accession numbers were MH161424, MH161425 and MH161423, respectively. The raw data of three species were submitted to the NCBI. The Bioproject ID of this study is PRJNA587349. The SRA accession ID of M. tricolor is SRR10442639, that of M. bibracteolatus is SRR10442640, and that of M. cochinchinensis is SRR10442641.

Genome Comparison and Phylogenetic Analyses
The whole-genome alignment for the chloroplast genomes of three Macrosolen species were performed and plotted using the mVISTA program (http://genome.lbl.gov/vista/mvista/submit. shtml) [42]. Gene content comparison was analyzed by CPGAVAS2 (Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing,