Comparative analysis of the complete chloroplast genome sequences in psammophytic Haloxylon species (Amaranthaceae)

The Haloxylon genus belongs to the Amaranthaceae (formerly Chenopodiaceae) family. The small trees or shrubs in this genus are referred to as the King of psammophytic plants, and perform important functions in environmental protection, including wind control and sand fixation in deserts. To better understand these beneficial plants, we sequenced the chloroplast (cp) genomes of Haloxylon ammodendron (HA) and Haloxylon persicum (HP) and conducted comparative genomic analyses on these and two other representative Amaranthaceae species. Similar to other higher plants, we found that the Haloxylon cp genome is a quadripartite, double-stranded, circular DNA molecule of 151,570 bp in HA and 151,586 bp in HP. It contains a pair of inverted repeats (24,171 bp in HA and 24,177 bp in HP) that separate the genome into a large single copy region of 84,214 bp in HA and 84,217 bp in HP, and a small single copy region of 19,014 bp in HA and 19,015 bp in HP. Each Haloxylon cp genome contains 112 genes, including 78 coding, 30 tRNA, and four ribosomal RNA genes. We detected 59 different simple sequence repeat loci, including 44 mono-nucleotide, three di-nucleotide, one tri-nucleotide, and 11 tetra-nucleotide repeats. Comparative analysis revealed only 67 mutations between the two species, including 44 substitutions, 23 insertions/deletions, and two micro-inversions. The two inversions, with lengths of 14 and 3 bp, occur in the petA-psbJ intergenic region and rpl16 intron, respectively, and are predicted to form hairpin structures with repeat sequences of 27 and 19 bp, respectively, at the two ends. The ratio of transitions to transversions was 0.76. These results are valuable for future studies on Haloxylon genetic diversity and will enhance our understanding of the phylogenetic evolution of Amaranthaceae.


INTRODUCTION
The eudicot clade comprises approximately 75% of all flowering land plant species, including major subclades: rosids, asterids, Saxifragales, Santalales, and Caryophyllales (APG III, 2009). Haloxylon species, which include psammophytic small trees or shrubs, are positioned phylogenetically in the Amaranthaceae Juss of the Caryophyllales Perleb among core eudicots (APG III, 2009;Pyankov et al., 2001;Akhani, Edwards & Roalson, 2007). The Haloxylon genus has about 11 species, with a distribution from the Mediterranean through Central Asia and into China (Zhu, Mosyakin & Clemants, 2004). Two Haloxylon species, which are known as the King of psammophytic plants, are found in the deserts of northwest China and, play important roles in environmental protection, including wind control and sand fixation (Zhu, Mosyakin & Clemants, 2004;Jia & Lu, 2004). These precious psammophytic woody plants can adapt to harsh environmental conditions, such as drought, desert, high temperature, and sand storms. However, populations of Haloxylon plants have been threatened in China in past decades as a result of decreased underground water, overgrazing, and over exploitation of agriculture.
Because of the environmental significance of these plants and their declining numbers, genetic research on Haloxylon germplasm resources has garnered significant interest (Song & Jia, 2000;Sheng et al., 2004;Sheng et al., 2005;Zhang et al., 2006a;Zhang et al., 2006b). However, Haloxylon plants possess only fine green assimilating shoots, without leaves, making the evaluation of their phenotypic diversity difficult. Further, the detection of genetic diversity within Haloxylon germplasm resources has been slowed by a lack of morphological markers (Sheng et al., 2004;Sheng et al., 2005;Zhang et al., 2006a;Zhang et al., 2006b;Wang et al., 2009;Suo et al., 2012a). A recent study by Long et al. (2014) used RNA-seq data to elucidate the Haloxylon transcriptome, providing a valuable sequence resource for further genetic and genomic studies; however, genetic information for members of the Haloxylon genus, and how they might differ from one another, is limited.
Each leaf cell of plants contains 1,000-10,000 chloroplasts (cp), which are key organelles for photosynthesis and other biochemical pathways such as the biosynthesis of starch, fatty acids, pigments, and amino acids (Dong et al., 2013b;Raman & Park, 2016). Since the first cp genome of Nicotiana tabacum was sequenced in 1986, around 800 complete cp genome sequences have been made available in the National Center for Biotechnology Information organelle genome database. These data are valuable sources of genetic markers for phylogenetic analyses, genetic diversity evaluation, and plant molecular identification (Dong et al., 2012;Dong et al., 2013a;Dong et al., 2013b;Dong et al., 2014;Ni et al., 2016;Suo et al., 2012b).
There are two published complete cp genome sequences (Spinacia oleracea and Beta vulgaris subsp. vulgaris) from members of the Amaranthaceae family (Li et al., 2014;Schmitz-Linneweber et al., 2001). However, the determination of the cp genome from Haloxylon plants is of further significance for potentially enhancing our understanding of their adaptability to severe desert environmental conditions, and their genomic evolution within the Amaranthaceae. Here, we report the complete cp genomes from two Haloxylon species, H. ammodendron and H. persicum, including patterns of nucleotide substitutions, microstructural mutation, and simple sequence repeats (SSRs). We further performed genomic comparative analyses on these and two other representative Amaranthaceae species, to better understand the evolutionary relationships within this family.

Chloroplast genome sequencing
The HA and HP cp genomes were sequenced using the short-range PCR method reported by Dong et al. (2012), Dong et al. (2013a) and Dong et al. (2013b. The PCR protocol was as follows: preheating at 94 • C for 4 min, 34 cycles at 94 • C for 45 s, annealing at 55 • C for 40 s, and elongation at 72 • C for 1.5 min, followed by a final extension at 72 • C for 10 min. PCR amplification was performed in an Applied Biosystems VeritiTM 96-Well Thermal Cycler (Model#: 9902, made in Singapore). The amplicons were sent to Shanghai Majorbio Bio-Pharm Technology Co., Ltd. (Beijing) for Sanger sequencing in both the forward and reverse directions using a 3730xl DNA analyzer (Applied Biosystems, Foster City, CA, USA). DNA regions containing poly structures or that were difficult to amplify were further sequenced using newly designed primer pairs for confirming reliable and high quality sequencing results.

Chloroplast genome assembling and annotation
The cp DNA sequences were manually confirmed and assembled using Sequencher (v4.6) software, and cp genome annotation was performed using the Dual Organellar Genome Annotator (DOGMA) (Wyman, Jansen & Boore, 2004). BLASTX and BLASTN searches were utilized to accurately annotate the protein-encoding genes and to identify the locations of the transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs). Gene annotation information from other closely related plant species was also used for confirmation when the boundaries of the introns or exons could not be precisely determined because of the limited power of BLAST in cp genome annotation (e.g., for some short exons of 6-9 nt in length, such as in the case of rps16, pet B, and pet D). Promoter, intron, and exon boundaries, as well as the location of stop codons for all protein-encoding genes, have been identified accurately. The cp genome map was drawn using Genome Vx software (Conant & Wolfe, 2008) (http://wolfe.ucd.ie/GenomeVx/), and the cp genome sequences have been deposited to GenBank with the following accession numbers: KF534478 for HA and KF534479 for HP (https://www.ncbi.nlm.nih.gov/nuccore/?term=Haloxylon+chloroplast+genome).

Repeat structure analysis
Gramene Simple Sequence Repeat Identification Tool software (http://www.gramene. org/db/markers/ssrtool) (Benson, 1999) was utilized to search for simple sequence repeat loci in the cp genome sequences, with the threshold value of repeat number as ≥10 for mono-nucleotide repeats, ≥5 for di-nucleotide repeats, ≥4 for tri-nucleotide repeats, and ≥3 for tetra-nucleotide, penta-nucleotide, or hexa-nucleotide repeats.

Gene content analysis and comparative genomics
The mVISTA program was employed in Shuffle-LAGAN mode (Frazer et al., 2004) to compare the complete HA and HP cp genomes. These were aligned using MUSCLE software (Thompson et al., 1997) and were manually adjusted using Se-Al 2.0 (Rambaut, 1996). Variable sites in the cp genome were calculated using DnaSP (DNA Sequences Polymorphism version 5.10.01) software (Librado & Rozas, 2009), and the genetic distance (p-distance) was computed using MEGA 6.0 software (Tamura et al., 2013). Based on the aligned sequence matrix, the micro-structure events were checked manually and were further divided into three categories: (i) microsatellite-related insertions/deletions (indels), (ii) non-microsatellite-related indels, (iii) and inverted sequences. Using the HA cp genome sequence as the standard reference, the size, location, and evolutionary direction of the microstructure events were counted. The proposed secondary structures of the inverted regions in the cp genomes of HA and HP were analyzed using mfold software (Zuker, 2003). The complete cp genome sequences of S. oleracea (GenBank accession number AJ400848.1, Spinacia L.) (Schmitz-Linneweber et al., 2001) and B. vulgaris subsp. vulgaris (GenBank accession number KJ081864.1, Beta vulgaris subsp. vulgaris) (Li et al., 2014), two closely related species in the Amaranthaceae family, were downloaded from GenBank databases (www.ncbi.nlm.nih.gov). These were used for comparison with the complete cp genomes of HA and HP.

Genome features
Similar to the typical cp genome structure in other higher plants, the Haloxylon cp genome is a double-stranded, circular DNA molecule of 151,570 bp in length in HA and 151,586 bp in length in HP. It also includes a large single copy region (LSC) of 84,214 bp in HA and 84,217 bp in HP and a small single copy region (SSC) of 19,014 bp in HA and 19,015 bp in HP; these are separated by a pair of inverted repeats (IR) (24,171 bp in HA and 24,177 bp in HP) (Fig. 1). The GC content in this IR region is 43.0% in HA and 42.7% in HP, and the GC content in the LSC and SSC regions is 34.4% (LSC) and 29.7% (SSC) in HA and 34.5% (LSC) and 29.7% (SSC) in HP (Table 1). Among the four Amaranthaceae species included in our analyses, which represent three genera, the longest cp genomes (151,570 bp for HA and 151,586 bp for HP) are 1,935 bp to 1,951 bp larger than the shortest one (149,635 bp for B. vulgaris subsp. vulgaris) (Li et al., 2014). The size of the S. oleracea cp genome (150,725 bp) (Schmitz-Linneweber et al., 2001) is intermediate (Table 1). Notably, the cp genomes of HP and HA are quite similar in size; the HP cp is only 16 bp longer than that of HA, with minor differences between them. There are a total of 112 genes in the Haloxylon cp genome, including 78 coding genes, 18 of which are duplicated genes in the IR region, 30 tRNA genes, and four ribosomal RNA genes (16S, 23S, 5S, 4.5S) ( Fig. 1 and Table S1). Based on their predicted functions, these genes can be divided into three categories, (1) genes related to transcription and translation; (2) genes related to photosynthesis; (3) genes related to the biosynthesis of amino acids, fatty acids, etc., and some functionally unknown genes (Table S1). The S. oleracea cp also contains the same 78 protein-coding genes, whereas the cp in B. vulgaris has 79. This species contains an additional gene (rpl23), which is a pseudogene in the other species ( Fig. 1 and Table S1). There are 17 genes harboring introns in the cp genomes of the four Amaranthaceae species analyzed (one class I intron, trnL-UUA, and 16 class II introns), and two of these genes, ycf 3 and clpP, contain two introns each (Table 2).
Several angiosperm lineages have lost introns from the rpl2 gene independently (Downie et al., 1991), which could also be regarded as a characteristic feature of the core members of the Caryophyllales (Logacheva et al., 2008). In each of the four Amaranthaceae cp genomes in our analysis, the rpl2 gene has lost its intron. Some authors have proposed that intron loss is not always a dependable marker of phylogenetic relationships (Millen et al., 2001;Dong et al., 2013b;Raman & Park, 2016), and further study, including the sampling of more taxa, is needed to clarify this issue.

Expansion and contraction of the border regions in Haloxylon cp genomes
To analyze these Amaranthaceae species at the genome-level, the sequences of all the four cp genomes were plotted using the VISTA program (Frazer et al., 2004), using the annotation of HA as a reference (Fig. 2). Similar to other angiosperms, we observed that the IR region is more conserved in these species than the LSC and SSC regions.
The expansion and contraction of the border regions between the two IR regions and the single copy region have contributed to genome size variations among plant lineages (Dong et al., 2013b;Goremykin et al., 2003;Ni et al., 2016). Therefore, we next compared the exact  (812) 152 (152) Notes. rps12 is trans-spliced with the 5 end located in the LSC region and the duplicated 3 end in the IR regions.
IR border positions and their adjacent genes among the four Amaranthaceae cp genomes (Fig. 3). From these data, we see that the IRa/LSC border is generally located upstream of the trnH-GUG gene. The distance between the IRa/LSC border and the trnH-GUG gene is 1 bp in the Haloxylon cp genomes and 2 bp in Beta genus, with no separation in Spinacia (Fig. 3). The IR region is expanded by 763 bp and enters the 5 end of the ycf 1 gene in Haloxylon species, whereas it is expanded by 1,427 bp and 1,492 bp, respectively, in Spinacia and Beta. Except for the expansion of the ycf 1 gene, the IR region extends to the rps19 gene in all of four Amaranthaceae cp genomes. The rps19 pseudogene was not observed in this study. Although there are expansions or contractions of IR regions observed among the investigated species of the Amaranthaceae, they contribute little to the overall size differences in the cp genomes. The exon at the 5 end of the rps12 gene is located in the LSC region, and the intron and 3 -end exon of the gene are situated in the IR region in all four Amaranthaceae species.

Indels and SNPs
Indel and single nucleotide polymorphism (SNP) sites are important molecular features valuable for development of DNA markers that are useful for plant identification and genetic analysis of population structure (Dong et al., 2012;Dong et al., 2013a;Dong et al., 2013b;Dong et al., 2014;Suo et al., 2012b;Suo et al., 2015;Suo et al., 2016). We detected 23 indels in the cp genome sequence alignment of HA and HP, including 16 indels caused by microsatellite repeat variations and seven non-microsatellite-related indels (Table 3). Most of the indel events occurred in non-coding regions (21/23). A large portion of the indels related to microsatellite repeat variations are characterized by a single base mutation; six insertions of this type were observed in the HA cp genome. The non-microsatellite-related indels were found to contain mostly five to six variable base sites, and two insertions of this type were detected in the HA cp genome.

Figure 3 Comparison of the junction positions between the single copy and IR regions among four Amaranthaceae genomes.
Forty-four SNPs were detected in the HA and HP cp genomes (Table 4), which is considerably less than what was found between the cp genomes of other closely related plant species, including Oryza sativa and Oryza nivara (159 SNPs, Masood et al., 2004), Machilus yunnanensis and Machilus balansae (231 SNPs, Song et al., 2015), Citrus sinensis and Citrus aurantiifolia (330 SNPs, Su et al., 2014), Panax ginseng and Palax notoginseng (464 SNPs, Dong et al., 2014), and Solanum tuberosum and Solanum bulbocastanum (591 SNPs, Chung et al., 2006). Of note, the indel and SNP mutation events in the Haloxylon cp genomes were not randomly distributed, but rather, clustered as ''hotspots'' (Shaw et al., 2007;Worberg et al., 2007). It is likely that such mutational dynamics created the highly variable regions in the genome (Suo et al., 2012b;Song et al., 2015).

Patterns of nucleotide substitutions
Overall, the differences between the HA and HP cp genomes are minor, with a genetic distance of 0.00029 between them (Table 4). In total, 44 variable nucleotide sites were detected, 23 of which were found in intergenic regions, six in introns, and 15 in proteinencoding regions.
We also found that the probability of occurrence for the various nucleotide substitutions is different, depending on the mutation, as shown in Fig. 4. The most frequently occurring mutations are from A to C and from T to G (12 times each); mutations from A to T and from T to A exhibited the lowest frequency (only one occurrence of each). The ratio of transitions (Ts) and transversions (Tv) was 0.76 in the cp genome of Haloxylon species.
In the gene-encoding regions of the HA and HP cp genomes, a total of 15 variable base sites were detected in 11 protein-encoding genes. Specifically, we found one mutation in each of the following genes: atpA, atpI, mat K, ndhF, ndhI, psbC, rpoB, rps15, and rps3. Two genes, rpoC2 and ycf 1, each contained three mutation sites (Table 5). These mutations included six Ts and nine Tv. Ten nonsynonymous substitutions occurred simultaneously in seven genes (Table 5).

Repeat structure feature
Simple sequence repeats (SSRs) are also called microsatellites. Within the cp genomes of HA and HP, 59 different SSR loci were detected. Of these, 44 loci are mono-nucleotide repeats, three are di-nucleotide repeats, one is a tri-nucleotide repeat, and 11 are tetra-nucleotide repeats; penta-nucleotide repeats or those containing a higher number of nucleotide repeats were not detected. Among the SSR loci detected, the most frequently observed repeats were A/T and AT/TA, accounting for 77.97% of the total number of SSR loci (Table  6). By comparison, in the cp genomes of M. yunnanensis and M. balansae, 36 SSR loci were identified (Song et al., 2015).

Inversions
Inversions are important events in the evolution of plant cp genomes. Smaller inversions are less frequent in these genomes, and they are generally associated with hairpins (Fig. 5). Most inversions are found in spacers and introns, and in most cases, the presence/absence of Table 4 The nucleotide substitution patterns present in the two Haloxylon chloroplast genomes.

Region Location H. ammodendron H. persicum
atpA The nucleotide substitution patterns in the two Haloxylon chloroplast genomes. The patterns were divided into six types, as indicated by the six non-strand-specific base-substitution types (i.e., numbers of G to A and C to T sites for each respective set of associated mutation types). The H. ammodendron chloroplast genome was used as a standard.
inversions is highly homoplastic during cp genome evolution (Kim & Lee, 2005;Catalano, Saidman & Vilardi, 2009), even at the population level (Quandt & Stech, 2004). A sequence alignment of the Haloxylon cp genomes revealed that an inversion event of 14 bp and one of 3 bp occur in the pet A-psbJ intergenic region and in the rpl16 intron, respectively. The two inverted sequences are predicted to form secondary hairpin structures, with repeat sequences of 27 bp and 19 bp at the two ends, respectively (Fig. 5).

Pseudogenes
Pseudogenes have been defined as nonfunctional regions of genomic DNA that originally derived from functional genes (Balakirev & Ayala, 2003). These are evolutionary relics of functional components in the genome that provide important information regarding the    history of the gene and genome evolution (Balakirev & Ayala, 2003;Zou et al., 2009;Choi & Park, 2015). The rpl22 and rps18 genes are putative pseudogenes in the Paeoniaceae (Dong et al., 2013b), whereas the atpB gene is a pseudogene in Aster spathulifolius. Conversely, the rpl22, rps18, and atpB genes are predicted to be normal and functional in the Haloxylon species, whereas rpl23 is a present as a pseudogene in the Haloxylon cp genomes ( Fig. 1 and Table S1).

CONCLUSIONS
Two Haloxylon cp genomes were sequenced and characterized for the first time, and we found that they share the same overall organization and gene content found in most angiosperm cp genomes, including that of the closely related Spinacia and Beta species. The location and distribution of repeat sequences and differing nucleotide mutation sites between the two cp genomes were identified. The LSC/IRB/SSC/IRA boundary regions of the Amaranthaceae cp genomes were compared, and lightly intense variations were identified within the genus Haloxylon. The complete Haloxylon cp genome sequences reported here enhance the genomic information available for the Amaranthaceae family and further contribute to the study of germplasm diversity. These data represent a valuable source of markers for future research on Haloxylon population genetics.