Complete chloroplast genome of Camellia japonica genome structures, comparative and phylogenetic analysis

Camellia is an economically, ecologically and phylogenetically valuable genus in the family Theaceae. The frequent interspecific hybridization and polyploidization makes this genus phylogenetically and taxonomically under controversial and require detailed investigation. Chloroplast (cp) genome sequences have been used for cpDNA marker development and genetic diversity evaluation. Our research newly sequenced the chloroplast genome of Camellia japonica using Illumina HiSeq X Ten platform, and retrieved five other chloroplast genomes of Camellia previously published for comparative analyses, thereby shedding lights on a deeper understanding of the applicability of chloroplast information. The chloroplast genome sizes ranged in length from 156,607 to 157,166 bp, and their gene structure resembled those of other higher plants. There were four categories of SSRs detected in six Camellia cpDNA sequences, with the lengths ranging from 10 to 17bp. The Camellia species exhibited different evolutionary routes that lhbA and orf188, followed by orf42 and psbZ, were readily lost during evolution. Obvious codon preferences were also shown in almost all protein-coding cpDNA and amino acid sequences. Selection pressure analysis revealed the influence of different environmental pressures on different Camellia chloroplast genomes during long-term evolution. All Camellia species, except C. crapnelliana, presented the identical rate of amplification in the IR region. The datasets obtained from the chloroplast genomes are highly supportive in inferring the phylogenetic relationships of the Camellia taxa, indicating that chloroplast genome can be used for classifying interspecific relationships in this genus.


Introduction
Camellia, containing about 280 species, is a genus with high economic, ecological and phylogenetic values in the family Theaceae [1,2]. It is native to Southern, Eastern Asia and China, which possess more than 80% of the species and are the center of species diversity [3]. Besides the abundance in species diversity and phylogenetic significance, people pay more attention to this genus, for their commercial and ornamental values. been proven to be beneficial for human health as they contain over 700 chemical constituents [1,4]. Camellia is also known as ornamental trees for urban gardening. The cultivation history of Camellia has been at least 1300 years in China [3]. Today, a group of yellow flowers named golden Camellia, e.g. C. chrysantha, are grown for ornamental purposes, with thirteen to sixteen petals of a flower and blooming several times in a year. Many other Camellia species, e.g. C. japonica, also had local uses. The other most economically valuable species, C. oleifera and C. reticulata, are used for edible oil and cooking in China [5,6]. At present, more than 3 million hectares are used for Camellia oil production, and the yield of the Camellia nearly 164,000 tons of edible oil [3,7]. Although the Camellia is native to Asia, because of its variety use, the cultivated species are now found all over the world [1,[8][9][10]. However, the genus Camellia is phylogenetically and taxonomically under controversial that detailed investigation is required, as a result of frequent interspecific hybridization and polyploidization. Whereas classification of the genus Camellia is traditionally based on morphology [11][12][13][14][15][16], the result of this systematics is often unreliable and made lots of controversy as morphology is often affected by environmental factors [2]. As a result, it is urgent to seek other methods for rebuilding the classification of Camellia.
Being relatively stable and not easily affected by the environment, molecular methods can provide useful information for taxonomic classification and phylogenetic. Molecular methods, e.g. DNA and RNA sequences [10,[17][18][19][20][21][22][23][24], internal transcribed spacer [10,18], simple sequence repeats (SSR) [25,26], ribosomal DNA [27] and several DNA loci [28], have been involved to better understand the evolution of the Camellia. A number of studies focus on the taxonomy, species identification and phylogenetics of the Camellia, but still have not get a satisfied resolution. A recent study used complete chloroplast genomes in several Camellia species [2,[29][30][31] and got more information of this species. Camellia japonica has been present in Qingdao, Shandong province since the tertiary, it has evolved in dependently after that. A recent research shows that the early flower development sequence placed C. japonica (Naidong) in a most primitive branch of the phylogenetic tree compared to other species [32]. The taxonomy of C. japonica with other Camellia species is in dispute.
In plant cells, the chloroplast is not only the most important and universal organelle, but also one of the major genetic systems (the other two are nucleus and mitochondria). It is involved in photosynthesis and associated with metabolism, such as fatty acid and amino acid synthetic pathways [33][34][35]. As an independent organelle, the chloroplast has its own genome. It has a covalently closed circular DNA structure and exists in multi-copies in plant cell. It has a conserved circular DNA arrangement [36]. Since the chloroplast genome is self-replicating and has a relatively independent evolutionary process, it has been used for resolving the source populations during species evolution [8,34,[37][38][39][40][41][42][43][44][45].
Here, we report newly sequenced complete chloroplast genomes of C. japonica using next generation sequencing technology and genomic comparative analysis with other five published chloroplast genome sequence download from the NCBI. This study aims to deeply analyze the chloroplast genomes of six Camellia species and to determine their (especially the C. japonica) phylogenetic positions.

Plant materials and genomic DNA isolation
We collected fresh leaves from an adult C. japonica tree growing in the Daguan Island (Jimo District Qingdao city, Shandong, China) (N36˚14 0 , E120˚46 0 , Altitude 10m). The leaves were dried immediately with silicagel. Total genomic DNA was extracted according to Wiland-Szymańska [46].

Chloroplast genome sequencing, assembly and annotation
We used an ultrasonicator to randomly fragment the extracted genomic DNA into 400-600 bp. The NEBNext Ultra DNA Library Prep Kit was used to construct an Illumina paired-end cpDNA library. Paired-end sequencing (2 × 150 bp) was run on an Illumina HiSeq X Ten platform. After filtering the raw data, the cp genomes were assembled according to the following steps. Frist the clean reads were used to assembled into contigs using SOAPdenovo 2.017. Then, the contigs were aligned to the relative species (C. sinensis JQ975030) and get the relative location of the contig sequences and the structure diagram of cp genomes were obtained. The software Gap Closer 1.12 were used to fill the gaps. Finally, the complete cp genome sequence were obtained. The chloroplast genome sequences were annotated with CpGAVAS software and DOGMA software, and then manually corrected.

Molecular marker development
We conducted a sliding window analysis and used DnaSP (DNA Sequences Polymorphism version 5.10.01) software to calculate the nucleotide diversity (Pi) of the six complete Camellia chloroplast genomes [47]. We set the step size to 200 bp with a window length of 600bp.
The mVISTA [48] program was used to compare the complete chloroplast genome of C. japonica to other five published chloroplast genomes of the genus Camellia, i.e., C. huana (KY_626042), C. crapnelliana (KF_753632), C. azalea (KY_856741), (KY_626042), C. liberofilamenta (KY_626041) with the shuffle-LAGAN mode [49], so that inter-and intra-specific variations were shown. We used Mega 6.0 software [50] to determine the variable and parsimonyinformative base sites across the complete chloroplast genomes, and LSC, SSC and IR regions of the six chloroplast genomes.
DnaSP software was used for manual detection of insertions/deletions. To estimate selection pressures, the 78 protein coding genes in the chloroplast genomes were combined. We used PAML with the yn00 program to calculate the Ka and Ks rates of the combined sequences [51].

Selection pressure analysis
We used KaKs_Calculator 2.0 [52] to determine the Ka and Ks values of genes containing SNP variations, so that we can analyze how different Camellia have evolved under different environmental pressures. We also analyzed the codon preference, and mapped them by R software.
We also used PAUP v4b10 to analyze maximum parsimony (MP). We treated gaps as missing, and character states as unordered. We selected MULPARS option when performing Heuristic search. Further steps include tree bisection-reconnection branch swapping, and random stepwise addition with 1,000 replications.

Basic characteristics of the Camellia chloroplast
A total of with 10.44 Gb clean data were generated by Illumina HiSeq X Ten platform. After data filtering with mean Q20 higher than 94.70% and the mean length was 150 bp. The chloroplast genome sizes of the six Camellia species ranged from 156,607 bp (C. japonica) to 157,166 bp (C. luteoflora). The structure of all chloroplast genomes is quadripartite, which is typical of angiosperm cpDNA. Each chloroplast genome consists of a large single copy region (86,258-86,719bp) and a small single copy region (18,203-18,406bp), separated by two inverted repeat regions (25,967-26,077bp) ( Table 1). The GC content of three Camellia species (C. japonica, C. huana and C. liberofilamenta) was 37.32% and the others were 37.30%. The average of GC content is almost similar among the six Camellia species. The well-conserved genomic structure resembled those of other higher plants, including gene number and gene order (Fig 1 and Table 2). The complete C. japonica cp genome sequence has been submitted to GenBank with the accession number PRJNA510919.

Comparative analysis of the Camellia chloroplast genomes
Chloroplast simple sequence repeats (cpSSRs) play a crucial role in studying phylogeny and population genetics [55]. We analyzed cpSSRs in the chloroplast genomes (S1 and S2 Tables).  The number of cpSSRs ranged from 67 (C.azalea) to 74 (C.huana) among the six camellia taxa. The number of nucleotide repeats had no significant difference among the six camellia taxa (Fig 2). The majority of the 420 SSR loci reside in LSC regions (286 loci, 68.10%). Only a minor portion are located in the SSC regions (72 loci, 17.14%) and IR regions (62 loci, 14.76%). Same as previously reported the SSR loci exhibited a significantly variable distribution among all regions in each of the six Camellia chloroplast genomes [34,56]. The lowest value (17) was

Chloroplast gene gain-loss events
Despite high conservation of chloroplast genome sequences, structural variations, gene loss, and metastasis occur in some species as a result of evolution. This study compared nineteen Camellia species (Table 4). We found that lhbA and orf188, followed by orf42 and psbZ, were readily lost during evolution. The results also showed that psaJ, psbF, psbH and psbZ were lost in C.danzaiensis, compared with the other eighteen species. Moreover, C.japonica had lost trnfM-CAU gene compared the other species. Gene loss events also occur in other plants, e.g. the loss of infA in the Fagales chloroplast genome [55], and the loss of rpl32 in the Paeonia obovata chloroplast genome [57]. Among the rpsl6, ndh, infA, and ycf2 genes, some have disappeared in some angiosperms, and in some legumes, gene loss events have occurred to all of them [58].

Analysis of codon preference
69.59% of the Camellia chloroplast genome sequence was gene coding, of which the vast majority was protein coding. The analytic varieties provided by statistical analyses of all protein-coding cpDNA and amino acid sequences showed obvious codon preferences. It also showed the similarity of protein codons in the six Camellia species, of which AAA, ATT, GAA, AAT, and TTT had the highest frequencies, and the TGA, TAG, TAA, TGC, CGC had the lowest frequencies (Fig 3). The third codon showed a high A/T preference, which is a common phenomenon in higher plant chloroplast genomes [59][60][61]. The pattern of the codon preference has important significance in studying species evolution. We used the relative synonymous codon usage (RSCU) as a relative intuitionistic to measure the extent of codon bias [62]. The synonymous codon preference is partitioned into four models: high preference (RSCU>1.3), moderate preference (1.2�RSCU�1.3), low preference (1.0<RSCU<1.2) and no preference (RSCU�1.0).
Among the protein-coding chloroplast genes in the six Camellia species, the 20 amino acids were encoded by 64 codons, in which most of the amino acids had codon preferences except tryptophan (Fig 4). As a total, 30 codon preferences were identified, with 18 amino acids and one stop codon involved. Among the preferred codons, 70.00% exhibited high preferences. This result further revealed the relative conservation of Camellia chloroplast genomes, as high codon preference is also a common phenomenon in higher plants.

IR contraction analysis
In the chloroplast genome, the IR region is considered as the most conserved region. However, genome size variations often occur in its expansion/contraction regions among various plant lineages, which can be used to study the phylogenetic classification of plants [63]. We compared the IR-SSC and IR-LSC boundaries information of six Camellia were compared (Fig 5). The LSC/IRa boundaries was located within the coding region of rps 19 and created a pseudogene of 279bp at LSC/IRa border. The ycf1 gene spanned the IRb/SSC region and the length of ycf1 was from 936bp to 1069bp. Furthermore, the TrnH-GUG gene (75bp) was located in the LSC. However, the gene trn-GUU and ndhf was not observed in Camellia except C. crapnelliana, that means they contribute little to the overall size variations in the chloroplast genomes of Camellia plants.

Genome divergence between the Camellia species
A sequence identity analysis based on mVISTA was performed between six Camellia species, and the reference was the C. japonica chloroplast genome (Fig 6). The aligned sequences that exhibit high similarity showed higher conservation than the remaining sequences across the whole chloroplast genome. Lower divergence levels were exhibited in IR and coding regions than in SC and non-coding regions, respectively. We conducted a sliding window analysis and DnaSP software to calculate the nucleotide diversity of the six complete Camellia chloroplast genomes (Fig 7) Among the six Camellia taxa, C. japonica had the most nucleotide substitutions and insertions/deletions, while C.   huana had least nucleotide diversity, and the smallest numbers of nucleotide substitutions and insertions/deletions. According to the chloroplast genome sequence alignment of the six Camellia taxa, six hyper-variable regions, trnS-trnR, petN-psbM, trnF-ndhJ, petA-psbJ, rpl32-trnL, ycf1 were discovered (Fig 7). These six sequences could be used as DNA markers for classification and revealing the genetic divergence of the Camellia taxa, with a high discrimination success ranging from 60% to 100% (Table 5). The sequences of the petN-psbM and ycf1 are two most rapidly evolving regions were able to discriminate all the taxa investigated in this study. In those most rapidly evolving regions, 121 and 122 variable base sites were detected, respectively, of which, 61 and 62 informative base sites, made up 2.86-2.99% in each of the sequences. Comparatively, the commonly recommended DNA fragments (rbcL and matK) achieved only 40% and 80% of discrimination success respectively.

Phylogenetic analysis
Previous studies have fairly well resolved the relationships between Camellia species, but have not well studied the position of Camellia [29,30,64]. Six data partitions including coding regions, large single-copy region, the small single copy region, IR region, the inverted repeat region, introns and spacers and the complete cp DNA sequences from the 19 Camellia were used for phylogenetic analyses. All six datasets produced similar phylogenetic trees with moderate to high support, whereas the IR dataset had poor support (Fig 8). The reconstructed The phylogenetic tree reveals that C. japonica is most related with C. oleifera. Furthermore, the phylogenetic result is consistent with the section-level classification by Raven [65]. The chloroplast resource will be helpful for the conservation, taxonomy, and breeding programs of the genus Camellia.

Chloroplast genome variation and evolution
In this study, Illumina next-generation technology was used to completely sequence the chloroplast genome of C. japonica and compared with the previously reported chloroplast genomes in Camellia. The chloroplast genomes of C. japonica displayed the typical quadripartite structure of flowering plants, were conservative in gene order and gene content, in comparison with the most lineages of angiosperms. The chloroplast genome sizes ranged from 156,607 to 157,166 bp in length. IR regions are considered as the most conserved region, which considered to be the primary mechanisms affecting length variation of angiosperm chloroplast genomes. Only minor variations were detected at the SC/IR boundaries of six Camellia. Occurrence of indels was the main factor effecting the variation of the length in Camellia chloroplast genomes. the Camellia chloroplast genomes contained more AT content than GC content, which is a common phenomenon in higher plant chloroplast genomes [59][60][61].
SSRs are widely used in phylogenetic analyses and population genetics and polymorphism investigations. A total of 420 SSR loci were identified and the number of SSRs ranged from 67 to74 in Camellia. the mono-nucleotide repeats are the most common SSRs in chloroplast genomes, which make more contributions to the genetic variation than the longer SSRs. Since the structure of chloroplast genomes are conservative, SSR primers are transferable across  species and genera. Information involving SSRs in this study will provide useful sources for estimating the phylogenetic relationships among species and genera.

Potential cpDNA barcodes
Camellia is the largest genus in its family, including more than 280 species all over the world. For effective exploration, conservation, and domestication, accurately identified wild species would provide a clear genetic background of this genus. However, the taxonomic inventory of genus Camellia is still under controversial, because of the vast amount of species with extensive global distribution and interspecific hybridization. DNA barcoding has been widely used in identify unknown species [66]. The rbcL and matK is considered as core universal DNA barcodes in many species. Therefore, genomic comparative researches of more complete chloroplast genome sequences have become necessary for developing variable DNA barcodes. These mutation "hotspot" regions can be used to develop novel DNA barcodes [67]. The six potential mutational hotspots (trnS-trnR, petN-psbM, trnF-ndhJ, petA-psbJ, rpl32-trnL, ycf1) identified in this study could be suitable barcodes for plant classification in Camillia. In previous reports, the gene ycf1 was recommended as core DNA barcode for plants because of the high divergence [68]. Ycf1 gene has been widely applied in plant phylogeny and DNA barcoding studies [69][70].
Recently, using the chloroplast genome as a super-barcode for plant species identification was discussed [71]. The analyses on chloroplast genome sequence divergence showed that it may indeed be useful as a super-barcode for species identification of Camellia. Further research is necessary to investigate whether these hyper-variable regions or complete chloroplast genome sequences could be used as reliable and effective DNA barcodes for species of Camellia. The results obtained in this study have significant value for future studies on global genetic diversity assessment, phylogeny, and population genetics of Camellia.

Perspectives of persimmon research in future
It is important to elucidate the genetic relationship of Camellia taxa for germplasm conservation, breeding strategies of Camellia. The accurate classification of sect. Thea have widely been acknowledged to be complex. For example, the taxonomy of C. pubicosta still has a dispute. Min et al considered that the C. pubicosta belongs to sect. Corallina [71], while Chang and Huang insisted it belongs to sect. Thea. [29,72]. In our research the C. pubicosta was close to C. sinensis and C. grandibracteata supporting C. pubicosta might be classified into sect. Thea. Previous studies reported that species of sect. Thea can be divided into two groups, agreeing with the locule ovary number [73,74]. However, our results showed that the classification of this species was not entirely consistent with previous studies [74,75]. For instance, the C. taliensis and C. cuspidate, C. grandibracteata and C. sinensis were supported as monophyletic respectively. However, the C. taliensis and C. grandibracteata have 5 ovaries, while C. cuspidata and C. sinensis have 3 ovaries.
The C. japonica population in Qingdao, Shandong province is the only one in temperate areas in China. While this population has been present in this area since the tertiary, after the quaternary glacier most thermophyilic species extinction or migration to warmer regions. In contrast, C. japonica adapted to temperate climate. Since then, it has evolved independently and no gene exchanges with the distribution center species. Zhang et al considered that the C. japonica was the relative evolutionary species. The results of phylogenetic analysis support that C. japonica and C.oleifera as monophyletic, However the C. japonica have 2-3 ovaries and the C.oleifera have 3-5 ovaries. Our results indicated that the classification of Camellia species using locule ovary number may be reconsidered. The combination of traditional classification methods, molecular markers and sequencing of more complete cp genomes of Camellia are necessary to solve the problem of Camellia classification in the future research.

Conclusions
We reported the complete chloroplast genome sequences of C. japonica were reported based on the Illumina HiSeq X Ten platform. C. japonica chloroplast genomes exhibited a typical quadripartite and circular structure with 156607bp.We investigated the variation of repeat sequences, SSRs among the six complete Camellia cp genomes. Selection pressure analysis revealed the influence of different environmental pressures on different Camellia chloroplast genomes during long-term evolution. Obvious codon preferences were shown in almost all protein-coding cDNA and amino acid sequences. Lower divergence levels were exhibited in IR and Coding regions than in SC and Non-coding regions, respectively. The results of phylogenetic showed that C. japonica has the closest relationship with C. oleifera. Therefore, chloroplast genome resources will be helpful for taxonomic studies, conservation, and breeding programs of the genus Camellia.
Supporting information S1