Comparison of intraspecific, interspecific and intergeneric chloroplast diversity in Cycads

Cycads are among the most threatened plant species. Increasing the availability of genomic information by adding whole chloroplast data is a fundamental step in supporting phylogenetic studies and conservation efforts. Here, we assemble a dataset encompassing three taxonomic levels in cycads, including ten genera, three species in the genus Cycas and two individuals of C. debaoensis. Repeated sequences, SSRs and variations of the chloroplast were analyzed at the intraspecific, interspecific and intergeneric scale, and using our sequence data, we reconstruct a phylogenomic tree for cycads. The chloroplast was 162,094 bp in length, with 133 genes annotated, including 87 protein-coding, 37 tRNA and 8 rRNA genes. We found 7 repeated sequences and 39 SSRs. Seven loci showed promising levels of variations for application in DNA-barcoding. The chloroplast phylogeny confirmed the division of Cycadales in two suborders, each of them being monophyletic, revealing a contradiction with the current family circumscription and its evolution. Finally, 10 intraspecific SNPs were found. Our results showed that despite the extremely restricted distribution range of C. debaoensis, using complete chloroplast data is useful not only in intraspecific studies, but also to improve our understanding of cycad evolution and in defining conservation strategies for this emblematic group.

the chloroplast using a genome skimming approach, resulting in high resolution phylogenies and allowing for estimations of timing of historical diversification, biodiversity and extent of genomic divergence 14,20,21 .
It is well known that genetic diversity can greatly vary between taxa, due to either different intrinsic characteristics (e.g. reproductive system, genome size and organization) or to extrinsic features (e.g. endemic vs. widespread species, young vs. old species) 17,18,22 . In addition, the hypothesis of a linear accumulation of mutations in sequences across time (i.e. a molecular clock) has been refuted in many groups [23][24][25] .
In this study, we analyse the C. debaoensis chloroplast as a reference together with molecular data of other Cycas species to identify potential DNA-barcode loci, and compared generic-level chloroplast features in cycads, to highlight the evolutionary history of this group. We also compared the chloroplast features of two individuals of C. debaoensis to provide new resources for marker development in this endangered species.

Results and Discussion
Genome size and features. Using genome skimming and reference-guided assembly, we reconstructed the 162,094 bp long chloroplast genome C. debaoensis_Jiang_DB-2015. The complete cp genome was submitted   Table 2). The tufA gene, found in gymnosperms and hornworts and inherited from green algae is coding for a nonfunctional protein synthesis elongation factor (723-bp long in C. taitungensis and Ginkgo) 17 . Interestingly, this gene was 1 bp longer in C. debaoensis_Jiang_DB-2015 than in other Cycas (Table 2).
Repeat and SSR analysis. Using REPuter, seven repeats were found in the chloroplast of C. debaoensis_ Jiang_DB-2015, which were three forward (F) and four palindrome (P), with no reverse and complement repeats discovered ( Table 3). The repeats were mainly distributed in the intergenic spacers of transfer RNA genes, some of them being located in transfer RNA itself (Table 3). Interspecific comparison and analysis in broader Cycadaceae showed that C. revoluta had the highest number of repeats (24), while C. debaoensis contained the fewest (7), and C. taitungensis contained an intermediate number of repeats (16) (Supplementary Table S1, Fig. 2). In contrast, the comparison of simple sequence repeats (SSRs) revealed a relative conservatism in their numbers, with congeneric species showing similarities in both numbers and spatial patterns of SSRs occurrence (Fig. 3). This is of particular value in Cycadaceae, where species are usually scarcely distributed, and in which diagnostic morphological characters are often poorly defined or absent. These generic molecular biomarkers have the potential to provide useful diagnostic data in redefining complex paraphyletic and polyphyletic species groups in the family 18,22,26 , aiding directly in conservation efforts. In other groups, the presence and nature of repeats have been shown to be of great value in evolutionary and population analyses 27,28 . Microsatellites (SSRs) are useful markers for population genetics, conservation of endangered species and species delineation 22,29-31 , as previously highlighted for C. debaoensis 10,12 . There were 39 SSRs in the chloroplast of C. debaoensis, 34 (87%) and 5 (13%) mono-and di-nucleotides SSRs, respectively (Table 4). These SSRs were mainly distributed in the IGS region (29; 74%), and the other 26% were distributed in CDS genes (Table 4). C. revoluta and C. taitungensis had similar SSRs numbers and locations than C. debaoensis, most of them being mononucleotides and distributed in the IGS (Fig. 3). Interestingly, C. revoluta lacked some SSR patterns (G and GA mono-and di-nucleotides, respectively) (Table S2). These diagnostic SSRs can be used in combination with nuclear SSRs developed in the genus for Cycadaceae conservation or reintroduction, species biodiversity assessments and phylogenetic studies in native or introduced areas 12,30-33 .

Cycads phylogenetic reconstruction and comparison.
In the maximum likelihood (ML) phylogenetic tree, all but two nodes were highly supported (bootstrap support ≥ 95), with the accessions of C. debaoensis closely related to the other Cycas species (Fig. 4A). Cycas spp. diverged first in the Cycadales, followed by Dioon, a clade containing Zamia, Ceratozamia and Stangeria. Bowenia diverged from the remaining Zamiaceae with relatively high support (bootstrap support 80%), Macrozamia being as sister to a clade containing both Encephalartos and Lepidozamia (Fig. 4A). This chloroplast phylogeny confirms the division of Cycadales into two suborders, each of them being monophyletic in our analyses (Fig. 4B), but contradict the current family delimitations 34 , with the family Stangeriaceae being polyphyletic with high support, in agreement with the most recent phylogenetic 13 Transfer RNAs Table 2. Genes present in the C.debaoensis (KU743927) chloroplast genome. (× 2) Two gene copies in the IRs. a Gene containing one intron. b Gene containing two introns.
work in cycads 35 . However, we found each genus in Stangeriaceae grouping with one of the clade in Zamiaceae, contrary to Salas-Leiva et al. 35 , in which Bowenia diverged in a basal position for all cycads but Cycas and Dioon, and Stangeria group with Zamia (Microcycas being absent from our dataset). C. debaoensis diverged in a basal position of the genus, with C. revoluta and C. taitungensis being closely related. In addition, the branch leading to Cycas is longer than the branches leading to the other genera, in accordance with the hypothesis of a recent diversification during the Miocene 19 . These results demonstrate the suitability and efficiency of using complete chloroplast sequences in reconstructing the evolutionary history of cycads, as previously demonstrated for other groups 36,37 .
Although the clear delineation of the genera in cycads are mostly due to the lengths of the sequences provided by complete chloroplast sequencing, the variability was unevenly distributed along the circular molecule (Fig. 5). Indeed, the ribosomal RNA genes as well as the region between psbA and rpoC1 genes (0-25 kb) and between ndhC and rpl20 (50-72 kb) of the cp sequences exhibited relatively few variations among cycads (Fig. 5). A cluster of four ndh genes (120-124.5 kb) appeared to be strikingly conserved. Overall, the level of variation increased with taxonomic distance, meaning that regions showing polymorphisms among Cycas species, exhibited higher variation among different cycad genera. However, some notable well-defined (< 500 bp long) polymorphic    regions departed from this assertion in ycf1 and ycf2 IRa. Indeed, the polymorphisms in these regions were higher among Cycas than among cycads as a whole. Interestingly, three SNPs were located in other regions (trnL-trnF, clpP intron 2 and ycf2 IRb) that exhibited a continuous increase in polymorphism levels across the family. ycf1 has been recently proposed as a barcode locus 38 , despite it not being present in all genera 39 , and was identified with clpP among the most variable loci in Parthenium spp. 40 . Finally, trnL-trnF was previously used in Cycas phylogenetic studies 41 , but also in species identification of trees and ferns 42,43 . Here, we stress the need to further assess these loci as potential more informative substitutes to the official barcode loci 44 .

Comparative interspecific chloroplast genomic analysis.
Focusing on the three Cycas species available in GenBank, the mVISTA results showed that the four chloroplasts were highly conserved; however, the coding regions appeared to be globally more variable than the non-coding regions ( Fig. 5 and Suppl. Fig. 1). Furthermore, the coding regions, e.g. rpoB, psbC, clpP (intron), ycf1, and ycf2; psbA-trnH and trnL-trnF intergenic spacers showed promising levels of variations for further development in applications such as DNA-barcoding or phylogenetic reconstruction. Cycas have been considered as a difficult group for DNA-barcoding 45 . Previously, it was reported that the psbA-trnH spacer was highly variable in Cycadales except Cycas 45 , but trnL-trnF was used in Cycas for phylogenetic studies 41 . Although rbcL + matK were chosen as a two-locus DNA-barcode for their universality and efficacy in land plant 44 , it was not variable enough in Cycas (Fig. 5 and Suppl. Fig. 1). rpoB, rpoC1, and non-coding  (e.g. atpF-atpH, psbK-psbI, and rpl32-trnL) regions have been shown to be variable enough at higher taxonomic levels 46 , but also within cycads 47,48 . In light of this, psbC, clpP (intron), ycf1, and ycf2 should be considered as candidates for future phylogenetic studies in Cycas.

Intraspecific comparison.
Comparing the two individuals of C. debaoensis, we found 10 SNPs and 1 indel, plus one "N" position in our data that didn't allow us to confirm its status (Table 5). Their genomic positions are indicated in Figs 1 and 5. Six and four SNPs were located in IGS and coding regions, respectively, and one indel in the clpC intron. The genetic distance between the two individuals was 0.005553% (i.e. ≈ 1 SNPs/indels per 20 kb). This result is consistent with previous studies showing low within-population diversity in addition to limited gene flow among populations in C. debaoensis 10,11 . Whittall et al. 49 found a comparable level of intraspecific divergence in pines, irrespective of the rarity of the considered species. However, in the pest species Jacobaea vulgaris (Asteraceae), the intraspecific divergence was four times higher 50 , perhaps due to it's short generation times as opposed to those prevailing in slow growing and long-lived trees. Further studies are still needed to determine whether or not intraspecific genetic diversity is linked to geographic ranges or the intrinsic characteristics of the taxonomic group.

Conclusions
Comparing genomic diversity at different taxonomical, but also spatial and temporal scales, we were able to reconstruct a robust phylogeny for cycads, and to identify regions showing promising levels of variation at three levels (familial, generic and intraspecific rank). These regions can provide useful and alternative loci for species identification and population-based studies for conservation, ecology and evolution. Despite their restricted geographic ranges, we showed that several, potentially diagnostic intraspecific variations can be found in the chloroplasts of different individuals C. debaoensis, including 10 SNPs and 1 indel in as of yet unstudied regions.
Comparing results from the three scales, four regions appeared to be variable at the three considered taxonomic scales, namely ycf3, clpP, psbD and the trnL-trnF IGS. Therefore, we recommend future studies in cycads further evaluate these loci in details. We expect that by providing and highlighting these new resources to the plant research community, it will allow for development of new diagnostic markers and innovative conservation strategies in this iconic, but highly threatened taxonomic group, especially in the case of C. debaoensis. The raw data were imported in Geneious R9 (Biomatters Ltd, Auckland, New Zealand), and a cp genome was assembled according to Hinsinger and Strijk 51 . Raw reads were trimmed according to their 5′ and 3′ -end quality, then a reference-guided assembly was performed, using the available cp of Cycas debaoensis (KM459003) as a reference for the mapping step. The cp genome annotation was transferred from C. debaoensis using the implemented function in Geneious R9 and their boundaries were manually checked. The circular cp genome map was generated using the Organellar Genome Draw program (OGDRAW). Intergeneric comparisons and phylogenetic reconstruction. Following the recommended best practices for complete organellar sequencing 52 , we performed a phylogenetic analysis to confirm the accuracy of our reconstructed plastid and sample identification. We retrieved all the Cycadales available in GenBank (accessed 2016/02/15), representing all the ten genera in the order except Microcycas 6 . To this Cycadales dataset, we added the sequences of Gingko biloba (NC016986), Pinus strobus (NC026302), Araucaria heterophylla (NC026450), Taxus mairei (NC020321) and Cupressus sempervirens (NC026296) as outgroups.

Materials and Methods
Sequences were aligned using MAFFT 53 with default options. We used the jModelTest 54 implementation in CIPRES 55 , and set the substitution model accordingly. We built a maximum likelihood tree using PHYML 56 , with a 012310 + G + F model using four gamma categories and 1000 bootstrap replicates. The ML tree was built using all positions. In addition, to identify regions with substantial variability, the complete cp genomes of nine cycad genera were compared using mVISTA 57 , with C. debaoensis (KM459003) as a reference for the annotations.
Sequence divergences among cycads were estimated using the Kimura 2-parameter model 58 , implemented in MEGA6 59 . Codon positions included were 1st + 2nd + 3rd + Noncoding. All positions containing gaps and missing data were excluded prior to analyses.
Interspecific comparisons. The complete cp genomes of C. debaoensis and two other species in cycas (C. revoluta and C. taitungensis) were compared using mVISTA 57 , as described above.
For each species, repeats (forward, palindrome, reverse and complement sequences) were identified using REPuter 60 with 30 bp and sequence identity greater than 90%. Simple sequence repeats (SSRs) of C. debaoensis and the two other species were detected using MISA 61 by setting the minimum number of repeats to 10, 5, 4, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta-and hexa nucleotides, respectively. Sequence divergences among the Cycas species were estimated as described above.
Intraspecific genome comparison. The two complete chloroplasts of C. debaoensis (KU743927, KM459003) were aligned in Geneious R9 (Biomatters Ltd, Auckland, New Zealand) using the MAFFT algorithm 53 , and differences were identified using the "Find Variations/SNPs" function and checked individually. We recorded substitutions and indels separately, as well as their location in the chloroplast genome (e.g. SSRs/repeats, coding region/rRNA/tRNA/IGS). Sequence divergence extent between the two individuals was estimated as described above.