Application of genotyping-by-sequencing data on inferring the phylogeny of Curcuma (Zingiberaceae) from China CURRENT

Background: Genotyping-by-sequencing (GBS), as one of the next generation sequences, has been applied to large scale genotyping in plants, which is poor in morphological differentiation and low in genetic divergence among different species. Curcuma is a significantly medicinal and edible genus. Improvement efforts of phylogenetic relationships and disentangling species are still a challenge due to poor morphology and lack in a reference genome. Result: A high-throughput genomic sequence data which was obtained through GBS protocols was used to investigate the relationships among 8 species with 60 total samples of Curcuma. Through the use of the ipyrad software, 437,061 loci and 997,988 filtered SNPs without reliance upon a reference genome were produced. After quality control (QC) of the filtered SNPs, 1,295 high-quality SNPs were used to clarify the phylogenetic relationships among Curcuma species. Based on these data, a supermatrix approach was used to speculate the phylogeny, and the phylogenetic trees and the relationships were inferred . Conclusions: Varying degrees of support can be explained, as well as the diversification events for Chinese Curcuma. The diversification events showed that the third intense uplift of Qinghai–Tibet Plateau (QTP) and formation of the Hengduan Mountains may speed up Curcuma interspecific divergence in China. The PCA suggested the same topology of the phylogenetic tree. The genetic structure analysis revealed that extensive hybridization may exist in Chinese Curcuma. Additionally, the GBS will be a promising approach for the phylogenetic and systematic study in the future.


Background
Curcuma L. (Zingiberaceae) is a genus of 40 species of aromatic herbs which occur in subtropical and tropical Asia.About 10 Curcuma species are distributed in China, of which 6 species are treated as vital Chinese folk herbal medicine in Traditional Chinese Medicine (TCM), and the extraction of rhizomes exhibits activity of anti-inflammatory, anticancer and HIV-1 protease inhibitory [1].
In recent years, the phylogeny of Curcuma has been given much attention, and the strong recovery support among the major lineages within the genus were topics of research [2][3][4].But its phylogeny and taxonomy also have different controversy and rifts, because of the poor diagnostic characters (aerial part, underground part, floral morphology, and so on) [5].The morphological differences in all species of the genus are neither unique nor universal.The cytological study of Curcuma had proved that extensive polyploidy exists in this genus.The basic chromosome numbers in different populations have different features [6].So far, on account of the influence of polyploidy on the complexity of species, the relationships in Curcuma have been unknown.With the development of molecular biotechnology in the past several decades, the phylogenetic studies of Curcuma have resulted in much new progress [7][8][9].Unfortunately, DNA regions still have some limitations in species identification, particularly sister or closely related species, and they show a low genetic divergence between species [10].However, the plants with a complex genome also show uncertain relationships in species-level and species boundaries [11] Next generation sequencing (NGS) has made great progress in recent years, and high-throughput sequencing stands in the breach [12][13][14].It comprised a number of procedures such as restriction-site associated DNA sequencing (RAD-Seq) and genotyping-by-sequencing (GBS), producing large amounts of genomic sequence data quickly and effectively [15][16].Recently, many systematic biologists have begun to pay attention to how to use the reduced-representation of genomic approaches to solve some problems which are hard to work out in traditional ways.These methods are very useful to phylogenetic studies, which not only produce many genome wide phylogenetic information locis in lack of a reference genome or non-model species, but also simplify the laboratory protocols, saving cost and efficiency, particularly in the species lineages less than 60 million years old [17][18][19].Although large numbers of single-nucleotide polymorphism (SNP) markers that analyzed population structure and genetic diversity were provided by a NGS-based method, the methods also have some drawbacks and uncertainties including short sequence reads (50-100 bp).There might appear some mistakes in sampling error, mutation and complex bioinformatics analysis, making it hard to apply to deeper timescale studies.But nonetheless, NGS has still successfully helped many researchers to understand the phylogeny for some representative family members of plants [20][21][22][23].
Genotyping-by-sequencing (GBS) approach obtained numerous sequence tags from individual species and the tags could be used for polymorphism detection of multispecies coalescent-based tree inference methods and polymorphism detection without a reference genome sequence.Thus, it is almost fit for any organism [24].The data based on GBS, using a multispecies coalescent-based approach have been shown to be highly useful for some plant studies taking gene history variation into account [25][26].
In this study, the main objectives of this study were (1) to test GBS performance for inferring the phylogeny of Chinese Curcuma; (2) to evaluate phylogenetic data analysis the lineages clade; (3) to obtain a resolved phylogeny with traditional maximum likelihood (ML), maximum parsimony (MP) and Bayesian methods; (4) to evaluate the divergence time and population structure of several Chinese Curcuma.The work will become a plowing ahead in this genus, and determine if the data base on GBS can help the phylogeny of Curcuma.

Sequences discovery and characterization
From the project, the sequences of 8 species a total of 61 samples (Table 1) returned 53.15 Gb, with an average of 892.2 Mb per sample.In order to obtain reliable data (Q20≥93.74%,Q30≥85.10%, some low quality data were left out.The NCBI BioProject ID: PRJNA557061can have access to Raw Reads. Using the denove clustering method in ipyRAD, 437,061 unique GBS locis across all 60 samples were revealed.After quality control (QC) of the filtered SNPs by Plink, filtering SNPs reduced the total to 1,295 base pairs.

Maximum parsimony inference
The maximum parsimony method was used to analyze the datasets.Fig. 1 was bootstrap 50% majority-rule consensus tree.The bootstrap values were marked on the top of branch in each species group.
The same species from different areas were clustered together, and species group had the highest bootstrap support (8 out of 13 clades with 100%, 6 out of 13 clades with 80-99%, and none<50%).(BS = 100).Compared to the previous study, these analyses got different clades [3].

Maximum likelihood inference
The maximum likelihood method got the same clades for MP tree.The ML analyses had a higher bootstrap support (8 out of 13 clades with 100% and none<90%) (Fig. 1

Bayesian inference
The Bayesian tree (Fig. 2) showed the same cluster and clades to MP tree and ML tree.The Bayesian analysis had a higher support than MP tree (12 out of 15 clades with 100%, 3 out of 15 clades with 85-99%, and none<85%).The sister clades which differentiated to the species group C. wenyujin and C. aromatica had a principle of maximum differentiation support in MP (BS = 99), ML (BS = 75) and Bayesian (PP = 85) tree.

PCA inference
The PCA based on high-quality SNPs could reflect the clustering and relationship among different individuals in a visual graphic from 60 Curcuma sample.Obviously, the same species had a better cluster.The genetic background could be reflected by the distance of species.In Fig. 3

Population structure inference
On the strength of the number of different assumed ancestral kinships (K),, the composition of ancestor species was shown in a visual graphic by the Bayesian algorithm.The structure was run K from 3-12 for 20 times (Fig. S1).With a list as a sample, the proportion of different color in samples expressed different ancestors.The population structure analysis was similar to that of the phylogenetic tree and PCA.All species had the same individual colors (Fig. 4).When K = 3, the results suggested that the Chinese Curcuma samples branches could be classified into three groups

Divergence time inference
Based on the GBS dataset in the chronophylogenetic analyses (Fig. 2), the major divergence times and events of Chinese Curcuma speciesoccurred in the Miocene (~7.45 Mya).The C. longa was an independent branch, splitting at ~6. 43

Phylogeny and biological implication
In this study, the phylogeny and evolution of Chinese Curcuma were explained by using GBS data firstly, which was never applied to Zingiberaceae.The phylogenetic relationships of eight main Chinese Curcuma species are reasonable.Unfortunately, our samples are only a subset of the species in Curcuma.It is always impossible to compare with previous results directly, because of the limited and not matched taxon samples.However, our study supports several hypotheses about the origin of Curcuma [4].Based on GBS data, we inferred the earliest appearance of Curcuma in China during the Miocene, a period when large-scale orogenesis and other geological events had an influence on the development and speciation of the flora in this area [27][28].The interspecific divergence events of Chinese Curcuma occurred in the late Miocene.As the third intense uplift of Qinghai-Tibet Plateau (QTP) and formation of the Hengduan Mountains began in the Pliocence [29][30], it changed the geographical environment and climate in Asia [31][32].These events may speed up Curcuma interspecific divergence, when it arrived at China.Hybridization and polyploidy are ubiquitous in plants and play an important role in speciation and diversification [33][34].The cytomixis and chromosome doubling can promote the production of a large number of new phenotypes in a short time, and polyploidy tend to be more adaptive than parents [35][36][37].Hybrids and polyploidization are found in the regions where there are species overlap and changes in the living environment.In the study, Chinese Curcuma may have a hybrid origin.According to previous studies about molecular data, Curcuma was likely to hybridize among distantly related species [4], and our results seem to support this opinion.Because our samples are only a subset of the species in Curcuma, it is difficult to discuss their origin further.In addition, there was a complex history of Curcuma.

Phylogeny and biological implication of the phylogeny of Curcuma
Many researches have shown that the "next-generation" as a new sequence approach and inference method have started to conquer some disputed problems in the relationship of closely phylogenetic lineages and estimate the recent divergence times and events [38,20].This study was the first to use GBS to analyze the phylogeny of Curcuma, which produced level of genomic sequence data as massive as possible.Based on the sequence data, it aims to generate high-quality SNPs to infer the evolution and phylogenetic relationships of the radiation of Chinese Curcuma.Over 400,000 locis and 997,988 filtered SNPs were produced from 8 species 60 samples and 1 outgroup.The high-quality SNPs "corverage" among all samples were obtained with the method of quality control (QC) of the filtered SNPs.Compared to many previous studies of this group, the quality of our dataset had a significant increase.The concatenated supermatrix date from QC of the filtered SNPs returned a strong support for the 8 species relationships of Chinese Curcuma.From fabricated different phylogenetic tree, the trees showed fairly consistent topology and recovered a similar clades with high support.A previous study demonstrated that several mistakes will appear in inference by using a single phylogenetic tree method [39].Especially, in the multispecies coalescent supermatrix data and model, there might exist inappropriate conclusions [40].Some studies noted that the approach of obtaining numerous sequence data is relatively ideal for establishing and reconstructing the phylogeny [41][42].Researches in molecular evolutionary biology often lack in locis and variable sites for inference.Therefore, more locis and variable sites may offer an opportunity for studies of complex groups mechanisms and history indeed.Due to the short age of locis about 100 base, a bias needs to overcome [43][44].
With the development of sequencing technology, the deficiency was apt to be solved and recovered by more accurate and longer data locis from more population samples.The new sequencing technology will prove more accurate and better phylogeny [13].For example, the Paired-end Illumina sequencing technology becomes more available and reliable, and it obtains longer sequences and locis from GBS methods [45].It is also significant to improve our comprehension about the context of variation when the sample genetically differentiated in slight and it had a large geographical distribution.Furthermore, the improved model studies have increasingly shown that a small number of samples in continuous geographic ranges can be misleading evidence that a population does not exist [46].The species phylogenetic tree inference also suggested that a better and optimal group size can be estimated from more samples [47].In this study, 7 out of 8 of the species had multiple accessions, and greater sample in species overcame some problems by sequence errors.

Conclusions
According to the study, the next step of the job with Chinese Curcuma is to collect more species samples from different geographical distribution, especially Curcuma kwangsiensis, and from subgenus Curcuma and other subgenus of Curcuma in the world [2].Besides, the use of nextgeneration sequencing methods such as ddGBS-seq and ddRad-seq can produce longer reads.The transcriptome sequencing methods have succeeded in recovering the phylogeny in recently radiated genus and mining for phylogentic study [48].
While the step to use new sequencing techniques is expected to improve the reliability of phylogenetic tree analysis and inference methods, these methods need to combine with simulated and authentic data to determine the best method.The Curcuma has a complex evolutionary history.
In order to make sure the reliability of its relationship and origin, besides depending on advances in sequencing technology, more taxa will play a crucial role.

Material and sequencing
The total 60 specimens which belong to 8 Curcuma species were used in this study, and Hedychium coronarium was used as outgroup (Table 1).All species collecting information were obtained in Table 1 and identified by Prof. Ruiwu Yang of Sichuan Agricultural University.All materials used in this study were stored at Sichuan Agricultural University.The genomic DNA isolation was carried out on fresh leaves by CTAB method.The GBS libraries for each sample were prepared using the protocol outline as previously described [24,49].The PstI (CTGCAG) as a restriction enzyme was used to digest all DNA samples.The library produced the DNA fragments that were ligated by a barcode adaptor and the sticky ends were corrected by a common adaptor.After then, a Qiagen MinElute 96-well PCR purification kit was used in the clean-up step to clean up the products.After PCR, the PicoGreen and a qPCR machine were used to examine the quality about the PCR products.All samples were pooled into a single GBS library.The genome research and biocomputing about the GBS library were carried out on Illumina Hiseq PE150 sequencer on Novogene Bioinformatics Technology

Clustering
The raw reads were performed using the ipyRAD 0.7.29 pipeline [50] from the Illumina FASTQ files.
Following seven sequential steps, the ipyRAD pipeline can obtain species or higher variation across clades in clustering and alignment method based on specific parameters in the ipyRAD documentation (https://github.com/dereneaton/ipyrad).This is unlike other pipelines [51].The RAD parameter settings were as follows: Sequences were clustered within samples by 90% similarity via the uclust function in USEARCH [52].Clusters of less than 10 sequences were discarded and the minimum number of individuals per cluster was set to 5. Any locus that was heterozygous among more than two samples was discarded.The rest of clusters were processed as loci.The phylogenetic matrix was assembled with the loci.

Concatenation-based species tree inference
The dataset used to analyze the phylogeny tree was the high-quality SNPs via quality control of the filtered SNPs QC by PLINK.The maximum likelihood (ML) and maximum parsimony (MP) method with booststrap in the software PAUP* 4.0a134 [53][54] were used to generate the phylogenetic tree.The dataset were implemented in RAxML8.2.8 [55], substitution models GTR for datasets used the Akaike Information Criterion (AIC) in MrModeltest [56] after calculation in PAUP* 4.0a134 [54].The MrBayes v.3.1.2was used to analyze the dataset based on Bayes theorem [47].

Phylogenetic structure and divergence time inference
Principal component analysis (PCA) is a purely mathematical method that reflects the clustering between groups and is based on the degree of SNPs in difference individual by EIGENSOFT [57].The 1,295 high-quality SNPs from 60 Curcuma sample were obtained by the computer software ADMIXTURE [58].In this study, 60 Curcuma samples were analyzed by STRUCTURE.The kinships (K) value was from 2-12 by ADMIXTURE.BEAST1.7.5 (Bayesian evolutionary analysis by sampling trees) [47] is an efficient and flexible computer software package based on the Bayesian method for molecular sequence mutation evolution analysis.It was the first software package using the relaxed molecular clock to estimate divergence time in the phylogenetic tree [59].On the basis of the highquality SNPs, the BEAST was used to estimate Chinese Curcuma species divergence time, and the Bayesian tree was dated by setting divergence time between Hedychium coronarium and Curcuma as Tables Table 1 Sample of species names, with source details and total loci after processing with ipyRAD, of the 61 samples used in phylogenetic analyses.

Supplementary Files
This is a list of supplementary files associated with this preprint.Click to download. FigS1.pdf The group C. longa formed a monophyletic branch (BS = 100); the group C. wenyujin (BS = 100) was sister to C. aromatica (BS = 100); the group C. amarissimaBS = 100) was sister to C. sichuanensis (BS = 100); the group C. kwangsiensis first clustered with C. yunnanensis (BS = 97)and C. phaeocaulis , different species had their own individual locations which were not overlapped.The C. longa was far from other species.The C. wenyujin was close to C. aromatica.The C. sichuannensis and C. amarissima formed a closely related cluster.The group of C. phaeocaulis, C. yunnanensis and C. kwangsiensis had a close cluster.The PCA suggested the same topology of the phylogenetic tree (MP, ML and Bayesian).
. The C. longa had an individual gene pool (blue).Most gene pool (green) comprised the C. amarissima, C. sichuanensis, C. kwangsiensis, C. yunnanensis and C. phaeocaulis.The C. wenyujin and C. aromatica were composed of most gene pool (red).With the number of ancestor increasing, more species groups were distinguished clearly.When K = 3-12, the C. longa could be distinguished by constitution of gene pool, and the results could be the origin of ancestors, which were different from other groups.The C. amarissima, C. sichuanensis, C. kwangsiensis, C. yunnanensis and C. phaeocaulis had a close relationship, and they could be distinguished in K = 5.The C. kwangsiensis and C. aromatica consisted of two or more gene pools, and most likely hybrid origin.
Mya, and intraspecific diversification was diverged at ~2.45 Mya during the late Pliocence and Quaternary.Another branch consisted of C. amarissima, C. sichuanensis, C. kwangsiensis, C. yunnanensis and C. phaeocaulis.The clusters A (C. amarissima and C. sichuanensis)andB (C.kwangsiensis, C. yunnanensis and C. phaeocaulis) were estimated to be ~4.83Mya.The divergence of the A lineages started ~3.57Mya in the Pliocence, and then intraspecific diversification was estimated around ~1.89 Mya for C. sichuanensis and ~0.67 Mya for C. amarissima.The B lineages were diverged at ~3.74 Mya, the C. phaeocaulis split at ~1.45 Mya and the C. yunnanensis split at ~0.81 Mya.The branchwhich consisted of C. wenyujin and C. aromatica was diverged at ~6.08 Mya during the Miocene and Pliocence; subsequent diversification events ~1.17 Mya diverged the species C. wenyujin and ~0.56 Mya diverged the species C. aromatica.

Figures
Figures

Figure 1 Phylogeny
Figure 1 Phylogeny tree of Chinese Curcuma resulting from maximum likelihood and maximum parsimony (ML bootstrap values/MP bootstrap values ).The phylogeny tree shows all 61 samples this study, different species with different colors.The bootstrap values (maximum likelihood bootstrap values/maximum parsimony bootstrap values) in every species clade.

Figure 2 Phylogeny
Figure 2 Phylogeny tree resulting from the Bayesian inference and divergency time of species clade in BEAST (Divergence time/Bayesian support ).The A lineages was consisted of C. amarissima, C. sichuanensis, C. kwangsiensis, C. yunnanensis and C. phaeocaulis.And B was made up (C.kwangsiensis, C. yunnanensis and C. phaeocaulis.Different geologic age with different color.

Figure 3 The
Figure 3The principle components analysis of Chinese Curcuma.Principal component analysis (PCA) of the 60 samples.

Figure 4 Population
Figure 4 Population genetic structure of Chinese Curcuma.When the kinship (K) =3, 5, 7 and 9, each vertical bar represents a Curcuma sample and different color represents different ancestral background in putative.