Identification of species and materia medica within Saussurea subg. Amphilaena based on DNA barcodes

Saussurea is one of the most species-rich genera in the family Asteraceae, where some have a complex evolutionary history, including radiation and convergent evolution, and the identification of these species is notoriously difficult. This genus contains many plants with medical uses, and thus an objective identification method is urgently needed. Saussurea subg. Amphilaena is one of the four subgenera of Saussurea and it is particularly rich in medical resources, where 15/39 species are used in medicine. To test the application of DNA barcodes in this subgenus, five candidates were sequenced and analyzed using 131 individuals representing 15 medical plants and four additional species from this subgenus. Our results suggested that internal transcribed spacer (ITS) + rbcL or ITS + rbcL + psbA-trnH could distinguish all of the species, while the ITS alone could identify all of the 15 medical plants. However, the species identification rates based on plastid barcodes were low, i.e., 0% to 36% when analyzed individually, and 63% when all four loci were combined. Thus, we recommend using ITS + rbcL as the DNA barcode for S. subg. Amphilaena or the ITS alone for medical plants. Possible taxonomic problems and substitutes for medicinal plant materials are also discussed.


INTRODUCTION
Saussurea is one of the most species-rich genera in Asteraceae and the taxonomic identification of these species is notoriously difficult (Lipschitz, 1979). Recent radiation, widespread hybridization, and convergent evolution have combined to make the delimitation of these species extremely complicated (Wang et al., 2009). Among the 289 recognized species in the ''Flora of China'' (FOC), many are very challenging to differentiate, with one or several morphologically similar species (Shi & Raab-Straube, 2011). For example, about nine current widely accepted species are suspected to be conspecific with S. taraxacifolia (Chen, 2015). Since the publication of FOC, the newly described species have totaled more than 60 species (Chen, 2015;Wang et al., 2014;Xu, Hao & Xia, 2014;Chen & Wang, 2018), with an average of 10 species every year, which is a far higher number than that of other genera. These new species have mostly been separated from the known species and at least 10 of them bear the prefix ''pseudo'' to indicate their similarity in terms of morphology (Chen, 2014;Chen & Yuan, 2015;Wang et al., 2014).
This taxonomic problem particularly affects S. subg. Amphilaena, which is one of the four subgenera of Saussurea, where these species are defined mainly based on the self-transparent and colorful bract that subtends the synflorescence (Fig. 1) (Lipschitz, 1979;Raab-Straube, 2017). This character is a well-known adaptation to high altitudes and it occurs in a number of angiosperm genera from different families (Omori, Takayama & Fls, 2000). Within S. subg. Amphilaena, it has also been documented that this character was derived multiple times and some of the species showing very high similarity, such as S. involucrata and S. obvolata, are actually distantly related according to molecular phylogeny (Wang et al., 2009). In addition, this subgenus is considered to be a result of a recent radiation in the Qinghai-Tibet Plateau where 35 of the total number of 38 species have been recorded (Raab-Straube, 2017). This type of process usually produces many closely related species where one species might resemble several other species, thereby yielding a number of complexes (Simões et al., 2016).
Complex taxonomy undoubtedly causes problems with identification, and among the 38 species recognized in the latest monograph, at least 13 species are widely misidentified. For example, S. orgaadayi was long misidentified as S. involucrata (Smirnov, 2004), although both species were described many years ago and the latter is one of the most famous plants in China because of its beauty and usage in traditional Chinese medicine (Chik et al., 2015). In addition, eight species within the S. obvallata complex have been recognized as single species since the establishment of S. obvallata (Raab-Straube, 2017).
Evidently, misidentification can lead to a misunderstanding of biodiversity. In some cases, these errors can even be deadly harmful for humans given that many Saussurea species are used in medicine (Chik et al., 2015;Li, Zhu & Cai, 2000;Yang et al., 2005). In addition to S. involucrata, 14 other species have been formally recorded as medically useful in S. subg. Amphilaena (Table 1) (Cao et al., 2016;Chen, Pei & Zhao, 2010;Jiang, Luo & Xu, 2010;Li, 1999). However, the authentication of species is time-consuming and it requires a specialist taxonomist in most cases. Moreover, some species are found only in areas that are difficult to access, possibly because of their excessive consumption. For example, S. involucrata is currently listed as second-class protected plants due to over-exploitation (Fu & Jin, 1992), while S. wettsteiniana and S. velutina are both endemic to a few mountains in Sichuan, China, and they are difficult to obtain due to their restricted distributions (Shi & Raab-Straube, 2011). Thus, possible substitutes for these species are urgently needed to be ascertained.
DNA barcoding is a rapid and reliable technique for identifying species based on variations in the sequence of short standard DNA regions. Phylogenetic studies based on these fragments can also help to identify substitute plants. However, the selection of the fragments used for DNA barcoding is a controversial problem. The Plant Working Group of the Consortium for the Barcode of Life (CBOL) proposed using a combination of rbcL and mat K as a ''core barcode'' for identifying land plants ). Subsequently, trnH-psbA and the nuclear ribosomal internal transcribed spacer (ITS) were proposed as supplementary barcodes for land plants (Kress et al., 2005;Li et al., 2011). In addition, trnK was found to outperform mat K in some studies (Cao et al., 2010;Müller & Borsch, 2005). Previously, the sequences used in DNA barcodes for Saussurea species have been rather limited and only five species have been reported with DNA sequences. Among these species, none have been reported more than two populations, which is obviously insufficient for DNA barcode studies (Wang et al., 2009). Thus, in this study, we performed extensive investigations in the field, and we sequenced five DNA barcode candidates in chloroplasts (mat K, trnH-psbA, trnK, and rbcL) and the nuclear ITS. Our main aims were: (i) to evaluate the application of these DNA barcodes in S. subg. Amphilaena; (ii) to develop an objective method for identifying medically important Saussurea species; and (iii) to explore the possible taxonomic problems and potential substitutes for some rare herbs.

Taxon sampling
In total, 20 species were sampled in the present study, including 18 from the 38 species recognized in the latest monograph on S. subg. Amphilaena (Raab-Straube, 2017), one recently published species, S. bogedaensis (Chen & Wang, 2018), and a Jurinea species, which was selected as an outgroup according to a previous study (Wang et al., 2009). Photos of some species are presented in Fig. 1. Our sample focus on medical resources and 15 species formally recorded in the medical literature were included in the analyses (Table 1). For most of the species in the ingroup, we collected from two or more populations, with more than three individuals from each population. In total, we collected 132 individuals and their details are listed in Table 2.

Data analysis
We constructed 31 datasets for ITS, psbA-trn H, mat K, and trnK, either individually or in different combinations. For the combination of ITS and each chloroplast loci, incongruence length difference (ILD) was preferred to test the incongruence (Farris et al., 1995) using PAUP version 4b10 (Swofford, 2003). For each dataset, the inter-and intraspecific genetic divergences were calculated as described by Meyer & Paulay (2005) and used to determine whether a barcoding gap was present. For each dataset, best close match (BCM) and two tree-based methods comprising neighbor-joining (NJ) and Bayesian inference (BI) were employed to analyze the five single markers and their different combinations. BCM analysis was conducted using the SPIDER package in R (Brown et al., 2012). NJ trees were constructed using PAUP with the Kimura two-parameter model (Swofford, 2003). Support for nodes was assessed based on 100,000 bootstrap replicates. BI analysis was implemented using MrBayes on XSEDE (v3.2.6) (Ronquist et al., 2012) and the optimal models for each marker were determined according to Akaike's information criterion with jModelTest2 in XSEDE (v2.1.6) (Darriba et al., 2012). Species were considered to be identified successfully if individual samples of a species clustered in species-specific monophyletic clades.

RESULTS
The PCR amplification ranged from about 73% (trnK) to 93% (ITS), while sequencing success rates from about 95% for the three chloroplast loci to 100% for the ITS, as shown in Table 4. The length after alignment, the variable sites, the interspecific or intraspecific genetic distance for each locus as well as the p values of ILD test between ITS and each chloroplast locus are also listed in Table 4. The mean intraspecific genetic distances for each species based on ITS and the four cp markers combined are listed in Table 5, and those for the mean interspecific genetic distances are shown in Table 6. The distributions of the intraspecific and interspecific distances for each species based on the five separate  (131) 19 (131) 19 (131) 19 (131) 19 (131) Interspecific distance mean ( Fig. 2. In general, the mean interspecific distances were higher than the intraspecific distances for the five markers. However, the ranges of the intra-and interspecific distances overlapped for all the barcodes tested in this study. The discriminatory powers of all the loci both individually and in different combinations based on the three methods are listed in Table 7 (Figs. S1-S59). In general, BCM achieved higher success rates, followed by NJ and BI, but there were a few exceptions. Among the results obtained with a single barcode, ITS (84.2-93.2%) had the highest species Table 6 The pairwise distances (%) of ITS (lower left) and the combined chloroplast loci (upper right) from 19 species of Saussurea. (1) S. bogedaensis, (2) S. bracteata, (3) S. erubescens, (4) S. globosa, (5) S. involucrate, (6) S. iodostegia, (7) S. luae, (8) S. nigrescens, (9) S. glandulosissima, (10) S. orgaadayi, (11) S. phaeantha, (12) S. polycolea, (13) S. pubifolia, (14) S. sikkimensis, (15) S. tangutica, (16) S. uniflora, (17)  discriminatory power, followed by trnK (15.8-36%), mat K (10.5-16.8%), and trnH-psbA (5.2-27%). Among the combinations of two barcodes, ITS + rbcL had the highest discriminatory success (89.5-100%), whereas that of mat K and rbcL, which was suggested as the core barcode by CBOL (CBOL Plant Working Group, 2009), was only 10.5-25.6%. The three-region combination of ITS + rbcL + trnH-psbA recovered the highest number of monophyletic species (18) in the NJ tree (94.7%). Only five species were successfully discriminated (26.3%) by either the NJ or BI trees using the combination of all four cp markers, i.e., mat K + rbcL + trnH-psbA + trnK.

Proposed DNA barcodes for S. subg. Amphilaena
Among the fragments tested in the present study, ITS obtained a much higher success rate compared with the other loci. In addition, all of the combinations without ITS yielded much lower success rates, regardless of the method used (Table 7). Moreover, the rate of successful PCR (92.7%) was more or less higher for ITS than the other fragments (72.9-91.6%). It has also been reported that this fragment is highly efficient in other Asteraceae genera (Gao et al., 2010;Gong et al., 2016). However, an intrinsic problem with this fragment is that an individual may have undergone recent hybridization, thereby resulting in multiple mosaic sites (Li et al., 2011). In S. subg. Amphilaena, two species failed to form monophyletic clades in the BI and NJ trees, which could be attributed to the presence of multiple mosaic sites (Fig. 3). However, ITS performed better than the other fragments in S. subg. Amphilaena, and thus we propose that this fragment should be the first or best choice when selecting only one of the current candidates. We found that it was difficult to identify the best second choice after ITS. TrnK performed much better than rbcL in terms of its efficiency when used individually, but its combination with ITS obtained contradictory results, i.e., ITS + trnK was inferior to ITS + rbcL in terms of efficiency. This contradictory result was unexpected and it is not common in other taxa (Cao et al., 2010;Müller & Borsch, 2005). We attributed this result to higher degree of congruence of the concatenated sequences of rbcL and ITS (P = 0.12 for ILD test), in compare to trnK and ITS (P = 0.001). But it might derive from some other mechanisms, such as the higher rate of mutation for trnK that could have caused differentiation within species, but not high enough to form distinct genetic differentiation among species, and thus a failure to cluster as a monophyletic group in line with species (Naciri, Caetano & Salamin, 2012;Petit & Excoffier, 2009). Therefore, we suggest that using trnK alone is problematic and instead we propose to use rbcL as complementary to ITS because this  (Fig. 4). The two loci comprising trnH -psbA and mat K were affected by the same problem as trnK, with higher mutation rates and barcode efficiencies compared with rbcL when used individually, but lower efficiency when combined with ITS. Thus, their combination with ITS + rbcL failed to significantly increase the success rate and lower results were even obtained in some cases (Table 7). However, among the combinations without ITS, the combination with higher mutation rates was more efficient than those with lower mutation rates, e.g., trnK + trnH-psbA was better than mat K + rbcL, which was proposed previously as the core DNA barcode for plants . Therefore, if ITS is subjected to hybridization, we propose that the priority order should be the following: trnK > trnH-psbA > mat K > rbcL. Moreover, the combination with more loci performed better than that with less loci. However, even the combination of all four loci was not sufficient to discriminate each species and new fragments should be considered.

Insights into taxonomic problems based on DNA barcodes
Most of the analyses failed to identify the species within two groups, i.e., S. luae vs. S. publifolia and S. globosa vs. S. erubescens (Figs. 3-5; Table 7). We found that these failures might have been attributable to taxonomic problems. For the first group, we found that S. luae was rather heterogeneous in terms of the ITS sequences. Some cp sequences were slightly differentiated compared with S. velutina, but the others were closer to those in S. glandulosissima or S. uniflora (Fig. 5). By contrast, the ITS sequences lacked variance and after excluding the mosaic sites, they were closely related in S. pubifolia or S. bracteata (Fig. 3). These nuclear-cytoplasmic inconsistencies suggest that hybridization may have occurred among these species. The second group comprising S. globosa and S. erubescens was often confused in previous studies because the latter resembles a smaller form of S. globosa, which has various forms across its distribution (Raab-Straube, 2017). In agreement with the morphology, the genetic distance between the cp sequences within S. erubescens was zero whereas that within S. globosa was 0.04% (Table 5), which is even larger than that between S. erubescens and S. globosa (Table 6). The ITS sequences had a very similar pattern and the rich mosaic sites in both species also indicated differentiation accompanying substantial gene flow (Naciri, Caetano & Salamin, 2012). Both the BI and NJ methods found that S. globosa formed a clade within which S. erubescens nested as a monophyletic clade (Fig. 3). Based on these results, we propose that S. globosa might be a species with a series of differentiated populations where S. erubescens represents one of the most obvious. The current delimitation might need revision on the basis of extensive morphological as well as genetic diversity across the distribution range of both species.

Identification of the medicinal species and the potential substitutes
All of the known medically important species could be identified using our proposed DNA barcodes, i.e., ITS + rbcL or ITS alone (Table 7;  . Moreover, some species such as S. bogedaensis, S. glandulosissima, S. polycolea, S. wettsteiniana, and S. orgaadayi could be identified with the cp DNA barcodes (Fig. 5). This high rate of success was unexpected because some species such as the two species in the S. obvallata complex (S. glandulosissima and S. sikkimensis) have been morphologically confused for many years and they were only separated very recently (Raab-Straube, 2017). Their distinction is indicative of difference in bioactive components. Therefore, our results caution against their indiscriminating usage in medicine.
Barcode sequences can also help to identify substitutes for medically useful species because closely related species might possibly share the same or similar secondary metabolites and bioactivities (Zhou et al., 2014). Thus, we propose that nine of the 15 medically useful species might be substituted by their close relatives according to the molecular phylogenetic context. Six of these species, which formed three groups, are also morphologically similar, i.e., S. involucrata and S. orgaadayi or S. bogedaensis, S. globosa and S. erubescens, and S. wettsteiniana and S. glandulosissima (Fig. 3) (Raab-Straube, 2017). Among the remaining three species, S. bracteata appears to be closely related to S. pubifolia whereas S. iodostegia and S. nigrescens are closely related to each other according to phylogenetic tree (Fig. 3). These affinities were not expected according to their morphology, but they are possibly due to convergent evolution or radiation in Saussurea (Wang et al., 2009). Secondary metabolomes or bioactivities are wanted to confirm their similarity.

CONCLUSION
Based on the sequence statistics, inter-and intraspecific distances, SPIDER, and phylogenetic analyses, it is concluded that internal transcribed spacer (ITS) + rbcL or ITS + rbcL + psbA-trnH could distinguish all of the species, while the ITS alone could identify all of the 15 medical plants. However, the species identification rates based on plastid barcodes were low, i.e., 0% to 36% when analyzed individually, and 63% when all four loci were combined. Thus, we recommend using ITS + rbcL as the DNA barcode for S. subg. Amphilaena or the ITS alone for medical plants.