Identification of the Plant Family Caryophyllaceae in Korea Using DNA Barcoding

Caryophyllaceae is a large angiosperm family, with many species being utilized as ornamental or medicinal plants in Korea, in addition to several endangered species that are managed by the government. In this study, we used DNA barcoding for the accurate identification of Korean Caryophyllaceae. A total of 78 taxa (n = 215) were sequenced based on three chloroplast regions (rbcL, matK, and psbA–trnH) and nuclear ribosomal internal transcribed spacers (ITS). In the neighbor-joining tree, a higher accuracy of identification was generally observed when using ITS (>73%) rather than chloroplast regions (<62%). The highest resolution was found for rbcL + ITS (77.6%), although resolution varied according to the genus. Among the genera that included two and more species, five genera (Eremogone, Minuartia, Pseudostellaria, Sagina, and Stellaria) were successfully identified. However, the species of five other genera (Cerastium, Gypsophila, Dianthus, Silene, and Spergularia) showed relatively low resolutions (0–61.1%). In the cases of Cerastium, Dianthus, and Silene, ambiguous taxonomic relationships among unidentified species may have been a factor contributing to such low resolutions. However, in contrast to these results, Gypsophila and Spergularia have been identified well in previous studies. Our findings indicate the need of taxonomic reconsideration in Korea.


Evaluation of DNA Barcodes
The mean interspecific pairwise distance was highest in psbA-trnH (0.3245), followed by ITS (0.2259), matK (0.1140), and rbcL (0.0291) ( Table 2). Furthermore, the mean intraspecific genetic distance was also the highest in psbA-trnH (0.0078), followed by ITS (0.0037), matK (0.0019), and rbcL (0.0004) ( Table 2). These values were 42-73 times larger in interspecific pairwise distances than in intraspecific pairwise distances. When combining the regions, the mean of interspecific pairwise distance is highest in psbA-trnH + ITS (0.2502), and the mean of intraspecific genetic distance is also highest (0.0048) ( Table 2). However, an overlap was noted in genetic distance between intraspecific and interspecific pairwise distances for all regions ( Figure 2); as a result, there was no distinct barcode gap.

Evaluation of DNA Barcodes
The mean interspecific pairwise distance was highest in psbA-trnH (0.3245), followed by ITS (0.2259), matK (0.1140), and rbcL (0.0291) ( Table 2). Furthermore, the mean intraspecific genetic distance was also the highest in psbA-trnH (0.0078), followed by ITS (0.0037), matK (0.0019), and rbcL (0.0004) ( Table 2). These values were 42-73 times larger in interspecific pairwise distances than in intraspecific pairwise distances. When combining the regions, the mean of interspecific pairwise distance is highest in psbA-trnH + ITS (0.2502), and the mean of intraspecific genetic distance is also highest (0.0048) ( Table 2). However, an overlap was noted in genetic distance between intraspecific and interspecific pairwise distances for all regions ( Figure 2); as a result, there was no distinct barcode gap.  The ability to identify species using the DNA barcode regions was also evaluated using the phylogenetic analysis-specifically, the Neighbor-Joining (NJ) method (Figures 3-5). Generally, a higher degree of species resolution was observed in cases using ITS (>73%) compared with those using only chloroplast regions (<62%) ( Table 3); therefore, the best resolution was shown in ITS + rbcL (77.63%) ( Figure 5). The best close match also showed similar results with phylogenetic analysis (Table 3). Overall, a higher success rate was observed in combined DNA barcode regions than a single region. Especially, cases using ITS (>69%) showed a higher degree of species identification ("correct" in Table 3) than those using only chloroplast regions (<68%). Differing with the phylogenetic analysis, using the combined four regions and ITS + matK + psbA-trnH showed the best resolution (76.74%), while lower resolution was observed in ITS + rbcL (70.69%). As with the NJ and best close match analyses, better species partitions were observed in cases using combined regions compared to a single region. Under the best asap-score, the highest species partition was observed when using all regions (number of subsets, 40; asap-score, 8.00). In addition, the ASAP exhibited better resolution in cases using ITS compared to those using only chloroplast regions (Figures S1-S14).
Based on the best resolution tree (ITS + rbcL) ( Figure 5), most genera were identified on all NJ trees, except for Minuartia and Silene. The ability for species identification varied according to the genus. Among the genera that included two or more taxa, all five genera (Eremogone Fenzl, Minuartia, Pseudostellaria Pax, Sagina L., and Stellaria) formed a clade, indicating successful identification ( Figure 5). Each individual for Eremogone capillaris (Poir.) Fenzl and Minuartia arctica (Steven ex Ser.) Graebn. was investigated on the NJ tree; these were thought to have been distinguishable due to sufficient branch divergence from related species. In contrast, species of five genera (Cerastium, Gypsophila, Dianthus, Silene, and Spergularia (Pers.) J. Presl & C. Presl) were moderately or hardly distinguished at all (0-61.11%). Two species of Gypsophila (Gypsophila oldhamiana and Gypsophila pacifica Kom.) showed no separation from each other (0%), whilst Cerastium and Spergularia also showed low resolution (33.33%). Cerastium glomeratum Thuill. was shown to be separated from the two subspecies of Cerastium fontanum Baumg. (subsp. hallaisanense (Nakai) J.S. Kim and subsp. vulgare (Hartm.) Greuter & Burdet); however, these subspecies were not distinguishable from each other. Similarly, in Spergularia, only Spergularia rubra (L.) J. Presl & C. Presl could be identified from Spergularia bocconei (Scheele) Asch. & Graebn. and Spergularia marina (L.) Griseb. The DNA barcodes showed better discriminatory power for Dianthus (60.00%) and Silene (61.11%). In Dianthus, Dianthus armeria L., Dianthus barbatus L. var. asiaticus Nakai, and Dianthus japonicus Thunb. were all clustered in each clade. However, Dianthus chinensis var. serpens Y. N. Lee, Dianthus chinensis var. morii (Nakai) Y. C. Chu, Dianthus superbus L. var. superbus, and Dianthus superbus var. speciosus Rchb. diverged into a branch separate from other taxa and were sequenced based on only one individual. In addition, the majority of Dianthus species showed multiple types on the ITS; therefore, we concluded that these four taxa may not be distinguishable. In Silene, Silene capitata Kom., Silene firma, Silene seoulensis Nakai, Silene baccifera (L.) Roth, Silene gallica L., Silene repens Patrin ex Pers., Silene antirrhina L., Silene koreana Kom., and Silene takesimensis Uyeki & Sakata were all clustered into each clade according to morphological identification. Only one individual was sequenced in the case of Silene conoidea L., although this was clearly separated from the other. Despite Silene aprica Turcz. Ex Fisch. & C. A. Mey. being distinguishable at the species level, it was more difficult to distinguish at the intraspecies level.

Discussion
In this study, we examined four DNA barcoding regions in Korean Caryophyllaceae species. Ideal DNA barcodes allow for easy amplification and have sufficient variable sites to identify species [22]. Across taxa, all regions were well amplified with universal primers and sequences were obtained successfully. These results corresponded with the criteria for easy amplification. However, when considering their discriminatory ability, the combination of ITS was deemed as the optimal DNA barcode, with 71.43 to 75.32% in the phylogenetic analyses ( Table 3). The best close match supported such a result of phylogenetic analysis, although there are some differences between the two analyses ( Table 3). This identification resolution was comparable to that of other angiosperm taxa [45][46][47]. ITS is well known to have many informative sites; therefore, this region has previously been suggested as a DNA barcode in other plants because of its high resolution [47][48][49][50][51]. These results may have been influenced by the mutation rate of ITS being higher than that of chloroplast genes (rbcL and matK) [52,53]. Similar to the results of previous studies, our results showed a higher discriminatory power when using ITS (>72%) than when using only chloroplast DNA regions (<61%) ( Table 3). A low amplification rate of ITS has occasionally been reported according to the taxon [54], although such a problem was not observed in this study.
Although the addition of the psbA-trnH region (74.03%) did not improve the ability of species identification when compared with using only ITS (74.03%), this region did enhance discriminatory power at the genus level (Table 3). In the ITS and ITS + rbcL trees, Silene oliganthella Nakai ex Kitag. was not well distinguished from the congeneric species Silene fasciculata Nakai and Silene jenisseensis Willd. because of its short branch (Figures 4 and 5). Meanwhile, Silene oliganthella diverged deeply from the latter with a long branch on the ITS + psbA-trnH tree ( Figure S15). However, such expectations require caution, considering that only one sample of Silene oliganthella was included in the present study. In addition, with regard to taxonomy, the species was regarded as a variety or synonym of Silene jenisseensis [55]. Despite considerable sequence variation in psbA-trnH attributed to divergence among Silene species, simultaneous high variation within species disturbed precise identification in the case of Pseudostellaria. Sequence variations within species were detected, although some were shared with individuals of other species. Such a low resolution of psbA-trnH has also been observed in other lineages [56,57]. In addition, bi-directional sequencing of psbA-trnH often failed, which was thought to be due to long mononucleotide repeats [58,59] or a potential loop structure [60]. In fact, slippage after long mononucleotide repeats was observed in Dianthus and Saponaria, whilst a pair of inverted repeats that could form a cruciform shape was found in Cerastium, Sagina, and Spergularia ( Figure 1). Therefore, in this study, we suggest ITS as a DNA barcode for Korean Caryophyllaceae species when considering both cost efficiency and discriminatory power, such as previous DNA barcoding studies [47][48][49][50][51]. Furthermore, using rbcL is recommended for better resolution (Table 3). In addition, psbA-trnH could be used as a supplementary region for the identification of a genus such as Silene. It is probable when considering the high degree of species identification in the best close match ( Table 3). When considering cases where two DNA barcode regions were combined, it was supported by the ASAP result that the highest degree of species partitioning (37) was observed in ITS + psbA-trnH (under the best ASAP score).
Most of the genera in Korea were accurately identified in our phylogenetic tree, with the exception of Minuartia and Silene, which were identified as polyphyletic groups ( Figure 5). However, this was not due to a lack of discriminatory power in the DNA regions used in this study. Molecular phylogenies have previously suggested that these two genera are polyphyletic [44,61]. In the case of Minuartia, the three species (Minuartia arctica, Minuartia laricina, and Minuartia verna (L.) Hiern var. leptophylla (Rchb.) Nakai) analyzed in this study each belong to different sections within the genus (summarized in [61]). The molecular phylogeny also showed that each of these species are included in different clades [61]. In fact, Minuartia laricina and Minuartia verna var. leptophylla were not analyzed in the analysis but their position in the phylogenetic tree could be inferred based on closely related taxa. As a result, they have been treated as Cherleria arctica (Steven ex Ser.) A. J. Moore & Dillenb., and Pseudocherleria laricina (L.) Dillenb. & Kadereit [61,62]. Minuartia verna var. leptophylla was not mentioned in the previous study [61]; it may be treated as a taxon of Sabulina Rchb. Silene has also been suggested as a polyphyletic group. In that study, Silene armeria L. (=Atocion armeria (L.) Raf.) were placed within the sister group of Silene. Lychnis species were posited within Silene; so, they were treated as a section of Silene. In light of the polyphyly of Minuartia and Silene, further investigation may be necessary to evaluate the validity of their current taxonomic classification in Korea.
Although most species of Caryophyllaceae were identified well using DNA barcodes, five genera (Cerastium, Gypsophila, Dianthus, Silene, and Spergularia) showed a low resolution (0-61.11%). Such a low resolution may have been caused by wide morphological variations within species and hybridization among species. In the case of Cerastium, subspecies of Cerastium fontanum (subsp. hallaisanense and subsp. vulgare) were not distinguished and were instead considered taxa [63]. Similar to Cerastium, unidentified Silene species in the NJ tree were thought to be undistinctive taxa (Silene takesimensis, Silene fasciculata, Silene oliganthella) [64]. For Dianthus, many species could have intercrossed both naturally and through cultivation (summarized in [65]). In reality, multiple peaks in the ITS region are frequently detected in the genus, which can make identification difficult. Therefore, the low rate of identification of these genera is not surprising considering the taxonomic similarities between them. In contrast, Spergularia bocconei could be distinguished from Spergularia marina by its dense glandular hair on stems and leaves [16], in addition to having smaller capsules/seeds than the latter [66]. Gypsophila pacifica and Gypsophila oldhamiana could also be distinguished by their leaf (ovate vs. oblong), inflorescence (diffuse vs. dense), and flower morphologies (pedicels, 2-5 mm vs. 5-10 mm; apex of petals, rounded vs. truncate or retuse; stamens and styles, shorter than petals vs. longer than petals) [3]. The Spergularia and Gypsophila species have been well identified using ITS2 and trnL-F [67], and ISSR and RAPD [68]. However, previous studies are in contrast to our results. There might be various reasons for such results, such as genetic variation across different regions or misconceptions about the taxonomic classification of these two genera. However, it is difficult to determine the exact cause based on the current study alone. Therefore, further investigation of morphological and genetic variations is necessary to better understand the relationships between these two genera in Korea.

Taxon Sampling
A total of 215 individuals representing 78 taxa across 17 genera were collected through field surveys, and dried specimens were collected from the herbarium of the National Institute of Biological Resources (NIBR). Leaves sampled from the field were immediately dried using silica gel. We attempted to cover the majority of species in Caryophyllaceae by citing the National List of Species of Korea [4]. For plants that grow in North Korea, we utilized samples collected from nearby regions, such as China and Russia. To represent genetic variation within species, we collected three or more samples per taxon from different populations, excluding plants that grow in North Korea. The information of samples is shown in Table S1.

DNA Extraction, Polymerase Chain Reaction (PCR), and Sequencing
Total genomic DNA of the abovementioned plants was extracted from silica-dried leaves using NucleoSpin ® Plant II (Macherey-Nagel, Düren, Germany), according to the manufacturer's protocol, although the incubation times during cell lysis and elution of DNA were manually modified. The concentration of extracted DNA was subsequently measured using a Synergy LX microplate reader (BioTek Instruments, Winooski, VT, USA).
PCR was carried out to amplify DNA barcode regions, and each reaction mixture contained approximately 10 ng of DNA, 10 µL of AccuPower ® Taq PCR PreMix (Bioneer, Daejeon, Republic of Korea), distilled water, and appropriate volume of primers (usually 0.3 µM) in a total volume of 20 µL. The primers were selected with reference to previous studies so that they could be applied to the taxa of Caryophyllaceae (Table 4). The reaction was conducted after initial denaturation at 95 • C for 3 min: denaturation step at 95 • C for 30 s, annealing step at 52 • C (for rbcL, matK, psbA-trnH) to 55 • C (for ITS) for 30-45 s, and extension step at 72 • C for 45-80 s according to product size in 35 cycles. The final extension was performed at 72 • C for 7 min. PCR results were then confirmed via high-resolution capillary electrophoresis on a QIAxcel Advanced Instrument (Qiagen, Hilden, Germany). Successfully amplified products were sequenced using an ABI 3730XL sequencer (Applied Biosystems, Foster City, CA, USA), and sequencing was performed by Macrogen (Seoul, Republic of Korea).
The characteristics such as length, number of variable sites, and G/C content of nucleotide sequences were measured using Geneious R 10.2.6 (Biomatters), whilst the number of parsimony-informative sites was measured using PAUP 4.0b10 [75]. To evaluate DNA barcoding regions, the barcode gap was checked, which indicated whether or not interspecific and intraspecific distances overlapped. In this process, according to various combinations of the four regions, the pairwise genetic distance matrix for each individual was calculated based on the Kimura 2-parameter (K2P) method [76], which was conducted using MEGA 11 [77]. To identify species, the NJ tree was also constructed, using the K2P method throughout MEGA 11, with the bootstrap value of each node being calculated with 2000 replications. Amaranthus spinosus L. was used as an outgroup in this NJ analysis. (ITS: KY968964, matK: MF159529, rbcL: MF135474, psbA-trnH: MF143791). The NJ analysis uses a distance matrix based on nucleotide differences between pairs of taxa [78], making it well-suited to DNA barcoding. Compared with other methods of phylogenetic tree construction, the NJ analysis is relatively insensitive to issues such as multiple substitutions and missing data, and it is also faster [78,79]. The success of species identification was determined by whether individuals within the species were clustered into a clade. In this process, the forma was not counted as an individual taxon. When only one individual within a species was analyzed, we decided manually by considering branch length.
Along with the NJ analysis, we used the "best close match" function in TAXONDNA to assess the success of species identification [80]. This analysis assigned a query to the species name of its best-matching barcode, regardless of the degree of similarity between the query and barcode sequences, within a distance threshold of 95% [80]. Additionally, we utilized Assemble Species by Automatic Partitioning (ASAP), an unsupervised Operational Taxonomic Unit (OTU) picking method based on pairwise sequence distance [81]. The ASAP suggests optimal species partitions with an asap-score that reflects the confidence level of the clustering. Generally, a lower score implies a better partition. This program is available online (https://bioinfo.mnhn.fr/abi/public/asap/asapweb.html (accessed on 5 May 2023)). In this process, we used default parameters, except for the substitution model (K2P method).

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/plants12102060/s1, Figure S1: Species partitions through ASAP algorithm based on rbcL; Figure S2: Species partitions through ASAP algorithm based on matK; Figure S3: Species partitions through ASAP algorithm based on psbA-trnH; Figure S4: Species partitions through ASAP algorithm based on internal transcribed spacer (ITS); Figure S5: Species partitions through ASAP algorithm based on ITS + matK; Figure S6: Species partitions through ASAP algorithm based on ITS + psbA-trnH; Figure S7: Species partitions through ASAP algorithm based on ITS + rbcL; Figure S8: Species partitions through ASAP algorithm based on matK + psbA-trnH; Figure S9: Species partitions through ASAP algorithm based on matK + rbcL; Figure S10: Species partitions through ASAP algorithm based on psbA-trnH + rbcL; Figure S11: Species partitions through ASAP algorithm based on ITS + matK + rbcL; Figure S12: Species partitions through ASAP algorithm based on ITS + psbA-trnH + rbcL; Figure S13: Species partitions through ASAP algorithm based on rbcL + psbA-trnH + rbcL; Figure S14: Species partitions through ASAP algorithm based on ITS + matK + psbA-trnH + rbcL; Figure S15: The neighbor-joining phylogenetic tree of the Korean Caryophyllaceae based on combined sequence from ITS + psbA-trnH; Table S1: List of species sequenced in this study and GenBank accession number. Data Availability Statement: The four chloroplast genomes, newly sequenced in this study, were archived in NCBI with accession numbers OQ150537-OQ150751 and OQ172318-OQ172962.