Testing the effectiveness of rbcLa DNA-barcoding for species discrimination in tropical montane cloud forest vascular plants (Oaxaca, Mexico) using BLAST, genetic distance, and tree-based methods

DNA-barcoding is a species identification tool that uses a short section of the genome that provides a genetic signature of the species. The main advantage of this novel technique is that it requires a small sample of tissue from the tested organism. In most animal groups, this technique is very effective. However, in plants, the recommended standard markers, such as rbcLa, may not always work, and their efficacy remains to be tested in many plant groups, particularly from the Neotropical region. We examined the discriminating power of rbcLa in 55 tropical cloud forest vascular plant species from 38 families (Oaxaca, Mexico). We followed the CBOL criteria using BLASTn, genetic distance, and monophyly tree-based analyses (neighbor-joining, NJ, maximum likelihood, ML, and Bayesian inference, BI). rbcLa universal primers amplified 69.0% of the samples and yielded 91.3% bi-directional sequences. Sixty-three new rbcLa sequences were established. BLAST discriminates 80.8% of the genus but only 15.4% of the species. There was nil minimum interspecific genetic distances in Quercus, Oreopanax, and Daphnopsis. Contrastingly, Ericaceae (5.6%), Euphorbiaceae (4.6%), and Asteraceae (3.3%) species displayed the highest within-family genetic distances. According to the most recent angiosperm classification, NJ and ML trees successfully resolved (100%) monophyletic species. ML trees showed the highest mean branch support value (87.3%). Only NJ and ML trees could successfully discriminate Quercus species belonging to different subsections: Quercus martinezii (white oaks) from Q. callophylla and Q. laurina (red oaks). The ML topology could distinguish species in the Solanaceae clade with similar BLAST matches. Also, the BI topology showed a polytomy in this clade, and the NJ tree displayed low-support values. We do not recommend genetic-distance approaches for species discrimination. Severe shortages of rbcLa sequences in public databases of neotropical species hindered effective BLAST comparisons. Instead, ML tree-based analysis displays the highest species discrimination among the tree-based analyses. With the ML topology in selected genera, rbcLa helped distinguish infra-generic taxonomic categories, such as subsections, grouping affine species within the same genus, and discriminating species. Since the ML phylogenetic tree could discriminate 48 species out of our 55 studied species, we recommend this approach to resolve tropical montane cloud forest species using rbcLa, as an initial step and improve DNA amplification methods.


INTRODUCTION
A biodiversity inventory is crucial as a first step to protecting species and ecosystems. A significant portion of global biodiversity remains unnamed. Recent estimations indicate that 8.7 million species of multicellular organisms occur on Earth, but about 20% of those species have been described using morphological approaches since 1750 (Mora et al., 2011). Thus, it is urgent to speed up the species identification process (Hvistendahl, 2016). In the early 21 st century a molecular technique DNA barcoding, was proposed to identify species using short-standardized sequences and only requiring a small sample of tissue (Hebert et al., 2003). Cytochrome oxidase 1 (CO1) successfully discriminates against many animal species but does not resolve plant species. The Consortium for the Barcode of Life's (CBOL) plant working group evaluated several plastid DNA regions based on universality, sequence quality, and species discrimination, recommending using a core of a two-locus combination of partial genes rbcLa + matK as the plant barcode ( CBOL Plant Working Group1 et al., 2009). Such a universality has not been found in all plant groups, and other studies suggest using additional loci (Kress & Erickson, 2007;Fazekas et al., 2008;China Plant BOL Group et al., 2011;Pang et al., 2012). Moreover, matK may work very well for orchid species (Lahaye et al., 2008) but not for certain fern groups (Trujillo-Argueta et al., 2021). Furthermore, in some angiosperm genera, such as Salix (Percy et al., 2014) and Quercus (Piredda et al., 2011), plastid markers might not work at all.
On average, the resolution of the tested DNA barcoding markers for plants is not as high as barcode markers used for many animal groups (Fazekas et al., 2008;CBOL Plant Working Group, 2009). Of the possible plant markers, rbcLa appears to be one of the best plant barcodes, because of its successful amplification and sequencing. Although far from perfect, the resolution of rbcLa was shown to be better than matK when tested both barcodes in wild arid plants in the United Arab Emirates (Maloukh et al., 2017) and when tested alone, in plants of Saudi Arabia (Bafeel et al., 2012). Also, rbcLa can be a valuable tool to identify species in conditions in which other methods are impractical. For instance, this marker was successfully used for studying root diversity patterns in old-field communities in Ontario, Canada (Kesanakurti et al., 2011). This kind of research is encouraging, but more studies are needed to explore the resolution potential of this marker for species in ecosystems other than those of temperate regions. The Neotropics are considered the richest region in biodiversity (Gaston & Williams, 1996;Thomas, 1999). Several barcoding studies have been performed in neotropical animals (e.g., Hajibabaei et al., 2006). However, barcoding plant studies in this area are scarce. The available studies are often limited to a few plant groups such as orchids (Lahaye et al., 2008) or ferns (Nitta, 2008;Trujillo-Argueta et al., 2021).
In Mexico, the tropical montane cloud forest (TMCFs) is a top priority ecosystem for conservation due to its high diversity, endemism richness, and anthropogenic threats (Villaseñor, 2010;Toledo-Aceves et al., 2011). Due to the reproductive biology of plants, the universality of DNA barcodes has been difficult to achieve when dealing with several taxa, therefore, some authors suggest to develop a DNA barcode library locally to be used for conservation and ecological studies (Lahaye et al., 2008;De Groot et al., 2011). Our study is part of a long-term project to characterize DNA barcodes of tropical plants of southern Mexico. We choose rbcLa as the first marker to study because (1) it is widely used in phylogenic analyses, (2) it has been helpful for ecological studies (i.e., Kesanakurti et al., 2011), and is one of the markers with more published sequences in public databases. The aim of this study is to evaluate the performance of the plant core DNA barcode rbcLa, using universal primers for vascular plants, as a first stage of DNA barcoding analysis in an unexplored tropical montane cloud forest of the Mixteca Baja, Oaxaca, Mexico. We followed the three above-mentioned CBOL criteria and built a barcode library of native plant species for this region. Sequences obtained of these species were submitted to The Barcode of Life Data System (BOLD) and GenBank. BOLD is a bioinformatic workbench devoted to acquiring, storing, analyzing, and publishing DNA barcode records (Ratnasingham & Hebert, 2007).

Description of study site
We conducted the field study in a tropical montane cloud forest at different successional stages in the San Miguel Cuevas, Santiago Juxtlahuaca Municipality, Mixteca Baja (17 15′00.96″N, 98 02′57.34″, centroid coordinates), which belongs to the Western physiography region of the state of Oaxaca, in southern Mexico (Ortíz-Pérez, Hernández Santana & Figueroa Mah-Eng, 2004). The climate is semi-humid, temperate to semi-warm (1,382 mm and 16.8 C, mean annual precipitation and temperature, respectively, Fernandez-Eguiarte, Zavala-Hidalgo & Romero-Centeno, 2020), with soils rich in organic matter, steep topography, and a mean altitude of 2,187 m.

Studied species
We carried out random walks thought the forest area and collected one hundred samples of fresh vascular plants, which position was georeferenced with a GPS. In the field, we took digital photographs of each sampled plant and its main structures. The number of samples collected per taxon was one and occasionally two. A small sample of fresh leaf tissue (2-5 g) was also collected from the same specimen, placed in a sealed plastic bag, and kept fresh until stored at −20 C in a lab freezer. Voucher plants were pressed flat for standard herborization (drying, sanitizing, identification, mounting, labelling, and shelf-storing (2/55) hold some type of concern; Daphnopsis tuerckheimiana the status of Near Threatened (NT) and Oreopanax sanderianus of Vulnerable species (VU). Fresh plant vouchers were first identified to family level and then determined by the following specialists to species level: Daniel Tejero-Díez, UNAM FES Iztacala, México, lycopod and ferns; Sergio Zamudio, Institute of Ecology, Veracruz, México, Berberidaceae;

DNA alignment
rbcLa sequence chromatograms were assembled into contigs using CodonCode Aligner v.9.0.1 http://www.codoncode.com/aligner, and the resulting nucleotide sequences were manually edited. Consensus sequences were generated and aligned using MUSCLE (Edgar, 2004). These alignments were examined by eye and corrected in case of base ambiguity.

BOLD and genebank
Three files were included in the metadata submitted to BOLD: (1) Specimen data file including detailed voucher information, scientific names of the taxa sampled, collection dates, geographical coordinates, elevation, collectors, identifiers, and habitat.
(2) An image file was submitted with high-quality specimen images from each plant.
(3) A trace file was submitted along with primers and the direction of sequences. DNA sequences, edited, aligned, using CodonCode software, and in FASTA format, were uploaded to BOLD and referenced by Sample IDs. Metadata and DNA sequences submitted to BOLD were registered under project name "Diversity of a humid temperate forest in Oaxaca, Mexico" project code DVHTF (http://www.boldsystems.org). DNA sequences were also submitted to the GenBank.

Species differentiation
We used three strategies to evaluate species discrimination: (a) The Basic Local Alignment Search Tool for nucleotide (BLASTn) method (Altschul et al., 1990). This tool compares the query sequence against the GenBank sequence database available online by the National Center for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov. Identification at the genus level was considered successful when all hits with the maximum percent identity scores >99% involved a single genus. Species identification was considered successful only when the highest maximal percent identity included a single species and scored >99% (Bafeel et al., 2012;Abdullah, 2017).
(b) Genetic divergence. Interspecific and intraspecific distances were obtained in MEGAX (Kumar et al., 2018). Genetic distance was inferred from 1,000 replicates, and the evolutionary distances were computed using the Kimura two-parameter method with gaps/missing data treatment adjusted using pairwise deletion. The K2P genetic distances percentages of families, genera, and species were analyzed in the Barcode of Life Data Systems (BOLD, www.boldsystems.org) (Ratnasingham & Hebert, 2007).
NJ trees were constructed using MEGAX (Kumar et al., 2018) inferred from 1,000 replicates, and the evolutionary distances were computed using the Kimura two-parameter method with gaps/missing data treatment adjusted using pairwise deletion. Following Braukmann et al. (2017), we used all species but duplicates to avoid bias created by an unequal number of sequences per species. ML analyses were run on the IQ-TREE web server (http://iqtree.cibiv.univie.ac.at) with settings of automatic substitution model and ultrafast bootstrap analysis. Internal node support was calculated using 1,000 bootstrap replicates. Tree inference using Bayesian analysis was run on MrBayes 3.2.2 on XSEDE via the CIPRES supercomputer cluster (www.phylo.org) with two runs, MrBayes block of four chains, 2 h maximum time to run, and the nucleic acid selection settings for this web server. The resultant ML and BI trees were visualized in the interactive Tree of Life (iTOL) (Letunic & Bork, 2019). We evaluated which of the tree-based methods (NJ, ML, and MB) recovered more monophyletic species with a bootstrap/posterior probabilities support of >70% (De Groot et al., 2011).

Studied taxa with DNA amplification and sequencing success
We could successfully amplify 69% (69/100) of the botanical samples collected and obtain high-quality bidirectional sequences (>250 bp) in 91.3% (63/69) of the samples collected, using the standard primers of the CCDB for rbcLa. The total number of specimens with high-quality bidirectional sequences includes 55 species, of which 29.1% (16/55) were herbs, and 70.9% (39/55) were woody plants (trees and shrubs). These species belong to 38 families and 48 genera. Twenty-seven families include one species, and 11 families include two to five species (Table 1).

BLAST
Only 47.3% (26/55) of our studied species had a previous rbcLa sequence register in the GenBank database (Fig. 1). We contributed in this study with 13 new species in the GenBank Taxonomy Database (Daphnopsis selerorum, Solanum nigricans, Comarostaphylis longifolia, Daphnopsis tuerckheimiana, Deppea guerrerensis, Vallesia aurantiaca, Myrsine juergensenii, Quercus martinezii, Miconia glaberrima, Passiflora quadraticordata, Cestrum commune, Zeugites hintonii and Salvia clarckcowanii), since none of this species had a previous register with rbcLa nor with any other gene sequence. Using previously published records in the GenBank of rbcLa sequence for our species (26) and BLASTn, we got 100% resolution in all the families, 80.77% (21/26) in the genera, and only 15.38% (4/26) at the species level (Fig. 1). Just four species, Monnina xalapensis, Cnidoscolus aconitifolius, Iresine diffusa, and Lophosoria quadripinnata, displayed the highest score BLAST match to a single species with more than 99% identity. Most of our rbcLa sequences matched from 2-12 species with ≥99% maximal percent identity; and seven species, Alnus acuminata, Solanum hispidum, Quercus laurina, Quercus callophylla, Pinus montezumae, Osmanthus americanus, and Physalis phyladelphica, matched the rbcLa sequences in the GenBank with >30 different species. The highest score BLAST match for our species are shown in Table 2. A specimen data file, image file, and trace file(s) were submitted to BOLD along with edited and aligned sequences for each of our 63 botanical samples (55 species and eight different duplicates) and can be accessed through the BOLD DNA database (http://www.boldsystems.org) under the 'DVHTF' project. Sixty-three new sequences generated by this study for rbcLa with their BOLD Process ID, and GenBank Accession numbers, are shown on Table 1.

Genetic divergence
The distribution of intra-and interspecific K2P distances across all taxon pairs of our 55 species of plants of The Mixteca Baja, Oaxaca, tropical montane cloud forest, obtained   Intergeneric rbcLa genetic distances of the tropical montane cloud forest species in the Mixteca Baja, Oaxaca, Mexico, found in multi-species genera. The Bold Process ID is below the scientific names.
The mean genetic divergence observed in the studied families with two or more genera is shown in Table 4. The highest mean divergence values were observed in the Ericaceae, Euphorbiaceae, and Asteraceae families.

Monophyly tree-based analyses
Phylogenetic tree-based analysis using Neighbor-Joining (Fig. S1), Maximum Likelihood (Fig. 3), and Bayesian Inference tree (Fig. S2) were reconstructed to evaluate our 55 species discrimination using the rbcLa barcode region. In all cases, we used ferns and a lycopodium as outgroups. These tree-based methods evaluated which tree rendered the greatest species resolution and whether the barcode sequences generated monophyletic species (Table 5). NJ and ML phylogenetic trees resolved 100% of monophyletic species using rbcLa. Nevertheless, the clade support value >70% with a bootstrap of 1,000 replicates yielded the most robust phylogeny in the ML tree (87.3%, 48/55) than the one obtained in the NJ tree (70.9%, 39/55). Therefore, we present the ML phylogenetic tree (Fig. 3). Although the BI tree showed the highest clade support value (92.7%, 51/55), this tree did not resolve all 55 species as monophyletic species; one polytomy was observed in the Quercus clade and another in the Solanaceae clade (Fig. S2).

DISCUSSION
Our study reveals the advantages and limitations of the rbcLa barcode region for species identification of vascular plant species of a neotropical montane cloud forest. First, the amplification was not universal, since near 30% of our samples did not amplify. However, bi-directional sequencing was highly successful from those samples that we could amplify. Using BLAST as an identification tool for genus level is convenient as it proved accurate in most cases. However, this was not usually the case for many of the studied species. Finally, in selected genera, this marker helped distinguish infra-generic taxonomic categories, such as subsections, and helps to group affine species within the same genus. Below, we discuss in detail these issues.
Multiple factors can cause the absence of DNA amplification in some samples. Since we could amplify rbcLa in several species, the possibilities of methodological failures or problems with the reactants or the lab equipment used are unlikely. One possible cause of  the amplification failure is DNA degradation in some samples, as those were collected in the field and brought to the lab. During this time, the tissues may become degraded in some species. This appears to be a plausible explanation for cases in which DNA from tissue samples was successfully amplified in one individual but not in another of the same species. This is the case of Solanum nigricans (this study) and Dryopteris wallichiana, which could not be amplified in this study but were successfully amplified in a previous study using samples from different plants (Trujillo-Argueta et al., 2021). Another possibility is that the pair of rbcLa universal premiers used may not work for certain species. Our rbcLa amplification success (69%) could be increased using the alternative set of universal primers proposed by CCDB for gene barcode rbcL.
Our sequencing success (91.30%) was high and similar to those reported in other works. In a study of root diversity patterns using plastid gene rbcLa, Kesanakurti et al. (2011) registered 96% amplification success with 85% sequencing success. In another study that identified Sicily's most threatened plant taxa, the amplification and sequencing successes were 96% and 95%, respectively (Giovino et al., 2016). In a study of the temperate flora of Canada, the use of rbcLa gave a 91.4% sequencing success (Burgess et al., 2011).
Our BLAST results were higher for genus discrimination (80.77%) than the values obtained for species differentiation (15.38%). Results from other regions and species are variable. For example, in wild, arid plants, discrimination at genus and species levels were lower than ours: 50% and 8%, respectively (Bafeel et al., 2012), but higher in a comprehensive study of the local flora of Canada (91% and 44%, Braukmann et al., 2017). In a study of threatened species of Sicily, the discrimination at the genus level was lower (52%) but higher at the species level (48%) than our results (Giovino et al., 2016). The peculiarities of the biology of the studied species may also account for the observed discrimination variability. Part of our low percent species discrimination results using BLASTn can be explained by low marker resolution, as was noticed in those species that matched their rbcLa sequence with more than 30 different species in the GenBank database (Alnus acuminate, Solanum hispidum, Quercus laurina, Quercus callophylla, Pinus montezumae, Osmanthus americanus and Physalis phyladelphica). Another explanation is misidentified voucher specimens in public DNA databases, an issue that several authors have acknowledged (e.g., Burgess et al., 2011;Abdullah, 2017). Since it is customarily to describe species based on morphological characteristics, it is possible that hybridization and polyploidy, which are common in plants, may contribute to decreasing barcoding species discrimination (Fazekas et al., 2008;Hollingsworth, 2011). Since more than half of the species in this study (52.72%) lacked comparative data in the GenBank database, it is necessary to increase the DNA barcode database, particularly for tropical wild plant species. Indeed, we contributed to new 63 rbcLa sequences to BOLD, its metadata, and the GenBank database. Although 42 of our species already had a rbcLa sequence on the GenBank database, new records on these species might help discover new haplotypes or geographical variants (Hajibabaei et al., 2007). Even if rbcLa does not have high species discrimination, it does for genus discrimination, which for some ecological studies might be enough (e.g., Kesanakurti et al., 2011).
Our distribution of intra-and interspecific genetic divergence (Fig. 2) agrees with the premise that a DNA barcode must exhibit high interspecific but low intraspecific divergence (Lahaye et al., 2008). The percent interspecific divergence of this study (0.65) is similar to those reported in other hotspot diversity areas such as the Mediterranean Basin (0.89) (Giovino et al., 2016) and Southern Africa (0.82) (Lahaye et al., 2008). The lack of genetic divergence observed in the three genera of trees: Quercus (Q. martinezii, Q. laurina, and Q. callophylla); Oreopanax (O. sanderianus and O. xalapensis), and Daphnopsis (D. selerorum and D. tuerckheimiana) concurs with Smith & Donoghue (2008). These authors found that the rates of molecular evolution are low in woody plants with long generation times compared to herbs. In the case of oaks (Quercus), several attempts have been made to identify species in Italy, using different plastid barcodes without success since hybridization and polyploidy are expected to be high in this group (Piredda et al., 2011). Null genetic divergence obtained in Oreopanax and Daphnopsis (Table 3) is of concern since Oreopanax sanderianus and Daphnopsis tuerckheimiana are on the red list of IUCN. The highest values of genetic distance found in the Ericaceae (5.57%), Euphorbiaceae (4.59%), and Asteraceae (3.3%) families that hold many herbs and shrubs species agree with the assumption that the rbcLa barcode has a better species differentiation for non-tree species. Moreover, a study conducted in a subalpine forest in Southwest China found a better DNA barcode resolution for herbs than for tree species (Tan et al., 2018). However, more studies are needed to confirm this trend in other species and localities.
The phylogenetic arrangements found in our study using barcode rbcLa concur with the recent Angiosperm Phylogeny Group classification (APG IV) of flowering plants (The Catalogue of Life Partnership, 2017). The percent monophyletic species resolution obtained in this study using NJ (100%), ML (100%), and BI (85.45%) phylogenetic trees, was higher compared to 17% of species resolution found in arid wild plants using ML trees (Bafeel et al., 2012), barcoding the biodiversity of Kuwait (58%) using NJ trees (Abdullah, 2017) and the 71.8% registered in two biodiversity hotspots of Mesoamerica and Southern Africa, using ML and BI trees (Lahaye et al., 2008). Our ML phylogenetic tree showed the most robust phylogeny (87.27%), Ocotea helicterifolia, Quercus callophylla, Quercus laurina, Iresine difussa, Berberis lanceolata, Moussonia deppeana, and Osmanthus americanus, could not be resolved as monophyletic species with a clade bootstrap support value >70%.
Most of these species are trees in agreement with the assumption that rates of molecular evolution are low in woody plants compared to herbs (Smith & Donoghue, 2008). For those species that could not be differentiated with the ML tree, we suggest the addition of a second barcode.
Species discrimination can be improved by using tree-based phylogenetic methods rather than BLAST analysis and genetic distance approaches. For instance, using NJ and ML phylogenetic trees, it was possible to differentiate Quercus martinezii from Q. laurina and Q. callophylla (Fig. S1, Fig. 3) despite the null genetic divergence observed in Quercus using BLAST. Based on an updated infrageneric classification of the oaks (Denk et al., 2017), Q. martinezii belongs to the white oaks (subsection Quercus), while Q. callophylla and Q. laurina belong to the red oaks (subsection Lobatae). In the Solanaceae family, three out of the five studied species (Physalis philadelphica, Solanum hispidum, and Solandra maxima) share high similitude with at least 30 species using the best BLAST match results. Furthermore, using our best BI tree, we observed a polytomy in the Solanaceae clade (Fig. S2), and a low discrimination value in the NJ tree. However, these species could be resolved with our ML phylogenetic tree. Taxonomic species are usually described based on morphological characteristics that can easily be altered by local adaptation, phenotypic plasticity, or neutral morphological polymorphism, which may cause a single variable species to be classified as many species (e.g., Gemeinholzer & Bachmann, 2005). On the other hand, very recent divergence and little differentiation might contribute to the inability of barcoding to separate species in some cases (Birch et al., 2017).

CONCLUSIONS
DNA barcoding using rbcLa can be a promising identification tool primarily at the family and genus level for vascular plant species of the neotropical montane cloud forest. We identify three major problems with the use of this technique. First, the lack of a universal amplification capability is probably associated with DNA degradation in some cases, but without ruling out other factors requiring further study. Second, the inability to detect certain morphological species is probably not related to rbcLa itself but to biological (e.g., polyploidy and hybridization) and technical (misidentifications or taxonomic misclassifications) problems. Third, the few available registers in the BOLD and GenBank databases (more than half of our species, 52.72%, did not have previous rbcLa sequence records). Indeed, we contributed new 13 species to the GenBank Taxonomy Database and 63 new sequences for rbcLa in BOLD and GenBank. We found preliminary evidence suggesting that the ability of the marker to discriminate species is not randomly distributed among taxa. Herb and shrub species in the Asteraceae, Ericaceae and Euphorbiaceae families showed the highest genetic distance using rbcLa, which can be helpful to distinguish congeneric species. Contrastingly, we detected nil genetic divergence among congeneric species in long-lived tree genera, Quercus, Oreopanax, and Daphnopsis. Nonetheless, the accuracy for discriminating species can be substantially improved using tree-based analysis. While BLAST and genetic distance approaches could not differentiate Quercus species, NJ and ML could successfully separate white oaks (Quercus martinezii) from red oaks (Q. callophylla and Q. laurina). Also, most species in the Solanaceae family that showed unsuccessful BLAST results and low genetic distance could be discriminated against with ML phylogenetic tree. The ML phylogenetic tree showed the most robust phylogeny (87.27%) of all our 55 studied species of the tropical montane cloud forest of San Miguel Cuevas in Oaxaca state, Mexico. The establishment of this local barcode database will be valuable for a broad range of potential ecological, conservational, and phylogenetic applications.