Unraveling the plant diversity of the Amazonian canga through DNA barcoding

Abstract The canga of the Serra dos Carajás, in Eastern Amazon, is home to a unique open plant community, harboring several endemic and rare species. Although a complete flora survey has been recently published, scarce to no genetic information is available for most plant species of the ironstone outcrops of the Serra dos Carajás. In this scenario, DNA barcoding appears as a fast and effective approach to assess the genetic diversity of the Serra dos Carajás flora, considering the growing need for robust biodiversity conservation planning in such an area with industrial mining activities. Thus, after testing eight different DNA barcode markers (matK, rbcL, rpoB, rpoC1, atpF‐atpH, psbK‐psbI, trnH‐psbA, and ITS2), we chose rbcL and ITS2 as the most suitable markers for a broad application in the regional flora. Here we describe DNA barcodes for 1,130 specimens of 538 species, 323 genera, and 115 families of vascular plants from a highly diverse flora in the Amazon basin, with a total of 344 species being barcoded for the first time. In addition, we assessed the potential of using DNA metabarcoding of bulk samples for surveying plant diversity in the canga. Upon achieving the first comprehensive DNA barcoding effort directed to a complete flora in the Brazilian Amazon, we discuss the relevance of our results to guide future conservation measures in the Serra dos Carajás.


| INTRODUC TI ON
Conservation efforts depend on a detailed knowledge of the biodiversity in the area of interest, although this is rarely available for megadiverse regions (Alroy, 2017;Hopkins, 2007;Milliken et al., 2010;Myers et al., 2000). The Amazon basin is a vast and diverse biome, being exceptionally important for the maintenance of the biodiversity in the Neotropical region over time (Antonelli et al., 2018). Although the region is undoubtedly one of the most important ecosystems in the planet, harboring an estimated one quarter of all extant plant species, there is a lack of knowledge about a huge portion of the Amazon ecosystems (BFG, 2018;Fearnside, 2002;Hopkins, 2007;Milliken et al., 2010;Morim & Lughadha, 2015). In addition, along its massive geographic area, the Amazon basin is composed of several different centers of endemism (see Silva et al., 2005 and references within), which are important for the resilience of the forests in face of the disturbing effects of direct anthropological impacts and climate change (Levine et al., 2016).  Skirycz et al., 2014;Viana et al., 2016), with a high floristic heterogeneity among sites . Such ironstone outcrops have been explored throughout the years mainly for iron ore mining activities (Skirycz et al., 2014), and robust biodiversity surveys are necessary to ensure species protection through effective conservation efforts in the presence of industrial activities, especially in view of the climate change scenarios predicted for the region (Giannini et al., 2020;Levine et al., 2016;Miranda et al., 2019).
Plant surveys in the Serra dos Carajás started in the 1970s, as detailed by Viana et al. (2016). However, a project to publish its flora, the Flora of the canga of Carajás (FCC), took part in just under 4 years, being the first complete Flora for a region of the Brazilian Amazon (Mota et al., 2018). This project provided complete floristic treatments for 116 angiosperm families, comprising approximately 900 species (Mota et al., 2018), a number considerably higher than the initial estimate of around 600 species (Viana et al., 2016). Other vascular plant groups detailed in the FCC included 175 ferns in 22 families, 11 lycophytes in three families (Salino et al., 2018), and a single gymnosperm, Gnetum nodiflorum Brongn. (Gnetaceae), a liana widely distributed in the Brazilian Amazon .
The systematic collection of DNA samples was taken on board as part of the floristic initiative of the FCC project (Mota et al., 2018), as the availability of genetic and genomic data of plants were seen from the onset as extremely important. Such a measure would ensure the correct identification of the species, which had been authenticated by taxonomist specialists, and backed by a deposited voucher, thus guiding more effectively all conservation efforts for the area.
The application of DNA barcodes (Hebert et al., 2003) et al. (2018), respectively. However, it is a well-known fact that the development of DNA barcodes is not as straightforward for plants as for other eukaryotes, such as animals and fungi (Fazekas et al., 2009;Hebert et al., 2016;Hollingsworth et al., 2016). The main problems associated with DNA barcoding of plant species arise with the considerably slower pace of evolution of the organelle genomes and the universality of some chloroplast DNA (cpDNA) markers, mainly those with higher nucleotide substitution rates within the plastomes, such as the matK gene (Hollingsworth et al., 2011). Also, there is a difficulty in standardizing which cpDNA regions will function as reliable plant DNA barcodes, since several authors have been reporting variable success rates using different markers (e.g., rpoB, rpoC1, atpF-atpH, psbK-psbI, and trnH-psbA) (e.g., Fazekas et al., 2008), although the combination of the rbcL and matK sequences has been recommended as the core barcoding loci (CBOL Plant Working Group, 2009;Kress, 2017). Besides organelle markers, some regions of the nuclear genome, such as the internal transcribed spacers (ITS1 and ITS2) of the 35S rRNA gene, yield useful DNA barcodes for plants (Chen et al., 2010;Hollingsworth et al., 2011).
Furthermore, the generation of DNA barcodes at the species level enables the use of composite samples for detection of species from a given environment, known as DNA metabarcoding. This approach has been regarded as a robust, fast, and cost-effective approach for automated multispecies identification (Deiner et al., 2017;Zinger et al., 2019). For plants, ITS2 has been one of the main markers of choice for surveying multiple species at once, considering the methodological advantages of using this DNA barcode, such as the ease of standardizing PCR conditions and a smaller amplicon size (~450 bp) in comparison with other frequently used regions (Chen et al., 2010;Gous et al., 2019;Richardson et al., 2015). Thus, a curated DNA barcode library and well-established analytical procedures can provide the basis for the successful application of DNA metabarcoding for monitoring biodiversity (Adamowicz et al., 2019;Dormontt et al., 2018;Kress, 2017).
To the best of our knowledge, there is no other DNA barcoding approach directed to the complete flora of any other region in the Amazon basin. Hence, we describe DNA barcodes for vascular plant species mainly focusing on the canga of the Serra dos Carajás, also including plants other from areas in the Brazilian state of Pará that are relevant to an understanding of the biodiversity composition of this mountain range as a whole. We tested the potential of eight commonly used DNA barcode regions and then chose the most suitable markers for a broader application of the DNA barcoding approach in the area, in order to provide robust tools to assess genetic diversity data of the flora of the Amazon basin. Here we followed two main premises: (a) the highest possible marker universality, considering the diversity of taxonomic groups in the canga; and (b) a reasonable standardization and automation of the protocols for sample processing and analyses. Moreover, we aimed to test the potential of DNA metabarcoding analyses with ITS2 for future applications in the Serra dos Carajás, taking advantage of the DNA barcode library developed here.

| Plant materials for the DNA barcode procedures
Preferentially, young leaf tissues were sampled for the DNA extractions, although either other vegetative or reproductive structures were employed when needed, as in the case of species of Cactaceae and Eriocaulaceae, for instance. A total of 1,179 specimens of vascular plants from 120 families, 343 genera, and 577 species were collected in the Serra dos Carajás and other relevant regions in Eastern Amazon, state of Pará, Brazil (Table A1), as part of the FCC project (Mota et al., 2018;Salino et al., 2018;Viana et al., 2016), under ICMBio/MMA permit numbers 47856-2, 48272-6, 53990-1 and 63324-1. Approximately 55% of those samples (645 specimens from 96 families, 243 genera, and 370 species) were used to test seven different cpDNA regions (the genes matK, rbcL, rpoB, and rpoC1, and the intergenic spacers atpF-atpH, psbK-psbI, and trnH-psbA) and the ITS2 intergenic region. The remaining 534 samples were barcoded only after the selection of the two best markers (rbcL and ITS2), as detailed below. The vouchers of all sampled specimens were deposited at the MG herbarium (Museu Paraense Emílio Goeldi, Belém, Pará, Brazil) (Table A1).
The remaining collected tissues (147, ca. 13%) were dried in silica gel and then stored at room temperature (~25°C) until processing.

| DNA extraction
For the DNA extractions, we established an efficient automated protocol for all plant materials, considering the high diversity of taxonomic groups observed in the canga of the Serra dos Carajás.
Approximately 20 mg of fresh plant tissue (or ~10 mg for silica dried samples) was separated in 96 racked 1.2-ml collection microtubes (Axygen) with two 3 mm tungsten carbide beads (Qiagen). The samples were frozen in a deep freezer (−80°C) for 18-20 hr and then ground in a TissueLyser II (Qiagen) for 1 min at 30 Hz. Then, 600 µl of extraction buffer (2% w/v CTAB, 0.1 mM Tris-HCl, 20 mM EDTA, 1.4 M NaCl) was added to the ground material and the samples were incubated for 40 min at 60°C in a water bath. The collection microtubes were centrifuged for 1 min at 2,900× g to eliminate debris, and 300 µl of the supernatant was transferred to a 96 deep-well Ubottom plate. Afterward, an automated extraction was performed in a QIAcube HT (Qiagen) with the "Q protocol V1" of the QIAamp 96 DNA Kit (Qiagen), with minor modifications regarding the sample preparation step, which was carried out without the VXL buffer and including an incubation for 30 s after adding 350 µl of binding buffer ACB, mixing for six times. Also, for some difficult samples, the DNA extractions were performed using the CTAB protocol I described by Weising et al. (2005), with minor modifications (0.5-1.0 g of leaf tissue and 10 ml of the extraction buffer, with the addition of 4% w/v PVP and 0.2% v/v β-mercaptoethanol), followed by the selective precipitation of polysaccharides described by Michaels et al. (1994).

| DNA barcode generation and phylogenetic analyses
The PCR conditions and sequencing reactions were performed as described in Babiychuk et al. (2017), using the primers listed in the Table A2. We used PIPEBAR  to process all trace files (*.ab1 and *.phd) to generate the assembled consensus of the forward and reverse sequences. Afterward, to check initially for problematic sequences (from either mislabeled or contaminated samples) generating unusual specimen groupings, considering mainly order and family affiliations, the sequences were aligned with MAFFT 7.388 using the algorithm Auto (Katoh & Standley, 2013) for each marker separately. Then, phylogenetic trees based on maximum likelihood (ML) were constructed with RAxML 8.2 (Stamatakis, 2014) as implemented in the CIPRES portal (http://phylo.org), using the substitution model GTR + G and rapid bootstrapping with 1,000 replicates. Furthermore, we performed BLASTn searches in the GenBank database (http://blast.ncbi.nlm.nih.gov/Blast.cgi) for additional quality control to avoid problematic sequences, especially in the case of the intergenic regions, which were considerably more difficult to align due to the high taxonomic diversity among the sampled specimens.
Finally, we tested the phylogenetic resolution by counting monophyletic species with at least 70% of bootstrap support, considering only those with more than one sampled specimen. ML trees were constructed in RAxML as described above, using six different ma-

| Barcode analysis
To test the barcode resolution (as the percentage of correctly as-  Table A1.

| Metabarcoding analysis
To assess the potential of using metabarcoding analysis with bulk samples for surveying plant diversity in the canga in future monitoring approaches, we sampled all discernible plant specimens within an approximate 10 m radius in six plots, including two markedly different vegetation types (forest groves and open rupestrian vegetation; Table A3), near the end of the dry season (27 and 28 September and 2017) that lasts from May to October (see Viana et al., 2016).
Although virtually all plants were sterile, field activities are considerably safer in the ironstone fields during the dry season (e.g., Sodré et al., 2020). For each sampled locality, pieces of young leaves with approximately 1 cm 2 were collected in a 50-ml Falcon tube containing 30 ml of the 2% CTAB-NaCl saturated buffer and then stored as previously described.
The procedures for DNA extraction using CTAB and selective precipitation of polysaccharides followed as mentioned above, except for the amounts of leaf tissue (8 g) and extraction buffer (15 ml) per sample. Likewise, the amplification of the ITS2 region followed the same PCR conditions as before, with minor modifications, including 1× TBT-PAR buffer (Samarakoon et al., 2013) and using the primers ITS2-S2F (Chen et al., 2010), with the adapters Ion A, and ITS4 (White et al., 1990), with the adapter trP1. Then, PCR products were purified with the kit Agencourt AMPure XP Beads (Beckman Coulter), following manufacturer's instructions. Each of the six different libraries (one library per collection plot) was composed by pooling four independent PCR replicates and sequenced using the Ion PGM platform (Thermo Fisher).
Raw data from the single-end sequencing run were processed using FASTX Toolkit (http://hanno nlab.cshl.edu/fastx_toolkit) and the R package DADA2 (Callahan et al., 2016) to correct sequencing errors and infer exact amplicon sequence variants (ASVs) (equivalent to OTU determination). An ASVs table was created, and representative sequences were assigned to taxa with BLASTn using our ITS2 library as a local reference database, based on minimum similarity and coverage settings (-perc_identity 95 and -qcov_hsp 70).
Finally, we used the LULU curation algorithm with default settings to collapse erroneous ASVs, minimum relative co-occurrence of 0.95, and the default minimum similarity threshold of 84% (Frøslev et al., 2017). Additionally, downstream analyses were performed with the R package Phyloseq v1.26.1 (McMurdie & Holmes, 2013), with an object built from the ASVs curated version, using data from taxonomy assignments and sampling plots. Fabaceae-matK, rpoB, and rpoC1) (Table A1) (Table A1).

| Amplification and sequencing success of barcodes
From our complete sampling, considering the 645 specimens used in the initial test with eight markers, plus the 534 remaining samples barcoded using only rbcL and ITS2, we obtained valid sequences of at least one of the eight markers for 538 out of the 575 sampled species (93.56%), totaling 1,130 specimens (95.84%) and 2,729 DNA barcodes (Table A1). After searching for previous records in the BOLD database, we observed that 344 (63.94%) of those species were barcoded for the first time in the present work (Table A1).
In addition, 33 out of the 323 genera with species barcoded here (10.22%) did not have any sequence available in the BOLD data-

| Barcode resolution
Considering the initial test with the eight markers, we observed lev-

| Phylogenetic resolution
Among the phylogenetic trees obtained from the six used matrixes   (Table 2).
Additionally, most of the species correctly identified in the barcode resolution analysis were recovered as monophyletic.
Nevertheless, some of the species correctly identified by the DNA barcodes (with barcode resolution) were not resolved in the phylogenies, such as Clitoria falcata Lam. (Fabaceae), which was correctly identified in the BLAST analyses with both rbcL and ITS2, although appearing as polyphyletic in all six trees. Correspondingly, the opposite situation, in which the species were monophyletic in all trees but without barcode resolution, was also observed, as in the case of Lindernia brachyphylla Pennell (Linderniaceae).

| Metabarcoding analysis
The ITS2 high-throughput amplicon sequencing generated 4,465,309 raw reads from the composite samples of the six plots (Table A3) in the Serra dos Carajás. After the quality control step, 2,269,135 high-quality reads remained, yielding an average length of 314 bp.
A total of 508 different ASVs were observed in the metabarcoding analysis after sequence filtering, then being grouped into 41 ASVs classified to the species level, considering 95% and 70% of sequence similarity and coverage, respectively, resulting in 34 identified species, belonging to 33 genera, 21 families, and 14 orders (Figure 3).
Malpighiales was the most representative order, with nine species,  Although defined as one of the two core barcode regions alongside rbcL (CBOL, 2009), matK performed poorly in our samples, with amplification and/or sequencing problems in approximately threefourths of the tested specimens. We obtained even worse results for trnH-psbA, with less than 20% of our samples generating valid sequences, which is surprising since this intergenic region has been one of the preferred alternative barcode markers in several studies (e.g., Erickson et al., 2014;Lahaye et al., 2008). As we have related above, it is paramount to emphasize that many samples were successfully amplified, although the cpDNA intergenic regions presented unsatisfactory sequence data recovery, especially in the case of trnH-psbA. Throughout the history of plant DNA barcoding, there have been several reports of methodological problems with most of the regions tested so far, as frequently reported for matK, which depend on several PCR optimizations for different taxa (e.g., CBOL, 2009;Fazekas et al., 2008;Ghorbani et al., 2017;Liu et al., 2015). On the other hand, the almost fully universal nature of many primers designed to amplify and sequence portions of the rbcL gene, obviously including the primer pair we used here, makes this marker the safest choice among the known options in terms of building a comprehensive barcode library for a given flora, even taking into account its lower polymorphism levels among closely related species (Hollingsworth et al., 2011).
Nuclear rDNA-based sequences have been successfully used as DNA barcodes for fungi, especially the ITS region, which is largely Lamiales Solanales Gentianales employed as the official barcode region for the group (Badotti et al., 2017;Schoch et al., 2012;Wurzbacher et al., 2019). Several authors have emphasized the enormous potential of the ITS components for plant barcoding, which are also frequently regarded as highly informative for resolving phylogenetic relationships (e.g., Liu et al., 2015;Saha et al., 2017;Vasconcelos et al., 2018). Nevertheless, reports of problems with sequence recovery of the complete ITS (including its three regions-ITS1, rDNA 5.8S, and ITS2) are not rare for plants, mainly due to issues related to paralogs and pseudogenes (Álvarez & Wendel, 2003;Feliner & Rosselló, 2007). Gonzalez et al. (2009), for instance, obtained poor sequencing results for ITS, with only 41% of the sampled Amazonian trees being successfully barcoded by the authors. On the other hand, the smaller ITS2 region has been indicated as one of the best regions for plant barcoding, presenting a high rate of sequencing success even for lower quality DNA samples (Chen et al., 2010;Kuzmina et al., 2012;Ramalho et al., 2018). Likewise, our data showed the usefulness of ITS2 as the second-best tested marker in terms of sequence recovery, with valid barcodes for 81.04% of the species and 81.33% of the samples and performing relatively close to rbcL (91.45% of the species and 79.38% of the samples). Obviously, the availability of sequences of a given marker in public repositories is essential for an effective inventory of plant diversity, and ITS2 has been one of the most frequently used barcode regions for angiosperms so far, accounting for 26.7% of the ca. 340,000 sequences available in the BOLD database (up to 20 January 2021), only behind rbcL and matK, with 35.8% and 31.6%, respectively.

| Species resolution
Assessing the levels of species discrimination in DNA barcoding approaches is undoubtedly important, although comparing results from different analyses is not as straightforward as one may assume. The first (and perhaps the most important) considerations are related to the study area and sampling coverage. DNA barcoding-specific local floras within a well-delimited geographic area, such as the campo rupestre on canga of the Amazon ironstone fields, for instance, may appear to be more limited in scope than studying the plant diversity . Also, there are basically two main approaches to assess F I G U R E 3 Relative abundance of the observed species in the DNA metabarcoding analysis with bulk samples collected in six different canga plots in the Serra dos Carajás, as detailed in the Table A3 Apocynaceae -Forsteronia affinis the capability of correctly identifying species (species resolution) of DNA barcode markers. The first is search-based using BLAST (barcode resolution) (e.g., Burgess et al., 2011), and the second one is tree-based, which considers phylogenetic relationships (phylogenetic resolution, tree-based) (e.g., Gonzalez et al., 2009), both with advantages and drawbacks (as discussed below). Therefore, we preferred to use both evaluation approaches.
At first glance, the barcode resolution may seem a more attractive approach, as noticeably higher values were obtained for the two best markers both individually and combined (rbcL-75.00%, ITS2-89.45%, and rbcL + ITS2-86.06%), when compared with the phylogenetic resolution (rbcL-62.30%, ITS2-75.16%, and rbcL + ITS2-66.02%). Moreover, using pairwise identity (or other related parameters of a BLAST search) to determine a correct sequence assignment (and consequently species identification) in DNA barcoding approaches is quite straightforward and practical, especially when handling a large volume of data. On the other hand, the importance of employing a parameter that reflects evolutionary relationships is obvious, as the inclusion of phylogenetic reconstructions with DNA barcoding data enables several other analytical inferences (Erickson et al., 2014;Kress, 2017;Kress et al., 2015;Miller et al., 2016). Therefore, besides assessing the levels of species discrimination of rbcL and ITS2 when barcoding the FCC, we also Furthermore, the discrimination levels obtained for both markers (separately and combined) were in accordance with previous results for rbcL and ITS2 (e.g., Burgess et al., 2011;Kress et al., 2009;Parmentier et al., 2013), although relatively higher than observed for other diverse floras, as reported by Gonzalez et al. (2009) andLiu et al. (2015). The fact that both species discrimination approaches used here were overly sensitive to sampling coverage is noteworthy, as the analyses considering only specimens with both barcodes provided higher resolution values. This difference was especially strong in the case of the phylogenetic resolution of the combination rbcL + ITS2, with an increase of 20.95% in the proportion of resolved species in the reduced sampling in comparison with the complete sampling (from 66.02% to 79.85%; Table 2). Such difference occurred due to the exclusion of specimens from species and/ or genera that present either more complex evolutionary histories or problematic taxonomy.

| DNA barcodes and conservation
Biodiversity indexes provided by DNA barcoding data have an undeniably important role in better directing conservation efforts, as the effectiveness of maintaining ecological services of biodiversity hotspots can be greatly enhanced by including phylogenetic diversity parameters in the decision-making process (Diniz et al., 2021;Forest et al., 2007). However, as mentioned before and pointed out by Kress (2017), properly populating the public databases with plant DNA barcodes has not been an easy task, being "one of the biggest challenges for the next decade". The difficulties in achieving such an important goal are especially evidenced by considering the actions needed to ensure proper conservation planning in such an immense (and still poorly known) area as the Amazon basin. Hence, the data presented here are strategic as the first and only genetic data available for several plant species of the region.
In addition, it is essential to pay extra attention to endemic and/ or rare species of such a unique Amazon vegetation as the campo rupestre on canga of the Serra dos Carajás, as in the case of the morning-glory Ipomoea cavalcantei and the quillwort Isoetes cangae, for instance. Both species present a very limited geographic distribution in the canga , with studies based on DNA barcoding data investigating their genetic diversity status for the first time (Babiychuk et al., 2017;Nunes et al., 2018), followed by further populational analyses (Babiychuk et al., 2019;Dalapicolla et al., 2021;Lanes et al., 2018). of Melastomataceae with ITS2 were only slightly better than for rbcL (one barcoded species), considering the universal protocols used. Thus, we acknowledge the crucial need for developing more directed protocols aiming at problematic taxa, which will be our next step toward accomplishing a DNA barcode library with full coverage for the flora of the Amazonian canga.
As mentioned above, inventorying species through DNA-based tools has consistently gained ground along the years, achieving further importance with the development of multispecies identification approaches based on high-throughput sequencing technologies (Deiner et al., 2017;Kress et al., 2015). Several authors have pointed out the many advantages of using DNA metabarcoding for monitoring biodiversity, especially considering robustness and efficiency of this analytical system (Bush et al., 2020;Deiner et al., 2017;Zinger et al., 2019). Certainly, the effectiveness of metabarcoding can be greatly affected depending on the completeness level of the reference DNA barcode library (Alsos et al., 2018); thus, care must be taken for its use for iden-

| Concluding remarks
Despite that DNA barcoding methods are well-established for plant species, and thus the approach novelty is limited, our study brings a considerable amount of novel sequencing data for a unique flora within the Amazon basin, which still presents poorly characterized genetic resources. Furthermore, the value of DNA barcoding data to guide conservation efforts in the Serra dos Carajás has been demonstrated also in the ecological context by helping to identify the importance of some plant taxa acting as nutrient providers for animal communities in ferruginous caves (Ramalho et al., 2018).
While the more polymorphic nature of the marker ITS2 makes it more suitable for species identification in most cases of the genera with more than one species in the canga of the Serra dos Carajás, there were some cases in which rbcL was better for discriminating species, such as within the genus Neea (Nyctaginaceae). Besides, there is excellent species coverage with rbcL in the available DNA barcode libraries, being especially crucial in the cases of species without any genetic information available. Therefore, the importance of rbcL as a plant barcode marker is unquestionable, and our choice of implementing ITS2 together with rbcL as primary barcodes for the highly diverse flora of the Serra dos Carajás covers all three principles of DNA barcoding.
In the case of the metabarcoding analysis, our goal was to test the method's viability when studying the diverse flora of the Amazon ironstone fields, aiming to establish a starting point and basal parameters for future large-scale studies in the region, using both bulk sampling and environmental DNA (eDNA) approaches (Oliveira et al., 2019). Hence, the ongoing development of the DNA barcode libraries for the region will be essential for the optimization of reforestation in decommissioned mining sites in the region, as well as fast and robust vegetation surveys in untouched native areas.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

O PEN R E S E A RCH BA D G E S
This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.17605/ osf.io/5xt3u.

DATA AVA I L A B I L I T Y S TAT E M E N T
All DNA sequences generated for this work may be accessed through the BOLD accession numbers indicated in the