Examination of sequence variations in partial mitochondrial 12S gene amongst damselfish species as references for DNA barcoding

Abstract Accurate species identification, based on DNA barcoding, can be achieved when sufficient sequence variations are present amongst species in the sampled marker. In general, the ability to discriminate species decreases with shorter sequences; however, shorter regions have a merit in amplification success by the polymerase chain reaction. In either case, it is important to investigate sequence variations amongst species before barcoding to understand its reliability and limitations. In this study, we investigate how accurately short, but hypervariable portion of the mitochondrial 12S ribosomal RNA (12S) gene (MiFish region with approximately 180 bp) is used to identify each species in diversified pomacentrid fishes compared with the longer region of the same gene (approximately 750 bp). We prepared three datasets with 301 sequences of the MiFish region for 150 species, the same 301 of sequences of the longer 12S region and 476 sequences of the MiFish region for 183 species. Neighbour-joining (NJ) analyses and genetic distance analyses revealed several indistinguishable pairs of species in these DNA regions. Although the number of such pairs was larger in the MiFish region, 83.6% (153 of 183) of species possessed respective unique sequences even in the MiFish region (versus 96.0% [144 of 150 species] in the longer 12S region). A part of indistinguishable pairs of species might have caused by mitochondrial DNA introgressions and taxonomically unresolved problems. Our analysis clarified the effectiveness and limitations of species identification using DNA barcoding for Pomacentridae and the sequences we provided here contribute to the expansion of references for pomacentrid mitochondrial 12S sequences.


Introduction
DNA barcoding is a rapid species identification tool that uses sequences of the short fragment of DNA (Hebert et al. 2003, Hebert andGregory 2005).Comprehensive sequence reference is vital for accurate and adequate species identification using DNA barcoding, in which a given sequence in query would demonstrate high consistency with one of the references.When sufficient sequence references are available, DNA barcoding serves as an excellent identification tool, if interspecific variation is greater than intraspecific variation (Meyer andPaulay 2005, Stoeckle andThaler 2014).Nevertheless, this condition may be violated in a group that has a short elapsed time from the common ancestor or that has an incomplete reproductive isolation causing genetic introgressions (Ballard andWhitlock 2004, Rubinoff et al. 2006).To apply DNA barcoding to such a group, it is necessary to examine in advance whether there are certain degrees of sequence difference in the target DNA region amongst all pairwise combinations of the species involved.
Taxonomists are not the only potential users of DNA barcoding; this technology has become useful for various fields of biology (Joly et al. 2014, Kress et al. 2015).Especially, recent popularisation of next-generation sequencers enables researchers to utilise environmental DNA (eDNA) metabarcoding analysis, by which we could understand the species composition in a given region from water, soil and aerosol samples (Ruppert et al. 2019).In conventional DNA barcoding, a partial sequence with approximately 650 bp of the mitochondrial cytochrome oxidase subunit I gene (COI) has been used and a large amount of relevant data has been compiled in the Barcode of Life Data (BOLD) Systems across animals (Ratnasingham and Hebert 2007).In contrast, much shorter DNA regions (typically < 200 bp) are used in eDNA metabarcoding, because a short fragment of DNA is much easier to amplify in the polymerase chain reaction (PCR), especially when the template DNAs are degraded (Ficetola et al. 2008, Shokralla et al. 2012, Freeland 2017).For species discrimination using short sequences in DNA barcoding, hypervariable regions with conservative flanking regions are required and the utilisation of mtDNA genes other than COI has been proposed (Collins andCruickshank 2013, Zhang et al. 2020).A partial sequence of the mitochondrial 12S ribosomal RNA gene (12S) is such a region and often used to analyse eDNA in water samples (Miya et al. 2015, Valentini et al. 2016, Taberlet et al. 2018).
For eDNA metabarcoding for fishes, Miya et al. (2015) proposed an excellent highly versatile primer pair termed MiFish, which was designed for amplifying a hypervariable region with an average length of 172 bp within the 12S rRNA gene.The performance of MiFish primers is apparently high enough to detect species of most groups of fish and these primers have been used for biodiversity monitoring of fishes in various aquatic environments in and around six continents worldwide (Miya et al. 2020).Nonetheless, many authors (Miya et al. 2015, Morita et al. 2019, Takeuchi et al. 2019, Miya et al. 2020) mentioned that sequence variations in the region amplified using MiFish primers are insufficient for accurate species identification in several groups of fish, including tunas (Thunnus), eels (Anguilla) and salmonid species.
Damselfishes (family Pomacentridae) are a major component of the coral reef fishes (Depczynski et al. 2007).Currently, 430 species are recognised in this family (Fricke et al. 2024) and new species are still described at a growing pace (e.g.Allen and Allen (2021), Allen et al. (2022), Prokofiev and Astakhov (2023)).This family contains relatively recent radiated lineages (Cooper et al. 2009) and it is believed that there are no diferences in sequence differences between some combinations of species in these groups, especially in a short fragment of DNA.Therefore, we explored whether damselfish species could be discriminated using short mitochondrial DNA sequences.In this study, we determined the sequences of approximately 40 species of damselfish (Pomacentridae) collected in the coral reef of the Ryukyu Islands, Japan and investigated sequence variations in the MiFish region amongst pomacentrid fishes, based on large datasets compiled with other available sequences.

Collection of sequences
We used 44 pomacentrid species of 12 genera that we collected in the shallow waters of coral reefs around Okinawa-jima Island of the Ryukyu Archipelago, Japan, for sequence analysis.We examined more than one specimen for each species whenever possible, considering possible intraspecific variations (Table 1).Before genetic examinations, we attempted accurate species identification, based on morphology for each specimen following (Nakabou 2013, Allen et al. 2015, Wibowo et al. 2017, Wibowo et al. 2018and Tang et al. 2021).The pectoral fins on the right side were collected from the specimens and stored in 99% ethanol until use.The voucher specimens were fixed with 10% formalin and transferred to 99% ethanol and representative specimens were deposited in the Ichthyological Collection of the Okinawa Churaumi Aquarium (OCF-P10591 to OCF-P10681).The remaining specimens are stored in the authors' personal collection shown as POM in Table 1.Examination of sequence variations in partial mitochondrial 12S gene amongst ...

Genus
Examination of sequence variations in partial mitochondrial 12S gene amongst ...   1Total DNA was extracted from the tissue samples using DNeasy Blood & Tissue Kit (Qiagen), according to the manufacturer's instructions.A fragment of the first half of the mitochondrial 12S ribosomal RNA (12S) gene (approximately 750 bp), which covered the entire length of the MiFish region inside, was amplified by standard PCR using Takara Ex Taq (Takara).The primers used for PCR were L709 (5ʹ-TACACATGCAAGTCTCCGCA-3ʹ) and H1552 (5ʹ-ACTTACCGTGTTACGACTTGCCTC-3ʹ) (Inoue et al. 2000).The cycling profile in PCR amplification consisted of an initial denaturation step at 94℃ for 90 s followed by 30 cycles of denaturation at 94℃ for 30 s, annealing at 49℃ for 30 s, extension at 72℃ for 45 s and final extension at 72℃ for 5 min.The PCR products, after confirmation using 1% agarose gel electrophoresis, were purified by PEG precipitation and then cycle-sequenced using BigDye v. 3.1 (Applied Biosystems) with a cycling profile of initial denaturation at 96℃ for 60 s followed by 25 cycles of denaturation at 96℃ for 10 s, annealing at 50℃ for 5 s and final extension at 60℃ for 4 min.The direct sequence was commissioned to an external company (Macrogen Japan).The sequences we determined are available at the International Nucleotide Sequence Database (INSD) (accession numbers LC814855 to LC814984).In addition to the sequences we determined, homologous sequences of pomacentrid fish were obtained from INSD.We first obtained sequences of the first half of the mitochondrial 12S gene (725-773 bp) for 154 species.These sequences were also used to generate a separate dataset of the MiFish region (166-199 bp) by cutting off redundant regions from the sequences.We further obtained sequences for 39 species for which only sequences of the MiFish regions are available in the INSD.By combining with the sequences we determined by ourselves, we obtained 330 longer sequences of the first half of the 12S region (725-773 bp) for 165 species and 523 sequences of the MiFish region for 194 species (Table 1).Furthermore, 20 sequences of a possibly undescribed species of the genus Pomacentrus , which was referred to as "Minamiiso-suzumedai" by Motomura (2020), were included in our data.In this study, it was treated as "P.sp.minamiiso".

Preparation of datasets
To detect sequences with unreliable labels due to various errors (Claver et al. 2022), we tentatively constructed Neighbour-joining trees for sequences of the longer 12S region and those of the MiFish region, based on the Tamura-Nei distance after sequence alignment using the MEGA-X software (Kumar et al. 2018).The pairwise distances for conspecific sequences and those for interspecific, but congeneric, sequences were calculated for the longer 12S and MiFish sequences using the MEGA-X software.Then, the sequences that were assigned to strange positions in the trees and those demonstrating apparently low sequence similarity to other conspecific or congeneric sequences were scrutinised and excluded when they were uncertain.The species that were subjected to taxonomic revisions after sequence deposition to INSD were carefully consulted and scientific names were replaced with others in 36 cases according to the latest taxonomic system described by Tang et al. ( 2021) (Table 1).
After removing erroneous sequences, we prepared three datasets.The first was the "longer 12S dataset (725-773 bp)" that consisted of 301 sequences for 150 species (8.8% of the equences we obtained from INSD was excluded after scrutinisation; Table 1), of which 53 species have more than one sequence.It covered 70.2% (80/114) of pomacentrid species known from the Japanese waters and 36.1% of species known in the world.The second was the dataset of the MiFish region (166-199 bp) consisting of the same 301 sequences for the 150 species that was generated by cutting off redundant regions from the first dataset.The third dataset was of the MiFish region, but including additional sequences retrieved from INSD.This dataset consisted of 476 sequences for 183 species (9.3% of the sequences we obtained from INSD was excluded; Table 1), with 87 species having more than one sequence and covered 89.5% (102/114) of Japanese species and 43.8% of species in the world.

Evaluating the ability of DNA barcoding
NJ analyses and genetic distance calculations were performed for the three datasets to investigate the capability of the two DNA regions (the longer 12S and the MiFish regions) for species identification.We first evaluated their capabilities according to placement of the conspecific sequences in the NJ trees.The NJ trees, based on the Tamura-Nei distance, were drawn and subjected to the bootstrap analysis with 1,000 pseudoreplications using the MEGA-X software.Ratnasingham and Hebert (2013) recognised the following four categories in evaluation of the performance of DNA barcoding of a given DNA region, based on recognition of the operational taxonomic unit (OTU) using the NJ tree framework: 1) Match, in which all sequences of a given species were placed collectively and exclusively in a single OTU, indicating that the barcoding would be successful for that species; 2) Split, in which the sequences of a given species were assigned to more than one OTU, but not intermixed with other species, indicating that barcoding would be a failure; 3) Merge, in which all sequences of a given species were placed in a single OTU together with another species, indicating that barcoding would be a failure; and 4) Mixture, in which the sequences of a given species were assigned to more than one OTU and at least one sequence is shared by another species, indicating that barcoding would be a failure.We adopted these criteria in evaluations of the capabilities of the longer 12S and MiFish regions for the pomacentrid species.
It is a fact that the NJ analyses approach is not applicable to species for which only a single sequence was included in the datasets, although we could consider barcoding a failure when the single sequence of a given species is shared by another (i.e.equivalent to "Merge" in the abovementioned four categories).In other cases, we focused on the genetic distance of a given species with its most closely-related species compared with the genetic distance generally observed within a species.Uncorrected p-distance was calculated for combinations of the relevant sequences using the MEGA-X software.
For setting an upper threshold of genetic distance within a species, we referred to genetic distances within each of the 53 and 87 species for which more than one sequence is available in the longer 12S dataset and the third dataset of the MiFish region with the larger samples.Our analyses demonstrated that the maximum intraspecific genetic distances were < 0.01 for 86.8% (longer 12S) or 80.5% (MiFish) of the species, respectively (see "Results").Thus, we tentatively included 0.01 in the genetic distance as a threshold and judged a given species as "success" for barcoding when the genetic distance between that species and its closest relatives was > 0.01.

Evaluation of sequence difference, based on the NJ analysis
The NJ trees constructed, based on the longer 12S and MiFish datasets, are depicted in Figs.1-3.In the longer 12S NJ tree, conspecific sequences of 49 of the 53 species for which more than one sequence was available exclusively formed respective clades and were judged "Match" (Fig. 1).The bootstrap values for these 49 clades were 64-99 (96.7 on average).Sequences of the remaining four species were classified as "Mixture".In addition to these, 2 of 97 single-specimen species shared identical sequences with other species and were classified as "Merge".
In the NJ tree constructed, based on the MiFish dataset for the same samples, the sequences of 41 species exclusively formed respective clades (Fig. 2).Of the remaining 12 species, two were classified as "Split," four were classified as "Merge" and six were classified as "Mixture".In addition to these, 12 single-specimen species shared identical sequences with other species and were classified as "Merge".
Fig. 3 shows the NJ tree constructed, based on the third dataset with 476 MiFish sequences for 183 species.Conspecific sequences of 71 of the 87 species for which more than one sequence was examined exclusively formed respective clades.The bootstrap values for these clades were 37-100 (85.7 on average).Of the remaining 16 species, two were classified as "Split", seven were classified as "Merge" and seven were classified as "Mixture" (Table 2).In addition to these, 13 of 96 single-specimen species shared identical sequences with other species and were classified as "Merge".

Evaluation using genetic distance
Fig. 4 shows an overview of the genetic distance at the intraspecific and interspecific levels.In the longer 12S dataset, the average number of base substitutions amongst conspecific sequences was 3.3 bp (equivalent to 0.004 in the uncorrected p-distance), ranging from 0 to 20 bp (0-0.026 in the genetic distances).Of the 53 species for which sequences of more than one individual were available, 47 species (86.8%) exhibited genetic distances < 0.01.The remaining six species were Chrysiptera biocellata (up to 0.012 in genetic distances), C. cyanea (0.023), C. rex (0.023), Neopomacentrus cyanomos (0.018), Pomacentrus taeniometopon (0.026) and P. sp. minamiiso (0.023).Amongst these species, C. rex, P. taeniometopon and P. sp minamiiso formed respective mixture clades, whereas the remaining three species formed respective unique clades in the NJ analysis.sequence variations with genetic distances < 0.01.At the between-congeneric species level, the base substitutions were 0-41 bp, being equivalent to 0-0.248 of the uncorrected p-distance.These values were larger than those in the longer 12S dataset for most genera examined.Three genera, viz., Amphiprion, Chrysiptera and Pomacentrus, contained several pairs of species with p-distance < 0.01.In the longer 12S dataset, 29 of 97 single-specimen species exhibited genetic distances > 0, but < 0.01 with the closest species, suggesting that their sequence differences from other species were not sufficiently large for stable and accurate identification.In the MiFish datasets with the same sample series, 16 of 97 single-specimen species exhibited genetic distances > 0, but < 0.01 from the respective closest species.In the third dataset, 20 of 96 single-specimen species exhibited genetic distances > 0, but < 0.01 from the respective closest species (Table 2).These species that fell below the threshold could be discriminated from any other species according to the sequences in the present datasets, but they were judged as species that require special cautions in identification, based on DNA barcoding.

Discussion
Accurate species identification using DNA barcoding can be achieved when sufficient sequence differences exist between all pairwise combinations of species in the targeted DNA region (Ward 2009).Whether sequence differences exist between species depends on the region and length of the DNA fragments used and, therefore, we investigated sequence differences amongst pomacentrid species in two DNA regions, viz. the first half of mitochondrial 12S DNA (725-773 bp) and the MiFish region (166-199 bp).Our analyses confirmed that > 96.0% of the 150 pomacentrid species examined in this study can be discriminated by sequences of the longer 12S DNA region.Although our evaluation of sequence differences for single-specimen species using genetic distance cautioned that the degree of sequence difference of 31 species (20.7% of all species examined) is not sufficiently large for reliable species identification, it is useful for the tentative identification of these species.Exceptions are three pairs of species, viz.Abudefduf sexfasciatus and A. vaigiensis, Pomacentrus caeruleus and P. caeruleopunctatus and P. taeniometopon and P. sp. minamiiso, which share at least one identical sequence.
When the MiFish dataset was used for the same 150 species, an additional five pairs and two groups of species could not be discriminated from each other because of the lack of variable sites (Table 2) and 24 of the 150 species (16.0%) could not be identified.An additional 16 species exhibited genetic distances < 0.01 from the respective closest relatives in the MiFish region and their identification requires some caution.Therefore, the longer 12S region exhibits better performance in discriminating pomacentrid species.However, the MiFish region has a merit for a higher probability of successful PCR amplification, especially when the quality of template DNAs is low (Deagle et al. 2006, Bylemans et al. 2018, Wei et al. 2018).Our analysis on the third dataset suggested that we could identify 83.6% of pomacentrid species to the species level using the MiFish region.This hypervariable region is useful in eDNA metabarcoding for pomacentrid fishes with some caution, as shown in Table 2.
Another merit of the MiFish region is a greater number of sufficient sequence references.Currently, we could find 1.5 times more sequences of 1.2 times more species of Pomacentridae in INSD than that in the longer 12S region.This advantage would be significant in accurate species identification, compensating for its possible limitation of less variable sites.
A possible failure of species identification might also arise from genetic introgression through hybridisation (Ward et al. 2009, Collins andCruickshank 2013).In that case, the barcoding does not function well for the involved species irrespective of the length and region of DNA.In pomacentrid fishes, hybridisation has been reported for Abudefduf vaigienis and A. sexfasciatus (Montanari et al. 2016, Bertrand et al. 2017). Bertrand et al. (2017) demonstrated that these two species could be discriminated using nuclear DNA markers, but shared identical sequences in the mitochondrial cytochrome b gene in a broad geographical area, suggesting extensive mtDNA introgression.The absence of sequence differences between these two species in our datasets appears to be this case.Nevertheless, hybridisation appears to be rare amongst benthic nesting fishes, including the members of Pomacentridae, that engage in deliberate pair spawning and parental care for the eggs laid on a substrate until hatching (Yaakub et al. 2006).Only a few cases of hybridisation have been reported for the pomacentrid fish (e.g.van Herwerden and Doherty (2006), Coleman et al. (2014), He et al. (2019)).Therefore, the adverse effects of hybridisation on DNA barcoding for this group are considered limited.
Another source of obstacles in DNA barcoding is taxonomic confusion.Our examinations for DNA sequence differences amongst pomacentrid fish revealed taxonomic confusion in P. taeniometopon and its relatives.Motomura (2020) recognised a possibly undescribed species of the genus Pomacentrus from the Japanese waters, which is referred to as Pomacentrus sp.minamiiso.According to Nakabou (2013), P. sp minamiiso morphologically resembles P. taeniometopon, but is distinguished from the latter in having pointed and denser suborbital protrusions.Kato (2011) mentioned that P. sp minamiiso differs from P. taeniometopon in having a black spot at the top of the operculum.We recognised four morphotypes in our specimens collected from Okinawan waters, viz. 1) typical P. taeniometopon that lacks both pointed and denser suborbital protrusions and black spot at the top of the operculum, 2) a form having both, 3) a form having pointed and denser suborbital protrusions only and 4) a form having a black spot at the top of the operculum only (Table 1).Our DNA analysis supported recognitions of the first and third forms as respective clades, but the remaining two were intermixed with each other.The NJ tree constructed, based on the MiFish region with larger samples, further suggested that another species, P. adelus, was included in this species complex (Fig. 3).Further taxonomic studies are warranted for this group.
Another case of possible taxonomic confusion involves Chrysiptera rex and its relatives.Our analyses demonstrated that the genetic distance within C. rex was very large up to 0.023 in the longer 12S dataset and 0.048 in the MiFish dataset.Two morphologically resembling species, C. chysocephala and C. caesifrons, were described previously (Allen et al. 2015).According to the sequences obtained from INSD, C. rex and C. chysocephala are confused with each other (Fig. 3).Although two sequences (FJ616 332 and JN935816), labelled as C. rex, appear to be C. chysocephala or C. caesifrons, we refrain from replacing their labels because of the lack of any information concerning voucher and locality for these sequences.It is desired to revise the label of the relevant sequences in INSD for future DNA barcoding.
The family Pomacentridae is a diversified group containing more than 430 extant species (McCord et al. 2021).Our examinations for sequence variations in the longer 12S and MiFish regions of the mitochondrial DNA confirmed that 96.0% and 83.6%, respectively, of the examined species can be identified to the species level, demonstrating the effectiveness and limitation of DNA barcoding for Pomacentridae using these DNA regions.Our study also revealed more sequence references for the MiFish region, suggesting the possible advantage of this region in DNA barcoding (Weigand et al. 2019).In contrast, the longer 12S region is advantageous in terms of higher resolution in species discrimination.Considering that the latter region contains the MiFish region inside, we recommend the accumulation of longer sequences unless the cost required for this is much different.DNA barcoding, using mitochondrial DNA, is potentially interfered with pseudogenes or numts, and one of the steps to avoid such a confusion is careful comparisons with closely-related published sequences (Song et al. 2008). Gold et al. (2021) indicated that it is meaningful to expand the DNA database for a particular regional fauna in a step-by-step manner.The sequences we provided in this study contribute to the expansion of references for the Japanese Pomacentridae.Further studies are required for filling the gap of taxon sampling for more successful DNA barcoding.
Examination of sequence variations in partial mitochondrial 12S gene amongst ...

Figure 1 .
Figure 1.The Neighbour-joining tree constructed, based on the longer 12S dataset for 301 sequences of 150 pomacentrid species.Identical sequences are not merged and, thus, all the 301 sequences appear as terminal OTUs.Those sharing identical sequences are expressed as branch length of zero.The bootstrap values are shown for species were judged "Match".

Figure 2 .
Figure 2. The Neighbour-joining tree constructed, based on the MiFish dataset for 301 sequences of 150 pomacentrid species.The bootstrap values are shown for species were judged "Match".

Figure 3 .
Figure 3.The Neighbour-joining tree constructed, based on the third dataset of the MiFish region for 476 sequences of 183 species.The bootstrap values are shown for species were judged "Match".

Figure 4 .
Figure 4. Intraspecific and interspecific variations in the sequences of the longer 12S and MiFish regions for each genus of Pomacentridae.

Table 1 .
List of the sequences we obtained by sequencing and from International Nucleotide Sequence Database (INSD).Sequences with voucher numbers are those determined here, based on specimens examination.Sequences without any marks in the columns of Datasets 1-3 are those taken from INSD, but excluded after scrutinisation for certainty.Note that species names of some sequences taken from INSD were revised in this list followingTang et al. (2021)taxonomic revision and that the original names deposited in INSD are given in the rightmost column

Species Voucher number GenBank accession codes Dataset 1 Dataset 2 Dataset 3 Original names deposited in INSD
Examination of sequence variations in partial mitochondrial 12S gene amongst ...

Genus Species Voucher number GenBank accession codes Dataset 1 Dataset 2 Dataset 3 Original names deposited in INSD
Examination of sequence variations in partial mitochondrial 12S gene amongst ... Examination of sequence variations in partial mitochondrial 12S gene amongst ... Examination of sequence variations in partial mitochondrial 12S gene amongst ... Examination of sequence variations in partial mitochondrial 12S gene amongst ... Examination of sequence variations in partial mitochondrial 12S gene amongst ... Examination of sequence variations in partial mitochondrial 12S gene amongst ... Examination of sequence variations in partial mitochondrial 12S gene amongst ...