Metabarcoding data of mitochondrial cytochrome c oxidase subunit 1 gene from bulk community of aquatic organisms collected from Nara Prefecture, Japan

Riverine metabarcoding data were obtained from the Takamigawa River, a tributary of the Kinokawa River, in Nara Prefecture (Central Honshu, Japan). We extracted DNA from bulk community samples of aquatic organisms, most of which could not be morphologically identified at species level due to their small body size (0.12 - 2 mm length). A partial coding region of the mitochondrial cytochrome c oxidase subunit 1 gene (cox1) was amplified using PCR, and the amplicon was subjected to high-throughput parallel sequencing (Illumina MiSeq). The 313 bp paired-end sequence reads were classified into operational taxonomic units (OTUs), their species boundaries were delineated using the Generalised Mixed Yule Coalescent (GMYC) method, and taxonomic names of the GMYC species were assigned using basic local alignment search tool (BLAST) against International DNA Databases (INSD: GenBank, ENA, and DDBJ).


a b s t r a c t
Riverine metabarcoding data were obtained from the Takamigawa River, a tributary of the Kinokawa River, in Nara Prefecture (Central Honshu, Japan). We extracted DNA from bulk community samples of aquatic organisms, most of which could not be morphologically identified at species level due to their small body size (0.12 -2 mm length). A partial coding region of the mitochondrial cytochrome c oxidase subunit 1 gene ( cox1 ) was amplified using PCR, and the amplicon was subjected to high-throughput parallel sequencing (Illumina MiSeq). The 313 bp paired-end sequence reads were classified into operational taxonomic units (OTUs), their species boundaries were delineated using the Generalised Mixed Yule Coalescent (GMYC) method, and taxonomic names of the GMYC species were assigned using basic local alignment search tool (BLAST) against International DNA Databases (INSD: GenBank, ENA, and DDBJ).

Value of the Data
• The metabarcoding result enables us to evaluate the efficacy of DNA-based biodiversity assessment in comparison with conventional morphology-based species identification data collected from the same localities. • The new OTU clustering algorism that compare the read abundances between neighbouring OTUs is available as an R script, which could be used widely to process amplicon sequence data. • The results are beneficial for researchers working on both DNA metabarcoding and conventional morphology-based faunal surveys. • The data will serve as a reference barcode sequence data of benthic fauna in mountain streams in the Central Honshu, Japan.

Data Description
Raw sequence reads of partial coding regions of cox1 in riverine metagenomes are deposited in DNA Data Bank of Japan (DDBJ). DDBJ accession numbers DRR311861 and DRR311862 were obtained from the run environment (slow riffle) of Aridoshi (ARD), DRR311863-DRR311865 from the rapid environment of Shisetsu (SST), and DDR311866-DRR311868 from the run environment of Shisetsu (SST). There were 472,697 reads in total of DRR311861-DRR311868, and they were classified into 888 OTUs.
Phylogenetic tree of the 888 OTUs was reconstructed by MrBayes 3.2.7a based on the GTR + I + G molecular clock model, and the species boundaries were delineated using the Generalised Mixed Yule Coalescent (GMYC) method ( Fig. 1 ). A total of 369 GMYC species were presumed from 888 OTUs.
Representative sequences of the GMYC species (the most abundant OTUs in the species clusters; marked with GMYC IDs in Fig. 1 ) were queried on INSD, and the sequence data showing the highest similarity were retrieved ( Table 1 ). According to the critical similarity scores for species identification [1] , OTUs showing similarity scores (percent identity) greater than 95% are considered to be properly deduced. In the present results, 137 out of 369 GMYC species were assigned to INSD with > 95% similarity. The DNA barcode references of the remaining unassigned GMYC species may not have been registered in INSD yet. These unassigned GMYC species can be extrapolated to coarser level taxonomic names (i.e., genus, family, or order) according to the critical similarity scores proposed by Inai et al. [1] .

Collection of specimens
Benthic organisms were collected using a hand net (mesh size: 200 per inch) in a perturbed riverbed by kicking at two sampling stations: ARD (on 17 March 2017) and SST (on 29 June 2017). Sampling was performed in the run environment (slow riffle) of both stations ARD and SST, and in rapid environment of SST (three community samples in total). The specimens were immersed in 70% ethanol until DNA was isolated. DNA of the three bulk community samples (one from ARD and two from SST) were separately extracted using conventional proteinase K treatment/phenol extraction/ethanol precipitation, as previously described [3] .

DNA sequencing
The DNA libraries for sequencing were constructed by 2-step tailed PCR. The first PCR amplifications of the target gene (partial coding region of cox1 ) were performed using PrimeSTAR® HS DNA Polymerase (Takara Bio Co. Ltd., Kusatsu, Japan) for 35 cycles of 98 °C-10 s / 40 °C-5 s / 72 °C-60 s with an initial incubation at 95 °C for 1 min and a final elongation at 72 °C for 7 min, using a set of primers IntF and HCOm [2] . The nucleotide sequences of the primers (containing the linker for the second PCR) were as follows: IntF, ACACTCTTTCCCTACACGACGCTCTTCCGATCTGGWACWGGWTGAACWGTWTAYCCYCC; HCOm, GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAHACTTCNGGGTGKCCRAARAATCA.
The amplified fragments were purified by AMPure XP (Beckman Coulter, Brea, CA, USA). The second PCR was performed using ExTaq® DNA Polymerase (Takara Bio Co. Ltd.) for 12 cycles of 94 °C-30 s / 60 °C-30 s / 72 °C-30 s with an initial incubation at 94 °C for 2 min and a final elongation at 72 °C for 5 min, with a set of DNA primers containing octanucleotide tag for indexing. The nucleotide sequences of primers were as follows: 2 nd -F, AATGATACGGCGACCACCGAGATCTACAC-[octanucleotide tag]-ACACTCTTTCCCTACACGAC GC; 2 nd -R, CAAGCAGAAGACGGCATACGAGATTAGCTGCAGTGACTGGAGTTCAGACGTGTG.
Two technical replicates for ARD and three technical replicates for each of two SST samples were prepared with different tags. The tag sequences for DRR311861, DRR311862, DRR311863, DRR311864, DRR311865, DRR311866, DRR311867, and DRR311868 were TCGACTAG, TTCTAGCT, CC-TAGAGT, GCGTAAGA, CTATTAAG, AAGGCTAT, GAGCCTTA , and TTATGCGA , respectively. Quality of          Fig. 1 is divided into 1a and 1b, both of which are linked at the node indicated by the closed circle ( •). groups with a single nucleotide difference were combined each other to form an OTU, and a sequence with the largest number of reads within the OTU was defined as the OTU representative sequence. A total of 888 OTUs were obtained and subjected to GMYC analysis. The R script used to produce OTUs and OTU representatives in FASTA format are deposited in Mendeley Data (DOI: 10.17632/pfhf4kt37s.1 ).

Estimation of GMYC species
A phylogenetic tree of 888 OTUs was reconstructed using Markov chain Monte Carlo (MCMC) methods by MrBayes 3.2.7a with the GTR + I + G molecular clock model. GMYC species boundaries were estimated using SPLITS package in R.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Operational taxonomic units of cox1 amplified from bulk community DNA samples in Takami gawa River, Nara, Japan (Original data) (Mendeley Data). cox1 amplicon sequences of riverine metagenomes (Original data) (DDBJ (DNA Data Bank of Japan)).