Data on taxonomic annotation and diversity of 18S rRNA gene amplicon libraries derived from high throughput sequencing

This Data in Brief article is a supporting information for the research article entitled “Protistan community composition in anoxic sediments from three salinity-disparate Japanese lakes” by Kataoka and Kondo (2019) [1]. Summary of 18S rRNA gene sequences originated from anoxic sediment of three lakes in two seasons using high throughput sequencing techniques (MiSeq, Illumina) was shown in this data article. Supergroup-level taxonomy was compared between the SILVA search for SILVA database and BLASTn search for the PR2 database. Alpha diversity was calculated in each sample, and beta-diversity was calculated among the six amplicon libraries. Partial sequence length between the primer set of 574*f and 1132R Hugerth et al., 2015 was compared between the forward read and the combined read.


Data
Raw read from MiSeq was quality controlled and grouped into OTUs at 98% sequence similarity level, then OTUs that is constructed only one sequence (singleton) was removed (Table 1). Annotation method for taxonomic path for representative sequence of each OTU of 18S rRNA gene sequence was compared in order to clarify suitable method for identifying supergroup taxonomy ( Table 2). Alpha diversity was compared by calculating rarefaction curve ( Fig. 1) in each sample, and beta diversity was determined by calculating by similarity profile analysis of all samples (Fig. 2). Partial sequence length between the forward and reverse primers was compared between independently generated query sequences (Fig. 3).

Experimental design, materials, and methods
Lacustrine sediments were collected from the southern basin of Lake Biwa, and the central basins of Lake Suigetsu and Lake Hiruga using an EkmaneBirge-type bottom sampler (RIGO, Saitama, Japan) [1]. Surface sediment was subsampled from the 0e5 cm depth using a syringe with the needle-end cut-off. Total nucleic acids were extracted from the 0.5 g sediment samples using a FastDNA Spin Kit for Soil (MP Biomedicals, LLC, Solon, OH) according to the manufacturers' instructions. An amplicon library for high throughput sequencing analysis of protists 18S rRNA genes was constructed using a primer set targeting to the V4eV5 hypervariable region in protist 18S rRNA genes named 574*f (5 0 -CGGTAAYTCCAGCTCYV-3 0 ) and 1132R (5 0 -CCGTCAATTHCTTYAART-3 0 ) [2]. PCR amplification was performed in a 25 mL reaction mixture containing 1 Â KAPA HiFi HotStart ReadyMix (KAPA Biosystems), 0.3 mM of each primer and 3 mL of ten-times diluted gDNA that corresponded to 0.4e1.3 ng of gDNA, under Genomic DNA was extracted from anoxic sediment in lakes.

Experimental features
Amplicon was generated using a primer set of 574*f and 1142R.

Data source location
Lakes Hiruga and Suigetsu in Mikata Lake Group in Fukui Prefecture and Lake Biwa in Shiga Prefecture, Japan.

Data accessibility
Analysed data is presented in the article. Raw  Value of the data Comparing methods of annotating taxonomic path for 18S rRNA gene sequence is valuable because sequence in public database is still insufficient for identifying diverse eukaryotic microbes. Information of partial sequence length between the forward-and reverse-primer is valuable for understanding protistan composition in natural environment where unknown microbes inhabit. Alpha and beta diversities of protistan genotypes in lacustrine sediments are rare example. cycling conditions as follows: heating to 94 C for 3 min to activate the hot-start DNA polymerase, 30 cycles at 94 C for 30 s, annealing at 51 C for 30 s, elongation at 72 C for 45 s, then a final elongation at 72 C for 7 min. Amplicon with expected lengths of 560 bp, which was determined using agarose gel electrophoresis, were purified and labelled with an index primer set attaching to both the 5 0 and 3 0 ends (NEBNext Multiplex Oligos, New England BioLabs), then sequenced using MiSeq Reagent kit v3 for 2 Â 300 bp (Illumina, CA, USA). All of the generated sequence reads were de-multiplexed according to the index primers and processed using the software package Claident ver. 0.2.2017.07.26 [3], as previously described with a minor modification [4]. For generating the pared-end sequences, forward and reverse reads were combined with >50 bp overlapping ends of each read by VSEARCH. The combined reads of >400 bp length with a quality value of >30 were used for establishing operational taxonomic units (OTUs) using a 98% cut-off level. The OTUs that were detected as a single read within all samples (singletons) were omitted because too many singletons, which accounted for 21%e43% of OTUs (Table  1). A representative sequence of each OTU was filtered to split the sequences into ribosomal RNA (rRNA) and non-rRNA genes using riboPicker [5], and both rRNA and non-rRNA sequences were identified using the SINA programme [6] with reference to the SILVA database (SSURef_NR99_132 [7]). The taxonomic path for both rRNA and non-rRNA sequences was also obtained from the top hit of a BLASTn search [8], with reference to the PR2 database (ver. 4.10.0 [9]). A given p-value cut-off of  Similarity profile analysis to detect significant clusters (p < 0.05). Dissimilarity was calculated by relative abundance data of sequence reads using the Bray-Curtis index, and significantly distant samples were clustered using Ward's method.
1 Â 10 À50 was used to remove non-rRNA genes [10]. In order to focus on potentially heterotrophic protists, fungal and autotrophic sequences were removed according to the PR2 taxonomy path. Rarefaction curves were calculated using the vegan package, ver. 2.4 [11]. Similarity profile analysis was conducted using the clustsig package, ver. 1.1. The dissimilarity was calculated by relative abundance data of sequence reads using the Bray-Curtis index, and significantly distant samples were clustered using Ward's method. All statistical analyses were conducted using R software ver. 3.3.2 (http://cran.rproject.org).