Development of a cost-effective metabarcoding strategy for analysis of the marine phytoplankton community

We developed a cost-effective metabarcoding strategy to analyze phytoplankton community structure using the Illumina MiSeq system. The amplicons (404–411 bp) obtained by end-pairing of two reads were sufficiently long to distinguish algal species and provided barcode data equivalent to those generated with the Roche 454 system, but at less than 1/20th of the cost. The original universal primer sequences targeting the 23S rDNA region and the PCR strategy were both modified, and this resulted in higher numbers of eukaryotic algal sequences by excluding non-photosynthetic proteobacterial sequences supporting effectiveness of this strategy. The novel strategy was used to analyze the phytoplankton community structure of six water samples from the East/Japan Sea: surface and 50 m depths at coastal and open-sea sites, with collections in May and July 2014. In total, 345 operational taxonomic units (OTUs) were identified, which covered most of the prokaryotic and eukaryotic algal phyla, including Dinophyta, Rhodophyta, Ochrophyta, Chlorophyta, Streptophyta, Cryptophyta, Haptophyta, and Cyanophyta. This highlights the importance of plastid 23S primers, which perform better than the currently used 16S primers for phytoplankton community surveys. The findings also revealed that more efforts should be made to update 23S rDNA sequences as well as those of 16S in the databases. Analysis of algal proportions in the six samples showed that community structure differed depending on location, depth and season. Across the six samples evaluated, the numbers of OTUs in each phylum were similar but their relative proportions varied. This novel strategy would allow laboratories to analyze large numbers of samples at reasonable expense, whereas this has not been possible to date due to cost and time. In addition, we expect that this strategy will generate a large amount of novel data that could potentially change established methods and tools that are currently used in the realms of oceanography and marine ecology.


INTRODUCTION
Traditional phytoplankton surveys have relied mainly on microscopic observations, which require significant time and trained experts to obtain accurate data. This has created a bottleneck and is a major consideration in research planning. For these reasons, there is a need for a faster and more reliable standard method of performing detailed analysis of phytoplankton community structures. Alternatively, phytoplankton communities can be roughly analyzed based on different pigments using flow-cytometry (Dubelaar et al., 2007), which can sometimes be integrated with digital imaging systems (Campbell et al., 2010;Kachel & Wietzorrek, 2000;Olson, Shalapyonok & Sosik , 2003). However, such strategies aimed at replacing microscopic observations are still far from reliable for discriminating the numerous phytoplankton taxa.
To improve results from phytoplankton community structure analyses, molecular techniques are now being applied. The most widely used DNA markers include two protein-coding genes, protein D1 of photosystem-II reaction center (psbA) and a large subunit of the ribulose-1,5-diphophate carxoylase/oxygenase (rbcL) or plastid 16S rDNA region (Kirkham et al., 2013;Man-Aharonovich et al., 2010;McDonald et al., 2007;Paul, Alfreider & Wawrik, 2000;Zeidner et al., 2003). In particular, the 16S rDNA sequence has been well documented in the GenBank database (http://www.ncbi.nlm.nih.gov/) and more than 6,490 phytoplanktonic 16S sequences are currently deposited in the PhytoREF database (http://phytoref.org). Although there is still debate, universal primers targeting 16S rDNA are now most widely used Herlemann et al., 2011). However, it has been difficult to design a universal primer set that will amplify all phytoplankton taxa from cyanobacteria to eukaryotic algae in the 16S rDNA region, and most studies have analyzed specific taxonomic groups, especially for the bacterial communities (Asudi et al., 2016;Cruaud et al., 2014;Kitamura et al., 2016;Le Bescot et al., 2015;Logares et al., 2014;Massana et al., 2015;Valenzuela-González et al., 2016;Vierheilig et al., 2015). Although the sequence information from the 23S rDNA region is only a subset of that for 16S rDNA in the database, this region is considered an important marker for phytoplankton community structure when designing better universal primer sets that will cover most phytoplankton taxa (Folmer et al., 1994;Sherwood & Presting, 2007).
The rapid pace of technological advancement in DNA sequencing and bioinformatics analysis has meant that researchers can now analyze phytoplankton community structure simply by massive sequencing of DNA barcodes that are amplified from environmental samples, a process known as ''metabarcoding.'' In fact, several trials have already been conducted to analyze algal community structure using next generation sequencing (NGS) technology and generating large amounts of data for large survey areas at low cost, as compared with the Sanger method which uses dideoxynucleotide analogs (Eiler et al., 2013;Steven, McCann & Ward , 2012). These studies were dependent on the Roche 454 platform because its maximum read length (approximately 700 bp) is sufficient to identify to the species level. However, the cost for sequencing with the Roche 454 system remains high, and the Illumina platform has been introduced to replace this system and produce the same amount of sequence information at less than 1/100th of the cost (Luo et al., 2012).
In fact, virtually all studies of microbial community structure are now being conducted using the Illumina sequencer (Caporaso et al., 2012;Degnan & Ochman, 2012;Jiang et al., 2013;Logares et al., 2014;Schmidt et al., 2013;Shakya et al., 2013;Staley et al., 2013) Still, phytoplankton community structure analysis using the Illumina system has yet to be widely applied, mainly due to the lack of universal primers that are suitable for the Illumina sequencer.
In this study, we developed a novel strategy of targeting the 23S rDNA region for phytoplankton community analysis using the MiSeq system (Illumina, San Diego, CA, USA). Performing large reads (404-411 bp) by paired-end sequencing (2X 300 bp) enabled us to analyze phytoplankton community structure at reduced cost (less than 1/20th of the cost with the Roche 454 system) and with high resolution. Our modified ''universal primers'' targeting the 23S rDNA region achieved a high level of coverage, amplifying most taxa of prokaryotic and eukaryotic phytoplankton while excluding the non-photosynthetic proteobacterial sequences. The reliability of this NGS strategy was confirmed by analyzing phytoplankton samples from different locations, depths, and seasons in the East/Japan Sea.

Sample collection and DNA extraction
Sampling was carried out in May and July 2014 under permitting (Department of Fisheries Resource Management-1728Management- & 2506 for the project ''Long-term change of structure and function in marine ecosystems of Korea,'' funded by the Ministry of Oceans and Fisheries, Korea. Field water samples from the surface (0 m) and at 50 m depth were collected using 10 L Niskin bottles at two sampling stations in the southwestern East/Japan Sea, Korea. Among various sample site options, station I (StI) was chosen as representative coastal water, whereas station II (StII) was in open sea. Samples were labeled with collection date, location, and depth, as follows: MIS: surface water from station I in May 2014 Station I surface water; MIIS: surface water from station II in May 2014; MIID: 50 m water from station 2 in May 2014; JIS: surface water from station I in July 2014; JIIS: surface water from station II in July 2014.
Water temperature and salinity were measured using an SBE-43 unit (Sea-Bird Electronics, USA). One liter of water was collected from each sampling site and filtered using a 0.45-µm GH Polypro membrane filter (Pall Corporation, New York, NY, USA).
Chlorophyll-a concentration was measured using a Turner 10AU fluorometer (Turner Design Co., San Jose, CA, USA) as described by Parson, Maita & Lalli (1984). The filtered sample was ground into a fine powder using a mortar and pestle with liquid nitrogen. The total DNA from the homogenized phytoplankton was extracted using a DNeasy R plant mini kit (Qiagen, Hilden, Germany) following the manufacturer's instructions. Extracted DNA was quantified and qualified using a NanoDrop spectrophotometer ND-1000 (Thermo Scientific, Waltham, MA, USA) and was stored at −70 • C until it was used.

Library preparation and sequencing
The ''universal'' primers for the mini-barcode were redesigned to target the plastid 23S rDNA region (Sherwood & Presting, 2007). According to multiple sequence alignments of 1,683 23S rDNA sequences (545 from eukaryotes and 1,138 from prokaryotes), several variations were detected in the forward primer. Based on these findings, the forward primer to be used were modified to increase specificity for phytoplankton taxa (Table 1). To lower the PCR-based bias, a nested PCR strategy with low cycle numbers was adopted. Two forward primers (A23SrVF1:GGACARAAAGACCCTATG and A23SrVF2: CARAAAGACCCTATGMAGCT) and two reverse primers (A23SrVR1:AGATCAGCCTGT TATCC and A23SrVR2: TCAGCCTGTTATCCCTAG) were designed and synthesized (Macrogen, Republic of Korea). The first and second PCR amplifications were performed under the following cycling conditions: initial denaturation at 94 • C for 3 min, followed by 15 cycles at 94 • C for 30 s, 55 • C for 30 s, and 72 • C for 30 s, with a final extension at 72 • C for 3 min. The first PCR reaction mixture (40 µL) contained 10 ng of template, 2 µL of each primer (20 pmol), 4 µL of dNTPs (10 mM), 0.4 µL Ex Taq Hot Start Version (2 U) (Takara Bio Inc., Shiga, Japan), and 4 µL 10X buffer. The first amplicon was purified using the AccuPrep R Gel Purification Kit (Bioneer, Republic of Korea) and eluted with 20 µL elution buffer. The second PCR reaction mixture was the same as the first except that 2 µL of purified PCR product from the first PCR amplification was used. As with the first round, 2 µL of each primer was used for the second PCR (20 pmol). The products were separated by 1.5% agarose gel electrophoresis and stained with loading star (Dynebio, Sungnam, Republic of Korea). Amplicons of the expected sizes (approximately 410 bp) were purified using the AccuPrep R Gel Purification Kit (Bioneer, Daejeon, Republic of Korea). A library was constructed from the NGS data using the TruSeq R Sample Preparation kit V2 (Illumina, USA). The quality and quantity of the library were measured using a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and sequencing was performed using Illumina MiSeq (2 X 300 bp pair-ends) (Illumina, USA).

Evaluation of the modified algal plastid 23S rDNA universal primers
As noted, the first step of the study was to modify the original ''universal'' primers targeting the plastid 23S rDNA region such that they excluded the non-photosynthetic proteobacteria (Table 1). Amplicons obtained from between the original universal primers and between the modified primers were compared (Table 2). In total, 185,773 (63 OTUs) and 625,917 (106 OTUs) amplicons were obtained using the original and modified primers, respectively (Table 2). Thirty-three OTUs were commonly identified by both primers. Thirty OTUs were exclusively identified from the sequence data generated with original universal primers, and all were proteobacteria. Seventy-three OTUs were exclusively identified from the sequence data generated with the modified primers, and all were eukaryotic algae. The difference in OTU numbers indicated that the taxonomic proportions of amplicons differed between the two sets of sequence data. Regarding the amplicons generated using the original universal primers, more than 84% were either unknown (43.9%) or proteobacteria (41.06%) according to the sequence data (Table 2). In contrast, regarding the amplicons generated using the modified primers, less than 24% were unknown (19.52%) or proteobacteria (4.55%), which reflected the presence of more algal amplicons and the elimination of non-photosynthetic proteobacteria ( Table 2). The proportions of eukaryotic algal sequences belonging to the Haptophyta and Dinophyta were also greater when the modified primers were used. Collectively, the original universal primers were improved by the modifications that were made to the sequences.

Construction of the algal ribosomal large subunit metabarcode database
To determine whether the modified 23S primers and PCR conditions could be directly applied to the phytoplankton survey, the described six samples were chosen and analyzed according to the different locations, seasons, and depths of the East/Japan Sea. After processing the raw reads, the number of end-paired amplicons obtained from each sample ranged from 290,082 to 639,836, and averaged 438,896 (Table 2). After several data processing steps, 345 OTUs were obtained from six samples combined (Table 2). Despite relatively small numbers of OTUs, only 95 (27.5%) matched the sequences previously listed in the GenBank database, indicating much lower numbers of the plastid 23S rDNA sequences compared with those of 16S rDNA in the database. In addition to the 95 OTUs identified, the remaining 250 OTUs were added to the PHYTOGEN database after assigning taxonomic ranks (phyla and classes) based on sequence similarity.
To determine the cut-off value for phylum similarity, the OTUs obtained were screened from 90% to 99% (Fig. 2). The data revealed minimal differences in phyla between 90% and 97%, whereas the presence of significant unclassified OTUs increased as the cut-off was shifted 97-99% (Fig. 2). Considering this, the optimal cut-off for similarity was set at 97% for assigning the phylum of the OTUs. This indicated that the modified primers successfully amplified most phyla of the eurkayotic and prokaryotic algae in the samples, including the Dinophyta, Rhodophyta, Ochrophyta, Chlorophyta, Streptophyta, Cryptophyta, Haptophyta, and Cyanophyta (Fig. 3). Phyla not found in this study, such as Bigyra, Ciliophora, Katablepharidophyta, and Telonemia, either had ambiguous taxonomies or were rare at our sample sites. Sequences that had <97% identity with the PYTOGEN database were classified as ''unknown,'' and the proportions in this category ranged from 2.34% to 17.03% depending on the sample (Table 3). To ascertain whether further analysis of the OTUs at class level was possible, similarity from 97% to 100% was tested (Fig. 2B). While there was considerable difference from 98% to 100%, no significant difference was identified below 98%, indicating that 98% was the optimal similarity cut-off point for assigning class. We were unable to assign family level or below mainly due to the relatively small numbers of plastid 23S sequences in the database.

Figure 2 Composition of phytoplankton by the different cut-off similarity of OTUs. (A) Each bar
shows the ratio of phytoplankton phyla according to cut-off similarity from 90% to 99%. (B) Each bar shows the ratio of phytoplankton classes according to cut-off similarity from 90% to 99%.

Figure 3 Coverage of obtained OTUs in algal phyla.
Phylogenetic tree was constructed by the Neigbhor-joining (NJ) algorithm using Molecular Evolutionary Genetics Analysis (MEGA ver 6.0). The evolutionary distances were computed using the Kimura 2-parameter method.

Community structure analysis for phytoplankton of the East/Japan Sea
More OTUs were present in the samples collected from the open sea than in those collected from the coastal seas, and this was mainly to the prokaryotic population, including cyanobacteria and proteobacteria (Table 3). The costal surface water sample in July (JIS) contained the lowest number of OTUs (266), whereas the open surface water sample in July (JIIS) contained the highest number of OTUs (325) ( Table 3). The numbers of exclusively identified OTUs at each sample site were very low, ranging from 10 to 24 (approximately 5% of the total OTUs [ Fig. 4B]). The numbers of OTUs in each phylum were similar in each sample, and only their relative proportions varied (Table 3). For example, the numbers of OTUs belonging to the Ochrophyta were similar (109-123) among the six samples, but the corresponding abundance proportion varied from 7.32% to 86.40%. Species belonging to the Ochrophyta comprised >80% of those identified in the coastal water samples collected in May and July (Table 3), and most belonged to class Coscinodiscophyceae (Table 4). of cyanobacteria and proteobacteria (Table 3). Cyanobacteria occupied approximately 20% of the phytoplankton community in the surface water of the open sea samples from May and July, whereas the corresponding proportions in the coastal water samples were both 1%. Although the proportions of algal communities in the coastal surface waters and those in the deep waters of the open sea were similar at the different sampling times, a considerable difference was detected in the algal communities of the surface water of the open sea between May and July (Table 3). The proportion of species belonging to phylum Chlorophyta increased from 2.3% in the May open-sea surface water sample to 23.91% in July, whereas those in phylum Ochrophyta decreased from 36.66% in May to 7.32% in July (Table 3). The major differences between the May and July open-sea deep water samples were increases in proportions of species belonging to the Dinophyta (from 2.33% to 13.01%) and Ochrophyta (from 17.74% to 28.48%) (Fig. 2).

Reliability of the modified universal primers for algal community structure analysis
The modified universal primers were more effective at identifying prokaryotic and eukaryotic phytoplankton taxa, and effectively excluded proteobacterial sequences. Compared with 16S primers, the modified primers will be more useful for analyzing phytoplankton community structure, as they allow for identification of more photosynthetic phytoplankton. When data produced by both 16S and 23S primers are compared, it will be possible to assess for correlations between autotrophic phytoplankton species and heterotrophic bacteria in the ocean. In the present study, we also found that all the sequences belonging to the Chlorophyta contained a GGACAA sequence rather than GGACAG at the 5 end of the original universal forward primer region (Table 1). This finding enabled us to modify the primers such that we were able to detect a greater proportion of Chlorophyta in the surface water of the open sea, and this supports the value of the modified universal primers (Fig. 4). In addition, the nested PCR using a low cycle number and modified universal primers yielded more eukaryotic algal sequences than the original primers did. Use of a single primer set often results in exaggerated stochastic effects due to the exponential nature of PCR amplification and variable primer efficiency (Caragine et al., 2009;Pinto & Raskin, 2012). The original 23S primers, which cannot discriminate heterotrophic proteobacteria from other photosynthetic phytoplankton species (whereas 16S primers can do so), yielded fewer eukaryotic algal sequences when the protobacterial population was large (Table 2). To overcome those drawbacks and to present better quantitative phytoplankton community structure, it is necessary to use reduced PCR cycles and different primer sets.
In the present study, a nested PCR strategy with low cycle number reduced the stochastic effects of PCR and yielded more photosynthetic algal species sequences presenting the better phytoplankton community structure for each sample examined. In conclusion, the performance of the modified 23S universal primers was superior to that of the original universal primers designed for marine phytoplankton community structure analysis. Of the 345 OTUs identified in this study, only 95 (27.5%) matched the sequences previously listed in the GenBank database, and this is an important limitation of the 23S primers. This is mainly because most phytoplankton researchers have paid more attention to the molecular taxonomy of the 16S rDNA region than the 23S rDNA region. Although many more 16S rDNA sequences have been deposited in the database, it is difficult to design the ''universal primers'' that will amplify all the prokaryotic and eukaryotic algal taxa that correspond to the 16S rDNA region. In fact, most studies that have used the universal primers have been conducted to determine the biodiversity of a specific taxon or group (Asudi et al., 2016;Cruaud et al., 2014;Kitamura et al., 2016;Logares et al., 2014;Valenzuela-González et al., 2016;Vierheilig et al., 2015). In contrast, our current results with the modified primers that target plastid 23S rDNA indicate that these can be used as ''universal primers'' for analyzing algal community structure because of their large coverage and the fact that they amplify both prokaryotic and eukaryotic algal species. For this reason, if PHYTOGEN is operated properly documenting the plastid 23S rDNA sequences this would be a useful database for understanding the marine phytoplankton community. Without doubt, the first step to improving the PHYTOGEN database would be to increase the sequence data with species description. Since a great deal of 16S rDNA data have already exist, one of the fastest ways to increase the sequence information is to perform massive sequencing the long DNA fragments that contain both 16S rDNA and 23S rDNA regions within the same sample. Given that the cost for sequencing is steadily decreasing, it is expected that the availability of plastid 23S rDNA sequences will increase dramatically and this will enable researchers to ultimately obtain extensive, useful 23S rDNA data without any further morphological analysis. In addition, the PHYTOGEN database will need to be updated periodically to add new findings for 23S rDNA to the GenBank database. Once PHYTOGEN has been updated and higher resolution data become available, we will produce data with better resolution and our dependence on molecular surveys for marine phytoplankton evaluations will increase.

Effects of economic metabarcording strategy on marine phytoplankton studies
After processing the raw reads, the numbers of end-paired amplicons obtained from each respective sample ranged from 290,082 to 639,836, and averaged 438,896 (Table 3). Compared with the previously reported average amplicon numbers produced by the Roche 454 pyrosequencing system (Eiler et al., 2013), equivalent data have been obtained at less than 1/100th of the cost using the Illumina platform (Mardis, 2008). Based on this, several studies have since used the MiSeq platform to analyze bacterial community structure using 16S primers (Caporaso et al., 2012;Schmidt et al., 2013); however, ''universal'' primers that are suitable for the Illumina platform (400 bp-500 bp) and cover prokaryotic and eukaryotic phytoplankton have not been reported. The modified primers we designed for this study are not only optimized for the Illumina platform with high resolving length (404-411 bp), but also exhibit large taxonomic coverage, amplifying prokaryotic as well as eukaryotic phytoplankton. In our present study, compared to the Roche 454 system, we obtained similar data at 1/20th the cost using the Illumina MiSeq system after trimming the raw data with high quality (6 bp overlapping sequences and omitting any mismatches). The cost would be further reduced to reach a similar conclusion by the increased indices or the less strict trimming algorithm. Considering the low cost of the metabarcoding strategy, it is possible for a single laboratory to run a year-round phytoplankton community survey handling the several thousand samples, which has never been imagined before. This means that researchers should spend more time in the laboratory analyzing the huge amount of data using various bioinformatics tools instead of collecting samples in the field.
This economic metabarcoding strategy enables large amounts of data to be analyzed at one time and will allow researchers to obtain novel knowledge. We recognize that certain features of the sampling methods should be changed; specifically, the filtered water volume should be 10 L per sample, as opposed to the 0.5 L volume used in our phytoplankton survey. A previous study showed that OTU richness was greater with increased filtered water volume (Padilla et al., 2015); therefore, increasing the water volume per sample would generate more accurate results, less variability and higher OTU numbers. If the proportion of each OTU is linked to the total carbon values, we would also be able to estimate the carbon biomass of each OTU. We could construct a spatio-temporal database of phytoplankton community changes on a daily or even hourly basis. Most of all, this new strategy would be especially useful for analyzing phytoplankton community structure and its contribution to primary production in ocean. One long-term question is whether picoplankton are expected to gradually increase as the oceans become warmer (Morán et al., 2010). These data would provide novel information about the paradoxical nature of phytoplankton (Hutchinson, 1961) and its ecosystem functions in marine food webs, which have not been investigated to date.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This research was part of the project titled ''Long-term change of structure and function in marine ecosystems of Korea,'' funded by the Ministry of Oceans and Fisheries, Korea. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
• Do-Hwan Ahn performed the experiments.
• Hyun Park performed the experiments, wrote the paper, reviewed drafts of the paper.
• Hyun-Woo Kim conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper.

Field Study Permissions
The following information was supplied relating to field study approvals (i.e., approving body and any reference numbers):

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.2115#supplemental-information.