Using Next-Generation Sequencing for DNA Barcoding: Capturing Allelic Variation in ITS2

Internal Transcribed Spacer 2 (ITS2) is a popular DNA barcoding marker; however, in some animal species it is hypervariable and therefore difficult to sequence with traditional methods. With next-generation sequencing (NGS) it is possible to sequence all gene variants despite the presence of single nucleotide polymorphisms (SNPs), insertions/deletions (indels), homopolymeric regions, and microsatellites. Our aim was to compare the performance of Sanger sequencing and NGS amplicon sequencing in characterizing ITS2 in 26 mosquito species represented by 88 samples. The suitability of ITS2 as a DNA barcoding marker for mosquitoes, and its allelic diversity in individuals and species, was also assessed. Compared to Sanger sequencing, NGS was able to characterize the ITS2 region to a greater extent, with resolution within and between individuals and species that was previously not possible. A total of 382 unique sequences (alleles) were generated from the 88 mosquito specimens, demonstrating the diversity present that has been overlooked by traditional sequencing methods. Multiple indels and microsatellites were present in the ITS2 alleles, which were often specific to species or genera, causing variation in sequence length. As a barcoding marker, ITS2 was able to separate all of the species, apart from members of the Culex pipiens complex, providing the same resolution as the commonly used Cytochrome Oxidase I (COI). The ability to cost-effectively sequence hypervariable markers makes NGS an invaluable tool with many applications in the DNA barcoding field, and provides insights into the limitations of previous studies and techniques.

In recent years, DNA barcoding has gained popularity as a method to taxonomically identify unknown specimens. DNA barcoding has significant benefits compared to traditional morphological identification, such as differentiating similar-looking species and identifying immature or damaged specimens. The accuracy and efficiency of DNA-based species identification makes it particularly suitable to vector surveillance and biosecurity programs, where specimen identification informs public health predictions and vector control decisions (Armstrong and Ball 2005;Batovska et al. 2016).
Mitochondrial Cytochrome Oxidase I (COI) is the most common barcoding gene used for animals; however, other ribosomal or nuclear genes are also often used and offer certain advantages. While mitochondrial genes are highly variable and easier to amplify due to their high copy number, nuclear genes exhibit variable rates of substitution that can provide greater resolving power (Lin and Danforth 2004). The Internal Transcribed Spacer 2 (ITS2) region of ribosomal DNA has a high copy number and a higher evolutionary rate than most nuclear DNA; however, it can cause alignment problems due to the presence of variable copies (alleles) within individuals (Álvarez and Wendel 2003). When multiple species are being barcoded, it is beneficial to use a ribosomal or other nuclear gene in addition to a mitochondrial gene. These loci are unlinked and evolve separately, thereby giving independent estimates of genetic lineages and relationships. Numerous mosquito barcoding studies have used both mitochondrial and nuclear markers to distinguish species (Foley et al. 2007;Paredes-Esquivel et al. 2009;Puslednik et al. 2012;Bourke et al. 2013). ITS2 is often used in mosquito studies as its high evolutionary rate makes it useful for investigating closely related species, such as those in the Anopheles genus (Wilkerson et al. 2004;Marrelli et al. 2006;Walton et al. 2007;Sum et al. 2014).  Currently, Sanger sequencing is the most commonly used method to acquire DNA barcodes (Taylor and Harris 2012). While Sanger sequencing can generate reads up to 1000 bases, there are complications that limit its application in DNA barcoding. Sanger sequencing can only process a single DNA template per sample and multiple templates can cause uninterpretable results. This means that DNA regions with size-variable alleles within an individual cannot be sequenced. For instance, the size of ITS2 varies in many mosquito species, so it must first be cloned and then sequenced in order to overcome this problem (Wesson et al. 1992;Severini et al. 1996;Aspen et al. 2003). Pseudogenes can also create similar issues and have been detected in some mosquito species (Cywinska et al. 2006;Wang et al. 2012). Sanger sequencing also struggles with long homopolymeric and repetitive regions, often causing difficulties with microsatellites that may be found in barcoding markers, such as ITS2 (Banerjee et al. 2007).
NGS technologies are able to achieve massive parallel sequencing of single DNA molecules, resulting in high-throughput data, unlike Sanger sequencing. Parallel sequencing means that multiple templates are not a challenge, and all regions and variants of the gene are sequenced, allowing the detection of pseudogenes, contaminants, or allelic variation within individual insects (Shokralla et al. 2014). Although wholegenome sequencing was one of the initial uses of NGS technology, a variety of methods have since been developed that can generate multilocus sequence data for various applications such as genotyping and phylogenetics. One method, amplicon sequencing, involves tagging a unique barcode onto PCR-amplicons that are then pooled with other tagged amplicons and sequenced (McCormack et al. 2013). This multiplexing capability makes NGS a cost-effective method for DNA barcoding.
In this study, we utilize both traditional Sanger sequencing and NGS amplicon sequencing to characterize a region of ITS2 known to contain microsatellites and indels in mosquitoes. The success of both methods is compared in order to determine their applicability in sequencing hypervariable genes such as ITS2. The variability found within ITS2 is also analyzed and its utility as a DNA barcoding marker for mosquitoes is assessed, compared with previous estimates based on the mitochondrial DNA locus COI (Batovska et al. 2016).

Specimen collection and identification
Mosquitoes were collected as part of the Victorian Arbovirus Disease Control Program (VADCP). The methods used to collect, store, and morphologically identify the mosquitoes are described in Batovska et al. (2016). In total, 88 mosquito specimens were used in this study, representing 26 species and 12 genera. Specimen information, including trapping locations, can be found in Supplemental Material, Table S1. These specimens were previously used for a DNA barcoding project based upon COI (Batovska et al. 2016), and the same taxonomic classification has been employed here. Further details of the specimens used for these projects are recorded in the Mosquitoes of Australia -Victoria (MOAV) project on the Barcode of Life Database (BOLD, www.boldsystems.org). DNA isolation and ITS2 amplification A single leg from each mosquito was used for DNA isolation. Magnetic bead-based nucleic acid extraction was performed using the MagMAX DNA Multi-Sample Kit (Life Technologies, Gaithersburg, MD). Sample homogenization and protocol adjustments are described in Batovska et al. (2016). DNA extractions were stored at 220°. A region of ITS2 (350-600 bp) was amplified using the novel primer pair ITS2-MOS-F (59-GCTCGTGGATCGATGAAGAC-39) and ITS2-MOS-R (59-TGCTTAAATTTAGGGGGTAGTCAC-39). These new primers are located in the conserved ribosomal DNA regions flanking ITS2 and were designed from sequences obtained from GenBank (Wesson et al. 1992;Toma et al. 2000;Linton et al. 2001) using Primer3 version 0.4.0 (Untergasser et al. 2012). PCR reactions consisted of a total volume of 25 ml: 15.3 ml 1 · bovine serum albumin (BSA), 2.5 ml 10 · ThermoPol Reaction Buffer (New England Biolabs, Beverly, MA), 2 ml 2.5 mM dNTPs, 1.25 ml of each 10 mM/L primer, 0.2 ml 1.0 U Taq DNA Polymerase, and 5-15 ng template DNA. The ITS2 PCR cycle was as follows: 94°for 2 min; 40 cycles of 94°for 30 sec, 51°for 45 sec, and 72°for 45 sec; and then finally 72°for 1 min. The PCR products were verified on a 2% agarose gel.

Sanger sequencing and data analysis
Size-verified ITS2 PCR products were enzymatically purified and sequenced commercially in the forward direction on an ABI3730XL by Macrogen (Korea). To determine the overall quality of the sequences, the percentage of bases with a PHRED quality score (Q) above 20 was obtained from the beginning of the sequence to the final adenine peak. For quality trimming, bad quality bases were automatically removed from the 59 and 39 ends using a 0.01 error probability limit in Geneious version 8.1 (www.geneious.com, Kearse et al. 2012). Any heterozygous bases found in the chromatograms were manually assessed and the base determined using the International Union of Pure Applied Chemistry (IUPAC) codes. Quality-trimmed Sanger sequences were aligned with ClustalW, sequence divergence was calculated (p-distance values), and a bootstrap neighbor-joining tree (1000 replicates) was created using MEGA version 6 (Tamura et al. 2007).

NGS and data analysis
The size-verified ITS2 PCR products were purified using AMPure XP beads (Beckman Coulter, Brea, CA) with a 0.8 · beads ratio in all instances. Universal Y-shape adaptors were ligated onto the individual PCR products and excess adaptors were removed using AMPure XP beads. Unique 8 bp barcodes with Illumina P5 and P7 adapters were added onto the ligated products via PCR (· 18 cycles). An AMPure XP bead clean was performed and each sample was then quantified using a NanoDrop 1000 (Thermo Scientific, Waltman, MA). A single equimolar pool was created and then quantified using a Qubit 1.0 fluorometer (Thermo Scientific). Library size was checked with a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Any remaining adaptor dimers were removed with a further AMPure XP bead clean-up. The pooled sample was sequenced on a MiSeq platform with 2 · 250 bp reads using version 3 chemistry (Illumina, San Diego, CA).
The demultiplexed MiSeq data were trimmed to remove adapters and low quality sequence reads using a custom Perl script. Reads were removed if they had three or more consecutive ambiguous "N" bases; three or more nucleotides (nt) with # Q20; a median Q score of , 20; or a read length of , 150 nt. Overlapping paired reads were merged using PEAR version 0.9.4 (Zhang et al. 2013). Contigs (alleles) were assembled with CAP3 version date 12/21/07 (Huang and Madan 1999) using a sequence identity of 95%. The allele with the largest number of reads attributed to it was considered to be the most common allele. All alleles produced were aligned using ClustalW in Geneious. Sequence divergence was calculated (p-distance values), and a bootstrap neighbor-joining tree (1000 replicates) was created using MEGA version 6 (Tamura et al. 2007). Phobos version 3.3.12 (Mayer 2006) was used to detect microsatellites in the sequences. Microsatellites were defined as 1-6 bp with at least five repeats (Selkoe and Toonen 2006). Using ClustalW, the most common allele for each sample was aligned to the corresponding Sanger sequence in order to verify if the sequence matched. Coverage was also calculated, which was defined as the percentage of the allele covered by good quality Sanger sequence. Sanger and NGS ITS2 neighborjoining trees were compared to determine how many individuals were not resolved monophyletically; these included individuals that formed a complex with closely related species or did not resolve monophyletically with conspecifics. A two-by-two table was generated to compare this data, and overall percent agreement was calculated.

Data availability
All 382 NGS allele sequences are accessible on GenBank (accessions KU495620-KU495708: most common alleles; KX865970-KX866263: other alleles). The most common alleles for each of the specimens are also available on the MOAV project on BOLD (accessions in the range of MOAV001-15-MOAV116-15), as is all associated specimen information. The raw sequencing files for all 88 samples have been deposited at the NCBI SRA database under project ID PRJNA343434.
Individual specimen details are found in Table S1, including trapping locations. Figure S1 compares Sanger and average NGS sequence lengths for all individual specimens. Figure S2 shows the location of microsatellites in all 382 NGS allele sequences. The neighbor-joining analysis of the 88 Sanger sequences can be seen in Figure S3.

Sanger sequencing
In total, 88 Sanger sequences were generated for ITS2, one for each sample in the forward direction. The quality of the sequences varied among species (Table 1). Culex cylindricus and Aedeomyia venustipes had consistently good quality sequences, with 91-93 and 92-94% of untrimmed bases . Q20, respectively; whereas other species had poor quality sequences, such as Dobrotworskyius rubrithorax with only 59-63% of bases . Q20 (Table 1). Interestingly, some species had both good and bad quality sequences, including Cx. pipiens form molestus sequences ranging between 33 and 93% of bases . Q20. After quality trimming, sequences were 95-523 bp in length, with the average length and SE for each species shown in Figure 1A, and the length for each individual shown in Figure S1.

NGS
After bioinformatic analysis, NGS generated 382 unique sequences (alleles) for the 88 samples. When the most common allele was aligned to the corresponding trimmed Sanger sequence, there was a 100% sequence identity match in 85 out of the 88 samples. Of the three remaining samples, the Sanger sequence had a 100% sequence identity match with the second most common allele. The difference in read number between the two most common alleles for these three samples ranged from 1.5 to 27.8% of all reads used for CAP3 analysis.
The alignments also revealed that the Sanger sequences post quality trimming covered a mean of 68.2% (range 19.1-90.7%) of the allele lengths. The mean coverage for each species can be seen in Figure 1A, and the coverage for each individual sample is listed in Table 1 and shown in Figure S1. While coverage varied among species with the aforementioned difference in Sanger sequence quality, generally all alleles were longer than the corresponding Sanger sequence. The only Figure 3 The location of microsatellites within 26 ITS2 sequences (shown from 100 bp), each representative of a species. The size of the light blue labels are proportionate to repeat length, whereas the number inside the label is representative of the repeat type. Microsatellites within all 382 alleles can be found in Figure S2. species with a Sanger sequence longer than the allele length was Anopheles annulipes ( Figure 1A). This occurred because the targeted ITS2 region in this species is , 500 bp, which meant that the bioinformatic process could not create full-length alleles (the MiSeq read length generated was 2 · 250 bp, so the reads do not overlap the entire fragment and therefore the PEAR program cannot join them). Since the allele lengths are not an accurate representation of the full length of ITS2 in An. annulipes, they were not included in the mean coverage calculation.
It should be noted that Ad. venustipes had alleles longer than 500 bp (range 509-522 bp). When the raw reads for these samples were aligned to the corresponding Sanger ITS2 sequences, reads were identified that began after the primer region. This is indicative of degradation or incomplete extension of the ITS2 amplicons prior to library preparation, and explains how alleles longer than 500 bp could be formed. However, it appears there is a limit to this process, given that An. annulipes has a larger ITS2 region (601 bp, Bower et al. 2009) but does not assemble into a single contig.
When examining the Sanger sequences post quality trimming, poor quality sequences were found to have resulted from a variety of allelic differences, including: SNPs, microsatellite regions, and indels. These variable regions were successfully sequenced with NGS and are present in the alleles for each individual. Examples of the effect of different types of polymorphisms in both the Sanger sequences and the alleles are shown in Figure 2.

Variation within ITS2 alleles
The ITS2 alleles were found to have many repetitive regions, with microsatellites ranging from 1 to 4 bp in length (Figure 3). In total, the alleles contained 929 homopolymers, 90 dinucleotide, 12 trinucleotide, and four tetranucleotide microsatellites. Many of these microsatellites were specific to species or genera ( Figure S2).
The presence of microsatellites and indels led to variation in ITS2 length within and between individuals ( Table 1). The mean allele length for each species is compared in Figure 1A. Tripteroides atripes had the shortest ITS2 alleles (356-359 bp), while Ad. venustipes had the longest (509-522 bp).
The number of ITS2 alleles per individual also varied. Most individuals (n = 61, 69%) had one to three alleles, however the two members of the Dobrotworskyius genus had substantially more (Table 1). The number of alleles detected in Db. rubrithorax ranged from 24 to 35 per individual, resulting in a total of 114 alleles for the species; while the closely related Db. alboannulatus also had high allele numbers, with 11-17 per individual and a total of 53 for the species (Figure 1, B and C).

ITS2 neighbor-joining analysis
The neighbor-joining analysis of 88 ITS2 sequences derived from Sanger sequencing resulted in only 18 of the 26 species resolving monophyletically ( Figure S3). All of the Culex species failed to form distinct groups (except for Cx. cylindricus), as did Ochlerotatus sagax, and the Dobrotworskyius genus (total of 24 individuals, Table 2). In comparison, NGS sequences provided much greater species resolution, with the neighbor-joining analysis of the 382 alleles resulting in 24 species forming distinct groups on the tree (Figure 4). The only species that failed to form distinct groups were Cx. pipiens form molestus and Cx. quinquefasciatus (total of eight individuals, Table 2). When the number of monophyletically resolved individuals were compared (Table 2), Sanger and NGS had an estimate of agreement of 73%.

Sanger vs. NGS
Our results show that Sanger sequencing is not an appropriate method for characterizing ITS2 in the majority of mosquito species tested in this study. As shown in Figure 2, there were a variety of polymorphisms in ITS2 that hampered Sanger sequencing, leading to short sequences post quality trimming that are not suitable for DNA barcoding analyses ( Figure S1 and Table 1 and 2). When sequenced with NGS, areas with sequence polymorphisms were fully recovered, revealing the many ITS2 alleles present in most mosquito species.
Interestingly, some individuals had good quality Sanger sequences, yet had multiple alleles produced from the NGS data (Table 1). This is likely to be due to an abundance of a single dominant allele. While there can be multiple ITS2 alleles present in an individual, often only one or two are abundant. This was demonstrated by Song et al. (2012) when ITS2 variants from 178 plant species were characterized and it was found that, on average, three of the most prevalent variants accounted for 91% of all ITS2 copies per species. A similar result was obtained when Symbiodinium ITS2 variants were sequenced, and the most common copy was on average 20 times more prevalent than the second most common copy (Arif et al. 2014). Therefore, with the other alleles present in low proportions, Sanger sequencing is able to produce a clean chromatogram. The detection of the rarer alleles in our study demonstrates the sensitivity of the NGS method, as well as generating a more comprehensive sequence data set and resource for these mosquito species.
Conversely, some individuals had disrupted Sanger sequences but only one allele identified in the NGS data (Table 1). One possible cause is the parameters set for CAP3 when assembling the NGS reads, which only produced a contig if sequences were more than 95% similar, meaning that if there was , 5% difference there would still only be one allele. Even this small amount of variation could cause issues with Sanger sequencing considering a single base pair frameshift mutation can disrupt the process. When an individual that had a poor quality Sanger sequence and a single allele was reanalyzed with a 98% cut-off set for CAP3, multiple alleles were produced, indicating that the CAP3 parameters are indeed likely to be responsible for the incongruent data. Other contributors could include the well-documented secondary structure in ITS2, quantification issues, and preferential sequencing n "Nonmonophyly" included individuals that formed a complex with closely related species or did not resolve monophyletically with conspecifics. These counts provided an estimate of agreement between Sanger and NGS of 73%. NGS, next-generation sequencing.
of primer dimers, which can all result in poor Sanger sequences even if there is only a single ITS2 allele (Wesson et al. 1992;Bertrand et al. 2014).
Regardless of the presence of any polymorphic allele issues, the NGS method consistently produced sequences that were closer to the full amplicon length compared to Sanger sequencing ( Figure 1A and Figure S1). The only instance in which the Sanger sequence was longer was in the An. annulipes species. The ITS2 region used is larger than 500 bp in this species, and so the 250 bp paired-end reads generated in this study could not overlap to form the full ITS2 sequence, highlighting a limitation of the NGS method. However, it should be noted that regions marginally larger than the overlapping NGS read length can sometimes be assembled, as seen with Ad. venustipes, which had a 509-522 bp ITS2 region. With other markers, the problem could be solved by ensuring the amplicon being used is no larger than the overlapping NGS read length; but in ITS2, suitably conserved PCR priming sites are limited to the flanking ribosomal DNA regions, so reduction of the amplicon size is not feasible. When dealing with size-variable markers, another option is to sequence in only one direction; however, the actual sequence length that is useful for taxonomic assignment would need to be determined (Bertrand et al. 2014;Arif et al. 2014).
Another advantage of NGS over Sanger sequencing is cost. If Sanger sequencing was used to generate the same level of data as NGS, it would require extensive cloning and sequencing of alleles, which would cost more than a single MiSeq run. Furthermore, Sanger sequencing requires each sample to be processed individually, whereas NGS can sequence all samples simultaneously using internal barcodes attached to Illumina adaptors. The lowdepthofcoverage required means thenumber ofsamples can be upscaled to hundreds or thousands, which greatly reduces cost (Adey et al. 2010).

ITS2 variation
The ability to sequence through genetic polymorphisms, such as microsatellites and indels, allowed NGS to characterize ITS2 allelic diversity in individual mosquitoes. A total of 382 unique ITS2 alleles were documented, with a mean of 4.3 alleles per individual (range 1-35, Table 1). Concerted evolution is a process that reduces sequence variation among repeats within an individual; however, the presence of intragenomic ITS2 variation is well-established (Bazzicalupo et al. 2013;Bertrand et al. 2014;Arif et al. 2014). In our study, incomplete concerted evolution appears to be even more prevalent in particular species, with 167 of the 382 alleles attributed to the eight mosquitoes from the Dobrotworskyius genus ( Figure 1C). Furthermore, although the Dobrotworskyius species were recovered in this study, the large differences between alleles within the genus is indicative of incomplete lineage sorting between species (Figure 4 and Figure 5A). Divergent copies of ITS2 sequences may have been retained over long periods of time and could be useful in elucidating evolutionary relationships (Vargas et al. 1999;Wissemann 2002). In addition to the number of alleles detected, the length of ITS2 also varied among species, ranging between 356-522 bp. Length was specific to species and caused difficulty in alignment, leading to large differences between some genera and particularly between subfamilies (Figure 5C). Ancient divergences lead to accumulated differences between Anophelinae and Culicinae species, however the degree of distance between these subfamilies using ITS2 is inflated when compared to mitochondrial genes (Cywinska et al. 2006;Versteirt et al. 2015;Batovska et al. 2016). Alignment-free sequence analysis methods could possibly overcome some of the difficulty associated with comparing alleles of different sizes (Little 2011).
The alignment of alleles revealed that the differences in length were caused by the presence of numerous indels and microsatellites, which were often specific to species (Figure 3 and Figure S2). These polymorphisms are likely to be caused by DNA replication slippage, and their accumulation is furthered by the slow rate of concerted evolution in ITS2 (Hancock and Vogler 2000). Their fast mutation rate might make them useful for population genetic studies, and species-specific microsatellites could be employed to improve genetic resolution and define boundaries for species determination (Arif et al. 2014).
ITS2 as a barcoding marker for mosquitoes When using Sanger sequencing, ITS2 is not a suitable barcoding marker for the species examined in this study, with only 69% of species resolving monophyletically ( Figure S3 and Table 2). Due to the polymorphisms found in ITS2 in these species, Sanger sequencing cannot characterize the region well, apart from when An. annulipes species are being sequenced ( Figure 1A and Figure S1). However, when using NGS, ITS2 can separate species with a 96% success rate (Figure 4), despite having overlapping conspecific and congeneric differences, and no clear barcoding gap (Figure 5, A and B). The only species that could not be separated were Cx. quinquefasciatus and Cx. pipiens form molestus (Figure 4). These species are part of the Cx. pipiens complex, which is known to be difficult to distinguish using standard barcoding methods (Laurito et al. 2013;Pfeiler et al. 2013;Batovska et al. 2016). In comparison to COI as a marker in the same specimens/species (Batovska et al. 2016), ITS2 has similar resolution, with both markers successfully distinguishing all species with the exception of the Cx. pipiens complex. Mitochondrial and nuclear genes evolve independently, therefore the concordance between the two datasets observed here helps to confirm the established COI species and genera (Lin and Danforth 2004;Reinert et al. 2009;Batovska et al. 2016). In terms of utility as a DNA barcoding marker for mosquitoes, COI appears preferable to ITS2 as it lacks the alignment issues caused by microsatellites and indels. However, ITS2 does serve as a useful secondary marker that can now be efficiently obtained using NGS.

Conclusions
This study found that NGS is the most suitable method for characterizing the ITS2 region in mosquitoes and this trend is likely to extend into many other insect species and genera. Multiplexing made NGS more efficient and cost-effective than Sanger sequencing, and polymorphic regions were successfully sequenced, revealing the large diversity of ITS2 alleles present in mosquitoes. The data produced by this study captures the variation in number and size of ITS2 alleles among species, as well as the abundant microsatellites and indels present. Future studies could use this kind of data to develop markers or investigate evolutionary histories, and lineages with unusual patterns could be examined further, such as the Dobrotworskyius genus. As a DNA barcoding marker, ITS2 functions as well as COI; however, COI is a superior marker with fewer analytical limitations and complexities. The lack of depth required for NGS amplicon sequencing means that, in the future, a panel of alternative genes could be simultaneously sequenced in a variety of mosquito species in order to find a marker that could successfully separate all species and create well-supported phylogenies. The gene with best resolution may then be used for bulk DNA barcoding, where large pools of mosquitoes could be sequenced and identified, thereby significantly reducing identification time in surveillance programs (Hajibabaei et al. 2011). The power and versatility of NGS is revolutionizing many scientific fields, including DNA barcoding, where allelic variation is no longer a limitation but rather an area for discovery.