Adapterama I: Universal Stubs and Primers for Thousands of Dual-Indexed Illumina Libraries (iTru & iNext)

Next-generation DNA sequencing (NGS) offers many benefits, but major factors limiting NGS include reducing the time and costs associated with: 1) start-up (i.e., doing NGS for the first time), 2) buy-in (i.e., getting any data from a run), and 3) sample preparation. Although many researchers have focused on reducing sample preparation costs, few have addressed the first two problems. Here, we present iTru and iNext, dual-indexing systems for Illumina libraries that help address all three of these issues. By breaking the library construction process into re-usable, combinatorial components, we achieve low start-up, buy-in, and per-sample costs, while simultaneously increasing the number of samples that can be combined within a single run. We accomplish this by extending the Illumina TruSeq dual-indexing approach from 20 (8+12) indexed adapters that produce 96 (8×12) unique combinations to 579 (192+387) indexed primers that produce 74,304 (192×387) unique combinations. We synthesized 208 of these indexed primers for validation, and 206 of them passed our validation criteria (99% success). We also used the indexed primers to create hundreds of libraries in a variety of scenarios. Our approach reduces start-up and per-sample costs by requiring only one universal adapter which works with indexed PCR primers to uniquely identify samples. Our approach reduces buy-in costs because: 1) relatively few oligonucleotides are needed to produce a large number of indexed libraries; and 2) the large number of possible primers allows researchers to use unique primer sets for different projects, which facilitates pooling of samples during sequencing. Although the methods we present are highly customizable, resulting libraries can be used with the standard Illumina sequencing primers and demultiplexed with the standard Illumina software packages, thereby minimizing instrument and software customization headaches. In subsequent Adapterama papers, we use these same iTru primers with different adapter stubs to construct double-to quadruple-indexed amplicon libraries and double-digest restriction-site associated DNA (RAD) libraries. For additional details and updates, please see http://baddna.org.


6
DNA insert is generally referred to as "library preparation". Library preparation of genomic 128 DNA, in its most common form, involves randomly shearing DNA to a desired size range (e.g., 129 200-600 bp); end-repairing and adenylating the sheared DNA; adding synthetic, double-stranded 130 adapters onto each end of the adenylated DNA molecules using T/A ligation; and using limited-  the individual molecules are clonally amplified, and up to four separate sequencing reactions 152 take place sequentially, each creating a separate sequencing read (Fig. 4). After sequencing, 153 computer software matches the observed index sequence for each molecule to a list of samples 154 with expected indexes (i.e., using a sample sheet; Supplemental File 2) and parses the bulk data 155 back into its component parts (i.e., demultiplexed, e.g., using bcl2fastq [Illumina 2013]). 156 In practice, the history and current status of Illumina indexing strategies is quite complicated 157 indexes were designed to be robust to substitution sequencing errors. Deletions, however, are 165 the primary errors of oligonucleotide synthesis (i.e., synthesis of the adapters and/or primers used 166 to make the indexed libraries). It is, therefore, desirable to have indexes that are robust to 167 insertions and deletions (indels) as well as substitutions, thus conforming to an edit-distance 168 metric and limiting the assignment of sequences to the wrong sample (Faircloth and Glenn 169 2012). When index sets have distances ≥3, then error correction can be employed, but this 170 distance criterion is frequently violated (Faircloth and Glenn 2012). 171 Building upon earlier in-house and external efforts, Illumina introduced a product (Nextera 172 kits) that used an i5 index and an i7 index (i.e., dual-indexing; see Box 1, Fig. 1, and below) each 8 of which were longer (8 nt) and, at that time, conformed to the edit-distance metric. Nextera 174 adapters use the same sequences for interaction with the flow-cell (i.e., P5 and P7, Fig. 1 cost of production, inventory, and quality control (QC) (i.e., it is less expensive to produce, 191 maintain stocks of, and do QC on 20 primers than 96), and c) the universality of the approach -192 dual-indexing is compatible with both full-length adapters (e.g., TruSeqHT libraries) or universal 193 adapter stubs and primers (e.g., Nextera, iNext, or iTru). 194 195

Illumina-compatible Libraries 196
Illumina's libraries have been the industry's gold standard for sequence quality on Illumina 197 platforms, but their library preparation kits are among the most expensive available. The number 198 of indexes offered by Illumina has been limited to ≤48 and the number of dual-index 199 combinations ≤96, until the relatively recent release of additional indexes for the Nextera system, up costs dramatically and can lead to contamination from previous oligonucleotides that were 210 purified on the same HPLC columns; b) relies on hairpin suppression of molecules with identical 211 adapter ends (instead of using a Y-yoke adapter) which is efficient with smaller inserts (e.g., 212 <200 bp) but loses efficiency with increasing insert length; and c) relies on blunt-ended ligation, 213 which allows the formation of chimeric inserts, a danger that increases with insert length. The F-214 2011 method introduced the idea of "on-bead" library preparation, which increases efficiency 215 and reduces costs; thus, many commercial kits have subsequently incorporated similar on-bead 216 library preparation approaches. Limitations of the F-2011 method include use of: a) custom 217 NEB reagents, not in the standard catalog or available in small quantities; b) large volumes of 218 enzymes; and c) Illumina adapters and primers, which increase costs and limit the number of 219 samples that can be pooled. 220 Our approach builds upon many of the previous approaches introduced by Illumina, MK-221 2010, F-2011, Rohland & Reich (2012), and others to develop library preparation methods for 222 genomic DNA that overcome many of these limitations. We describe adapters, primers, and 223 library construction methods that produce DNA molecules equivalent to and compatible with 224 Illumina's TruSeqHT libraries (and, separately, Nextera libraries, see Supplemental File 1; Table  225 2). Our method extends the number of available index combinations from 8 x 12 to 192 x 387, 226 while maintaining a minimum edit-distance of three between all indexes. We demonstrate the 227 effectiveness of our combinatorial indexing primers by controlled quantitative PCR experiments, 228 and we demonstrate the utility of our system by preparing and sequencing iTru libraries from 229 organisms with varying genome size and DNA quality. 230 231

Methods 232
Adapter and primer design 233 We modified the Illumina TruSeq system by dividing the adapter components into two parts: 1) a 234 universal Y-yoke adapter "stub" that comprises parts of the Read 1 and Read 2 primer binding 235 sites plus the Y-yoke, and 2) a set of amplification primers (iTru5, iTru7), parts of which are 236 complementary to the Y-yoke stub and which also contain custom sequence tag(s) for sample 237 indexing (Figs. 1, 3; Table 3; Supplemental File 4). The iTru Y-yoke adapter has a single 5' 238 thymidine (T) overhang and can be used in standard library preparations that produce insert 239 DNA with single 3' adenosine (A) overhangs. We designed a large set of indexed amplification 240 primers (iTru5, iTru7; Supplemental File 4) that contain a subset of our custom 8 nt sequence 241 tags (from Faircloth & Glenn 2012), as well as an initial set that incorporated the TruSeq HT 242 indexes (i.e., D5xx for iTru5 and D7xx for iTru7) which could serve as controls. We grouped 243 the iTru primers with our sequence tags into clearly identifiable, numbered sets (100 or 300 244 series) that are compatible with 8 nt tags in the standard Illumina TruSeqHT primers, as well as 245 Illumina v2 8 nt tags (including the 6 nt tags converted to 8 nt via addition of invariant bases 246 from the adapter). We also created several additional numbered sets (200 or 400 series) of iTru 247 primers that are compatible with all other primers and sequence tags in our iTru system, but 248 which are not compatible with all Illumina indexes. We then balanced the base composition of 249 all iTru primers in all numbered sets in groups of eight for iTru5 or 12 for iTru7, because 250 balanced base composition is critical for successful index sequencing (Illumina 2016b; see 251

Discussion for additional information on combining small numbers of libraries). 252
We ordered the components of our Y-yoke adapter stubs and iTru primers from 253 Integrated DNA Technologies (IDT, Coralville, IA, USA). We modified the adapter stub 254 sequence by phosphorylating the 5' end of iTru_R2_stub_RCp oligonucleotide ( Figure 1; Table  255 3), and we modified each of the iTru primer sequences by adding a phosphorothioate bond 256 (Eckstein 1985) before the 3' nucleotide of each sequence to inhibit degradation due to the 257 exonuclease activity of proof-reading polymerases (Skerra 1992), which are commonly used in 258 library preparation. Following initial small-scale orders, we ordered the entire complement of 259 iTru primers, placing the iTru5 and iTru7 primers into every other column (iTru5) or row (iTru7) 260 of 96-well plates, with 0.625 or 1.25 nmol aliquots in replicate plates (Supplemental Files 4, 5). 261 We hydrated newly synthesized primers to 10 µM in the plate and 5 µM prior to use 262 (Supplemental File 6). 263

Validation of iTru Primers by Quantitative PCR (qPCR) 265
To determine whether our indexed iTru5 and iTru7 primers were biasing amplification, we 266 selected a subset of iTru7 (n=160) and iTru5 (n=48) primers for qPCR validation. To validate 267 the iTru primers, we prepared a pool of adapter-ligated chicken DNA using an inexpensive, Because we needed to run multiple plates of qPCR to test all of the primers, we included the 278 iTru5 set 2 primer A (iTru5_02_A) and the iTru7 set2 primer 1 (iTru7_02_01) on all plates to 279 provide a baseline reference for iTru5 or iTru7 primer performance. We determined the 280 threshold cycle (C T ) using the default settings of the StepOnePlus, we averaged C T values from 281 replicate runs, and we calculated Delta C T for each iTru primer using two approaches. First, we 282 evaluated the relative performance of all iTru5 and iTru7 primers by subtracting the C T of the 283 iTru5 or iTru7 primer being tested from the average C T of all iTru5 or iTru7 primers. Second, 284 we evaluated the performance of all iTru5 and iTru7 primers by subtracting the baseline 285 reference C T of iTru5_02_A from the C T of the iTru5 primer being tested and by subtracting the 286 baseline reference C T of iTru7_02_01 from the C T of the iTru7 primer being tested. We 287 expected that unbiased primers would not deviate from the average and/or baseline performance 288 by more than 1.5 PCR cycles (>1.5 C T ) , a value that should encompass the stochasticity seen 289 between independent PCR reactions as a result of small, unavoidable primer concentration and 290 other amplification performance differences. 291

DNA samples 293
To test the performance of both our Y-yoke adapters and the iTru system in a variety of library 294 preparation scenarios, we prepared libraries from DNA of various types and quality. As a 295 simple, known source of control DNA, we used Escherichia coli k-12 strain MG1655 (hereafter 296 E. coli; Roche 454, Inc.) which has a high-quality genome sequence available (GenBank 297 accession NC_000913; 4.6 Mb) and which is commonly used for quality control of sequencing 298 libraries. To examine how our iTru system performed with DNA of varying quality and 299 complexity, we also prepared iTru libraries from DNA that we isolated from a diverse array of 300 six species (three sharks, a tarantula, jellyfish, and coral). We isolated each of these DNA 301 sources using a variety of techniques commonly used in many labs, including commercial kits, 302 salting out, or CTAB Phenol-Chloroform extraction (Table 4; also see Supplemental File 1 for 303 additional details about testing iNext). We felt that these samples represented the range of 304 species, sampling conditions, and DNA isolation techniques that are commonly encountered in 305 model and non-model organism studies, and the taxa we sampled included particularly 306 challenging specimens (i.e., tarantula, coral and jellyfish) that have previously performed poorly 307 with commercial library preparation kits. Before library preparation, we fragmented E. coli 308 genomic DNA to 400-600 bp using a Covaris S2 (Covaris, Woburn, MA, USA), and we 309 14 fragmented genomic DNA (normalized to 23 ng/µL) to 400-600 using the Bioruptor UCD-300 310 sonication device (Diagenode, Denville, NJ, USA). 311 312

Library construction 313
Prior to library preparation, we annealed the iTru adapter sequences to form double-314 stranded, Y-yoke adapters by mixing equal volumes of the iTru_R1_stub and 315 iTru_R2_stub_RCp oligos at 100 µM, supplementing the mixture with 100 mM NaCl, heating 316 the solution to 98°C for 2 min in a thermal cycler, and allowing the thermal cycler to slowly cool 317 the mixture to room temperature (Supplemental File 7). 318 We prepared iTru libraries from E. coli using kits, reagents, and protocols from Kapa 319 Biosystems (Wilmington, MA, USA), with minor modifications to the manufacturer's 320 instructions. The largest change we made was to ligate the universal iTru adapter stubs (Table 3, After sequencing, we demultiplexed reads using Illumina software (bcl2fastq v 1.8 -2.17; 343 Illumina 2013). We then imported reads to Geneious 6.1.7 -R9.0.4, and trimmed adapters and 344 low-quality bases (<Q20). We removed reads with inserts of <125 bases prior to all downstream 345 analyses. We mapped E. coli reads back to NC_000913 using the Geneious mapper (fastest 346 setting, single iteration). We assembled reads from the eukaryotic libraries using the Geneious 347 assembler (fastest setting), and we extracted contigs of 250 to 450 bp from eukaryotic libraries of 348 tarantula, jellyfish, and coral for downstream microsatellite searches using msatCommander Following initial validation of the iTru primers and the utility of the iTru library preparation 358 approach, we put the iTru system into an extensive test phase in which we routinely used this 359 approach for library construction within our own labs while we also made all components of the which we used on E. coli) on the P5 side with iTru7 primers on the P7 side. The second batch 372 (n=111) combined iTru5 primers on the P5 side with iTru7 primers on the P7 side. The first 373 batch allowed us to assess iTru7 performance separate from that of iTru5, while the iTru7+iTru5 374 libraries allowed us to assess performance of the full iTru system relative to all other 375 combinations. For all remaining libraries within the other projects, each group followed the 376 protocols for iTru library preparation described above using combinations of only iTru5 and 377 iTru7 primers. 378 suggesting that our iTru indexed amplification primers amplify successfully (98.7% success for 400 iTru7; 100% success for iTru5) and perform similarly to one another. There were two iTru7 401 primers that failed to amplify during their initial tests, iTru7_401_07 and iTru7_209_04. We  Table 4). Using a 418 genome skimming approach, we sequenced the mitogenomes of the shark and coral samples to 419 an average coverage of 33x and 50x, respectively. We used the contig assemblies from our 420 tarantula, jellyfish, and coral samples to design primers pairs targeting >100 microsatellite loci in 421 each taxon. Although the variance in the number of sequencing reads returned per library was 422 higher among these samples than the E. coli libraries, these results demonstrate that the iTru 423 system can be used to prepare libraries from DNA of different organisms extracted using 424 different purification approaches, even DNA that produced very poor results with commercial 425 kits (data not shown). 426 427

Larger-scale Tests 428
Our beta test allowed us to collect sequence data from many different iTru5 and iTru7 primers 429 After testing the iTru system in several labs, we made several changes in our approach. 452 The most significant of these were: 1) to modify our original naming scheme so that researchers 453 can easily identify sets of iTru7 primers that are compatible or incompatible with TruSeq 454 indexes, and 2) to increase the amount of iTru5 and iTru7 aliquoted into plates after oligo 455 synthesis (from 0.625 nmol to 1.25 nmol), which reduced library amplification failures that 456 resulted from improper hydration of low-quantity primers in specific wells of plates. The 457 naming scheme and concentrations used in all supplemental files and the naming scheme we 458 used in the Methods section reflect these changes to minimize confusion. After making these 459 changes, we and others have successfully produced libraries and sequencing reads from the vast 460 majority of iTru5 and iTru7 primers detailed in the supplemental files, and we have no evidence 461 suggesting that any of the primer sequences will not work correctly. The original sets of iTru7 462 primers (sets 00 -13) exist, but they have mixed compatibility with Illumina indexes, thus we 463 encourage beta users to exhaust those primers quickly and adopt the new sets. 464 It is important to note that the iTru5 and iTru7 primers are grouped into "balanced" sets 465 Although all researchers endeavor to conduct mistake-free experiments, foul-ups are certain to 536 occur. In addition to simple record-keeping errors, a very common mistake is flipping the 537 orientation of one of the strip tubes containing iTru primer aliquots. Thus, it is critical to have 538 the capacity to quickly and easily determine what index sequences and combinations are present 539 within a sequencing run. We have developed a small and fast python program (Supplementary 540 File 17) that can count the indexes within a file of reads that were not assigned to specific 541 samples during demultiplexing (i.e., the undetermined reads from bcl2fastq). 542 543

Other Applications and Future Modifications 544
It is possible to use the iTru system for a variety purposes beyond what we describe here. For 545 example, we have used the iTru system for making RNAseq libraries using Kapa library kits, but 546 any approach that yields double-stranded template molecules with a single adenosine can be used 547 with no significant modifications to what we have described. One of the attractive features of 548 our system is that it separates the primers and stubs into more manageable units. We have also 549 used several of the approaches described above to modify the iTru system for use with amplicon 550 sequencing and RADseq studies. In subsequent Adapterama papers, we use these same iTru 551 primers with different adapter stubs to construct double-to quadruple-indexed amplicon libraries 552 for the purpose of further manipulation or NGS library construction. In this paper we will 740 make use of double-stranded DNA adapter stubs (see below). 741 barcodes -see index or tag; this term is also used to mean a DNA sequence that can be used to 742 identify the taxon from which a sample derives, thus we avoid using this ambiguous term.  The color scheme follows Figure 1, except that the sequences derived from the complementary 844  Table 3. iTru and iNext adapter stub oligonucleotides and tagged primer sequences. All sequences are given in 5' to 3' orientation. To 889 make it clear which portions are constant among all tagged primers, as well as to identify function, the tagged primers are given in three 890 pieces (the invariant 5' end, the tag sequence which varies among primers, and the invariant 3'end), but the primers are obtained as a 891 single contiguous fusion of these three pieces. Complete balanced sets of primers are available as Supplemental Files (4, 15)    overhangs is ligated to stubby Y-yoke adapters with G overhangs (cf. Fig. 2). The C overhangs 925 prevent chimeric ligation of genomic DNA molecules, and the G overhangs prevent ligation 926 position), which allows ligation of stubs to genomic DNA. During limited cycle PCR, iNext5 928 and iNext7 primers anneal to the ends of the Y-yoke adapters to produce full-length, double-929 indexed molecules (cf. Figs. 2, S3, and S4). Illumina still supports libraries with a single index; the i7 index (i.e., Indexing Read1) is always 963 used in these instances. If libraries of this type are mixed with iTru, or any other dual-indexing 964 libraries, and both index sequencing reads are obtained from the pool, an i5 sequence will be 965 generated, but different strands and thus positions will be sequenced based on which instrument 966 (indexing read2 primer) is used. The i5 sequence obtained will be GTGTAGAT from NextSeq 967 and MiniSeq, whereas the sequence ACACTCTT is obtained from MiSeq and HiSeq ≤2500 968 instruments. HiSeq ≥3000 instruments initially generate the sequence GTGTAGAT, but that is 969 NextSeq High Output run, and the target for each sample was 1.0% of the total reads generated 988 across the partial run (blue). The heat map shows deviations from the optimal percentage. 989 990 Figure S12. The percentage of reads generated for each combination of iTru5 and iTru7 991 from a study of 183 carangimorph fish lineages. Data were generated from a partial, PE150, 992 Illumina NextSeq High Output run, and the target for each sample was 0.5% of the total reads 993

' -A C A C T C T T T C C C T A C A C G A C 3 ' -C A C T G A C C T C A A G T C T G C A C A
GCTCTTCCGATCT-3' CGAGAAGGCTAG-5'

' -A C A C T C T T T C C C T A C A C G A C 3 ' -C A C T G A C C T C A A G T C T G C A C A A C A C G T C T G A A C T C C A G T C A C -3 ' TCTAGCCTTCTCG GATCGGAAGAGC C A G C A C A T C C C T T T C T C A C A -'
A

C A C G T C T G A A C T C C A G T C A C -3 ' 3'-TCTAGCCTTCTCG 5'-GATCGGAAGAGC C A G C A C A T C C C T T T C T C A C A -5 '
i7 index i5 index