Comprehensive profiling of retroviral integration sites using target enrichment methods from historical koala samples without an assembled reference genome

Background. Retroviral integration into the host germline results in permanent viral colonization of vertebrate genomes. The koala retrovirus (KoRV) is currently invading the germline of the koala (Phascolarctos cinereus) and provides a unique opportunity for studying retroviral endogenization. Previous analysis of KoRV integration patterns in modern koalas demonstrate that they share integration sites primarily if they are related, indicating that the process is currently driven by vertical transmission rather than infection. However, due to methodological challenges, KoRV integrations have not been comprehensively characterized. Results. To overcome these challenges, we applied and compared three target enrichment techniques coupled with next generation sequencing (NGS) and a newly customized sequence-clustering based computational pipeline to determine the integration sites for 10 museum Queensland and New South Wales (NSW) koala samples collected between the 1870s and late 1980s. A secondary aim of this study sought to identify common integration sites across modern and historical specimens by comparing our dataset to previously published studies. Several million sequences were processed, and the KoRV integration sites in each koala were characterized. Conclusions. Although the three enrichment methods each exhibited bias in integration site retrieval, a combination of two methods, Primer Extension Capture and hybridization capture is recommended for future studies on historical samples. Moreover, identification of integration sites shows that the proportion of integration sites shared between any two koalas is quite small.

149 step strategy, as previously described (Kircher, Sawyer & Meyer, 2012). A quality control 150 strategy (Meyer et al., 2008) was also applied, which consisted of qPCR to quantify the product 151 after each step of library amplification. Based on qPCR results, three samples for which DNA 152 quality was too poor for analysis were excluded from further processing.

153
In the first round of amplification, AmpliTaq Gold, a non-proof reading enzyme, and 154 indexing primers (Table S1) Table S6 summarize the computational pipeline used for the identification 242 of KoRV integration sites. For its implementation, both existing software and customized perl 243 scripts were used that made use of BioPerl (Stajich et al., 2002). Because the nested primers or 244 bait were designed near the ends of LTR, the primer extension products would include either the 245 first 49 bp of the 5' LTR or the last 82 bp of the 3' LTR, which are designated "LTR ends" in 246 Figure 2A. All sequences with a KoRV flank should contain an LTR end, as a result of the 247 primer extension ( Figure 2B). Therefore, KoRV integration sites could be identified as the 248 sequence beyond the KoRV LTR end since all integration sites would be attached to an LTR 249 sequence. However, due to DNA degradation in museum samples, some primer extension 250 products may not have a complete LTR end. Furthermore, minor deletions at the end of the 251 integrated LTRs may be present (Fields, Knipe & Howley, 1996); for example, a 19 bp deletion 252 was found in a KoRV provirus . To get around these potential issues, 253 identification of the LTR ends relied on sequentially selecting sample sequences that contain 254 defined LTR segments; this was done in separate steps for the 5' and 3' flank-containing 255 sequences. The LTR end was divided into two segments, designated A and B ( Figure 2B): the B 256 segment corresponds to the last 19 bp of the LTR and is referred to as 5B or 3B in the 5' and 3' 257 LTR ends, respectively. The A segment is the remaining section of the LTR end, which has a 258 length of 30 bp in the 5' end (5A) and 63 bp in the 3' end (3A).

259
Initially, sequences containing either of the two A regions in the KoRV LTR end (5A or 260 3A in Figure 2B) were identified. For this step, optimal local pairwise sequence alignments Manuscript to be reviewed 301 with a match to the wallaby scaffolds or the koala data, the LTR sequences were trimmed and 302 were then concatenated with the KoRV flanks (obtained in previous steps) for further analysis.
303 Sorting of sequences representing different integration sites 304 All sequences with matches to the different segments of the 3' and 5'LTR ends and/or to 305 wallaby scaffolds or koala HiSeq data from each of the enrichment techniques were collected.
306 The sequences matching 3' and 5' LTR ends were kept separate, resulting in a total of six 307 different data sets for further analysis (two data sets each for the PEC, SPEX and hybridization 308 capture). LTR ends were removed from all sequences in these data sets. Before using these 309 sequences to identify shared and unique integration sites, all KoRV flanks were sorted into three 310 categories by length (Table 3) 339 clusters were evaluated to be of high quality, the sequence was further analyzed. The parameter 340 combinations for optimal clustering and related all against all BLAST are listed in Table 4.

341
Singletons and non-singleton clusters containing sequences derived from a single 342 individual koala were considered to represent unique integration sites. Clusters containing 343 sequences shared by more than one koala were considered to represent shared integration sites 344 (Table S4 and  357 Identified sequences for either one of the two computations were considered to be KoRV 358 integration sites. Sequences shorter than 15 bp are too short for efficient mapping or BLAST; 359 however, because they contained an LTR end, were included in the KoRV specific enrichment 360 statistics (Table 2), although they were not further analyzed.   Table 3). An additional 15,822 400 sequences were less than 4 bp; these could not be further analyzed since their length was shorter 401 than the target site duplication, these are listed as KoRV flanks shorter than 4 bp in Table 3.  (Table 3).  (Table 3).  (Table 3).
428 Summary of computational data processing 429 At each step of our bioinformatics pipeline, we recorded for each experiment the number 430 of sequences that met our screening criteria (Figure 3). The mean length, minimum length and 431 maximum length of sequences were also calculated at each step (   (Table S4).

505
Among 5' integration sites, three were shared between the museum samples in this study 506 and those used in . Two integration sites were found to be shared between 507 the museum samples of this study and Pci-SN265, and two integration sites were found shared  Table 1).
545 Several samples in that study failed to yield PCR products but were successful here, likely 546 because shorter sequences, less than 100 bp, are easily retrieved by the methods applied by the 547 current study.

548
Hybridization capture found the greatest number of 5' integration sites, which included 549 nearly all integration sites identified by SPEX and 86.71% of the integration sites identified by 550 PEC (Figure 5). In contrast, for the 3' LTRs, PEC yielded the most integration sites including 551 85.07% and 91.67% of the integration sites identified by SPEX and hybridization capture 552 respectively. The results were generally consistent across individuals and with the data pooled 553 (Table S6) 592 Generally, the low number of shared integration sites between the three studies can be 593 due to the varying level of intensiveness for KoRV flank retrieval, which can potentially miss 594 many shared integration sites. Given the independent aims and methods used across the three 595 studies, statistical modeling of shared KoRV integration sites through time was only performed 596 for the ten museum koalas in this study (Article S1 Fig S2). While the number of koalas 597 examined is few, a statistically significant increase in integration site sharing was observed over Manuscript to be reviewed Primer extension reactions of the biotinylated oligos were performed only to these hybridized libraries. Biotinylated oligos were collected by magnetic beads together with the hybridized targeted libraries. The single stranded libraries with target sequences were dissociated with biotinylated oligos and were eluted for subsequent PCR amplification. In contrast for SPEX, DNA extracts are directly denatured to be single stranded and mixed with the same biotinylated oligos used in PEC for 1 minute at 55°C. Similar as in PEC, primer extension reactions of the biotinylated oligos were performed only to the single molecules (target sequences) hybridized with biotinylated oligos. These hybrid molecules were collected using magnetic beads. The original single stranded target molecules were washed away and the biotinylated oligos with 3' extension were eluted off the beads and were treated with a poly C tailing reaction. These poly C tailed molecules were amplified using primers with a 5' overhang of the Illumina sequencing adaptor. Through this process, the SPEX products were constructed into Illumina libraries without an additional library preparation step. These SPEX-Illumina libraries were then used in an index PCR and a further amplification step. As shown, SPEX requires at least one more amplification step than HC or PEC, which may explain the high level of clonality in the SPEX result. Manuscript to be reviewed . Panel A illustrates that the genome of the koala retrovirus (KoRV) has two identical long terminal repeats (LTRs) on both ends. The primers or baits can bind to both LTRs, so two categories of products exist: A) products extending into the flanks from primer extension a; B) products extending into the middle of KoRV genome from primer extension b. In principle, there should be an equal number of sequences for the two categories. Panel B indicates that the KoRV LTRs contain three components, U3, R and U5. For SPEX, primers were partially nested. All primers are 20 bp long, and there is a 8 bp-overlap between the inner primers (3.1 and 5.1) and outer primers (3.2 and 5.2) respectively. To avoid known polymorphisms in the LTR, the 3' end of outer primers are 17 bp from the 5' end of LTR and 50 bp from the 3' end of LTR. Since the 5' LTR and 3' LTR of the same KoRV are identical products can also extend into the KoRV genome. The 5' and 3' flanks can be distinguished by their linked LTR end, with the 5' flank linked to 5' LTR and 3' flank linked to 3' LTR. Considering the longest deletion found at the end of LTR is 19 bp, the LTR end was divided into two segments for subsequent computational identification: the B region representing the last 19 bp of the LTR, and the A region representing the rest of LTR end. . The pipeline was run separately for each data set obtained by three different techniques. For the key steps, the number of sequences retained is indicated in parentheses for each technique in this order from left to right: PEC, SPEX and hybridization capture. After processing NGS reads, KoRV integration sites were identified in a two-step analysis of KoRV LTR ends, next to the host DNA flanking KoRV. The first round of selection targeted the A region of the LTR end and its output, was used for subsequent identification of the B region. The LTR ends of all sequences were trimmed off, and only sequences longer than four bp were considered. Using a sequence clustering approach, unique vs. shared integration sites were sorted into clusters. The consensus of each non-singleton cluster was computed using a multiple sequence alignment. These consensus sequences and singleton sequences were queried against wallaby genomic scaffolds and koala Illumina Hiseq reads to determine whether they represented KoRV flanking sequences. At the same time extension products into the KoRV genome were identified. Manuscript to be reviewed