Accurate Detection of HPV Integration Sites in Cervical Cancer Samples Using the Nanopore MinION Sequencer Without Error Correction

During the carcinogenesis of cervical cancer, the DNA of human papillomavirus (HPV) is frequently integrated into the human genome, which might be a biomarker for the early diagnosis of cervical cancer. Although the detection sensitivity of virus infection status increased significantly through the Illumina sequencing platform, there were still disadvantages remain for further improvement, including the detection accuracy and the complex integrated genome structure identification, etc. Nanopore sequencing has been proven to be a fast yet accurate technique of detecting pathogens in clinical samples with significant longer sequencing length. However, the identification of virus integration sites, especially HPV integration sites was seldom carried out by using nanopore platform. In this study, we evaluated the feasibility of identifying HPV integration sites by nanopore sequencer. Specifically, we re-sequenced the integration sites of a previously published sample by both nanopore and Illumina sequencing. After analyzing the results, three points of conclusions were drawn: first, 13 out of 19 previously published integration sites were found from all three datasets (i.e., nanopore, Illumina, and the published data), indicating a high overlap rate and comparability among the three platforms; second, our pipeline of nanopore and Illumina data identified 66 unique integration sites compared with previous published paper with 13 of them being verified by Sanger sequencing, indicating the higher integration sites detection sensitivity of our results compared with published data; third, we established a pipeline which could be used in HPV integration site detection by nanopore sequencing data without doing error correction analysis. In summary, a new nanopore data analysis method was tested and proved to be reliable in integration sites detection compared with methods of existing Illumina data analysis pipeline with less sequencing data required. It provides a solid evidence and tool to support the potential application of nanopore in virus status identification.


INTRODUCTION
Human papilloma virus (HPV) is the major cause of cervical cancer, and HPV16 and 18 are the two most prevalent highrisk HPV types worldwide. Cervical cancer is the second most common malignant cancer in females; 570,000 women are diagnosed with cervical cancer and 311,000 die of this disease each year (Bray et al., 2018). Although it is known that persistent HPV infection causes cervical cancer, it remains unclear how HPV induces carcinogenesis and what exactly plays the most important function in the process. After HPV infection, HPV proteins E6 and E7 are expressed, which inhibits tumor protein 53 and retinoblastoma protein, disrupts cell proliferation and many other biological processes, and induces cervical cancer (Balasubramaniam et al., 2019). In addition, HPV DNA could also integrate into human DNA, which is an early and important event during carcinogenesis (Nicolas et al., 2004;Pett and Coleman, 2010;Schmitz et al., 2012;Christiansen et al., 2015;Bodelon et al., 2016;Walline et al., 2017), and involves epigenetic mechanisms that affect the expression of key genes in the tumor transformation process. Maria et al. (2015) proposed a scheme of modifications and alterations generated by HPV integration into the host genome that can lead to carcinogenesis. This recent evidence shows that HPV integration often preferably affects those genes that are continuously expressed during DNA transcription and repair to induce carcinogenesis (Oyervides-Muñoz et al., 2018). In addition, the progression to cancer can be explained by viral DNA integration into tumor suppressor genes; this integration into host DNA inactivates those genes leading to uncontrolled growth . Understanding viral oncogenesis is critical for the clinical management of HPVpositive cancer (Morgan et al., 2017).
An increasing number of studies have indicated that detecting HPV DNA integration has become mainstream in HPV oncogenic research worldwide, leading to the discovery of HPV integration into the human genome (Svetlana et al., 2008;Hu et al., 2015;Zhang et al., 2016;Shen et al., 2017;Ibragimova et al., 2018;Warburton et al., 2018). Hu et al. (2015) found that HPV randomly integrates into the human genome with the tendency to integrate into genomic hot spots. Svetlana et al. (2008) found that HPV integration sites significantly vary among HPV types. Warburton et al. (2018) found that HPV integration generates a super-enhancer-like element composed of tandem interspersed copies of the viral upstream regulatory region and a cellular enhancer, which drive high levels of oncogene expression. Zhang et al. (2016) and Shen et al. (2017) discovered that HPV integrated into the human genome can significantly increase related gene expression. Ibragimova et al. (2018) and Koneva et al. (2018) discovered that patients with integrated HPV status have a worse survival rate compared to those with episomal HPV. The increasing number of studies about HPV integration has stimulated the idea that HPV integration status may be a biomarker for the diagnosis, progression, and survival prediction, and even as a biomarker for cancer screening (Han et al., 2015;Liu et al., 2018;Grayson et al., 2019;Jiang et al., 2019). Harlé et al. (2019) found the same integration sites and DNA deletion regions between tongue and anal cancer of the same patient. Campitelli et al. (2012) analyzed the serum signal of HPV integration sites and found that it could be used as a liquid biopsy marker to monitor the effects of individual treatment plans. HPV integration status and locus discovery have become increasingly important not only for discovering the mechanism underlying virus infection, but also for application in clinical diagnosis and treatment to decrease the incidence and improve cervical cancer treatment.
Different methods have been used to study HPV integration sites including amplification of papillomavirus oncogene transcripts (APOT), detection of integrated papillomavirus sequences by ligation-mediated PCR (DIPS-PCR), and nextgeneration sequencing (NGS). Das et al. (2012) found 48 integration sites by APOT, which is a method based on RNA level detection of HPV integration; therefore, the results greatly rely on the sample quality. Luft et al. (2001) discovered 22 integration sites by DIPS-PCR including known and new integration sites. Although it is a method based on DNA detection, the potential for identifying new integration sites is limited. With the development of NGS, whole genome sequencing has been used for virus integration sites detection (Tuna and Amos, 2016;Chae et al., 2018). However, it requires large amounts of sequencing data, and thus is not applicable in clinical usage, which requires fast and accurate results. To date, the best way to detect integration sites is the probe capture sequencing method. After enrichment of virus genomic material using this method, the fusion fragment of human and HPV sequence is isolated and further sequenced by NGS. Many studies have used this method to identify large amount of integration sites with good accuracy (Chandrani et al., 2015;Liu et al., 2016a,b;Li et al., 2018Li et al., , 2019. Although NGS is powerful for discovering HPV integration sites, the quality of the results relies on many factors including the probe capture efficiency, sequencing depth, and analysis method. In addition, the requirement for rapid and accurate results for clinical application is not met with this method. Therefore, the development of a new method for HPV integration detection with high accuracy and prompt reporting capacity is crucial for future clinical application.
In this study, we developed a novel method to detect HPV integration sites using the third generation nanopore sequencing platform. We compared our detection results with previously published and newly generated Illumina results. Our results showed that third-generation sequencing technology can be used to detect HPV integration sites.

Sample Collection
A fresh tissue specimen was collected from a patient with IIIb squamous cell carcinoma who had undergone surgery at Anyang Cancer Hospital (Henan province, China) in 2009. The patient was 56 years old and had HPV16-positive cancer. This study received ethical approval from the Institutional Review Board of the hospital. Individual informed consent had been collected from this study participant.

Illumina Library Preparation
Sheared DNA was used to make genomic DNA libraries using the NEBNext R UltraII TM DNA Library Prep Kit for Illumina R (E7370L) and NEBNext R Multiplex Oligos for Illumina R [E6609L; New England Biolabs (NEB), Ipswich, MA, United States] as recommended by the manufacturer. In short, 200 ng DNA was sheared in a 50 µL volume using the Covaris M220 Focused-ultrasonicator (Covaris, Cambridge, MA, United States) for 340 s with a duty cycle of 10%, cycles per burst of 200, and peak power of 75. Then, fragmented DNA was end-repaired and A-tailed using the NEBNext R UltraII TM Mix and buffer, and ligation of DNA adapters was done using the NEBNext R UltraII TM Ligation Master Mix and Enhancer. Then, 0.9 × AMPure XP beads (Beckman Coulter Inc., Skyesville, MD, United States) were used to remove all traces of adapter dimers. The adapter-ligated DNA was amplified with six cycles of PCR using the NEBNext R UltraII TM Q5 Master Mix. The generated library was quantified with the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) using the Qubit 3.0 fluorometer (Invitrogen, Carlsbad, CA, United States), and run on the Qsep1 TM biosystem (BiOptic Inc., NTC, Taiwan) for quality analysis. The final size of the electrophoresis fragment was about 320 bp.

Targeted Capture and Illumina Sequencing
Human papillomavirus probes were designed by Integrated DNA Technologies (Integrated DNA Technologies, Coralville, LA, United States), according to the full-length genome of HPV16. Overall NGS target enrichment with xGen Hybridization and a Wash Kit was conducted by Integrated DNA Technologies, which is briefly described below and performed as detailed by the plate standard protocol for xGen R hybridization capture of DNA libraries from Integrated DNA Technologies. In brief, wholegenomic libraries were hybridized with HPV probes (Integrated DNA Technologies), absorbed onto the beads via Dynabeads MyOne Streptavidin T1 (Thermo Fisher Scientific), after which the uncaptured DNA fragments were removed by washing. Then, the eluted fragments containing the targeted gene were enriched by 15 cycles of PCR to generate libraries for sequencing. The captured library was quantified with the Qubit dsDNA HS Assay Kit using a Qubit 3.0 fluorometer (both from Thermo Fisher Scientific) and run on the Agilent 2100 TapeStation (Agilent Technologies, Santa Clara, CA, United States) for quality analysis prior to sequencing. DNA libraries were sequenced using the NextSeq platform (Illumina) with 150 bp paired-end reads.

Library Preparation for Nanopore Sequencing
The sheared DNA was used to make genomic DNA libraries using the NEBNext R UltraII TM DNA Library Prep kit for Illumina R (E7370L) and NEBNext R Multiplex Oligos for Illumina R (E6609L), as recommended by the manufacturer, followed by DNA shearing with the M220 Focused-ultrasonicator (Covaris). Then, 200 ng genomic DNA was sheared into longer fragment sizes of about 800 bp in a 50 µL volume on the Covaris M220 (Covaris) for 50 s with a duty cycle of 20%, cycles per burst of 200, and peak power of 50 according to the manufacturer's instructions. Then, the fragmented DNA was end-repaired and A-tailed using the NEBNext R UltraII TM Mix and buffer, and ligation of adapters was done using the NEBNext R UltraII TM Ligation Master Mix and Enhancer, after which 0.9X AMPure XP beads (Beckman Coulter) were used to remove all traces of adapter dimers. The adapter-ligated DNA was amplified with six cycles of PCR using the NEBNext R UltraII TM Q5 Master Mix. The generated library was quantified with the Qubit dsDNA HS Assay Kit using the Qubit 3.0 fluorometer (both from Thermo Fisher Scientific) and run on the Qsep1 TM biosystem (BiOptic) for quality analysis. The final size of the electrophoresis fragment ranged from 250 to 1500 bp. The overall capture experiment with xGen Hybridization and Wash Kit was conducted by Integrated DNA Technologies as detailed by the standard protocol for xGen R hybridization capture of DNA libraries from Integrated DNA Technologies with slight modifications. Briefly, the HYB program extends for 4 to 6 h that incubating the tubes at 65 • C with a heated lid set at 105 • C. The captured library was quantified with the Qubit dsDNA HS Assay Kit using the Qubit 3.0 fluorometer (both from Thermo Fisher Scientific) and run on the Agilent 2100 Tape Station (Agilent Technologies) for quality analysis prior to nanopore sequencing. DNA libraries were sequenced using MinION TM MK1B device from Oxford Nanopore Technologies (ONT; Oxford, United Kingdom).

Nanopore Sequencing
The capture of HPV 16 probe-template duplexes was done using the Ligation Sequencing Kit (SQK-LSK108) with Native Barcoding Expansion (EXP-NBD103) from ONT, and performed as detailed by the 1D Native barcoding genomic DNA standard protocol from ONT. Then, 310 ng purified amplicon DNA was dA-tailed using the NEBNext Ultra II End Repair/dA Tailing module (E7546S; NEB) at a temperature of 20 • C for 30 min and at 65 • C for 30 min using the thermal cycle. Then the DNA was purified with AMPure XP beads (A63881; Beckman Coulter), and 60 µL (1×) of re-suspended AMPure XP beads was added by pipetting and thoroughly mixed on a rotator mixer for 5 min. The beads were separated on a magnet to remove the supernatant. The beads were kept on the magnet and washed twice using 200 µL of 70% ethanol without disturbing the pellet, followed by re-suspension of the pellet in 25 µL nuclease-free water. Then, 275 ng end-prepared DNA was ligated with native barcode NB04 from ONT using the NEB Blunt/TA Ligase Master Mix (M0367S; NEB), and then incubated at 25 • C for 15 min. Following the barcode ligation reaction, the DNA was cleaned again with 1X Agencourt AMPure XP beads (Beckman Coulter) and eluted in 26 µL. The resulting DNA was pooled with another unrelated barcoded library by equivalent weight. Then, 543 ng pooled barcode DNA was used to perform the adapter ligation step, and 20 µL Barcode Adapter Mix, 10 µL Quick T4 DNA Ligase, and 20 µL of NEB Next Quick Ligation Reaction Buffer (5X; E6056S; NEB) were added in that order to the pooled barcoded 50 µL DNA. The reaction was incubated at room temperature for 15 min, and then cleaned using 0.4X Agencourt AMPure XP beads (Beckman Coulter). Then, the beads were washed twice with 140 µL Adapter Bead Binding, re-suspended in 15 µL Elution Buffer, and incubated for 10 min at room temperature before pelleting in a magnetic rack. The prepared library was combined with 35 µL running buffer with Fuel Mix, 25.5 µL Library Loading beads and 12 µL DNA library (∼158 ng) and loaded into the SpotON sample port of the R9.4 flowcell. All nanopore sequencing runs were conducted using the MinION TM MK1B device following the recommended sequencing protocols by ONT. The captured library was sequenced individually using the FLO-MIN-106 flow cell. MinKNOW software (v18.07.18) was used to control the MinION device following a 24-h run script, and sequencing data were collected in real-time and processed into basecalls using the Metrichor TM agent.

Bioinformatic Analysis
The nanopore sequencing results were basecalled using the EPI2ME interface (v. 2.59.1896509). For passed 1D reads, quality scores (Q-score ≥ 7) and length distributions were evaluated using FastQC. Obtained fastq files were converted to fasta files using the FASTX-toolkit 1 . These fasta files were aligned to the Illumina adapter database by LAST (version: lastal 956; Frith et al., 2010;Kiełbasa et al., 2011), which alignment parameters were mismatch cost: 1, gap extension cost: 1, gap existence cost: 1, and minimum score for gapped alignments: 45. Then these reads were divided into two group reads, which we defined that the only library (OL) structure reads and multiple library (ML) structure reads ( Figure 2B). The ML reads were split to the OL reads by the Illumina adapter sequences. All OL reads were aligned to the human genome (GRCh37/hg19) and HPV genome (NC_001526.2) databases using Blat (version: v36; Kent, 2002), with parameters were stepSize = 5, repMatch = 2,253, minScore = 20, and minIdentity = 0. We also defined the highest score alignment result as the best OL reads alignment result to find the breakpoint of HPV integration. We used Samtools (version: v1.9) software to analyze the HPV16 genome sequencing depth. In each one OL reads, we defined the breakpoint of HPV integration which the gap and overlap between the best human alignment result and the best HPV alignment result less than 10 bp. The breakpoint was annotated to gene function using ANNOVAR 2 and an in-house program with human reference and HPV16 reference, respectively. The sequencing reads generated by Illumina sequencer were analyzed using the SEGF pipeline (Xu et al., 2018). A diagram of the integration site distribution was made using Illustrator software. Gene function analysis was applied using the DAVID online analysis tool (Huang et al., 2009a,b).

PCR Verification
To validate the selected HPV integration into the human genome detected by nanopore sequencing, primers were designed, one of which was derived from the human genome at the potential site of integration, and the other of which was against HPV sequences suspected of being near the site of integration within the HPV genome. Both primers were designed 100∼400 bp away from the detected integration sites based on the nanopore targeted-capture reads, and 13 HPV integration sites (n ≥ 5) were selected. PCR was performed using each primer set for DNA that was previously used for HPV capture. As a template, 1 µL (30 ng) genomic DNA solution was used in the subsequent PCR. The PCR reaction mix was prepared in a total volume of 20 µL containing 4 µL 5X Phoenix Hot Start Taq Reaction Buffer, 2 µL dNTP (2.5 mM), 0.5 µL of each forward and reverse primer (10 mM), 0.2 µL Phoenix Hot Start Taq DNA Polymerase (500 U), and 11.8 µL nuclease-free water (not DEPC-Treated). The PCR conditions were as follows: 5 min at 95 • C; 35 cycles of 30 s at 94 • C; 60 s at 50-60 • C for annealing; and 60 s at 72 • C; followed by 72 • C for 1 min. PCR products were analyzed by electrophoresis on 2.0% agarose gels, purified, and Sanger sequenced.

Nanopore and Illumina Sequencing Results
The captured library was constructed and sequenced by ONT MinION. The sequencing run generated 381,475 (149.1 Mb) sequencing reads from 977 active pores. To obtain high-quality reads, the raw reads were filtered using the Metrichore 1D base calling program and kept for further analysis if a Qscore ≥ 7 was obtained. In total, 333,028 1D reads (130.2 Mb) were retained with the read lengths ranging from 67 to 7,404 bp, with a mean read length of 502.8 bp, and the quality score ranged from 7 to 17, with a mean value of 12.8. The distribution diagram of the read lengths and quality scores are shown in Figure 1 and other sequence details are summarized in Table 1. The captured Illumina library was sequenced by Illumina sequencing machine and 11,093,630 raw reads (1.67 Gb raw bases) were generated. After data cleaning, 11,093,535 clean reads (1.64 Gb) were obtained with an average length of 151 bp. The summary of sequencing results from the two platforms are listed in Table 1 and the sequence coverage of HPV16 genome by Illumina and nanopore sequencer was shown in Supplementary Figure S1. It showed that HPV16 full genome was covered by both sequencing platforms and Illumina result provided obviously higher sequence depth than nanopore. The sequence depth of the most HPV16 genome regions are even both on Illumina and nanopore results,  Although it showed a significant decrease of the sequence depth in this region, it provided more than 100× sequencing depth of it.

Identification of HPV Integration Sites by Nanopore and Illumina Platforms
The sequencing data from nanopore sequencing were analyzed according to the pipeline described in the Materials and Methods, shown in Figure 2A. There were 12,951 sequence reads in the human genome and 389,839 reads in the HPV genome. Next, the potential integrated reads were analyzed using BLAST with the human and HPV genome sequences to determine the exact integration breakpoints. Without performing error correction analysis, 7,859 reads were identified in both the human and HPV sequences. To filter out the chimeric product from the library construction process, we used the number of overlapped bps as filter criteria. The HPV and human part of the same sequencing read were identified with different bp-length gaps. Only reads, whose gap was shorter than 10 bp length, were kept for further analysis. A total of 339 integration sites were finally identified (Supplementary Table S1 There were 60 breakpoints in total were included. The distribution of all breakpoints in the different chromosomes is shown in Figure 3A. Chromosomes chr20, chr2, chr6 were the three chromosomes with 12, 11, and 8 breakpoints, respectively. Both chr1 and chrX had 7 and 7 integration sites each and the rest of the chromosomes had ≤5 integration sites. The identified read numbers of each integration site were highly different, with a range change from 2 to 406. There were 31.7% (19 of 60) integration sites had more than 10 reads. The top 19 abundant integration sites were distributed in chromosomes 20(6), 6(4), X (4), 11(3), and 2(2). Among the top five most abundant breakpoints, There were 3 located in chromosome 20, HPV16:2804,chr20:32516985 (406 reads), HPV16:7139,chr20:32478733 (169 reads), HPV16:4276,chr20:32502143(51 reads), and two were in chrX, HPV16:5534,chrX:20464412 (132 reads), HPV16:3163,chrX: 20462930 (50 reads). Figure 3B indicates the integration site distribution in the HPV genome. Within the 60 identified integration sites, 18 were in the L2 gene, 12 in L1, 11 in E1, 7 in E2, and <4 in the each of the other genes (E6, LCR, E5, and E7). On the human genome, there are 38 integration sites in the intergenic region, 18 in the intronic region, 2 in the ncRNA region, 1 in the exonic and 1 in the downstream region of the genes, as indicated in Figure 3C. To further look at the breakpoints in the same chromosome, in chr20, most breakpoints (10 of 12) located in the intergenic region of two genes CHMP4B, RALY-AS1, one of the other two breakpoints in intergenic region of KIF3B, ASXL1, and the other is in the intronic region of gene ATRN. With the exception of chr20, other chromosomes also have obvious cluster tendency of integration sites, like six integration sites of chrX in the intergenic region of genes RPS6KA3 and CNKSR2. Three breakpoints in chr6 were in the intronic or downstream of gene CAGE1. 10 regions of human genome were identified with more than two breakpoints clustered together, indicating there were integration sites cluster in the human genome. In addition, 14 breakpoints without any cluster tendency were observed. Illumina identified 1,718 integration sites, and only the integration sites with ≥ 3 reads were collected for further analysis. All sites are listed in Supplementary Table S2. The characteristic analyses of these integration sites are summarized in Figure 3.

Integration Sites Difference of All Three Data Sets
Capture-based Illumina identified 54 integration sites with ≥ 3 reads sequenced, and four sites (HPV16:7619,chr2:99404946 and HPV16:6558,chr2:99431140, HPV16:850,chr20:32515476, and HPV16:462,chr20:30942164) with less than three but overlapping with nanopore results showed in Table 2. Altogether, there are 54 integration sites were included for further analysis. A previously published paper (Liu et al., 2016a) reported 19 integration sites from the same sample we used. All integration sites were compared and the overlapping sites by either method were recorded. The numbers are shown in the Venn diagram in Figure 4A, and the detailed list of integration sites is shown in Figure 4B and Within the integration sites identified by all the three methods, Illumina, nanopore and Liu paper, with relative reliable abundance, nanopore identified the most validated integration sites. All the unique integration sites and their detailed information are listed in Table 2. To test the accuracy of the new integration sites identified by our data, we performed Sanger sequencing. Together, 14 integration sites were chosen for verification, and 13 were PCR-amplified and successfully sequenced, indicating the true positive of these integration sites, which were mostly identified by both nanopore and Illumina platforms. Detailed information on the designed primers and sequencing results with chromas images are summarized in Supplementary Table S3 and Supplementary Figure S2. The verified integration sites show different coexisting situations. Eight coexisted in nanopore and Illumina integration sites datasets, which indicating the advantage of our integration detection pipeline of both platforms compared with previouse published pipeline. There were 9 integration sites with more than 10 identified reads in each of the three datasets, and four integration sites with less than 10 reads in all three datasets were verified. The high verification rate (92.86%) of integration sites also indicated the high integration sites detection accuracy. Also, four sites were verified to exist in the intronic region, and the remaining nine integration sites were found to exist in the intergenic region. The PCR gel image and the sequencing result were shown in Figure 5 and Supplementary  Table S3, respectively.

Affected Gene Function Analysis
All the integration sites identified by the three platforms were combined to the final unique integration sites list ( Table 2) for further analysis with nanopore reads (n ≥ 2) and Illumina reads (n ≥ 3). A total of 83 unique integration sites were found, each of which were annotated, and further function classification and pathway analyses were conducted. If the integration sites located in intergenic region, the closer gene was used for analysis. The genes were clustered into eight biology processes including positive regulation of transcription from RNA polymerase II promoter (6 genes), negative regulation of transcription from RNA polymerase II promoter (6 genes), regulation of transcription from RNA polymerase II promoter (5 genes), transcription from RNA polymerase II promoter (4 genes), negative regulation of retinoic acid receptor signaling pathway (2 genes),establishment of skin barrier (2 genes), proximal/distal pattern formation (2 genes) and embryonic limb morphogenesis (2 genes) and six cytoplasm (13 genes), nucleoplasm (8 genes), extracellular exosome (8 genes), protein binding (6 genes), DNA binding (2 genes) and actin binding (2 genes). Detailed information on the function classification results is shown in Supplementary Figure S3 and Supplementary Table S4.

DISCUSSION
Most studies have indicated that nanopore sequencing results contain about a 10-15% error rate (Nagarajan and Pop, 2013;Melanie et al., 2015), which greatly limits the application of the nanopore platform in the genome study, especially in clinical applications. Although some studies have indicated that error correction can significantly improve the assembly result and help with finding small variations in the genome (Lu et al., 2016;Sović et al., 2016;Xiao et al., 2017), choosing appropriate error correction software can be challenging for researchers who are not good at bioinformatics analysis. In addition, different error correction software provides different results that can potentially affect the study results. In our study, we combined two steps of different sequence matching analyses, Last and Blat, and after comparing the results with previous verified integration sites, we obtained 16 correct integration sites in 19 positive ones, strongly indicating that our data analysis method is appropriate for integration site discovery analysis by using nanopore sequencing data. Last can determine the human and virus merged sequence promptly and based on the candidate reads, and Blat can help identify the exact break point position accurately. With stringent criteria used for filtering, we filtered out the chimeric PCR product and only kept the correct structured integration sites. Therefore, the results were compatible with the positive one. However, to obtain high capture efficiency when enriching integration sites by using an HPV probe, we only used 500 bp reads for library construction. It has been reported that genome structure might be complex if two integration sites located closely in a genome region (Michael et al., 2014;Walline et al., 2016) and has been found in long sequencing fragment, e.g., 5 kbp (Cretu Stancu et al., 2017). Although we intended to discover the whole structure of virus integration in the human genome as paper reported (Adey et al., 2013;Akagi et al., 2014), the results did not support our hypothesis mainly due to the reason that the reads were not long enough to detect multiple integration sites within one region. To improve it, we need to either try longer reads library construction by HPV probe capture in the future.
With the comparison of the three integration datasets, Illumina, Nanopore, and Liu et al., we found that although overlapping results were found by all of them, there were still unique breakpoints identified by each dataset. Our Illumina data contained 22 sites that were not identified by nanopore and previous studies. Although nanopore results provided good coverage of published data, it still missed some sites by our data analysis, possibly caused by the differences in the library construction step. The capture efficiency of the probes might be different when using different lengths of DNA fragments; therefore, some integration sites may not be seen in the other platform. We also noticed that most of the unmatched integration sites had relatively low read numbers covered. Among the 22 unique Illumina breakpoints, There were 13 breakpoints with only three reads being identified. We also compared all of the breakpoints identified between the two platforms; those with only one read had few overlaps. Therefore, we hypothesize that the capture efficiency of probes with low abundant reads varies more compared with high abundant reads. Since it is difficult to verify the low read counts by Sanger sequencing or other methods, we are not sure if the 280 integration sites we found with only one read are real integration sites or not. Since each data analysis method has its limitations based on the mechanism of the method, our conclusion was drawn based on the methods we chose for this study. There might be false positive or false negative integration sites existing in the list. However, due to the technical difficulty of measuring the exact sensitivity and specificity of the method in detecting virus integration sites, the overlapped integration sites in two different data sets and the verified integration sites by Sanger sequencing provided more   (Liu et al.), Nanopore reads number cut-off value is 2, and Illumina reads number cut-off value is 3. There is no cut-off value for integration sites which overlapped by different sequencing platforms. Lane of GeneDetail_refGene listed the distance of integration sites with the nearby genes. S:Sanger Verified.
Frontiers in Genetics | www.frontiersin.org valuable information for our conclusion. The other integration sites we identified and summarized in the list provided more reference values. Our data indicated that most of the integration sites exist in the intergenic region of the human genome, concordant with previous studies (Bodelon et al., 2016;Pinatti et al., 2017;Brant et al., 2018;Koneva et al., 2018;Li et al., 2019). Since human genome was identified with large portion (∼60%) of intergenic region, our results identified about 40% integration sites located in intergenic region of human genome, which showed no significant difference. Therefore, it could be concluded that the way that integration sites distributed in human genome was affected by the nature of genome functional structure. We hypothesized that another underlying mechanism might be that the integration of HPV genome to human non-exon region of genes wouldn't cause significant phenomenon of losing the function of the genes, which therefore could keep the host cell for living, instead of dying instantly. We also found that integration happened in non-coding RNA (ncRNA), which plays many important functions, for example, long ncRNA interacts with p53 protein (Liu Y. et al., 2019). Insertion of ncRNA might cause more serious function interruption; therefore, the incidence is not very high. Some genes have been found to be hot spot genes, in which HPV has the tendency to insert, for example tumor protein 63 (TP63), which plays very important roles in carcinogenesis (Ojesina et al., 2014). Akagi et al. (2014) identified integration happened in TP63. In this study, we identified hotspots in the intergenic region of two genes CHMP4B and RALY-AS1. The integration cluster tendency was very strong, with 10 integration sites located in 76 unique integration sites identified, as well as in several other chromosome regions. Why does the virus integrate into several specific regions and is it random or specific? Further studies are needed to answer these questions.
The significantly increased gene expression of HMGA2 and TP63 has been reported in neoplastic samples where HPV is integrated into their introns, whereas LRP1B is under expression with HPV integration in their flanking region (Michael et al., 2014;Hu et al., 2015). Therefore, a combination of HMGA2, LRP1B, and TP63 as potential biomarkers may be useful for screening during triage of HPV-positive patients, particularly for detecting CIN2+ (Jiang et al., 2019).
In recent integration sites analysis, most of reported genes affected by viral genome integration are related to cellular repair pathways, tumor suppression, growth, and cell proliferation, and in some cases, code for transcription factors (Oyervides-Muñoz et al., 2018). Several studies aiming to discover viral integration sites in genome of host cells have demonstrated frequent in the MYC, TMEM49, and FANCC genes (Zhang et al., 2016). In another report, POU5F1B, FHIT, KLF12, KLF5, LRP1B, and LEPRL1 were found to be recurrent sites for integration (Hu et al., 2015). Interestingly, FHIT gene has been associated translocation in cancer while the LEPREL1 gene has also been associated with breast cancer development. In our study, we found viral integration in the PDGFD, ELP4, and ATRN genes, with roles in the regulation of cell proliferation, transcription, and DNA-templated and inflammatory response, which were not reported by other neoplastic studies.
We believe that nanopore has obvious advantages compared with Illumina sequencing, as reported previously (Xiao et al., 2018;Liu Q. et al., 2019). In this study, we generated about 150 M data by nanopore and obtained better results than Illumina sequencer, which generated 1.6 G data and was ten times greater than nanopore. Besides the sequencing data amount, the instant data analysis capability would make the application more efficient and prompter, which is important for clinical usage. These will bring huge potential for nanopore application in multiple areas of clinical diagnosis, especially pathogen detection. We think our method developed for HPV and its integration site detection will become an important tool in the research or clinical related application field in the future.
There was only few newly published paper using nanopore platform to identify HPV integration sites (Quan et al., 2019;Van Arsdale et al., 2019). The first study revealed that using nanopore sequencing could simultaneously detect HPV infection and microbiota composition promptly and accurately. The other showed that long-range DNA sequencing utilizing an Oxford Nanopore MinION flowcell yielded 3.56 × haploid genome coverage including three reads encompassing the HPV70 DNA insertion in BCL11B. The latter also provides evidence of feasibility in using nanopore in clinical applications of virus integration detection, which confirmed our conclusion in another way.

CONCLUSION
By using nanopore sequencing technology, we successfully enriched the virus DNA sequence and virus integrated human genome sequence with novel discoveries compared with previous published. These data strongly suggest that nanopore can be used for both prompt virus detection and infection status discovery with comparable accuracy of Illumina but with far more less required sequencing data, and without the need for error correction.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the Genbank database repository, PRJNA591355, https://www.ncbi.nlm.nih. gov/sra/PRJNA591355.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Institutional Review Board of the Peking University School of Oncology, China. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
GT and YL acquired the funding for this study. BM and GT conceptualized the study. YL and JjL obtained the samples, patient consent letters, and ethical approval letter for this study. WY, JL, and WW conducted the experiments and collected the data. BM, RD, and JdL analyzed most of the results. BM and WY wrote the initial draft of the manuscript. BM, YL, JY, WW, and GT revised the manuscript. All authors discussed the results and reviewed the manuscript.