Viral integration detection strategies and a technical update on Virus-Clip

Oncovirus infection is crucial in human malignancies. Certain oncoviruses can lead to structural variations in the human genome known as viral genomic integration, which can contribute to tumorigenesis. Existing viral integration detection tools differ in their underlying algorithms pinpointing different aspects or features of viral integration phenomenon. We discuss about major procedures in performing viral integration detection. More importantly, we provide a technical update on Virus-Clip to facilitate its usage on the latest human genome builds (hg19 and hg38) and the adoption of multi-thread mode for faster initial read alignment. By comparing the execution of Virus-Clip using single-thread and multi-thread modes of read alignment on targeted-panel sequencing data of HBV-associated hepatocellular carcinoma patients, we demonstrate the marked improvement of multi-thread mode in terms of significantly reduced execution time, while there is negligible difference in memory usage. Taken together, with the current update of Virus-Clip, it will continue supporting the in silico detection of oncoviral integration for better understanding of various human malignancies.


Introduction
Oncovirus infection is a major risk factor for human cancers (Muller-Coan et al., 2018). Some of the viruses may integrate into the human genome, leading to structural variation known as viral genomic integration. Infections of common oncoviruses, e.g., hepatitis B virus (HBV), human papillomavirus (HPV) and Epstein-Barr virus (EBV) are known to cause viral integration events and they are involved in the carcinogenesis process of liver cancer, cervical cancer and nasopharyngeal carcinoma, respectively. Viral integration may result in genome instability, disruption of human genes, aberrant human gene expression, and/or expression of chimeric oncogenic proteins, which contribute to the consequence of tumorigenesis.
Using hepatocellular carcinoma (HCC, a major form of primary liver cancer) as an illustration, it is a prevalent cancer and one of the leading causes of cancer death worldwide (El-Serag, 2011;Villanueva and Llovet, 2014). It has poor prognosis and only few effective treatment options are available. Despite years of efforts in studying the molecular mechanism of HCC carcinogenesis, current understanding on this lethal disease is still limited, with high recurrence and metastasis being the major hurdles for disease cure. Among the identified etiological risk factors for HCC (Ho et al., 2016), which include chronic viral infections (HBV and hepatitis C virus), chronic alcohol consumption, non-alcoholic fatty liver disease and nonalcoholic steatohepatitis, chronic HBV infection accounts for around 50% of cases (Llovet et al., 2021). One of the distinctive features of HBV genome is that it can integrate into the human genome, which in turn disrupts the endogenous tumor suppressors and other regulatory genes, or enhances the activity of proto-oncogenes. The imbalance of the overall oncogenic and tumor suppressive signals may result in enhanced cell survival, proliferation and reduced apoptosis and lead to HCC development (Ho et al., 2016).
Given the prominent role of viral integration in certain oncovirus-driven human cancers, it is important to characterize their underlying oncogenic mechanisms. With the wide adoption of next-generation sequencing (NGS), it is possible to have more systematic and unbiased survey of viral integration events. Throughout the past decade, different computational tools have emerged to detect viral integration and determine the exact breakpoint position of the human-virus chimera. Existing tools differs in the underlying algorithms, pinpointing different unique aspects/features of viral integration.

Materials and Methods
Target-panel sequencing data on HBV-associated HCC patients We executed Virus-Clip using the target-panel sequencing data (Sze et al., 2021) of two selected HBV-associated HCC cases. They were detected with HBV integration events at KMT2B and TERT genes, respectively, at the human genome.

Performance evaluation of Virus-Clip
Virus-Clip was executed using single-thread and multi-thread modes for the initial read alignment using BWA-MEM. We compared the performance of the two modes in terms of execution time, and the number of CPU and memory used.

Availability of Virus-Clip
The latest version of Virus-Clip is available at https://github. com/dwhho/Virus-Clip.

Read alignment and extract of chimeric pairs and/or soft-clipped reads
Most viral integration detection tools begin with having input as FASTQ or BAM files. After filtering for low-quality sequencing data, reads are mapped to the human and/or HBV genome and they search for potentially useful reads that indicate viral integration events. The most frequently used read aligners for detecting viral integration detection are different variants of BWA and BLASTN. BLASTN can map with greater accuracy and at shorter read length (with word size of 11 by default) (McGinnis and Madden, 2004), but it is relatively time consuming compared to other aligners. On the other hand, BWA can align with longer and pair-end reads with much faster speed. To shorten the time for read alignment, some algorithms, e.g., BATVI and SurVirus, apply initial k-mer candidate filtering strategy to narrow down the possible set of input reads.
For the alignment strategy, there are mainly three ways to extract chimeric pairs and/or soft-clipped reads (Chen et al., 2019b). Strategy 1 'Human-Virus': Raw reads are first mapped to the human genome, with partially mapped or unmapped reads obtained and then aligned to the virus genome. Vy-PER, HGT-ID and VIcaller use this strategy. Strategy 2 'Virus-Human': It is similar to strategy 1 but is performed in reverse order. Virus-Clip, ViralFusionSeq, and BATVI, employ this strategy. Strategy 3 'Human+Virus': Raw reads are aligned to a hybrid genome concatenating human and virus genomes. Tools use this strategy include VirusSeq, ViFi and VirTect. The remaining ones adopt a combinatorial approach, like VirusFinder and VERSE combine strategy 1 and 3, while SurVirus and SummonChimera integrate strategy 2 and 3. Due to the huge difference between the size of human and virus genome, the choice of the initial reference genome (reads to be mapped to) will make significant difference on the execution time. Tools, e.g., Virus-Clip initially aligns reads to the virus reference genome can substantially speed up the alignment and minimize the required computational resources.
Due to the genetic variability of virus genome and the virus-induced host genome instability, they are the ratelimiting factors for detecting viral integration. To improve the mapping ability, VERSE is designed to use short reads to iteratively modify reference genomes by SNPs and indels, so as to build a customize reference genome. ViFi applies phylogenetic methods to derive evolutionary relationships by a collection of profile Hidden Markov Models (HMMs) between known viral strains, and novel or mutated viral strains to identify viral reads from any region of the virus family of interest.

Quality control of aligned and extracted reads
To eliminate potential false-positives and ambiguous viral integration events, quality control for supportive reads is crucially important. Tools are having procedures to clean up low quality reads before detecting integration breakpoints. More recent ones, e.g., SurVirus, VIcaller, and ViFi, pay much more attention to quality control as compared to the earlier tools such as VirusSeq, ViralFusionSeq, and VirusFinder. Read alignment in BAM/SAM format has a mapping quality (MAPQ) score suggesting the precision of a read that aligned to the reference genome. In order to improve accuracy for detecting viral integration, MAPQ score are applied to filter low quality alignments. ViFi uses a criterion to remove low quality reads with MAPQ score <10. HGT-ID and VIcaller discard reads with MAPQ score <20. Besides, PCR duplicates could also cause false positives. Detection tools like BATVI, SurVirus and VIcaller remove redundant reads before subsequent analysis.
As human genome contains over 50% repetitive sequences (de Koning et al., 2011;Hannan, 2018;Lander et al., 2001), e.g., tandem repeats, satellite DNA and transposable elements. There may be high sequence similarities between human and virus repetitive sequences. Thus, detecting viral integration in repetitive regions can be challenging, since aligners usually fail to map reads correctly in repetitive regions. It tends to identify more false positives in those regions. Therefore, Vy-PER, SurVirus, ViFi, BATVI and VIcaller have dedicated strategies to reduce such artefacts. Soft-clipped reads that have chimeric human-virus sequences (one end can be mapped to the reference genome while the other one cannot) are critical to provide information for indicating viral integration and suggesting the exact breakpoint positions. However, some soft-clipped sequence portions are too short to be unambiguously aligned. Hence, ViFi, VirTect, and Vicaller only extract soft-clipped sequence more than a certain threshold, while Virus-Clip only retains events that have soft-clipped sequence portions specifically realigned to reduce false positives. Soft-clipped sequence portions that unmapped to neither human nor virus genome, are not always invalid. The unmapped soft-clipped portions might be caused by the limited length of sequence, or the possibility of a short random sequence insertion. With the above consideration, BATVI has a dedicated strategy to rescue for putative soft-clipped reads that are either having softclipped sequence portions too short to be re-aligned or due to short random sequence inserted within viral integration site. Alternatively, ViFi attempts to rescue reads that are viral but might be unmapped due to evolutionarily divergent from the virus reference genome by using ensemble of HMMs. Taken together, different tools differ in their underlying consideration of viral integration features and they result in variable performance (efficiency and accuracy) in viral integration detection.

Integration of candidate discovery and determination of integration breakpoints
With the identification of high-quality read pairs or chimeric reads, they are used for deriving the exact integration breakpoint positions. Tools either cluster reads to determine breakpoints or rely on additional structural variation detection programs. Most of the existing tools follow the first approach. Concerning about the issue of tumor heterogeneity , some viral integration sites are shared among samples and they are believed to carry a higher degree of accuracy than the singleton ones. Nevertheless, directly pooling all sequencing data from different samples can be computationally intensive and time consuming for data analyses. Particularly, soft-clipped reads are pivotal to locate the exact breakpoint positions of the viral integration events and this strategy is using in tools, e.g., Virus-Clip. Alternatively, if soft-clipped reads are absent, tools, e.g., BATVI will assemble sequences to determine the breakpoints. Given that there could be mismatches and gaps around the breakpoints, tools, e.g., VirTect applies local HMM realignment to improve accuracy.
In summary, the procedures for identifying viral integration breakpoint are similar among the tools. They mainly differ in the threshold or refining strategy used to improve the precision of candidate breakpoints.

Technical Update of Virus-Clip
Virus-Clip was developed as a fast and memory efficient computational tool for detecting viral integration and determining the breakpoint position at single-base resolution. It takes raw reads in FASTQ format as input and can handle both single-and paired-end reads from ordinary NGS sequencers. Unlike most of the other tools that was developed at that time, Virus-Clip adopted 'Virus-Human' mapping strategy i.e., initially performing read alignment to virus reference genome. Due to this simple but yet important optimization, the efficiency of the entire viral integration detection process can be greatly enhanced. Besides, the installation of Virus-Clip is relatively easy, and we have provided necessary setup instructions. This is important because some tools have been reported to fail the installation due to complex compilation and/or execution errors (Chen et al., 2019b). Furthermore, BLASTN is employed to map the candidate soft-clipped sequence portions that are putatively of human origin. With the default minimal length of 11bp as input, it can effectively reduce false positives by discarding short candidate softclipped sequence portions (low discriminative power or high chance of random match due to short length). Another useful feature of Virus-Clip is the integrated annotation function that can determine the affected human genes without the need for additional annotation by another tool.
In the previous version of Virus-Clip, it was developed solely for the genome build hg19 and it is assumed to be run in single-thread mode. Therefore, in the current updated version (Fig. 2), we have revised Virus-Clip to allow for using either genome build hg19 or hg38 as human reference. Although Virus-Clip running in single-thread mode can already achieve good efficiency, with the increasing large size of NGS data, we have provided instructions to allow for using multi-thread mode in the initial read alignment. Indeed, Virus-Clip was tested to analyze the target-panel sequencing data of two HBVassociated HCC patients and the empirical performance of single-thread and multi-thread modes demonstrated significant improvement in executive time, while there was negligible difference in memory requirement (Tab. 2). Results justified the adoption of multi-thread mode in the initial read alignment of Virus-Clip, when computational resources are available. Regarding the precision of the viral integration events identified by Virus-Clip, as exemplified by our previous reports using empirical statistics on analyzing whole-transcriptome (Ho et al., 2015) and targeted sequencing (Sze et al., 2021) data, it can achieve good success rate of experimental confirmation using a threshold of at least 3 supporting reads.
Taken together, with the current update of Virus-Clip, we hope to continue delivering a simple and useful bioinformatics tool that have all-rounded performance in terms of simplicity of installation, execution efficiency, low requirement of computational resources, and good experimentally validated accuracy of detection. We believe Virus-Clip will continue facilitating the detection of oncoviral integration and the studies of their related human malignancies.