Detection and quantification of viral RNA in human tumors using open source pipeline: viGEN

Introduction An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples. We present an open source bioinformatics pipeline viGEN that combines existing well-known and novel RNA-seq tools for not only detection and quantification of viral RNA, but also variants in the viral transcripts. Methods The pipeline includes 4 major modules: The first module allows to align and filter out human RNA sequences; second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral genes level thus allowing for downstream differential expression analysis of viral genes between experimental and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package.

them are unaware that they harbor the virus [1]. Infection can also occur in high risk groups like injection drug users and health care workers [1,3].
Apart from HBV and HCV infections, other risk factors for liver cancer include heavy alcohol consumption, obesity, diabetes, tobacco smoking, certain rare genetic conditions, or cirrhosis (scarring of the liver) [1]. Liver disease triggered by obesity and diabetes is called nonalcoholic fatty liver disease (NAFLD) [2]. Even though these risk factors are known, it's not clearly understood how these normal liver cells become cancerous [4].

Existing methods of screening and treatments
For patients who are at high risk of liver cancer, screening using ultrasound, and also a blood test for alpha-fetoprotein (AFP, protein made by the liver and yolk sac of the developing fetus and normally found in fetal blood) is done every 6-12 months [5]. Currently, detection of Hepatitis B is done using serology tests and involves measurement of HBV specific antigen and antibody Detection of Hepatitis C virus is typically done using Enzyme immunoassay (EIA) to detect hepatitis C antibody; or Hepatitis C RNA assays are used to determine the viral load. Genetic testing is then done to check for the type of Hep C infection, which can be of six types (genotypes 1 through 6). In the US, baby boomers (born 1945-1965) are encouraged to get tested, as they may be treated with antiviral drugs to prevent progression to cancer [8].
Vaccine is available to protect against HBV infection, but not for HCV. Even though vaccines exist, it can only protect against infections if they are administered before the person is exposed to the cancer-promoting virus [9]. Recent advances in screening have helped in early detection of cirrhosis. In recent times, quantitative assays for HBsAg and HBeAg are also being used in identifying patients likely to respond to anti-HBV treatments, although more work is needed to standardize these new assays [10]. When early detection is successful, treatments include surgery to remove part of the liver (partial hepatectomy) or liver transplant [1]. Other treatments include cryosurgery and radiation therapy for cases where cancer has not metastasized. For patients where cancer has metastasized, targeted therapy, chemotherapy or clinical trials are tried. So early detection of liver disease is crucial, but has been challenging since the symptoms may not appear until the cancer has advanced [4].

Viral mechanisms of action
HBV is an enveloped partially double-strand DNA virus in the hepadnavirus family, and is a known oncogenic virus. In a HBV infection, the virus enters the bloodstream and infects the liver cells by active viral replication. During this time, the HBV genome integrates into the host chromosome and becomes the basis for chronic infection. HBV DNA integration may offer a selective growth advantage on infected cells and promote tumor progression. The integration sites of HBV DNA usually occur in genes involving growth control or cell signaling. When HBV DNA integrates into the host, chromosomal instability is also increased (e.g., large deletions, amplification, and translocations). Integrated HBV DNA sequences are found in about 80% of human Hep B induced liver cancer [11].
HCV is an enveloped single-strand RNA virus in the flavivirus family. Unlike HBV, HCV does not have a reverse transcriptase, so is not able to integrate into the genome of hepatocytes. HCV causes liver cancer by an indirect pathway when it produces chronic inflammation, cell death, proliferation and cirrhosis in the liver. Acute HCV infection patients that have large CD4+ and CD8+ T cell response in their blood are known to have a better chance of recovery. In contrast, patients who lack T cell response seems to indicate that the patient will develop chronic HCV disease [11].

Opportunities with Next generation sequencing
The popularity of next-generation sequencing (NGS) technology has exploded in the last decade.
NGS technologies are able to perform rapid sequencing, and in a massively parallel fashion [12].
In recent years, applications of NGS technologies in clinical diagnostics have been on the rise [13,14]. Amongst the various NGS technologies, whole-transcriptome sequencing, also called RNAseq has been very popular with methods and tools being actively developed. Exploring the genome using RNA-seq gives a different insight than looking at the DNA since the RNA-seq would have captured actively transcribed regions. Every aspect of data output from this technology is now being used for research, including detection of viruses and bacteria [15][16][17].
They are also independent of prior sequence information, and require less starting material compared to conventional cloning based methods, making it a powerful and exciting new technology in virology [12].
Our pipeline viGEN combines existing and novel RNA-seq tools to not only detect and quantify read counts at the individual viral genes level, but also detect viral variants from human RNA-seq data. The input file to our pipeline is a fastq [18] file, so our viGEN pipeline can be extended to work with genomic data from any NGS technology. Our pipeline can also be used to detect and explore other types of microbes as long as the sequence information is available in NCBI [19].
There are a number of existing pipelines that detect viruses from human transcriptome data.
Of these, very few pipelines offer quantification at the gene/CDS level. A comprehensive comparison of these pipelines is provided in Table 1 [26], allowing users to perform analysis on either their local computer or the cloud.

METHODS
In this paper, we used RNA-seq data from a publicly available human liver cancer data from the TCGA collection [27]. The raw genomic data was downloaded after obtaining special access from NCBI dbGAP (http://www.ncbi.nlm.nih.gov/gap). Existing well-known and novel RNA-seq tools were used to detect and quantify viral RNA at the genome and gene/CDS level. Once the viral genomes were detected, it allowed for downstream differential expression analysis of viral genes between experimental and controls groups. The viral variants detected in our pipeline can also give more insight into the mutations in these viruses and their impact on the tumor. The viGEN pipeline includes 4 major modules:

Module 1: Viral genome level analysis (filtered human sample input)
In Module 1 (labelled as 'filtered human sample input'), the human RNA sequences were aligned to the human-reference genome using RSEM [29] tool using the Globus Genomics platform [30].
The RSEM tool filters out all the sequences the aligned to the human genome, and outputs all sequences that did not align to the human genome (hence the name 'filtered human sample input'). These un-aligned sequences were taken and aligned to the 'viral reference file' using popular alignment tools BWA [31] and Bowtie2 [32]. Figure 1 shows an image of our viGEN pipeline.

Module 2: Viral genome level analysis (unfiltered human sample input)
In Module 2 (labelled as 'unfiltered human sample input'), the human RNA seq sequences were directly aligned to the 'viral reference' using BWA and Bowtie2 using the Globus genomics platform without any filtering.
The reason for using two methods to obtain the viral genomes in human RNA-seq data (Module 1 and Module 2) was to allow us to be as comprehensive as possible in viral detection.
The aligned reads from Module 1 and 2 were in the form of BAM files [33], from which read counts were obtained for each viral genome species (referred to as 'genome level counts') using Samtools idxstats [34] and Picard BAMIndexStats [35] tools. Only those virus species that had average genome count more than a minimum threshold (set to 100 reads) across samples in each sub-group (Hep B, Hep C, HepB+HepC) were selected for the next step of the pipeline.
Once the viral genomes were detected, it allowed us to examine them through a genome browser. We also checked the genome level counts to see if the Hepatitis B and C viruses were detected from this output, and compared it with the information from viral serology tests.

Module 3: Viral gene/CDS level analysis
The BAM files from Module 1 and 2 (from Bowtie2 and BWA) were input into the Module 3 (referred to as 'viral gene/CDS level analysis'), which calculated quantitate read counts at the individual viral genes level. We found existing RNA-seq quantification tools to be not sensitive enough for viruses, and hence developed our own algorithm for this module. Our in-house algorithm used region-based information from the general-feature-format (GFF) files [36] of each viral genome, and the reads from the BAM file. It created a summary file, which had a total count of reads within or on the boundary of each region in the GFF file. This is repeated for each sample and for each viral GFF file. At the end, a matrix is obtained where the features (rows) are regions from the GFF file, and the columns are samples.
The read count output from Module 3 (viral gene/CDS module) allowed for downstream differential expression analysis of viral genes between experimental and controls groups. In this pilot study, we examined the differences between "Dead" and "Alive" samples at the viraltranscript level for each sub-group using Bioconductor tool EdgeR [37] in R (http://www.Rproject.org). Cox proportional hazards (Cox PH) regression model [38] was also applied to look at overall survival time and event in the Hepatitis B sub-group.

Module 4: Viral variant calling module
The BAM files from Module 1 and 2 (from Bowtie2) were also input to Module 4 to detect mutations in these viruses (referred to as 'viral-variant calling module'). The BAM files were first sorted coordinate-wise using Samtools [34]; PCR duplicates were removed using tool Picard [35], then the chromosomes in the BAM file were ordered in the same way as the reference file using  [41], was also used.
The current version of our pipeline uses Varscan2. Low quality and low depth variants were flagged, but not filtered out, in case these low values were attributed to low viral load. Once the variants were obtained, they were merged to form a multi-sample VCF file. Only variants that had a variant in at-least one sample were retained. PLINK [42] was used to perform case-control association test (Fishers Exact Test) to compare 'alive' and 'dead' samples in the HepB and HepC groups.

Tutorial in Github
Since access to TCGA raw data is controlled access, we could not use this dataset to create a publicly available tutorial. So we looked for publicly available RNA-seq dataset to demonstrate our pipeline with an end-to-end workflow. We chose one sample (SRR1946637) from publicly https://github.com/ICBI/viGEN/. This tutorial has also been provided as Additional File 1.

Detection of Hepatitis B and C viruses at the genome level
We used our viGEN pipeline to get genome-level read counts obtained from viruses detected in the RNA of human liver tissue. We then checked to see if Hepatitis B and C viruses were detected from this output, and how it compared with the information from viral serology tests. We

Landscape of viruses at the genome level
We explored the genome level BAM files (from Module 1) through the IGV genome browser to see if we could find any interesting patterns. We looked at the landscape of HBV genome, HCV genomes and HERV K113 genome across all the samples.
In the landscape of HBV, we found the samples could be grouped under one of three patterns  patients who are 'Alive' seem to have more pileups that those who 'Died' although this conclusion cannot be strongly made since there are only a small number of samples in each pattern. In the landscape of HCV (NC_004102.1), we found two types of patterns ( The figures in Additional File 3 shows these above mentioned pattern types. We also matched these patterns to Dead/Alive status and summarized our findings in the tables in Additional File 3.
The range patterns in these viral landscapes indicate that information can be extracted in a meaningful way from the read information, and it adds to the validity of our approach.

Comparing 'dead' and 'alive' samples in the HepB subgroup using viral gene/CDS data
To get a more detailed overview of the viral landscape, we examined the human RNA-seq data to detect and quantify viral gene/CDS regions. We then examined the differences between 'Dead' and 'Alive' samples at the viral-transcript level on the Hepatitis B sub-group.
Out of 25 patients, 16 were alive (baseline group), and 9 dead (comparison group) as per the clinical data. The significant results are shown in Table 4.  (c) The overall model is significant with p-value < 0.05 from the Log rank test (also called Score test). using variant caller Varscan2, were the same. We collated the significant common results (p value <= 0.05) in Table 6. Among these results, we saw several missense and frameshift variants in Gene X of the Hepatitis B genome (nucleotide 1479), Gene P (2573, 2651, 2813), and a region that overlaps Gene P and PreS1 (nucleotides 2990, 2997, 3105, 3156). All these variants were found mutated more in the cases than controls. Other significant common results included

Detection of Hepatitis B and C viruses at the genome level
We used our viGEN pipeline to get genome-level read counts obtained from viruses detected in the RNA of human liver tissue. In our results, HBV was detected in 20% of the samples. This is in concordance with earlier analyses of TCGA liver cancer cohort study [16,47], which detected the HBV virus in 23% and 32% (with typically low counts range) of cases respectively.
It has also been reported that the viral gene X (HBx) was the most predominately expressed viral gene in liver cancer samples [47] which is in concordance with our findings where the peak number of reads were observed for gene X region of the HBV genome (see Figure 2).
We also compared the HBV and HCV detection from our data with the viral serology tests (Table 2A and 2B). We see some, but not a lot of concordance. There could be several reasons for the differences we see between RNA from tissue and serology: (a) We should remember that we are looking at viruses detected in RNA of human cancer tissue, and it is well known that this landscape is different from blood (which is used for the serology tests) or normal liver. According to [48], lab tests prove that HBV DNA replication and HBsAg are generally detected in different hepatocytes, while HBV DNA replication is generally, but not consistently seen in hepatocytes with HBcAg.
(b) If the viral DNA is integrated into the host (seen in acute infection stages and often precedes tumor development), in spite of having antigen/antibody markers in blood (causing serology test to be positive), it will not produce any RNA particles, causing low viral load in RNA [48,49].
(c) The tumor site acts like a viral reservoir, which allows the virus to accumulate and be stable, and allows for replication of virus. This makes the virus hard to detect through serology, and might be detectable when examining the tumor site [50][51][52].
(d) A patient could be in a 'HBV carrier state', which is characterized by the presence of HBsAg in the serum, low or undetectable levels of HBV DNA, normal aminotransferase activity and lack of HBeAg [53]. That means that in this stage, low levels of HBV DNA could cause low viral load in RNA even through the serology test is positive. In this stage, use of immunosuppressive therapies can lead to reactivation of infection [53].
(e) Serology tests are known to be un-reliable when the immune system becomes dysfunctional and may also explain the false positives seen in the results [54,55].
These results show that even though the genome-level viral counts detected through human RNA-seq are not a 100% match to viral serology data, it gives a good overview of the viral landscape in the tumor sample, and demonstrates the complex dynamics of infection and immune response. These results also indicate that a deeper look inside these viruses is warranted.

Comparing 'dead' and 'alive' samples in the HepB subgroup using viral gene/CDS data
To get a more detailed overview of the viral landscape, we examined the human RNA-seq data to detect and quantify viral gene/CDS regions. We then examined the differences between 'Dead' and 'Alive' samples at the viral-transcript level on the Hepatitis B sub-group (Table 4).
From the differential expression analyses, the two most informative results were (1) a region of the Hepatitis B genome that produced the HBeAg protein was overexpressed in the 'dead' patients and (2)  showing that antigens HBeAg and HBcAg were overexpressed in 'dead' patients compared to 'alive patients' makes sense, indicating that these patients never recovered from acute infection.
The results also indicate a higher level of HBsAg in the 'alive' patients compared to the 'dead' patients. The highest levels of HBsAg in the virus are known to occur in the 'immunotolerant phase'. This pattern is seen in patients who are inactive carriers of the virus i.e. they have the wild type DNA, and the virus has been in the host for so long, that the host does not see the virus as a foreign protein in the body, and hence there's no immune reaction against the virus. In this phase, there is known to be minimal liver inflammation and low risk of disease progression [58][59][60]. This could explain why we saw higher level of HBsAg in the 'alive' patients compared to the 'dead' patients.
Also among the significant results were three regions from the Human endogenous retrovirus K113 (HERV K113) genome (with negative log fold change) that were overexpressed in the 'alive' patients. Two of these regions were Sequence-tagged sites (STS) and the third region was in the gag-pro-pol region that has frameshifts. HERV could protect the host from invasion from related viral agents through either retroviral receptor blockade or immune response to the undesirable agent [61].
Overall, we found that our results from viral-gene/CDS level make biological sense, with much of the results validated through published literature.

Comparing 'dead' and 'alive' samples in the HepB subgroup using viral-variant data
We performed variant calling on the viral data to see if it can add valuable information to the tumor landscape in humans. We then compared the 'dead' and 'alive' samples at the viral-variant level on the 25 patients in the Hepatitis B sub-group.
Among the significant results (Table 6) included variants in Gene C (nucleotide 1979, 2396) and variants in PreS2 region (nucleotide positions 115, 126 and 148). The Gene C region creates the pre-capsid protein, which plays a role in regulating genome replication [62]. The mutation in the 2396 position lies in a known CpG island (ranging from 2215-2490), whose methylation level is significantly correlated with hepatocarcinogenesis [63]. Mutations in PreS2 are associated with persistent HBV infection, and emerge in chronic infections. The PreS1 and PreS2 regions are known to play an essential role in the interaction with immune responses because they contain several epitopes for T or B cells [64].
Mutations in the 1762/1764 positions of the X gene are known to be associated with greater risk of HCC [64] [65], and is independent of serum HBV DNA level [65]. This mutation combination is also known to be associated with hepatitis B related acute-on-chronic liver failure [66]. It is predicted that mutations associated with HCC variants are likely generated during HBVinduced pathogenesis. The A1762T/G1764A combined mutations was shown to be a valuable biomarker in the predicting the risk of HCC [64] [65]; and are often detected about 10 years before the diagnosis of HCC [64].
Among the significant common results to both, were a few variants of the Human endogenous retrovirus K113 complete genome (HERV K113). These variants map to frameshift and missense mutations in the putative envelope protein of this virus (Q779_gp1, also called 'env'). Studies have shown that this envelope protein mediates infections of cells [67]. HERV K113 is a provirus and is capable of producing intact viral particles [68]. Studies have shown a strong association between HERV-K antibodies and clinical manifestation of disease and therapeutic response [69] [70]. It is hypothesized that retroviral gene products can be 'reawakened' when genetic damage occurs through mutations, frameshifts and chromosome breaks. Even though the direct oncogenic effects of HERVs in cancer are yet to be completely understood, it has shown potential as diagnostic or prognostic biomarkers and for immunotherapeutic purposes including vaccines [70].

CONCLUSION
With the decreasing costs of NGS analysis, our results show that it is possible to detect viral sequences from whole-transcriptome (RNA-seq) data in humans. Our analysis shows that it is not easy to detect DNA and RNA viruses from tumor tissue, but certainly possible. We were able to not only quantify them at a viral-gene/CDS level, but also extract variants.

DECLARATIONS
Availability of data and material -The TCGA liver cancer dataset was used in the analysis and writing of this manuscript. The data can be obtained from https://cancergenome.nih.gov/ -Since access to TCGA raw data is controlled access, we could not use this dataset to create a publicly available tutorial. So we looked for publicly available RNA-seq dataset to demonstrate our pipeline with an end-to-end workflow. We chose one sample (SRR1946637) from publicly available liver cancer RNA-seq dataset from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). This dataset is also available through EBI SRA (http://www.ebi.ac.uk/ena/data/view/SRR1946637). The dataset consists of 50 Liver cancer patients in China, and 5 adjacent normal liver tissues. We downloaded the raw reads for one sample, and applied our viGEN pipeline to it. A step-by-step workflow that includesdescription of tools, code, intermediate and final analysis results are provided in Github: https://github.com/ICBI/viGEN/.

Project details:
Project name: viGEN Project home page: https://github.com/ICBI/viGEN/ Operating system(s): The R code is platform independent. The shell scripts can run on Unix, Linux, or iOS environment Programming language: R, bash/shell