A host subtraction database for virus discovery in human cell line sequencing data

The human cell lines HepG2, HuH-7, and Jurkat are commonly used for amplification of the RNA viruses present in environmental samples. To assist with assays by RNAseq, we sequenced these cell lines and developed a subtraction database that contains sequences expected in sequence data from uninfected cells. RNAseq data from cell lines infected with Sendai virus were analyzed to test host subtraction. The process of mapping RNAseq reads to our subtraction database vastly reduced the number non-viral reads in the dataset to allow for efficient secondary analyses.


Introduction
Host subtraction is the bioinformatics process of filtering reads derived from host DNA and RNA (Daly et al., 2015). Host subtraction enriches the non-host component of sequence datasets and is especially attractive for assays involving highthroughput sequencing technologies that generate short reads in high volume, where data reduction can realize cost savings. Following host subtraction, remaining reads can be mapped to references and counted, or used as queries to sequence databases, or possibly assembled to reconstruct novel transcript or genome sequences.
Host subtraction resources are needed for human cell lines that are widely used for RNA virus propagation. This includes three human cell lines Jurkat, HuH-7, and HepG2 commonly used to grow viruses or to amplify viruses suspected to be present in clinical isolates. The Jurkat line, derived from human T cells, supports replication of HIV and some Herpesviruses. Jurkat cells have been described to harbor Xenotropic murine leukemia virus-related virus (XMRV), a now famous gammaretrovirus incorrectly proposed as a causative agent of human prostate cancer and chronic fatigue syndrome (Cingöz et al., 2012). The HuH-7 and HepG2 lines, both derived from liver cells, are widely used for virus research among multiple virus families. These cell lines are permissive to the growth of several RNA viruses. Huh7 cells support Hepatitis C virus (HCV) replication (Lindenbach et al., 2005), though not all Huh-7 lines are permissive to HCV (Sainz et al., 2009). Huh-7 cells are also permissive for Dengue virus (Lin et al., 2000). In addition, Vesicular Stomatitis Virus (VSV), of the Rhabdoviridae family, replicates efficiently in HepG2 and HuH-7 (Marozin et al., 2008). HepG2 cells are permissive to Influenza virus (Ollier et al., 2004) and have been shown to harbor cell-line specific chromosomal rearrangements (Wong et al., 2000). Notably, Jurkat, HuH-7, and HepG2 cell lines are all permissive to the Paramyxovirus Sendai virus (SeV). While SeV does not cause disease in humans, infection in mice results in pneumonia in mice (Faísca & Desmecht, 2007). More importantly, SeV is extensively used as a model to study virus-host interactions since it has a similar genomic organization to pathogenic viruses that include Ebola, Marburg, Hendra and Nipah Virus.
An ideal subtraction resource for any cell line would include a complete genome sequence. Less expensive alternatives are available for cell lines derived from humans. One subtraction alternative is a reference human genome supplemented with a collection of cell-line-specific sequences. To this end, we have developed a sequence subtraction database (SDB) that permits enrichment of viral sequences through the computational depletion of host sequences. We developed an SDB, named SDB1, representing three human tissue cell lines that are often used for the detection of RNA viruses that infect humans. Reducing the size of the NGS dataset by removing host background reads allows the researcher to perform subsequent analyses (e.g. de novo assemblies, read mapping, homology searches) more efficiently and with less computational requirements.

Cell line sequencing
Frozen HepG2 cells were obtained from ATCC, part number ATCC HB-8065, lot # 61983117. The HepG2 cell line is derived from a liver hepatocellular carcinoma of a 15-year-old Caucasian male. Frozen HuH-7 cells were obtained from JCRB Cell Bank, part number JCRB0403, lot # 08062010. HuH-7 is described as a well-differentiated human hepato-cellular carcinoma cell line derived from the liver or gallbladder of a 57-year-old male Japanese patient, who died in 1985. Frozen Jurkat cells were obtained from ATCC, part number ATCC TIB-152, lot # 6213515. This Jurkat cell line was derived from peripheral blood of a 14-year-old boy who was diagnosed with acute T cell leukemia. Cells were maintained following the standard recommended protocol per cell line.
A single genomic DNA library was prepared per cell line. Genomic DNA was isolated from the cell line using a Qiagen genomic DNA isolation kit. Bioanalyzer analysis confirmed high molecular weight DNA was recovered. After Blue Pippin size selection, fragments appeared to be 290bp. NextGen paired end barcoded genomic library construction was performed with a NEBNext whole genome library prep kit. NextGen Library quantification and normalization was performed by qPCR. Each library was test sequenced with one run on an Illumina MiSeq, and then sequenced with two runs of an Illumina NextSeq 500 using the Illumina High Output Kit. Reads were demultiplexed, which removed barcodes and sequencing adapters.
Four RNA libraries were prepared per cell line. Cells were either mock infected or infected with SeV as previously described (Dilley et al., 2017). In each condition, libraries were either treated to deplete ribosomal RNA or total RNA was subjected to library construction. Libraries were multiplexed and sequenced using one run on an Illumina NextSeq 500 using the Illumina Mid Output Kit.
SDB construction DNA and RNA sequence reads were trimmed of adapter using CutAdapt 1.8.1. RNA sequence reads were also trimmed of low quality bases, adapter sequences, and resulting short reads using Trimmomatic 0.35. The DNA read sets were not filtered based on length, quality, or number of ambiguous base calls.
Reads were mapped with bowtie2 (Langmead & Salzberg, 2012) version 2.2.5 with stringent parameters designed to identify and remove only those read pairs with full-length reference agreement. Reads were mapped to the NCBI GRCh38.7 human genome reference sequence using bowtie2 in global alignment mode (--end-to-end) with low sensitivity (--fast) and stringent output settings (--no-unal --no-mixed --no-discordant

SDB tests
The RNAseq subtraction step used a sensitive mapping strategy. Although it did not use a splice-aware aligner, it relied on sensitive local alignments, the selection of at most one best alignment per read, and the subtraction of both reads of a pair if either read mapped. The mapping used bowtie2 with parameters "--sensitive-local" and "--no-unal" to map read pairs to the subtraction database SDB1. The mapping step can be parallelized with one pair of input read files per job. Each job requires RAM approximately equal to twice the database size. Our runs against the 3.3 GB SDB1 reserved 8 GB RAM and 4 threads.
For reads characterized by BLAST, BLASTN 2.2.31+ from the NCBI BLAST+ package was used to search the NCBI nt database. BLAST was run with parameters "-outfmt7 qseqid sseqid pident length mismatch gapopen evalue bitscore staxids" to capture taxon ID and "-max_target_seqs 1" to retain the top hit per query sequence.

Construction of the Subtraction Database
To obtain cell line genomic sequence, each cell line was separately subjected to DNA sequencing. For each of three cultured human cell lines, one DNA sequencing library was created with a ~300 bp insert size. Each library was test sequenced on an Illumina MiSeq platform to generate small volumes of 2x300 bp read pairs; in these pairs, reads were expected to overlap. Each library was then sequenced on two runs of an Illumina Next-Seq platform to generate high volumes of 2x150 bp read pairs. Reads were trimmed with CutAdapt. The result offered 105 Gbp of sequence in 346 M read pairs, or 28X average coverage of the human genome per cell line. Further details and public accessions are provided in Table S1. The reads were filtered to eliminate pairs that provide sequence that is already present within a standard human genome reference sequence. This was done by mapping the NextSeq genomic read pairs with stringent parameters. This step filtered 94% of read pairs.
To further characterize the mapped reads, the reference mapping was scanned for high-coverage and low-coverage areas. High coverage would indicate sequence present in the cell line at higher copy number than their representation in the reference sequence. For example, coverage over 100X would represent at least 3X higher representation in the cell line than the reference. With this criterion for high coverage, the number of reference bases at high coverage was 8.6 Mbp per cell line. The average high-coverage interval was 106 bp, which is shorter than the 150 bp read length, suggesting that short tandem repeats may be expanded in the cell lines. The mapping was also scanned for low-coverage areas. Low-coverage regions shorter than read pairs would indicate rearrangements within the cell line genomes. The combined span of reference bases with coverage less than 2X was 96 Mbp and the average low-coverage interval was 265 bp. Further details are provided in Table S2.
To capture sequences present in any of the three cell lines, but absent from the reference, the unmapped reads were assembled into Cell Line Contigs (CLCs). The assembly generated about 3 Mbp in 8 K CLCs for each cell line. CLC size statistics are provided in Table S3. The average CLC size was 380 bp though there were a few large CLCs from each cell line. The largest CLC from each cell line had a partial alignment to the largest CLC from the other two lines. Analysis of the two largest contigs per cell line showed similarity to other human or primate sequences in the public databases and redundancy between cell lines (Table S6).
The subtraction database named SDB1 was constructed by concatenating FASTA representations of the CLCs, the human genome reference, the PhiX genome, the UniVec database, and a collection of Mycoplasma complete genomes. The database construction process is summarized in Figure 1a. Table S4 shows the number of sequences and number of bases per data source.

Subtraction and detection of Sendai virus
The subtraction database SDB1 was used to filter host sequence from RNAseq data using the process summarized in Figure 1b. The subtraction process was assessed with a test designed to emulate a cell-based RNA virus detection assay. In this assay, environmental samples are analyzed to determine which RNA viruses are present, if any. To overcome the presumed low titer of viral RNA, this assay uses virus-permissive cell lines to amplify viral load. After exposure to environmental samples, the cells in culture are grown for sufficient time for viral replication. RNA is harvested, optionally depleted of rRNA, and sequenced. The rRNA depletion step is employed to enrich the non-host RNA and increase the number of non-host RNAseq reads generated. After sequencing, a sequence analysis step involves alignment of the RNAseq data to reference databases for taxonomic classification and quantification. Classification of host cell reads is uninformative so all computational investment in classifying them represents overhead cost. The goal of the subtraction step is to reduce the overhead without loss of sensitivity. Although the taxon of the virus was known a priori, the same process could be used to detect unknown viruses in an uncharacterized sample.
This test used SDB1, our subtraction database that includes genomic DNA sequence derived from the HepG2, HuH-7, and Jurkat human cell lines. This test also used RNAseq from those cell lines. RNAseq reads were generated from multiple samples and mapped to SDB1. This test used uninfected cells and cells infected with the Sendai virus (SeV). After the growth period, cells were treated for rRNA depletion, or left untreated as a control. Twelve libraries were generated in total, representing the three cell lines under four different conditions: infected-and-depleted, infected-and-not-depleted, mock-infectedand-depleted, and mock-infected-and-not-depleted. The libraries were barcoded, multiplexed, and sequenced on a NextSeq platform to generate 2x150 bp read pairs. RNAseq read counts per library are provided in Table S1.
The sequence data were subjected to a subtraction process by mapping to SDB1. The mapping used local alignments to allow host spliced RNA sequences to map to host genomic DNA sequence. The results are shown in Table 1. The read counts after trimming ranged from 19 to 27 million per library (Table 1, column A). The rate of subtraction ranged from 89% to nearly 100% per library (Table 1, column B). The relative contribution by each type of SDB1 sequence is shown in Table S5. As expected, the human genome reference sequence subtracted the most reads. The CLCs from the human cell lines subtracted 0.19% to 1.34% of reads per library. Contrary to expectation, the cell line origin of the RNAseq was not predictive of the cell line whose CLCs would subtract the most reads. Instead, the CLCs derived from HuH-7 consistently subtracted more RNAseq reads than those from HepG2 or Jurkat. It is possible that the CLCs derived from HuH-7 capture larger portions of transcriptional units that are partially represented in the other CLCs.
Next, the RNAseq reads that did not map to SDB1 were extracted for analysis. These "non-SDB" reads were mapped Contigs from cell line gDNA are combined with host and contaminant references to form the subtraction database. During application (b), cells are exposed to an infected sample. RNA virus replication accompanies cell growth. Subsequent RNAseq data is filtered to remove close matches to the subtraction database. Remaining data are characterized by the more costly process of aligning to a reference database.
to SeV genome reference sequences. As expected, the libraries with the largest SeV complement were the libraries that had been infected with Sendai virus and depleted of rRNA. When expressed as a portion of initial reads, the SeV complement was 5% to 10% for these RNAseq libraries (Table 1, column B). When expressed as a portion of the non-SDB reads, the SeV complement was 84% to 99% (Table 1, column C). This demonstrates that the SDB subtraction step left RNAseq data that was highly enriched for the viral sequence.
In all the datasets derived from infected libraries, SeV was the majority component of the non-SDB reads (Table 1, column D). This was true even for the non-depleted libraries.
When expressed as a portion of initial reads, the SeV complement was 0.47% to 0.96% for non-depleted libraries. When expressed as a portion of non-SDB reads, the SeV complement was 78% to 95% for non-depleted libraries. While the non-depleted libraries did contain fewer SeV reads in absolute terms, the portions of reads identified as SeV after subtraction were almost as high as those for the depleted libraries. This suggests that the SDB subtraction can provide a viable alternative to rRNA depletion for the detection of RNA viruses.
In the datasets from cells that were mock infected, the portion of non-SDB reads mapping to SeV was non-zero (Table 1, column D). In one case, it was 24%. These results suggest that taxonomic classification by mapping is noisy and that false positive virus identifications could result. Our data suggests a threshold for avoiding false positives, namely that a virus is suspected only if the post-SDB virus read count is as least 0.10% of initial reads. Additional data and additional replicates would assist selection of appropriate thresholds for desired levels of specificity.
The viral taxon was known beforehand and mapping non-SDB reads to SeV references only confirmed a priori knowledge. In a discovery situation, the non-SDB reads could be characterized by BLASTn homology search against the NCBI nt database. With our data, SeV was the taxon with the most hits for all libraries. This result indicates that SeV could have (B) At least 89% of reads from every sample mapped to the subtraction database named SDB1. (C) Non-SDB reads were mapped to SeV references. Expressed as a fraction of initial reads, the SeV complement was 5% to 10% in infected samples and less than 1% in the controls. (D) Expressed as a percentage of non-SDB reads, the SeV complement was 78% to 95% in infected-but-not-depleted samples. The complement was larger in the infected-and-depleted samples, and smaller in the mock-infected samples. This suggests that analysis of non-SDB reads could support the detection of known viruses in an uncharacterized sample. The apparent false positive enrichments (e.g. 24.24% SeV in HEP/none/none) can be discounted by applying a minimum requirement for 0.10% viral reads out of initial reads.  Table 2. From every library, a majority of reads were subtracted by mapping to SDB1 (Table 2, column B). After subtraction, the non-SDB reads were mapped to reference genome sequences for Ebola (EBOV) and Marburg (MARV). The EBOV complement of the non-SDB reads was 57% to nearly 100% in EBOV-infected libraries and lower in the other libraries (Table 2, columns C-D). The MARV complement of the non-SDB reads was 39% to 75% in MARV-infected libraries and lower in the other libraries (Table 2, columns E-F). These results confirm that SDB1 can provide effective enrichment of other viruses within independent cell cultures of HuH-7.

Discussion
The establishment of host sequence databases from commonly used cell lines, especially for lines often used to propagate viruses, is a critical control to ensure experimental results are attributed to the specific virus being tested. Therefore, we analyzed the human cell lines HepG2, HuH-7, and Jurkat in order to increase their utility for amplifying and detecting viruses in clinical or environmental samples. We developed a subtraction database (Daly et al., 2015) for the computational removal of host reads from RNAseq datasets. The database consists of the GRCh38 reference human genome sequence plus cell line specific sequences and potential contaminant sequences. The database contained DNA sequence from uninfected cells. We sequenced genomic DNA, rather than mRNA, in order to capture cell line genes whose expression may be limited to the viral propagation stage. We demonstrated utility by mapping RNAseq reads from cells infected with RNA viruses, as well as control cells, to the database. We were able to subtract host sequences and enrich each dataset for the non-host complement.

Supplementary material
The following are provided as an Excel workbook (Click here to access the data): Table S1: Characteristics and accessions of human cell line sequencing data.      The subtraction process reduced the computational cost that would be incurred by characterization of every read. The subtraction process removed host reads which often represented 99% of the data. The subtraction process used mapping software which imposed approximately 1% of the computational cost compared to characterization by BLAST. However, the identification of known sequences could be accomplished at lower cost using alignment-free K-mer matching software such as Kraken (Wood & Salzberg, 2014). Our subtraction process relied on the Bowtie2 (Langmead & Salzberg, 2012) mapping software. The process could be modified to use instead a splice-aware aligner such as HiSat2 (Kim et al., 2015). Whereas our highly sensitive process subtracts all read pairs with even one partial alignment, a splice-aware process might achieve higher specificity by subtracting only those RNA pairs with full-length alignments including spliced alignments. Being a faster mapper, HiSat2 might also reduce the computational cost of subtraction.
Our subtraction database included novel sequences specific to the three cell lines. Since it is not yet economical to generate complete genome and transcriptome assemblies of each subject cell line, we generated short reads at low coverage and retained only those reads that did not map to the human reference. Subsequent de novo assembly of the retained reads yielded 8 Mbp of novel sequence per cell line. The resulting cell line contigs (CLCs) were included in the subtraction database.
The marginal value of the CLCs was low. The CLCs were responsible for only 0.19% to 1.34% of RNAseq subtraction per sequencing experiment. It thus appears that the cost of CLC construction outweighed the benefit, at least for the cells and viruses tested. Investigators constructing subtraction databases for other human cell lines should evaluate the utility of existing public sequences prior to cell line sequencing. CLCs could be more beneficial on cell lines harboring transcribed endogenous viral sequence, on cell lines derived from non-human species for which prior sequences are lacking, on human cancer cell lines harboring divergent genomes, and on cells suspected of harboring transfection-induced chromosomal rearrangements, as was reported for transfected HepG2 cells (Livezey et al., 2002). CLCs could inform the search for marker sequences for cell line authentication (Almeida et al., 2016).
Characterization of the CLCs remains for future work. The largest CLCs had full-length alignments to complete sequences of clones from chromosomes of human or other primates, (Table S6) and partial alignments to CLCs from the other cell lines used here. Our coverage analysis of reads that did map to the reference suggested copy number variation and structural variation in the cell line genomes. It is likely that many CLCs capture cell-line specific chromosomal breakpoints such as those reported in HepG2 (Wong et al., 2000) or cell-line specific retrotransposed insertions.

4.
I offer some brief comments that could be addressed to make the paper more accessible and also justify the work. A schematic of the process used in the study. There was a very detailed description that could have been shortened by providing a schematic. This would enhance the paper, by making it easier to read and understand. What is the marginal advantage of the data in the paper? If one were to just use mapping to the reference human genome (hg38), I assume most of the host-derived reads would get eliminated. How many more reads are eliminated using the database provided in this paper? If this is not a big number, then maybe it is not worth the work? There were some contigs assembled that do not map to the reference human genome? What are these? Some hints on where they come from? Is the DNA or mRNA data most useful? It was not clear to me at the end if one or the other was more useful in the subtraction.

Are sufficient details provided to allow replication of the method development and its use by others? Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly I have published with 2 of the authors on this author list, however due to my large Competing Interests: number of publications I do not feel this is a conflict of interest. I also wish for the readers to know that Jason Miller contacted me personally, requesting for me to review this article.

Reviewer Expertise: Genomics, bioinformatics.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
In this article, the authors present a host subtraction database for human cell line sequencing. The purpose of this approach is subtract non-viral reads vom RNA-seq samples of virus-infected cells to improve efficiency of subsequent analyses. Their substraction database is targeted specifically towards three commonly used human cell lines: HepG2, HuH-7 and Jurkat.
In brief, their approach was as follows: For creation of the host subtraction database (SDB): -Perform DNA sequencing for the cell lines considered -Align reads to the human reference genome -Assemble unaligned reads into contigs The SBD was then created as a concatenation of FASTA sequences of the assembled contigs, the human reference genome, the UniVec Database (nucleic acid sequences which may be of vector origin), the PhiX genome (used as control in Illumina sequencing) and a collection of Mycoplasma genomes The subtraction process was then performed by aligning reads with bowtie2 against the SDB and then removing all read pairs if at least one read in the pair could be mapped with a local alignment to the SDB.
Evaluation of their approach was performed using RNA-seq data of the three cell lines either infected or not infected with Sendai Virus (SeV) as well as Huh-7 cells infected with Ebola or Marburg virus or not infected. They show that after host subtraction, virus reads make up the largest portion of the remaining reads. They also show that the combined runtime of first running bowtie2 and then BLAST on the remaining reads, e.g. to identify the virus, is only around 20% of the time that would be required if BLAST were run on all reads.
While the article is well written and the method is well described, there are a number of issues that need to be addressed: 1) The advantage of adding contigs assembled from DNA sequencing to the SDB is relatively small. Only between 0.19% and 1.34% of reads were subtracted in the SeV test case. However, the effort for obtaining these contigs is substantial as it requires DNA sequencing and contig assembly. Thus, the SDB cannot be easily extended to other cell lines without additional experiments. To show that this additional effort is warranted, the authors should at least show the following: That removing reads mapping to the contigs before post SDB mapping has a substantial influence on the results by either resulting in reduced numbers of false positive mappings to the Sendai genome or substantially reducing runtime of the total pipeline (including SDB mapping and potentially BLAST analysis). That a similar effect cannot be obtained by augmenting the human genome by publicly available human sequence data not part of the reference genome, e.g. transcript sequences from Ensembl or RefSeq or alternatively publicly available RNA-seq data for the considered cell lines. In particular, the latter might even outperform DNA sequencing for the cell lines as assembly of reads not mapping to the human genome would result in novel transcript sequences to which RNA-seq reads map better with the (unspliced) read aligner used in this approach. Furthermore, it would allow easily extending the SDB approach to other cell lines for which RNA-seq data already is/becomes available.
2) There should be some justification for the use of an unspliced read aligner, i.e. bowtie2, for mapping RNA-seq reads against the SDB, rather than a dedicated RNA-seq mapping program that also allows identifying spliced read alignments, such as STAR or HISAT2. In particular, the latter is both fast and requires relatively little memory. While removing reads for which bowtie2 identifies a local alignment likely identifies all sequences originating from the SDB sequences, there is a potential that too many reads may identifies all sequences originating from the SDB sequences, there is a potential that too many reads may be removed, e.g. in case of local similarities between SDB sequences and the virus in consideration. A safer approach would be to only exclude reads that can be aligned completely or almost completely (including gaps due to splicing). Regarding this point, the authors should show that: there is a substantial speedup of using local alignment with bowtie2 compared to fast RNA-seq alignment with HISAT2 or STAR that results do not differ much between the two approaches, i.e. the bowtie2 approach does neither miss many mappings a dedicated RNA-seq mapper would find nor reports many mappings not found using an RNA-seq mapper. If the latter is the case, they should show that these additional mappings missed by an RNA-seq mapper are valid mappings to the human cell lines and influence subsequent results.
Finally, DNA and RNA sequencing data for the considered cell lines should be submitted to a public database to ensure full reproducibility of their approach.

Is the description of the method technically sound? Yes
Are sufficient details provided to allow replication of the method development and its use by others? Yes We agree that the manuscript should clearly convey the finding that the CLCs offered little advantage on these cells and these viruses. The manuscript already noted that the CLCs subtracted only 0.19% to 1.34% of RNA reads, providing "a small amount of mapping sensitivity." However, the extensive treatment of CLCs in the Discussion seemed to emphasize their role. We deleted almost three paragraphs about CLCs from the Discussion. We inserted what is now the third paragraph of Discussion which includes the statement, "It thus appears that, for the cells " and viruses tested, the cost of CLC construction outweighed the benefit.
With this change incorporated, it is important to note the Huh7, HepG2, and Jurkat cell lines are commonly used for viral propagation. Cell lines often have genomic abnormalities not previously reported in public databases. From this standpoint, generation of sequence data and the corresponding CLCs provides a valuable reference dataset to the community, if only to show consistency with public domain data. The first sentence of the revised Discussion now says host sequence databases provide "a critical control to ensure experimental results are attributed " to the specific virus being tested.
One can question whether an essentially equivalent resource could be reproduced at lower cost. Performing that experiment would go beyond the scope of this F1000Resarch paper, which announces an existing resource that readers may download without charge. The question is certainly relevant for construction of future subtraction databases. Therefore, we modified the manuscript to raise the issue. The new third paragraph of Discussion includes the statement, " Investigators constructing subtraction databases for other human cell lines should " It also lists evaluate the utility of existing public sequences prior to cell line sequencing. types of cell lines for which sequencing could reasonably provide more benefit than it did for us.

Reviewer Comment #2.
This comment is focused on the mapping of RNA reads to genomic sequences. It says the process may remove too many reads because the Bowtie2 mapper is not splice-aware. It says the manuscript should demonstrate that Bowtie2 performs better than alternatives such as HiSat2.
Our process employs high sensitivity by design. It uses Bowtie2 parameterized for sensitive alignments and it subtracts both RNA reads of any pair that shows any detectable alignment, including pairs with only one partial alignment to one read. We considered HiSat2 during the design phase, but we reasoned that a HiSat2-based process would be less sensitive if it only subtracted reads with partial alignments explained by splicing. One could probably design and build several alternate processes, test them on our data, and characterize all the reads that subtract differentially. Such testing might be informative, but it would go beyond the scope of this F1000Research manuscript, which describes one process that is already functional, characterized, and available. Since many readers may wonder about our choice of mapper, we modified the manuscript to point out that one could implement a splice-aware variant of our process. The new text, added to the second paragraph in the revised Discussion, says, "Our subtraction process relied on the Bowtie2 mapping software. The process could be modified to use instead a splice-aware aligner such as HiSat2. Whereas our highly sensitive process subtracts all read pairs with even one partial alignment, a splice-aware process might achieve higher specificity by subtracting only those RNA pairs with full-length alignments including spliced alignments. Being a faster mapper, HiSat2 might also reduce the computational spliced alignments. Being a faster mapper, HiSat2 might also reduce the computational " cost of subtraction.

Reviewer Comment #3.
This comment is focused on data availability of sequence reads.

Competing Interests:
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com