Reanalysis of single-cell RNA sequencing data does not support herpes simplex virus 1 latency in non-neuronal ganglionic cells in mice

ABSTRACT Most individuals are latently infected with herpes simplex virus type 1 (HSV-1), and it is well-established that HSV-1 establishes latency in sensory neurons of peripheral ganglia. However, it was recently proposed that latent HSV-1 is also present in immune cells recovered from the ganglia of experimentally infected mice. Here, we reanalyzed the single-cell RNA sequencing (scRNA-Seq) data that formed the basis for that conclusion. Unexpectedly, off-target priming in 3’ scRNA-Seq experiments enabled the detection of non-polyadenylated HSV-1 latency-associated transcript (LAT) intronic RNAs. However, LAT reads were near-exclusively detected in mixed populations of cells undergoing cell death. Specific loss of HSV-1 LAT and neuronal transcripts during quality control filtering indicated widespread destruction of neurons, supporting the presence of contaminating cell-free RNA in other cells following tissue processing. In conclusion, the reported detection of latent HSV-1 in non-neuronal cells is best explained using compromised scRNA-Seq datasets. IMPORTANCE Most people are infected with herpes simplex virus type 1 (HSV-1) during their life. Once infected, the virus generally remains in a latent (silent) state, hiding within the neurons of peripheral ganglia. Periodic reactivation (reawakening) of the virus may cause fresh diseases such as cold sores. A recent study using single-cell RNA sequencing (scRNA-Seq) proposed that HSV-1 can also establish latency in the immune cells of mice, challenging existing dogma. We reanalyzed the data from that study and identified several flaws in the methodologies and analyses performed that invalidate the published conclusions. Specifically, we showed that the methodologies used resulted in widespread destruction of neurons which resulted in the presence of contaminants that confound the data analysis. We thus conclude that there remains little to no evidence for HSV-1 latency in immune cells.

sequencing (scRNA-Seq) technologies provides a unique opportunity to study latency and reactivation in neuronal latency models (7), as well as to provide evidence for whether HSV latency is established in non-neuronal cells.A recent study by Wang et al. (8) addressed the latter and concluded that, in addition to neurons, HSV-1 establishes latency in immune cells that are present in the TG of HSV-1 experimentally infected mice.Here, we present a reanalysis of the scRNA-Seq data used by Wang et al. (8) and demonstrate significant problems with the quality of their scRNA-Seq datasets and conclude that these data cannot be used to support their conclusion.
During latency, HSV-1 gene expression is highly restricted and limited to the latency-associated transcript (LAT) and associated miRNAs (9)(10)(11)(12)(13)(14).The primary LAT transcript is 8.3 kb in size, capped, and polyadenylated (15)(16)(17)(18).Splicing produces stable 1.5 and 2 kb LAT intron lariats that accumulate to high levels in sensory neurons, while the highly unstable 6.3 kb spliced polyadenylated RNA is rapidly processed into viral microRNAs (19,20) (for a comprehensive review of LAT, see (21)).Two major claims are reported by Wang et al.: (i) in addition to neurons, various types of immune cells recovered from TG of experimentally infected mice express HSV-1 LAT, and (ii) the presence of LAT in these cells indicates that HSV-1 can establish latency in non-neuronal cells present in the TG.The core data supporting these claims was obtained by drop let-based scRNA-Seq analysis (10X Genomics platform) of TG from uninfected C57BL/6 mice (dataset: "Uninf-1") and two biological replicate groups of C57BL/6 mice infected via the corneal route with 2 × 10 5 plaque-forming units/eye of HSV-1 strain McKrae 35 days earlier (datasets: "Inf-1" and "Inf-2").Each biological replicate was obtained by pooling the CD45 POS leukocyte-enriched dissociated cells from paired left and right TG from 15 animals (i.e., 30 TG per biological replicate).To examine the claims by Wang et al. (8) in more detail, we aimed to reproduce their analyses.However, none of the (raw) scRNA-Seq datasets, barcode matrices, or analyses scripts are linked to the published article.We were later able to obtain the raw data from the study (SRA PRJNA937697, GEO GSE225839) via the handling editor of Science Advances.What follows is a reanalysis of the data presented by Wang et al. (8) using the same filtered barcode matrices that served as input for their scRNA-Seq analysis.

Quality control of scRNA-Seq datasets
Isolation of dissociated, single cells from organs requires mechanical and/or enzymatic tissue dissociation, typically followed by removal of dead cells and (if needed) further purification of cells of interest by magnetic bead-or flow cytometry-based cell sorting.Quality control (QC) filtering of the obtained scRNA-Seq datasets is therefore a critical first step (22).The Chromium Single Cell 3' v3.1 Reagent Kit (10X Genomics) used by Wang et al. (8) for library preparation is designed to capture polyadenylated RNAs and prime reverse transcription using a poly(T) primer that also includes the barcode and unique molecular index (UMI) sequences.QC filtering involves the identification and removal of doublets, as well as an assessment of cell viability in each of the samples.This latter is achieved by measuring, for each individual cell (i) the number of unique genes detected, (ii) the total number of RNA molecules (UMI) recovered and (iii) the proportion of reads derived from mitochondrial RNAs (23,24) (Fig. 1).In the original matrix count files generated by Wang et al. (8), the dataset designated Uninf-1 had 1,822 distinct genes detected per cell and a median total RNA count of 6,879, while 83% cells had mitochondrial RNA fractions <15%.By contrast, the Inf-1 and Inf-2 datasets showed different results with a median of 1,979 and 558 distinct genes per cell, a median total RNA count of 7,148 and 929, and only 51% and 33% of cells having a mitochondrial RNA fraction <15% (Fig. 1A), respectively.These data indicate higher proportions of dead and dying cells within both the Inf-1 and Inf-2 datasets with Inf-2 particularly severely impacted.Next, we applied filters on mitochondrial RNA content and unique gene counts, according to the parameters described by Wang et al. (8), namely that cells were only retained if between 300 and 9,000 distinct genes were detected, and the proportion of mitochondrial reads present was below 15%.At this stage, we observed large numbers of low-quality cells filtered out of each data set (Fig. 1B).This resulted in 3,608 cells for Uninf-1 (reduced from 4,206, a loss of 14%), 3,158 cells for Inf-1 (reduced from 6,155, a loss of 49%), and 5,660 cells for Inf-2 (reduced from 17,014, a loss of 67%).

Clustering and annotation of single cells
One of the most contentious components of any scRNA-Seq analyses is the reduction of multidimensional into simple two-dimensional figures through either t-distributed RNAs.(B) Quality control filtering of these datasets dramatically reduced the total number of cells available for analysis, indicative that many dead/dying cells were present in the original single-cell suspensions of Inf-1 and Inf-2.Filtering parameters removed cells with less than 300 or more than 9,000 distinct expressed genes, and cells for which more than 15% of reads derived from mitochondrial RNAs.Inset: Number of cells before and after filtering.stochastic neighbor embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP) techniques.While this has been reviewed elsewhere (22), it is worth noting that significant care must be taken when trying to interpret these data.We performed integration and clustering in a similar manner to Wang et al. (8), matching as many parameters as possible (see Methods section) (Fig. 2A).A deeper analysis of the clusters revealed significant differences in the relative proportions of cell types present in each dataset with some clusters being almost entirely derived from a single dataset (e.g., Cluster 0 and to a lesser extent cluster 2, Fig. 2B).
A second challenging component of scRNA-Seq analyses is the process of assigning specific cellular identities to a given cluster.This is typically achieved by identifying distinct markers within a given cluster and comparing this to a well-described reference database of cell identities.Here, we used the same annotation tool and reference databases as Wang et al. (8) (SingleR (28), with (MouseRNAseqData (29) and ImmGen Data (30) from the celldex package (https://github.com/LTLA/celldex)(Fig. 2C and D).This analysis (i) produced generally convergent results (Fig. 2E) and (ii) demonstrated that while most clusters could be identified with high confidence (i.e., more than 90% of cells present are predicted to have the same identity) other clusters are reported as mixtures of cell types (i.e., clusters 0, 3, 5, and 6) and assigning singular cell identities to these can only be done with low confidence (Fig. 2C and D).Subsequently, an analysis using representative markers for the cell types present in each cluster further confirmed the division between high confidence and low confidence cluster identities (Fig. 2F).We further analyzed Cluster 0 to better understand why no robust cellular identity could be assigned.Notably, this cluster derived almost entirely from the Inf-2 dataset (Fig. 2B), and when compared to other clusters, it was characterized by containing cells with high proportions of mitochondrial reads and low numbers of detectably expressed genes per cell (Fig. 3A and B).To test the hypothesis that most cells in this cluster were dying/dead, we summarized the expression of 40 cell death markers (31), and again when compared to other clusters, determined these to be predominantly expressed in the Inf-2 derived cells in Cluster 0 (Fig. 3C; Table S1).This is particularly relevant in the context of the original Wang et al. (8) analyses as the major conclusions in that study were derived from the analysis of cells in this cluster.Finally, we examined the possibility that ambiguous cell types and lack of clustering by cell type could be related to doublets present in the samples, or due to differences in integration methods between our analysis and Wang et al.While not performed by Wang et al., the identification and removal of doublets is considered an important quality control step during scRNA-Seq data processing.We used DoubletFinder (32) and identified just 847 doublets across all three datasets (representing 6.8% of total cells).Doublets were identified in multiple clusters, and importantly, very few were present in Cluster 0, representing only 1.2% of cells in that cluster (Fig. S1).We therefore concluded that doublets could not explain the issues with Cluster 0 and sample Inf-2.Secondly, while the integration of datasets is not strictly required and is mostly useful for multimodal datasets, we performed integration after doublet removal and found that, while this resulted in 12 clusters rather than 13, the overall cell type classifications did not change (Fig. S2).In conclusion, these data indicate that (i) Inf-1 and Inf-2 datasets are not valid biological replicates, (ii) clusters 0, 3, 5, and 6 cannot be assigned a specific cell identity with high confidence (Fig. 2), and (iii) Inf-2 derived cells in Cluster 0 are likely undergoing programmed cell death (Fig. 3).

Off-target capture enables profiling of LAT intron lariats by scRNA-Seq
We next switched focus to the reported detection of HSV-1 LAT transcripts in multiple clusters.Of particular note is that the stable HSV-1 LAT 1.5 and 2 kb introns are not polyadenylated, and that the mature LAT RNA is highly unstable (15,18).Thus, one would not expect to detect intron-derived RNAs by 3' scRNA-Seq, in which the 3' oligo d(T) adapter is designed to prime poly(A) tails.However, reanalysis of the raw fastq files from Wang et al. (8) demonstrated that 74%-92% of viral reads (representing <0.005% of all reads) aligned to the LAT intron, while the remaining reads mapped at low density throughout the rest of the HSV-1 genome (Fig. 4A and B; Table S2).Closer examination of read alignments across the LAT locus showed consistent alignments that were associated with short adenosine homopolymers located within the intron and a much smaller peak at the 3' end of the mature LAT (Fig. 4C).Taken together, these data show that off-target priming in 3' scRNA-Seq experiments (33) results in the detection of non-polyadenylated HSV-1 LAT introns.Similar results have been observed in other scRNA-Seq studies of HSV-1 latently infected ganglia, indicating that the 3' scRNA-Seq approach is compatible with studies of HSV-1 latency models (7, 34), however, the efficiency of this off-target priming remains unknown.

Loss of HSV-1 LAT during filtering suggests cell-free RNA contamination
LAT reads were not universally detected in the Inf-1 and Inf-2 datasets, but instead were 10 times more abundant in the Inf-2 dataset (Fig. 4B).In addition, most LAT reads in the base dataset (filtered barcode matrices) were excluded during the initial quality control process (Fig. 4D).Notably, most cells with LAT reads in the filtered dataset only contained a single LAT read as determined by the UMI present in each read (Fig. 4D).Subsequent analysis of the individual clusters demonstrated that the vast majority of cells designated as LAT POS were associated with Cluster 0 and were almost exclusively from the Inf-2 dataset (Fig. 5A).Similarly, the relative expression of LAT was highest in Cluster 0 (Fig. 5B).Because (i) this cluster is composed of dead/dying cells and (ii) HSV-1 LAT introns accumulate to high levels in neurons (35), we hypothesized that high background levels of cell-free RNA -originating from HSV-1-infected neurons that were damaged during TG tissue processing -could be the source of LAT reads in non-neuronal cells.To test this hypothesis, we compared the number of reads aligning to HSV-1 LAT and several cell  S1).
type-specific markers in the pre-(Fig.1A) and post-filtered (Fig. 1B) datasets.Strikingly, this analysis demonstrated a significant loss of both HSV-1 LAT and neuronal markers during filtering that was not observed for any of the other major cell types present (Fig. 5C; Fig. S3).Thus, extensive death of (HSV-1 latently-infected) neurons during TG processing is the most likely source of ambient RNA contamination (36,37).

DISCUSSION
The recent study by Wang et al. (8) has challenged the long-held dogma that her pes simplex virus type 1 (HSV-1) exclusively establishes latency in neurons.However, overturning existing dogmas necessitates robust and rigorous evidence that is suppor ted by well-controlled independent experiments using orthologous methodologies.A key premise of Wang et al. (8) is that the presence of HSV-1 LAT RNA in a cell is sufficient to conclude that the virus has established latency in the infected cell.HSV-1 latency can be operationally defined as the presence of viral DNA in host cells in the absence of virus particle production, provided that the genome is maintained in a reactivatable state that enables the production of new infectious HSV-1 particles (38,39).Transcriptional activity of the latent HSV-1 genome is repressed with exclusive expression of LAT driven by a neuron-specific promoter (11,13,15).Neurons are the only cell type in which HSV-1 latency has been clinically and experimentally demonstrated in both human and mouse ganglia (2, 9, 40-43).Interestingly, not all HSV-1-infected TG neurons express LAT in the HSV-1 mouse model ( 44), and it is unclear whether all neurons harboring HSV-1 DNA express LAT in human TG (45,46).This raises the question of whether all neurons containing HSV-1 DNA support virus reactivation.Thus, even if low-abundant LAT reads were detected in non-neuronal cells, this is not by the currently accepted definitions (38,39), conclusive evidence of HSV-1 latency.Principally, our reanalysis of the single-cell RNA sequencing (scRNA-Seq) data from Wang et al. ( 8) reveals significant problems with the quality of their scRNA-Seq data sets.We conclude from our reanalyses that the reported detection of LAT reads from non-neuronal cells is best explained by cell-free RNA originating from latently-infected neurons that were damaged during tissue processing.Moreover, we here specifically establish that the Inf-1 and Inf-2 datasets cannot be considered biological replicates, with the Inf-2 dataset in particular showing evidence of extensive cell destruction during tissue processing.We have also shown that integration of these datasets yields multiple cell clusters that cannot be assigned a specific cellular identity.One of these clusters is dominated by low-quality HSV-1 LAT-expressing cells that are almost entirely derived from the Inf-2 dataset.A deep analysis of this cluster identified markers of multiple cell types including neurons and expression of a large number of programmed cell death markers.The significant loss of reads associated with HSV-1 LAT and neuronal markers during QC filtering further supports extensive neuronal cell death during tissue processing and the release of both neuronal RNAs and HSV-1 LAT into the homogenized single-cell suspension.Such RNA is easily bound on the surface of other cell types and thus carried into the droplets in which cell lysis and the initial steps of scRNA-Seq library preparation take place (36,37).
As we had to specifically request that the scRNAseq datasets be made available to us post-publication (and these data are not linked to the publication on the journal website), we feel obligated to reiterate the FAIR (Findability, Accessibility, Interoperabil ity, and Reusability) data principles.These guidelines provide a framework to increase transparency and promote the reuse of data by the scientific community (47,48), which in turn will accelerate scientific discoveries and provide the opportunity for dataset correction by other analyses.Many funding agencies, universities, and scientific journals aim to promote open science by recommending or requiring researchers to adhere to the FAIR principles and open-access publishing.In the context of scRNA-Seq experiments, this means that all raw sequence files, metadata, raw and filtered matrices, including all code/scripts used for analysis, need to be deposited in publicly available data repositories (e.g., GEO, SRA, and GitHub).References to the location of the data should be provided in the relevant sections of the article.It would have been helpful for Wang et al. (8) to include references to their scRNA datasets in the manuscript to follow journal guidelines, and to provide sufficient details in the Methods section to reproduce all aspects of their data analysis, e.g., the Cell Ranger parameters used for aligning raw sequence data and the construction of the hybrid genome reference were not described.While Methods sections are often written in a concise manner, it is increasingly common, and generally required by journals, that authors make available all scripts used for the analysis of the original data presented.Additionally, we recommend that authors demonstrate the impact of both QC filtering steps, such as those shown in (Fig. 1) and cluster labelling strategies (Fig. 3), on each individual biological replicate.By showing this as Supporting Data in manuscripts describing scRNA-Seq data, it becomes easier for experts in the field and other interested parties to evaluate the results.
In summary, our reanalysis of recently published scRNA-Seq data of HSV-1-infected mouse TG does not support the reported detection of HSV-1 LAT RNA in non-neuronal cells.While studies investigating the virus and host factors contributing to viral latency and reactivation at the single-cell resolution will undoubtedly advance our understand ing of these processes, it is critical to adhere to the accepted best practices for the design and analysis of scRNA-Seq data and to share both the datasets and code used for analysis.

Data sourcing
The raw data files (FASTQ) associated with the original study of Wang et al. (8) (SRA PRJNA937697) were downloaded from the sequence read archive using fastq_dump from the SRA tool kit v2.10.9 (https://github.com/ncbi/sra-tools). Filtered count matrices generated by Wang et al. (8) were downloaded from the Gene Expression Omnibus archive (GEO GSE225839) in order to reproduce the analyses.It was not possible to reproduce the Cell Ranger (10X Genomics) analysis as the original authors did not provide sufficient detail on how reference genomes and annotations were generated, nor what parameters were used when running Cell Ranger.

Preprocessing of Wang et al. filtered count matrices
Filtered count matrices from all three datasets (Uninf-1, Inf-1, and Inf-2) were imported and analyzed using the Seurat package (v5) in R (v4.3.2) (54).Since the Wang et al. (8) original analysis scripts were not provided, we attempted to use parameters from the manuscript where possible to recreate the analysis.However, our analysis differed from Wang et al. in two ways.Firstly, although Wang et al. mentions integration of the three sample datasets, no information was provided about the method or parameters used.While the integration of datasets is not strictly required and is generally more useful for multimodal datasets, we tested whether integration affected clustering and cell-type annotation results.Integration was performed using the workflow described in https://satijalab.org/seurat/archive/v4.3/integration_introduction with default settings.Results were compared against a simple merge() of the datasets.Secondly, Wang et al. (8) provided RNA-and UMI-level filter metrics for LATcells only, and it was unclear what filters were used on the full dataset.We chose not to apply different filters to different subsets of data as this could not reasonably be rationalized and remains a significant error of the original analysis.Mitochondrial filtering was performed at the same level as Wang et al. (8) (15%), and only cells with 300-9,000 features (i.e., distinct transcript identities) were retained.

FIG 1
FIG 1 Quality control of scRNA-Seq datasets derived from pools of HSV-1 latently-infected mouse trigeminal ganglia.scRNA-Seq datasets generated by Wang et al. (8) were obtained from uninfected ("Uninf-1) and two biological replicates of HSV-1-infected C57BL/6 mice ("Inf-1" and "Inf-2").Each replicate was obtained by pooling dissociated cells -composed of a 1:1 mixture of CD45-enriched cells and the original cell suspension -from left and right TG from 15 animals (30 ganglia).(A) Using the filtered barcode matrices generated by Wang et al. (8), the quality of each dataset was assessed by (left) the number of unique genes detected per cell, (middle) the total number of RNA molecules (UMI) recovered per cell and (right) the proportion of reads per cell derived from mitochondrial

Full 4 FIG 2
FIG 2 Clustering and annotation of cell populations.(A) Aggregated tSNE plot of all three datasets identifies 13 distinct clusters (0-12).(B) The proportion and total number of cells in each cluster shown differs between datasets e.g., cluster 0 is almost entirely composed of cells from the Inf-2 dataset.(C and D) SingleR was used to perform unbiased cell type recognition using both (C) MouseRNAseq and (D) lmmGen databases.For each cluster, the maximum proportion of cells given the same identity (analogous to a confidence score) is shown above the bar plot.(E) Both databases yielded similar results and confidence scores.Notably, clusters 0, 3, 5, and 6 could not be adequately resolved into a single dominant cell type.(F) Bubble plot showing both the proportion of cells in each cluster that express a particular cell type-specific marker (25-27) and the expression level of that marker.

FIG 3
FIG3 Further evaluation of cell quality within each cluster.For each cluster, and segregated by dataset, we determined (A) the proportion of mitochondrial reads per cell, (B) the number of distinct genes expressed per cell, and (C) the aggregated expression level of a selection of cell death markers (TableS1).

FIG 4
FIG 4 Abundant detection of non-polyadenylated LAT introns.(A) Coverage plot denoting the distribution of HSV-1 reads in the raw (unfiltered) Inf-1 and Inf-2 datasets.Black lines represent the HSV-1 strain McKrae genome, while grey boxes indicate open reading frames and thin lines indicate introns.Top and bottom panels represent HSV-1 reads aligning to the forward and reverse strand of the genome.Both copies of the LAT locus are indicated in green.(B) Reads aligning to the HSV-1 genome comprised only a small proportion (< 0.006%) of the Inf-1 and Inf-2 datasets, while most of these reads (74%-92%) aligned to the LAT introns located in the LAT locus.(C) Coverage plot of the LAT locus confirms that the majority of HSV-1 LAT reads aligned next to short adenosine homopolymers (blue vertical bars) located within the intron, indicating off-target capture.(D) The majority of cells with LAT reads contained only a single copy of LAT (i.e., a single UMI) and most of these were removed during the QC filtering step.

FIG 5
FIG 5 Abundant loss of reads associated with HSV-1 LAT and neuronal markers during filtering.(A) tSNE plot from Fig. 2A split by sample shows that the majority of LAT expression (in green) mapped to Cluster 0, which was exclusively present in the Inf-2 dataset.(B) Violin plot showing log normalized expression of LAT in each cluster.(C) For HSV-1 LAT and representative markers of different cell types, we determined the total number of reads present in the pre-filtered (Fig. 1A) and post-filtered (Fig. 1B) datasets.Data for a wider selection of markers is shown in Fig. S3.