Host transcriptional responses in nasal swabs identify potential SARS-CoV-2 infection in PCR negative patients

Summary We analyzed RNA sequencing data from nasal swabs used for SARS-CoV-2 testing. 13% of 317 PCR-negative samples contained over 100 reads aligned to multiple regions of the SARS-CoV-2 genome. Differential gene expression analysis compares the host gene expression in potential false-negative (FN: PCR negative, sequencing positive) samples to subjects with multiple SARS-CoV-2 viral loads. The host transcriptional response in FN samples was distinct from true negative samples (PCR & sequencing negative) and similar to low viral load samples. Gene Ontology analysis shows viral load-dependent changes in gene expression are functionally distinct; 23 common pathways include responses to viral infections and associated immune responses. GO analysis reveals FN samples had a high overlap with high viral load samples. Deconvolution of RNA-seq data shows similar cell content across viral loads. Hence, transcriptome analysis of nasal swabs provides an additional level of identifying SARS-CoV-2 infection.


INTRODUCTION
The SARS-CoV-2 pandemic continues to disrupt everyday life, with over 296,496,809 confirmed cases and over five million deaths (WHO 07-JAN-2022) (Organization, 2020). One of the initial responses in the biomedical field was the development of diagnostic tests for the rapid and sensitive detection of SARS-CoV-2 infection, resulting in various molecular PCR and antigen-based tests (Islam and Iqbal, 2020;Liu et al., 2020;Okamaoto et al., 2020). Many of these tests continue to operate under Emergency Use Authority regulations from the FDA in the US. However, many anecdotal reports of people with COVID-19 symptoms with negative tests are common. Here, we investigated the host transcriptional response of potential PCR false-negative subjects using RNAseq data from a shotgun transcriptome sequencing study . Indeed, multiple studies from biospecimens show activation of immune responses (alpha and gamma interferon responses) and bioenergetic responses to (Okamaoto et al., 2020) SARS-CoV-2 infection, as well as viral load-dependent changes in the host transcriptome Xiong et al., 2020;Sajuthi et al., 2020). Studies of SARS-CoV-2 and other viruses suggest that the host immune response to a virus may be sufficient to assist in the diagnosis of an infection (Zhang et al., 2021). We used published sequencing data from clinical specimens obtained from 670 patients tested for SARS-CoV-2 in the New York area in early 2020 (first wave), resulting in 192 positive and 389 negative test results via quantitative PCR assay (phs002258.v1.p1). Here, we show evidence for false-negative SARS-CoV-2 detection in 42 patients based on RNA-sequencing coverage of the SARS-CoV-2 genome and the host transcriptome.

RESULTS
Identifying potential false-positive SARS-CoV-2 infections using RNA-seq Subjects at New York Presbyterian Hospital-Weill Cornell Medical Center were clinically assessed for SARS-CoV-2 infection using a quantitative PCR (qPCR) test. The qPCR test included an amplicon for the E (envelope) gene to detect B lineage beta-coronaviruses and a second amplicon for the S (spike) gene that uniquely detects SARS-CoV-2 . The qPCR cycle threshold (Ct) value of the S amplicon is inversely related to the sample's amount of starting viral material, with a limit of detection cutoff at Ct R 40. Using RNA-seq data generated from the same nasal swabs, we measured the association between RNA-seq reads aligned to the SARS-CoV-2 genome and Ct values ( Figure 1A). We found an inverse linear relationship, r 2 = 0.69, between Ct values and SARS-CoV-2 reads in subjects deemed positive by PCR (166). In contrast, of the 317 PCR-negative subjects, Ct R 40, we found 42 subjects with over 100 SARS-CoV-2 aligned reads that represent potential false negatives.
We checked the mapping of our data by comparing the relative STAR aligned counts to the human and SARS-CoV-2 genome with a parallel analysis using Kraken2. We observed a similar correlation of reads mapping to both human ( Figure 1B) and SARS-CoV-2 ( Figure 1C) genomes using both the STAR alignment and Kraken2 tools (Pearson correlation of raw data r 2 = 0.49 and 0.99, respectively). There was no correlation between the number of SARS-CoV-2 aligned reads and human aligned reads ( Figure 1D) or the total number of sample reads ( Figure 1E and 1F). The number of gene counts aligned to the SARS-CoV-2 genome was proportional to the number of gene counts as a fraction of the total generated RNA-seq reads (data not shown).
We then classified the subjects into viral load groups based on qPCR Ct and the number of SARS-CoV-2 reads via RNA-seq (Table 1). An additional six samples were found to be false positives and 72 samples tested positive for other respiratory virus based on Kraken2 analysis; we removed these samples from subsequent analyses ( Figure 1G).
Next, we used Bedtools genomecov to investigate the relative coverage of the SARS-CoV-2 genome (Figure 2A) in each group. In high and medium viral load samples, we observed a consistent 5' to 3' coverage of the SARS-CoV-2 genome, whereas in low viral load samples, we observed more coverage variation (Figure 2B). For the false-positive samples, we observed variable coverage of the genome (only 7 samples had high enough coverage for strain analysis). We also investigated the relative coverage of the SARS-CoV-2 transcripts in each group using RSeQC geneBody coverage. In high and medium viral load samples, we observed a consistent 5' to 3' coverage of SARS-CoV-2 transcripts, whereas in low viral load samples, we observed more coverage variation ( Figure 2C). Finally, for the false-positive samples, we observed variable coverage of the transcripts. The low fragmented coverage of false-positive samples is suggestive of low viral load levels or potentially viral fragments indicative of potential post-infection shedding.
Next, we focused on the 42 false-negative samples and analyzed the number of SARS-CoV-2 aligned transcripts for each annotated SARS-CoV-2 gene. We detected only a few reads aligning to Orf1ab.1, Orf6, and Orf7b; in contrast, we observe a range of expression values of E, M, N Orf10, Orf1ab, Orf3a, and S genes. In the majority of the false-negative samples, the relative abundance of the highest expressed viral genes varied within each sample ( Figure 2D). Together, our analyses clearly identified viral RNA from multiple genomic regions of SARS-CoV-2 in 42 subjects, who were determined uninfected by PCR.

Viral clade analysis
We identified 42 potential false-negative patient samples. From these, we were able to assemble genomes for strain analysis from seven samples  which were submitted to GISAID. Nextclade analysis identified 19B (1), 20B (1), and 20C (5) strains of the SARS-CoV-2 virus. Compared to the PCR-positive samples (148), we observe relatively higher numbers of 19B and 20B and a similar proportion of 20C strains (71% vs 76%); however, the proportions are likely influence by the low number of samples passing QC ( Table 2). The other false-negative samples did not pass QC for genome analysis (data not shown). Due to the low numbers, we were unable to determine whether the strain of virus had an impact on the ability of PCR to detect its presence or not.
Directional response of host gene expression identified viral load-dependent signatures and similarities between low viral load and false-negative subjects We hypothesized that false-negative samples likely reflect a low viral load condition. To test our hypothesis, we used the nasal swab RNAseq data to both profile the host gene expression response and compare the response across our subject groups (Table 1). We defined differential gene expression as an absolute fold change > 1.2 with a false discovery rate <0.05 using negative samples (both by qPCR and RNAseq) as the comparator group Law et al., 2014). In the host RNA from high viral load subjects, we observed a skewed signature of differential gene expression with over 10-fold more upregulated to downregulated genes (518 up / 47 down, Figures 3A and 3B). This pattern was reversed in both medium viral  Figures 3C-3H). Next, we compared the commonality of the differentially expressed genes and found that false-negative samples had the most overlap with low viral load subjects (28 genes), followed by medium viral load (27 genes), and only one gene in common with high viral load subjects (Figure 4). Among the PCR-positive groups, 59 genes were commonly regulated across viral loads, 197 genes were shared between high and medium viral load, whereas 275 were shared between medium and low viral load. In contrast, only two genes were uniquely shared between high and low viral load ( Figure 4). These data suggest that subjects with a high viral load, as determined by nasal swab qPCR, have a distinct directional host transcriptional response compared to other SARS-CoV-2-positive subjects.

Functional analysis of host gene expression identified host immune and inflammatory responses in false-negative samples
Using our viral load classification, we found the directionality and uniqueness of host expression changes at the individual gene level (Figures 3 and 4). Functional analysis of RNAseq data leverages collections of genes that are related to common pathways, functions, and cellular localizations to identify specific gene sets that are enriched across the transcriptome. We hypothesized that functional analysis of differential gene expression would identify host responses common to SARS-CoV-2 infection as well as host pathways sensitive to viral load. We used fGSEA and topGO to compare the differential gene expression patterns at gene ontology (GO) level.
Our fGSEA analysis ( Figure 5A) found that the high viral titer group had the largest enrichment of GO terms, 2,606. The medium titer group had 766 enriched GO terms, 66% of which overlapped with the high titer group. The low titer and false-negative groups had the fewest number of enriched GO terms, 66 and 390, respectively. We identified 23 common GO pathways enriched across all subjects with detectable SARS-CoV-2, via RNAseq, compared to negative subjects ( Figure 5A). Remarkably, the 23 common pathways were all upregulated compared to the negative controls and shared inflammatory and immune response pathways observed in viral infections, including in patients with COVID-19 ( Figure 5B) Xiong et al., 2020;Sajuthi et al., 2020). In addition to the 23 common pathways, there were 188 common pathways between high load, medium load, and false-negative subjects. Surprisingly, there were only six common pathways enriched in both the false-negative and low viral load groups.

Viral load linked to unique functional host responses at the transcriptional level
Differences in viral load may reflect the infection time course, the effectiveness of host response, or the effectiveness of treatments. Regardless, it is clear from both linear (PCA) and non-linear (tSNE)  Figure S1A). iScience Article dimensionality reduction of the host RNAseq data ( Figure S2) that the high load samples are the most homogeneous compared to the other groups. As expected, the fGSEA analysis of high viral load samples identified pathways associated with defense responses to viral infections, but also protein targeting to iScience Article membranes, and mRNA catabolic processes. We also used topGO, a similar enrichment analysis of GO terms, that considers the hierarchical structure of ontologies to increase accuracy. For high viral load samples, topGO analysis shows enrichment of pathways associated with sensory perception signaling pathways followed by RNA and DNA processing (supplementary datas). Our analysis of medium and low viral responses using topGO revealed similar sensory perception and RNA silencing pathways.
Next, we investigated the chromosomal enrichment of genes in the false-negative (PCR negative-sequence positive) samples and observed two significant enrichments sites on chr1p21 and chr8q21, which were also observed in the medium dataset, but not the high dataset (supplementary datas). Gene set enrichment analysis with topGO again reveals enrichment in sensory perception pathways and RNA/ DNA regulatory mechanisms in PCR negative-sequence positive samples (supplementary datas).

Cell population mixtures were consistent across nasal swab samples
Differences in cell populations can influence differential gene expression profiles in biosamples (Bruning et al., 2016). Given the differential gene expression profiles and the results of our functional analyses, we used the RNA-Seq deconvolution tool MuSiC (Wang et al., 2019) and a single-cell airway dataset (Lukassen et al., 2020) to deconvolve our data to predict cell type proportions. We identified cilliated1, cilli-ated2, goblet, FoxN4, and basal3 cell types as the largest contributors ( Figure 6A). We analyzed the cell populations across all samples using principal component analysis ( Figure 6B), hierarchical clustering ( Figure 6C), and tSNE ( Figure S2). Neither method revealed patterns of cell proportions that correspond to viral load status. Finally, we used 2-way ANOVA with cell type and viral load as main factors and did not observe any interaction between cell type and viral load (interaction between PCR status and cell type = 0.06, no group show significant interaction, p = 1.0: post-hoc Tukey test: Figure 6D and S1). These data suggest differences in cell type populations do not explain the differential gene expression we observed.

DISCUSSION
Analysis of RNA sequencing data obtained from nasal-pharyngeal samples collected during SARS-CoV-2 testing in the New York Region revealed 42 out of 317 (13%) PCR-negative samples had detectable SARS-CoV-2 genomic material, suggesting they were false negatives (F-N) (Figure 1). RNA sequencing data from these potential F-N samples aligned to multiple SARS-CoV-2 genes across the SARS-CoV-2 genome (Figure 2B), suggesting this was not just a single region being detected (erroneously). Gene expression analysis of F-N samples shows a downregulation of gene expression response that was similar to the response of patients with low and medium viral loads (Figure 3), although the genes were different between groups (Figure 4). Gene Ontology analysis showed similar biological pathways are regulated by F-N samples and all SARS-CoV-2 positive patients ( Figure 5). Finally, the cellular content of the swabbed samples (as determined by deconvolution) was not different between false-positive samples and other SARS-CoV-2positive samples ( Figure 6). Together, these data support our observation that 13% of PCR-negative samples were false negatives.
Patients showed different host transcriptome responses depending on viral load (Figure 3). Overall responses changed from increased gene expression to decreased gene expression as the viral load reduced (Figure 3). The host response in the F-N samples was most similar to the Med and Low viral response groups (downregulation). However, there were few common regulated genes overlapping between all three groups ( Figure 4); out of over 1,000 genes regulated in Med and Low samples, only 22 were common to iScience Article Med and F-N samples and only 21 were common to Low and F-N samples. Despite few genes overlapping between samples of varying vial load, the biology of regulated genes appears similar among samples (Figure 5). Using both FGSEA and TopGO tools (supplementary datas), we see a concordance of biological pathways regulated by all viral conditions. Interestingly, we see the highest (2.4% of all terms from all samples) overlap with the high viral load group (High, 70), followed by the medium and low viral load group iScience Article ( Figure 5A). This may suggest a biphasic gene response to viral load, or that different genes converge to regulate the same biological pathways.
Different cell proportions may influence gene expression profiles from bulk tissues (Bruning et al., 2016). Deconvolution of RNA seq data using MuSiC did not identify any effects of viral status on the cell proportions of the samples (Figure 6). Attempts to predict the viral status of F-N samples using Random Forest and Neural Network models of top expressed genes were unsuccessful (best models were 5/42 called, data not shown, script provided). However, this may not be that surprising given the variability of gene expression in the samples using PCA and tSNE, even following filtering for the top 5, 10, and 100 most significant DEGs (based on LIMMA adj. p values) between groups does not cleanly resolve the viral status samples (High, Med, Low, F-N from Negative sequencing samples; Figure S2). Together, these data suggest that SARS-CoV-2 infection can be identified by host tissue responses to infection, and that gene expression patterns regulate common biological response pathways to viral infection, but these responses are not due to cell composition of the samples.
The rapid development of PCR-based testing for SARS-CoV-2 contributed to helping monitor and control the disease (Islam and Iqbal, 2020;Liu et al., 2020;Okamaoto et al., 2020). PCR is commonly referred to as the gold standard for SARS-CoV-2 testing (Brooks and Das, 2020;Drame et al., 2020). Any shortcomings of PCR testing are usually limited to failure to detect past infection (Yong et al., 2020), although concerns about the accuracy of early PCR tests were raised (Fang et al., 2020;Drame et al., 2020). This study suggests that patients who were tested for SARS-CoV-2 but received a negative PCR test may have been infected with SARS-CoV-2. While we do not know how many of the F-N patients were symptomatic, or asymptomatic, our data show that prior to filtering, 11% of PCR-negative samples showed evidence of SARS-CoV-2 infection using RNA sequencing ( Figure S1A). Earlier tests relied on a single SARS-CoV-2 gene for identification, whereas later tests used multiple genes. However, even multi-gene PCR tests are vulnerable to mutations resulting in suboptimal amplification of the target amplicon (such as the S-gene drop out observed in some tests (Volz et al., 2021)). Based on calculations of true positive rates taking sequencing to be the true value, we calculate the accuracy of PCR to be 90% (specificity 97% and sensitivity 79%) ( Figure S1C) (Trevethan, 2017); this suggests that PCR is better at detecting if a person does not have SARS-CoV-2 infection. This compares close to a recent review (Zitek, 2020) of nasal-pharyngeal RT-PCR tests (reporting a specificity and sensitivity of 98.8% and 78.2%, respectively based on data in ). Conversely, if we consider PCR to be the true standard, then sequencing in this study had a sensitivity of 97% and specificity of 87%. Therefore, sequencing could be deemed a better test for detecting infected people, but more costly and too slow for widespread use. It is not clear whether this analysis extends to other PCR assays but warrants further investigation. Collectively, these data suggest the number of reported COVID-19 cases is likely lower than the actual number of individuals who have been infected with the virus. iScience Article Limitations of the study Our analysis suggests that over 10% of patients with negative PCR results may have been infected with SARS-CoV-2. Clearly, the impact of this estimate on positivity rates would depend on whether the testing was performed as a clinical assay or for surveillance. Furthermore, it is not clear whether people tested were symptomatic, at what stage of the infection they were at, severity, or their outcome. All these confounding factors may affect the expression of genes and may account for the high variability in expression profiles we observe ( Figure 3) and the challenge of predicting COVID status based on gene expression alone (Figure S2). Even when we compare control (PCR negative) samples, we observe high variability in the data (see Figure S2). Nasal-pharyngeal transcriptomes are less studied than other tissues, such as blood, hence the natural variation of this tissue sample is less well understood. Regardless of absolute gene expression, it iScience Article was clear that biologically related genes were regulated in the disease (Figure 5), and sequencing enabled us to identify those signatures, as well as reads from across the SARS-CoV-2 genome. Taken together, this study shows over 10 % of PCR-negative samples were likely positive for SARS-CoV-2 suggesting estimate of prevalence of COVID-19 may be underestimated worldwide. More large-scale clinical data to validate these approaches are recommended in future work.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

INCLUSION AND DIVERSITY
One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in their field of research or within their geographical location. One or more of the authors of this paper self-identifies as a gender minority in their field of research. One or more of the authors of this paper self-identifies as a member of the LGBTQIA+ community. iScience 25, 105310, November 18, 2022 iScience Article iScience Article