Alterations in the host transcriptome in vitro and in vivo following severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection


 Background. Exploring alterations in the host transcriptome following SARS-CoV-2 infection is not only highly warranted to help us understand molecular mechanisms of the disease, but also provide new prospective for screening effective antiviral drugs, finding new therapeutic targets, and evaluating the risk of systemic inflammatory response syndrome (SIRS) early.Methods. We downloaded three gene expression matrix files from the Gene Expression Omnibus (GEO) database, and extracted the gene expression data of the SARS-CoV-2 infection and non-infection in human samples and different cell line samples, and then performed gene set enrichment analysis (GSEA), respectively. Thereafter, we integrated the results of GSEA and obtained co-enriched gene sets and co-core genes in three various microarray data. Finally, we also constructed a protein-protein interaction (PPI) network and molecular modules for co-core genes and performed Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis for the genes from modules to clarify their possible biological processes and underlying signaling pathway. Results. A total of 11 co-enriched gene sets were identiﬁed from the three various microarray data. Among them, 10 gene sets were activated, and involved in immune response and inflammatory reaction. 1 gene set was suppressed, and participated in cell cycle. The analysis of molecular modules showed that 2 modules might play a vital role in the pathogenic process of SARS-CoV-2 infection. The KEGG enrichment analysis showed that genes from module one enriched in signaling pathways related to inflammation, but genes from module two enriched in signaling of cell cycle and DNA replication. Particularly, necroptosis signaling, a newly identified type of programmed cell death that differed from apoptosis, was also determined in our findings. Additionally, for patients with SARS-CoV-2 infection, genes from module one showed a relatively high-level expression while genes from module two showed low-level. Conclusions. We identified two molecular modules were used to assess severity and predict the prognosis of the patients with SARS-CoV-2 infection. In addition, these results provide a unique opportunity to explore more molecular pathways as new potential targets on therapy in COVID 19.


Introduction
The epidemic of Coronavirus Disease 2019 (COVID-19) caused by SARS-CoV-2 is one of the public health emergencies all over the world and severely affects human health. As of June 16, 2020, more than 8,300,000 human cases were con rmed, with 446,000 deaths and about 5.3% estimated mortality rate.
The rapidly increasing of patients, especially the critical or lethal patients, brought a big challenge to the public health. Generally, the infection with SARS-CoV-2 is spread from person to person via respiratory droplets, however, other special transmission vectors, such as aerosol, fecal-oral route, and intermediate fomites from both symptomatic and asymptomatic patients during the incubation period, are also be our focuses 1. In terms of the characteristics of symptoms, it has been demonstrated that majority of patients with SARS-CoV-2 infection have u-like symptoms as their initial symptom. But non-speci c gastrointestinal symptoms such as dyspnea and diarrhea are also observed in several patients 2. In terms of the characteristics of patient population, male, aged over 65 and smoking patients are generally at a greater risk of disease progression, which result in the critical or lethal condition and present with severe pneumonia, acute respiratory distress syndrome (ARDS), multiple organ failure, and even death 3. Based on published data from the Chinese Center for Disease Control and Prevention, more than 80% patients were classi ed as mild, and 20% cases were classi ed as severe or critical illness. Alarmingly, the mortality was up to nearly 50% in patients with critical illness 4.
According to current evidences, the pathogenesis of lethal SARS-CoV-2 infection is associated with dysregulated in ammatory responses, tissue injury, and multiple organ dysfunction (MODS), but the details remain obscure. Therefore, at a gene level, exploring how disease exactly changes from mild to severe is not only highly warranted to help us understand disease progress, but also provide new prospective for screening effective antiviral drugs, nding new therapeutic targets, and evaluating the risk of systemic in ammatory response syndrome (SIRS) early. For this purpose, we designed this study of integrated bioinformatics analysis based on GEO (https://www.ncbi.nlm.nih.gov/geo/) database 5,6 using gene expression pro le dada of lung tissues from healthy individuals andpatients with SARS-CoV-2 infection, with a view to discovering valuable mRNA biomarkers potential pathogenic mechanisms, and nally in order to guiding clinical diagnosis and treatment.

Methods
We downloaded GSE148815 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE148815), GSE150316 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi), and GSE147507 7 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE147507) datasets from GEO database. In vivo level, we extracted the gene expression data of the adults in GSE148815 as normal healthy samples, and the data of patients with SARS-CoV-2 infection in GSE150316 as infected samples. Then, we integrated two sets of Raw-Counts data to compose the patient-microarray data used in the study. In vitro level, we extracted the gene expression data of normal human bronchial epithelial (NHBE) cell line, and A549 (lung adenocarcinoma) cell line to form two cell line-microarray data (NHBE-data and A549-data).
In order to reduce the effect of errors caused by gene chip on the pattern of results and our conclusions, data preprocessing was performed according to the following procedures using R software (R 3.6.1, https://cran.r-project.org/): standardized microarray data utilizing DESeq2 package 8. log2 transformed. handled missing or unknown values. checked and adjusted batch effects. corrected value of log fold change (logFC) to reduce the error caused by gene with a constant expression like zero. The detailed methods of data processing were described in the supplementary Text S1. For the microarray data was downloaded from a public database, the patient consent or ethics committee approval was not necessary.

Differential analysis
Differential analysis was performed in the three microarray data using DESeq2 package to identify the differentially expressed genes (DEGs) between the samples with SARS-CoV-2 infection and noninfection. And a logFC and a statistical p value were obtained. When a gene's logFC>0, we de ned it as high-expression in the SARS-CoV-2 infected group; when logFC<0 as low-expression. The result was visualized as a heat map with hierarchical clustering using pheatmap package (https://CRAN.Rproject.org/package = pheatmap).
GSEA and core genes identi cation GSEA was performed with the results of DEGs using phenotype labels "SARS-CoV-2 infected" versus "non-infected" by the clusterPro ler package 9 to elucidate relevant biological signi cance. Firstly, we converted the gene symbols to Entrez IDs, and then comparatively analyzed with the data of hallmarks gene set (Version 7.1) 10 downloaded from the Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb/) 11. And the results with a p value<0.05 as a bubble chart. After that, we obtained co-enriched hallmarks gene sets and their co-core genes from the GSEA results of the three microarray datasets.

PPI network and molecular modules construction
The PPI network of co-core genes was constructed in the STRING database (https://string-db.org/) through the following setting: meaning of network edges was set as "con dence", active interaction sources were all, minimum required interaction score was selected as "highest con dence (0.4500)" and nally display simpli cation was to hide disconnected nodes in the network 12,13. The PPI-data was downloaded and then identi ed key molecular modules by using MCODE plug-in Cytoscape software (version 3.7.1, https://cytoscape.org/), respectively. In MCODE, lters were based on the parameters as "Network Scoring ticked Include Loops and Degree Cutoff = 2," "Cluster Finding ticked Haircut, Node Score Cutoff = 0.2, K-Core = 2, and Max. Depth = 100" 14.
KEGG pathway enrichment analysis KEGG pathway enrichment analysis was conducted with genes in each molecular module to identify the biological functions of modules using clusterPro ler 9 package. Adjusted p value<0.05 was considered to be statistically significant, and the results was visualized by GOplot package. In addition, we used patient-data to evaluate the expression levels of genes from the modules in patients with SARS-CoV-2 infection compared to normal healthy individuals, and the results were shown as box plot.

Results
The data processing work ow was illustrated in Figure 1. The results corresponding to data processing were explained in Text S1, which re ected that the gene expression used in our study was uniformly distributed.

Differential analysis
The heatmaps ( A549-data ( Figure 3C). We merged the three GSEA results and obtained 11 hallmark gene sets coenriched, including 10 activated and 1 suppressed ( Figure 3D), and the details were listed in Table 1.
Then, we extracted all core genes in the co-enriched gene sets from the three datasets (499 core enriched genes in patient-data, 293 in NHBE-data, and 438 in A549-data), and obtained 56 co-core genes ( Figure  3D) that were described in Table S1.
From these ndings, we found that in samples with SARS-CoV-2 infection, in ammation-related hallmark gene sets like IL6-Jak-Stat3-signaling, IL2-Stat5-signaling, complement, coagulation, angiogenesis, Kras-signaling-up, and TGF--signaling were activated while cell cycle-related hallmarks gene set like E2F-targets was suppressed. Additionally, we also found in ammation-related genes like IL6, ICAM1, IL1B, and TNF were involved in the co-enriched gene sets. These ndings indicated that, to some extent, in ammatory genes contribute greatly to the biological process of disease progression and the interaction of molecular function in samples with SARS-CoV-2 infection.

PPI network and molecular models
The PPI network for 56 co-core genes was built through STRING database and the result contained 54 nodes and 234 edges ( Figure 4A). Through MCODE, we identi ed two key molecular modules, and their gene make-ups were shown in Figure 4B-C ( Figure 4B as module one, Figure 4C as module two). To our surprise, nearly a third of the co-core genes were members of module one suggesting that module one plays a crucial role in the PPI network, which deserved special attention.
KEGG pathway enrichment analysis 50 of various KEGG pathways were enriched by the genes from the module one (Table S2), including multiple speci c infection-related signaling pathways caused by malaria, legionellosis, measles, tuberculosis, in uenza A, and hepatitis B. In addition, it also included multiple in ammation-related signaling pathways like TNF signaling pathway, IL-17 signaling pathway, PI3K-Akt signaling pathway, Jak-Stat signaling pathway ( Figure 5A). For genes from module two, four pathways ( Figure 5B) related to cell cycle and DNA replication were identi ed. Finally, through the results of gene expression level ( Figure  5C), we found that, for patients with SARS-CoV-2 infection, genes from module one showed a relatively high-level expression while genes from module two showed low-level. Remarkably, our data source was from patients who died due to SARS-CoV-2 infection. These results signi ed that high expression of genes in module one and low expression of genes in module two might predict a worse prognosis in patients with in SARS-CoV-2 infection.

Discussion
As obligate intracellular parasites, viruses must rely on host cell for replication, and its infection also cause changes in the morphology and function of host cells. SARS-CoV-2 is a monopartite, singlestranded, and positive-sense RNA virus with a genome size of 29,903 nucleotides, making it the secondlargest known RNA genome 15. Severe COVID-19 was characterized by respiratory failure caused by hyper-in ammation, which not only bring great di culties to clinical treatment, but also increase mortality for disease 16. It is now well established that a cytokine storm syndrome (CSS) involved in molecules like interleukin 6 (IL6), interleukin 8 (IL8), E-cadherin, MCP-1, and VEGF, probably may be the reason for hyper-in ammation in severe cases 2,17.
Investigating the differences in genetic level during the disease occurrence and development not only help us understand disease more clearly, but also guide for developing countermeasures. In the present study, candidate datasets were consisted of two parts, in vivo level (patient-data) and in vitro level (NHBEdata and A549 data), and they were annotated by the same platform of GPL18573 [Illumina NextSeq 500 (Homo sapiens)], which eliminated the adverse effect of the bias of gene coverage and algorithm usage from different sequencing platforms on the results and conclusion. After completing the differential analysis and GSEA, we found that in samples with SARS-CoV-2 infection, in ammation-related hallmark gene sets like IL6-Jak-Stat3-signaling, IL2-Stat5-signaling, complement, coagulation, angiogenesis, Krassignaling-up, and TGF--signaling were activated while cell cycle-related hallmark gene set like E2Ftargets was suppressed both in vivo and vitro level. Subsequently, in ammation-related genes were found in the core genes of those co-enriched gene sets, such as IL6, ICAM1, IL1B, CSF1, TLR2, VEGFA, TNF, and so on. We constructed PPI network for the co-core genes, and analyzed molecular modules through MCODE, and identi ed two key molecular modules. Module one consisted of 17 genes: BCL2A1, BCL2L1,  SOD2, PTGS2, TNF, PLAUR, MYC, LIF, VEGFA, LOX, IL6, CSF1, C3, TLR2, ICAM1, IL1B, and TNFAIP3, which were mainly involved in in ammatory response pathway and complement system cascade, such as such as TNF signaling, IL-17 signaling, PI3K-Akt signaling, Jak-Stat signaling, NOD-like receptor signaling, Toll-like-receptor signaling, and complement signaling. Module two consisted of 9 genes: EZH2, PTTG1, DLGAP5, RFC3, DUT, RPA2, CCNB2, PCNA, and MCM4, were mainly involved in cell cycle and DNA replication signaling. Finally, by assessing the level of gene expression in the patient-data, we found genes from module one showed a relatively high-level expression while genes from modules two showed low-level. All of these ndings may provide novel evidences selecting effective mRNA biomarkers to evaluate disease progression and predict prognosis for patients with SARS-CoV-2 infection. Also, these ndings improve our understanding of additional targets for anti-SARS-CoV-2 agents.
Cascade reactions of in ammation and immunity are the two key aspects in the pathogenic process of virus infecting the host, which often requires multiple cells and cellular components to participate, and lead to changes in the level of gene expression. It is no exception for SARS-CoV-2. GSEA, based upon functional class scoring methods, is capable of solving biological problem by focusing on gene sets rather than individual genes 11. In our study, we did use this method for analyzing genetic changes related to in ammation and immune response after SARS-CoV-2 infection, which preserved gene-gene correlations, and provided an improved understanding of the biological functional enrichment in the groups of high-level and low-level expressed genes 18. For hallmarks gene sets, we found two gene sets involved in two signi cant molecules in the interleukin super family, gene set of IL6-Jak-Stat3-signaling (genes were up-regulated by IL6 via Stat3 during acute phase response) and IL2-Stat5-signaling (genes up-regulated by Stat5 in response to IL2 stimulation), were activated in samples of SARS-CoV-2 infection. Several studies have been reported that IL6, as one of the genes involved in cytokine release syndrome (CRS), was used to assess severity of COVID 19 like respiratory failure, ARDS, and adverse clinical outcomes 19-24. Similar to these conclusions, we identi ed IL6 not only was at a high-level expression in data of cell lines and cases died due to SARS-CoV-2 infection, but also interacted the most with other genes in the PPI network, which con rmed again its value of as an active cytokine with a wide range of biological functions and as an effective biomarker for the prognosis of COVID 19. IL2 is a secreted cytokine that was important for the proliferation of T and B lymphocytes25,26. Here, no core role of IL2 was observed even though we found activated IL2-Stat5-signaling in the infected samples, but IL2related genes with the example of CSF1 and LIF, were among the co-core genes with a high-level expression, which revealed that the action of IL2 might be as a 'homeostatic' cytokine, and mainly involved in the immune response but not in CRS. Moreover, the gene sets of TNFA-signaling (genes regulated by NF-κB in response to TNF) and Kras-signaling-up (genes up-regulated by Kras activation) were also activated in patients with COVID 19, which further uncover the roles of genes and cytokines such as NF-κB, TNF, and Kras in the pathogenesis and progression of SARS-CoV-2 infection. These factors together with IL6 might lead to the possibility of severe and/or fatal status for COVID 19. Notably, gene set of epithelial-mesenchymal-transition (EMT) (genes de ning EMT) was represented in activated state in SARS-CoV-2 infected samples, which implied the potential of SARS-CoV-2 to lead to pulmonary interstitial brosis (PIF) as the disease progress. PIF was also an important feature of COVID 19 cases with poor prognosis, as previously reported 27,28.Apart from this, gene set of complement (genes encoding components of the complement system, which is part of the innate immune system) and gene set of coagulation (genes up-regulated during formation of blood vessels) were also found as activated status in our result, which indicated that the disease evolution of SARS-CoV-2 infection was a multiple pathway, complicated process, comprising of various dynamic changes in the genome.
E2F transcription factors play critical roles in the control of transcription, cell cycle and apoptosis29-31.
For samples with SARS-CoV-2 infection, we found that the gene set of E2F targets was suppressed obviously, which indicated that cell cycle disorders might occur in COVID 19.
From a clinical perspective, this might be related to hypoxia caused by SARS-CoV-2 infection in human bodies 32, because hypoxia could directly affect cell differentiation and energy metabolism 33. From gene level, we identi ed that genes like PCNA, CCNB2, PTTG1, as were at a low-level expression in the samples of SARS-CoV-2 infection. According to literature report, PCNA and PTGG1 involved in regulating the biological function of p21 that was a CDK inhibitor to trigger cell-cycle arrest in the G1 and G2 phases 34,35. CCNB2 involved in the formation of CCNB2/CDK1 complex that controlled separase activity through inhibition of phosphorylation and regulated the biological process of G2 phase of cell cycle 36.
Incorporating previous studies, we proposed that in the process of SARS-CoV-2 infecting the host, genetic changes caused cell cycle disorders, which might manifest as hyper-active S-phase and hypoactive G2 phase facilitating viral DNA replication.
Other co-core genes, such as ICAM1, a leukocyte adhesion molecule, mainly encoded a cell surface glycoprotein typically expressed on endothelial cells and immune cells. As reported, ICAM1 plays an important role in enhancing CD16+ monocyte adhesion to the endothelium 37, and regulating IL6/Akt/Stat3/NF-κB-dependent pathway 38. For roles in viral infection, ICAM1 had been determined to involve in DC-mediated HIV-1 transmission to CD4(+) T cells 39 and regulate interferon-gamma and IL17 in hepatitis B virus infection 40. Our results showed that ICAM1 participated in the signaling pathways of NF-κB, IL17, and TNF, which was in line with similar previous studies. TNFAIP3, induced by the TNF, plays a role in inhibiting the activation of NF-κB as well as TNF-mediated apoptosis. Literature data reported TNFAIP3 was closely associated with the replication of viruses like in uenza A 41, and hepatitis B virus 42. In our study, we identi ed TNFAIP3, TNF, and IL1B were involved in the signaling pathway of necroptosis.
Similar to apoptosis, necroptosis was executed via distinctive signaling mechanism comprising a cascade of speci ed proteins, resulting in regulated necrotic cell death 43. Physiologically, necroptosis induced an innate immune response as well as premature assembly of viral particles in cells infected with virus that abrogates host apoptotic machinery, which was advantageous for the host. On the other hand, necroptosis was also deleterious because it can cause various diseases such as sepsis, neurodegenerative diseases and ischemic reperfusion injury 43. Qin et al. observed that necroptosis of the pulmonary epithelium was associated with severe H7N9 infection leading to ARDS, and the nal conclusion indicated that necroptosis inhibition might be a novel therapy for H7N9 in uenza virus 44. However, the bene ts of necroptosis to the host may sometimes be outweighed by the potentially deleterious hyperin ammatory consequences of activating this death modality in pulmonary and other tissues45,46. All this evidence remined us that necroptosis signaling was a double-edged sword in the defense to microbial infection, or even might be the likely culprit of the critical and/or lethal SARS-CoV-2 infected cases. This also provided a novel clue for necroptosis signaling in the treatment of COVID 19 and served it as a potential therapeutic target. BCL2A1, one of the pro-in ammatory cytokines, encoded a member of the BCL-2 protein family. In our study, we found BCL2A1 participated in the signaling of NF-κB and apoptosis, which were similar to previous reports in the literature 47,48. Nevertheless, more details of how BCL2A1 affected the pathogenesis of SARS-CoV-2 infection still needed to be explored.
For signaling pathway, besides mentioned above, signaling pathway of AGE-RAGE, MAPK, cytokinecytokine receptor interaction, EGFR tyrosine kinase inhibitor resistance, viral protein interaction, were enriched in our results. From these ndings, we believed that in the course of SARS-CoV-2 infection, the dysregulation was gene-speci c rather than pathway-speci c. Hence, we tried to construct molecular modules not only for accurate prediction but also for the evaluation of the effects of genes on COVID 19 patients' prognosis. Our data indicated that in severe and/or fatal SARS-CoV-2 infection cases, immune responses tended to be gentle while in ammatory cascades were hyper-activated, which might be attributable to aberrant gene expression. However, in terms of how genes found in our results participated in the pathogenesis of SARS-CoV-2, it is still an unresolved problem, and further needed to verify in vivo data.

Conclusions
In conclusion, with integrated bioinformatics analysis for microarray datasets, we identi ed two molecular modules were used to assess severity and predict the prognosis of the patients with SARS-CoV-2 infection. In addition, these results provide a unique opportunity to explore more molecular pathways as new potential targets on therapy in COVID 19.   The GSEA results (A-C) and distributions of co-enriched gene sets and co-core genes in the three microarray data. (A) is the result from patient-data, (B) is from NHBE-data, (C) is from A549-data. Upper half of (D) is Venn plot for distribution of co-enriched gene sets, and lower half of (D) is Venn plot for distribution of co-core genes. Patients, patient-data. NHBE, normal human bronchial epithelial cell line.

Figure 4
The cluster heat map of the three microarray data. (A) is for patient-data, (B) is for NHBE-data, (C) is for A549-data. Red represents gene expression with a high-level, green represents gene expression with a lowlevel, and black represents gene expression with an intermediate-level.