SCovid: single-cell atlases for exposing molecular characteristics of COVID-19 across 10 human tissues

Abstract SCovid (http://bio-annotation.cn/scovid) aims at providing a comprehensive resource of single-cell data for exposing molecular characteristics of coronavirus disease 2019 (COVID-19) across 10 human tissues. COVID-19, an epidemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has been found to be accompanied with multiple-organ failure since its first report in Dec 2019. To reveal tissue-specific molecular characteristics, researches regarding to COVID-19 have been carried out widely, especially at single-cell resolution. However, these researches are still relatively independent and scattered, limiting the comprehensive understanding of the impact of virus on diverse tissues. To this end, we developed a single-cell atlas of COVID-19. Firstly we collected 21 single-cell datasets of COVID-19 across 10 human tissues paired with control datasets. Then we constructed a pipeline for the analysis of these datasets to reveal molecular characteristics of COVID-19 based on manually annotated cell types. The current version of SCovid documents 1 042 227 single cells of 21 single-cell datasets across 10 human tissues, 11 713 stably expressed genes and 3778 significant differentially expressed genes (DEGs). SCovid provides a user-friendly interface for browsing, searching, visualizing and downloading all detailed information.


INTRODUCTION
Coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is an ongoing global health threat since the beginning of the outbreak in late 2019 and has infected more than 190 million people worldwide as of 21 July 2021 (1). Research on isolating, sequencing and cloning the virus, development of diagnostic kits, and the testing of candidate vaccines are rapidly proceeding (2)(3)(4)(5)(6). However, key questions remain about the pathophysiology of COVID-19 (7).
With the in-depth case studies of COVID-19, accumulating evidence indicates that COVID-19 could not only result in acute respiratory distress syndrome but also multiorgan involvement. SARS-CoV-2 binds to angiotensin converting enzyme 2 (ACE2) receptors presented in vascular endothelial cells, lungs, heart, brain, kidneys, intestine, liver, and other tissues, which directly injures these organs (8). For example, emerging data from autopsy studies demonstrated that COVID-19 is accompanied by acute interstitial pneumonia (AIP), diffuse alveolar damage (DAD) and microvasculature involvement with pulmonary vessel hyaline thrombosis, haemorrhage, vessel wall oedema, intravascular neutrophil trapping and immune cell infiltration (9)(10)(11). In addition, gastrointestinal symptoms associated with COVID-19 vary widely but can include loss of appetite, nausea, vomiting, diarrhoea and generalized abdominal pain (12). ACE2 expression in cardiac tissue is also significantly elevated, which may potentially facilitate myocarditis caused by viral infection (13-15). To reveal tissue-specific molecular characteristics, researches regarding to COVID-19 have been carried out widely, especially at single-cell resolution. Triana et al. identified a subgroup of enterocytes as the prime target of SARS-CoV-2 and found the lack of positive correlation between infection susceptibility and ACE2 expression using singlecell RNA sequencing of SARS-CoV-2-infected colon and ileum organoids, which indicates that SARS-CoV-2 suppresses the immune response (16). Moreover, Arunachalam et al. revealed that various cell types exhibit unique pro-and anti-inflammatory responses by analyzing the peripheral blood mononuclear cells (PBMCs) of COVID-19 patients (17). Since the rapid development of COVID-19 has led to the imminent researches on COVID-19, numerous COVID-19related databases have emerged. GISAID (18), Nextstrain (19), GESS (20) and European Nucleotide Archive (21) collected SARS-CoV-2 strains of different patients all around the world and provided tools to analyse sequences. , LitCovid (23) and BioRxiv & MedRxiv summarized the literature about the latest progress in COVID-19 research. DrugBank (24), DockCoV2 (25) and COVID19 Drug Repository (26) predicted drugs with potential therapeutic effects and were well cross-linked to external databases, which provided the possibility to speed up the discovery of therapeutic drugs. Coronavirus3D (27), CoV3D (28) and RCSB PDB (29) annotated and visualized structures of coronavirus proteins and their complexes with high resolution. Besides, various types of single-cell databases such as CancerSEA (30), CellMaker (31), TISCH (32) and so on are emerging in endlessly. However, none of these databases focuses on molecular characteristics of COVID-19 patients. Therefore, we developed SCovid, a single-cell atlas for exposing molecular characteristics of COVID-19 across 10 human tissues. This database could be freely available at: http://bio-annotation.cn/scovid.
Considering the technical noise of assay, we removed low quality cells and lowly expressed genes of each COVID-19 related scRNA-seq datasets for further analysis, using the following strategy: (i) cells that had fewer than 200 genes, as well as genes expressed in fewer than three cells; (ii) liver cells that contained greater than 50% of mitochondrial genes, as well as other tissue cells that contained >20% of mitochondrial genes. For each dataset, we used the R package 'Seurat' (v3.2.3) (35) for data integrating, clustering, dimensionality reduction, and visualization. For these analyses, the function 'SCTransform' was used to integrate and scale data. Then, PCA analysis was performed using variable feature genes, and the principal components (PCs) identified by the function 'ElbowPlot' were used to cluster the dataset. Next, each cluster annotation was confirmed  by our previous knowledge of known cell type-specific gene markers, which were obtained from DE genes of each cluster by 'FindMarkers' function. Subsequently, we performed UMAP to reduce the dataset into two-dimension, and finally the cells were visualized on the website. We also performed analysis of scRNA-seq expression, including DE genes and gene pathway. First, for each cell type, MAST (v1. 16.0) (36) was used to calculated differentially expressed genes (DEGs) between the cells from samples with COVID-19 and the cells from controls. Then, up/down-regulated genes with top 5% |Log 2 FC| and P value <0.05 were regarded as significant DEGs, which were visualized by volcano plot. Next, the GO pathways of each cell type were enriched using these up/down-regulated significant DEGs by R package clusterProfiler (37).
Overview of SCovid database is shown in Figure 1. The current version of SCovid documents 1 042 227 single cells of 21 single-cell datasets across 10 human tissues (including intestine, blood, pancreas, lung, brain, airway, heart, kidney, liver and lymph node), 11 713 stably expressed genes (217 495 associations) and 3778 significant DEGs (8898 associations). Each dataset in SCovid contains detailed information of data source, sample source, grouping information, single-cell number and cell types. Each entry of DEGs contains Log 2 FC, P value and visual information. Figure 2 shows the number of genes in each dataset. Figure 3 shows the most frequently occurred significant DEGs that might be potential cell-type specific markers in these 21 datasets.

USER INTERFACE
We provided a user-friendly web interface to visualize the datasets by a few flexible steps as shown in Figures 4 and 5. All datasets are organized according to tissues types. Users can browse datasets by clicking the corresponding tissue icon or 'Tissue' hyperlinks in the 'Home' page or clicking specific tissue name in the navigation menu in the 'Browse' page ( Figure 4A and B). After selecting a dataset, for example, 'Delorey TM. (Liver)', all the detailed and visual information, including 'Detailed description', 'UMAP', 'Cell proportion', 'DEGs in cell types' and 'Expression profile', would be retrieved.
• Detailed description. The 'Detailed description' section contains dataset name, tissue type, accession number, number of cells, cell types, sample source and relevant publication information ( Figure 4C). Additionally, accession number and publication title contain hyperlinks the clients can follow. • UMAP. Visualization of the selected dataset using UMAP analysis is displayed in the 'UMAP' section with colorful points representing different cell types ( Figure  4D).   lin plot and a UMAP projection plot for the specific gene ( Figure 4I). The GO enrichment bar plots displaying GO classifications of up/down-regulated genes, in which hovering over any bars would pop up detailed information including ontology aspect, term ID, term description, P value and genes' symbol. • Expression profile. The 'Expression profile' section provides the heatmap that shows the expression profile of high-variance genes in different cell types ( Figure 4F). The individual tiles in the heatmap are scaled with a range of colors proportionate to gene expression values. The gene sequences correspond to the rows of the matrix and the cells correspond to the columns. • Data search. In the 'Search' page, SCovid offers two sections involving 'Search DEG in all tissues' and 'Search cell type' (Figure 5A). For a gene, SCovid allows users to input its symbol to query its related DEG information in all tissues and cell types and a table will be returned as described above on the Browse page ( Figure 5B and C). By selecting cell type, users will query its detailed DEGs and enriched GO terms in a tissue based on one dataset ( Figure 5D and E). • Data download. In addition, all data in SCovid can be downloaded in the 'Download' page, containing the DEGs' expression profile, variation information of all stably expressed genes and DEGs.

SUMMARY AND FUTURE PERSPECTIVES
Since the outbreak of COVID-19 in Dec. 2019, databases about the literature collection, SARS-CoV-2 genome sequencing or proteins' structures, and drug prediction appeared subsequently, while none of them focuses on molecular characteristics of COVID-19 patients. Given the high accuracy and cellular specificity of single-cell sequencing, we collected 21 single-cell datasets of COVID-19 across 10 human tissues paired with control datasets to reveal molecular characteristics of COVID-19 based on manually annotated cell types. We further developed a database system SCovid to provide a user-friendly interface for browsing, searching, visualizing and downloading stably expressed genes, significant DEGs and functional analysis of these significant DEGs based on cell types across tissues. The current version of SCovid documents 1 042 227 single cells of 21 single-cell datasets across 10 human tissues, 11 713 stably expressed genes and 3778 significant DEGs. Each dataset in the SCovid contains detailed information of data source, sample source, grouping information, single-cell number and cell types. Each entry of DEGs contains Log 2 FC, P value and visual information. SCovid is a powerful and high-quality database for molecular characteristics of COVID-19. Biologist can access the variation information of genes of interest on specific cell types of different tissues, and the enrichment pathways of differential genes on specific cell types of different tissues. Bioinformatician can use machine learning methods to predict tissue-specific driver genes and therapeutic drugs of COVID-19. Although there is limited single-cell data of COVID-19 currently, research on COVID-19 will increase largely, since there is no effective way to completely inhibit the spread of the virus now. Meanwhile, the research focus has gradually shifted from virus strains to molecular characteristics of COVID-19 patients, which means genomics, epigenomics and proteinomics data of COVID-19 will continue to emerge. Therefore, we will focus continuously on the latest data and construct unified analysis pipelines, so as to continuously update our database.