Analyzing single cell transcriptome data from severe COVID-19 patients

Summary We describe the protocol for identifying COVID-19 severity specific cell types and their regulatory marker genes using single-cell transcriptomics data. We construct COVID-19 comorbid disease-associated gene list using multiple databases and literature resources. Next, we identify specific cell type where comorbid genes are upregulated. We further characterize the identified cell type using gene enrichment analysis. We detect upregulation of marker gene restricted to severe COVID-19 cell type and validate our findings using in silico, in vivo, and in vitro cellular models. For complete details on the use and execution of this protocol, please refer to Nassir et al. (2021b).

Download the datasets mentioned in 'Deposited data' section of key resources table.
2. Sample preparation for qPCR from clinical samples.
a. Collect samples from nasopharyngeal swabs of COVID-19 patients and control group in 3 mL sterile viral transport medium (VTM) tube, following standard sample collection protocol. To maximize the viability of samples, transport all the samples at 2 C-8 C within 72 h post collection and store at À80 C until use.
CRITICAL: Institutional permission and approval to handle SARS-CoV-2 should be taken and any material that has been in contact with the virus should be handled and discarded off in accordance with local regulations.
b. Keep the following ready. i. RNA samples from SARS-CoV-2 infected patients and control group.
ii. cDNA preparation from samples.

MATERIALS AND EQUIPMENT
Bioinformatics analysis All the bioinformatics analyses were carried out either on the institute (MBRU) server (Linux OS, AMD EPYC 7402 24-Core Processor, 132 GB RAM) or PC (iMac, 4.3 GHz Intel Core i7 processor, 16 GB RAM). The minimum configuration needed for carrying out this analysis is Linux OS, 16 GB RAM and around 30 GB of available storage space.

OPEN ACCESS
Analyze the h5ad output file generated from the above step using Scanpy library (in Python). Perform principal component analysis (PCAs), select the highly variable PCAs for cluster detection (PhenoGraph) and calculate the cluster connectivity using the partition-based graph abstraction (PAGA) algorithm. The python codes are described in the GitHub repository at Github data: https://github.com/theislab/scanpy. The control dataset 2 can be downloaded as h5ad object and similar steps for filtering, normalization, PCA calculation and clustering can be done using Scanpy. Use UMAP (uniform manifold approximation and projection) to visualize the clusters in reduced dimensional space, and then compare the cluster topography with the UMAP from the original studies. Identify the number of PCAs using elbow plot for data clustering and decide on the highly variable features to be used for downstream analysis. b. COVID single cell dataset processing using Seurat: Once you download the data from GSE145926, process it using Seurat in R, Github data: https://github.com/zhangzlab/ covid_balf/ seurat_integration.R). The script 'seurat_integration.R' carries out the following functions. i. Load data and create Seurat object.
2. Prepare signature gene marker dataset and ascertain cell type identity of clusters. This section describes steps in marker dataset preparation and assigning cell type identity to clusters. a. Dataset preparation.
Marker dataset compilation will depend on the type of single cell data you are processing. Here, we composed an in-house database consisting of canonical markers for cell types associated to human lung region using a combination of literature search (Deprez et al., 2020;Grant et al., 2021;Madissoon et al., 2019;Reyfman et al., 2019), PanglaoDB (Franzen et al., 2019), and CellMarker (Zhang et al., 2019) databases. Deprez et al. provided the single cell map for human healthy airway, constituted by 77,969 cells from 35 different locations, which were clustered into seventeen broad cell types. We extracted the unique markers from the heat map representing the marker gene expression across various cell types. Furthermore, we collected the marker genes from lung single cell map (Madissoon et al., 2019) consisting of 57,020 cells, classified into 25 different cell types. Similarly, we enlisted the lung specific markers from other largescale datasets, which characterized single cell heterogeneity of lung tissues. The steps for compiling marker genes are shown in Figure 1. In order to download the marker gene files from PanglaoDB and CellMarker database, first install curl on your PC through terminal window executing the following codes.
Next, open either R-studio or R-terminal and execute the following code which will: i. Download the data from PanglaoDB and CellMarker database (You can either download the file or use the file we have deposited in Zenodo). ii. Select human, lung data which have sensitivity above 0.5 and keep the genes associated with each cell type in a new list. iii. Identify only unique list of genes for each cell type and combine all the cell types (collected from literature as well as databases - Figure 1) to generate the final database. In our study, we compiled 966 unique marker genes representing 38 cell types (Nassir et al., 2021b). cmdb<-read.csv("Human_cell_markers.txt", header=T, sep="\t") cmdb_filtered<-filter(cmdb,cancerType=="Normal") cmdb_filtered<-filter(cmdb_filtered,tissueType=="Lung") cmdb1<-split(as.character(cmdb_filtered$geneSymbol),cmdb_filtered$cellName) <-c("CD3D", "CD3E", "CD3G", "CD4", "CD8A","CD4", "CD68", "CD8A") lis<-names (pdb1) lis2<-names (cmdb1) #making list of all cell types  CRITICAL: It is critical to exhaustively search the publications for marker database creation as the quality of this database will be reflected directly on cell type assignment. Only going through the databases will hardly take few minutes to hours. It is the literature search which is time consuming, due to the exponential rate at which single cell analysis is being performed on different human tissues.
b. Cell type assignment Subsequently, use the marker dataset to find the cluster identity ( Figure 2) employing these steps: i. Extract and plot (boxplot) the normalized expression for all the genes belonging to a specific cell type across all the clusters for the given dataset. Alternatively, you can also calculate the median expression for all the clusters per cell type in the database. The following code can be used for calculating the median. ii. Compare the median expression across the cell types per cluster and to the 99th percentile expression value from the datasets.
iii. Assign the cell type identity based on the highest normalized median expression value for each cluster.
3. Prepare comorbid gene list and identify subtype with upregulated comorbid gene expression.
This section describes steps in compiling the list of COVID-19 comorbid gene lists and identifying the cluster with maximum number of enriched comorbid gene set.
We collated a list of COVID-19 associated genes to analyze and compare their expression among control, moderate and severe COVID-19 clusters. Since COVID-19 disease pathology is associated with enormous release of pro-inflammatory cytokines, we included the list of cytokines and cytokine receptor genes, in addition to the rare infection and syndromic genes accounting for the genetic predisposition, lung channelopathy genes, differentially expressed genes from COVID-19's BALF transcriptome (Zhou et al., 2020) and genes associated with key comorbid conditions such as chronic obstructive pulmonary disorder, cardiovascular disease, hypertension and diabetes. The list of cytokine and cytokine-receptor genes was retrieved from the Immunology Database and Analysis Portal (Bhattacharya et al., 2018). The monogenic infection gene list was collected from (Casanova, 2015). Syndromic genes (category S) were collected from the Simons Foundation Autism Research Initiative (https://gene.sfari.org/). Primary comorbid disease genes associated with COVID-19 were extracted from GWAS catalog (https://www.ebi.ac.uk/gwas/), where the selected genes had a probability < 10 -7 . The normalized expression of these gene sets was plotted for all the clusters The script below will yield cluster specific boxplots for comorbid gene expression.
Mark clusters for which the normalized median expression of the comorbid condition associated gene sets was higher than 99th percentile expression value from the dataset. Compare the boxplots and choose the cluster in which most of the comorbid conditions is upregulated as your subtype of interest, as demonstrated in Figure 2 and Figure S5 of (Nassir et al., 2021b).

Downstream analysis.
This section describes identifying genes restricted to cell type of interest.
Using step 3, we identified a severe cluster which has maximum enrichment of the comorbid genes, severe cluster 11, which was assigned the identity as monocyte-derived alveolar macrophages (MoAM), indicated by the presence of CCL3L1 -MoAM_CCL3L1. One of the most important steps next is to identify the marker genes that are restrictively expressed in this cluster. Please follow the steps below. a. Use FindAllMarkers to compute the DEGs for each cluster using different tests -Wilcoxonranked sum test, t-test.
b. Select the genes which are differentially expressed (highest probability, combining all the three tests). c. Rank the genes using their Log of fold change in descending order. d. Use top 20 genes and map them using feature plot and dot plot. Analyze the expression pattern and select the genes which are upregulated in severe cluster and have negligible expression in all other clusters in the dataset. Using this approach, we identified the most differentially regulated gene that demarcated cluster 11 in samples from severe COVID-19 patients, FCGR3B (Figure 3).

Pathway enrichment.
This section describes pathway enrichment of differentially expressed genes and visualization of pathways using Cytoscape tool.
We used the DEGs from severe cluster 11 for enrichment analysis and mapping. For enrichment analysis, please follow the script at Github data: https://github.com/MBRULab/2021_Nassir-etal/blob/ main/geneoverlap.R. This script takes the list of genes from severe cluster 11 as input and scans the KEGG (Kyoto Encyclopedia of Genes and Genomes) and GO (Gene Ontology) databases to identify %R #R version will be printed on your screen >setwd("$/PATH_TO_WORKING_DIRECTORY_WHERE_db.R_IS_SAVED") > comorbidexpression.R >q() >FindAllMarkers(seu_object, test.use= ''wilkox'') ll OPEN ACCESS the overlapping pathways. The input file required (KEGG and GO file and gene list file) are deposited in Zenodo. Here, the intersection of your set of genes over KEGG and GO pathway genes is computed. This script generates a csv file with the following information: pathway id, pathway name, pvalue and Odds ratio. This file is used as the input to Cytoscape where two main plugins are used: 'Enrichment map' and 'Autoannotate'. Cytoscape is an open-source network visualization software which can be downloaded from https://cytoscape.org/download.html. Further, 'App Manager' present in the 'APPS' tab within the Cytoscape can be used to add these plugins (Figure 4). Then, use the Enrichment map and input the above-mentioned files using 'Generic/gprofiler' type of analysis. Set the FDR cut off (<0.01) OR p value cut off (<0.001) (after clicking the advanced icon). After building the map, use the auto annotate plugin and refine the network map.
CRITICAL: This script restricts the pathway to size 50 to 1,000 genes (which means only pathways having more than 50 and less than 1,000 genes are included -to account for housekeeping genes), that can be modified on line 17 of the script. CRITICAL: We consider a pathway intersection > 5 as positive match. It means that pathway will be considered only if more than 5 genes are included in it. This can be relaxed (by reducing the number) on line 26 of the script.

In silico validation.
This section describes validation of findings using other single cell and bulk RNA-seq datasets.
Once the cell type and specific marker gene has been identified, you will need to validate your findings, for which you need to decide and download required validation data sets. We used single cell BALF, bulk PBMC and bulk nasopharyngeal datasets for validation. This step involved comparing the expression pattern of FCGR3B in control and patient data and then calculating the significance using t-test p-values. The validation script deposited at Github data: https://github.com/MBRULab/ 2021_Nassir-etal (validation_BALF_SC.R, validation_nasph.R, validation_pbmc.R) can be directly used for this purpose. . If using R-studio, copy and paste the code from GitHub in new R-Script. Select the whole script and press the 'RUN' icon and observe your outputs in the console window. Else, you can save this code as 'validation.R', open R console on terminal by writing R, change and set directory and execute validation.R as shown below. Quit the screen after running the code successfully. life-support treatment (e.g., mechanical ventilation)) using RNeasy 96 QIAcube HT Kit (-QIAGEN USA) as per the manufacturer's instructions.
CRITICAL: Pay attention to avoid RNase contamination.
d. Transfer 2 mL of Universal Transfer Medium (UTM) to a clean, sterile tube and centrifuge at 1,800 g for 3 min. Carefully remove 1,650 mL of supernatant without disturbing the pelleted squamous and respiratory epithelial cells. e. Resuspend the pellet with remaining 350 mL of Universal Transfer Medium (UTM) and transfer to S Block (RNeasy 96 QIAcube HT Kit, QIAGEN USA). f. Determine quality and quantity of the extracted RNA using NanoDropä 8000 Spectrophotometer (Thermo Fisher Scientific, USA). g. Reverse transcribe RNA to cDNA using a High-Capacity cDNA Kit (Applied Biosystems, Foster City, CA) according to the manufacturer's protocol. Prepare master mix for reverse transcriptase reaction in a microcentrifuge tube. RT Master Mix for cDNA Generation h. Mix the reagents by briefly pipetting or gently vortexing. Label tubes and pipette 10 mL of master mix into PCR strip tubes. Add 10 mL RNA to respective PCR strip tube and gently mix by pipetting 5-10 times. i. Flash spin PCR strip tubes in a mini centrifuge at 2,000 g for approximately 10 s. Place tubes into thermocycler and run the following program: cDNA generation PCR steps j. Thaw TaqMan Fast Advanced Master Mix on ice. Thaw primer/probe set on ice, protected from light. Label one microcentrifuge tube for each primer/probe set and mix the master mix and primer/probes.

Reagent
Volume per reaction (mL) Note: Use technical duplicates for each sample and primer/probe set.
k. Prepare the above mixture and mix by pipetting slowly up and down. Dispense 9 mL of master mix reaction into a 96-well plate. Add 1 mL cDNA to the 96-well plate containing the reaction mixture. Cover the loaded 96-well plate with an adhesive cover and centrifuge the 96-well plate in a mini plate spinner at 500 g for approximately 20 s. Select the reagent type as TaqMan reagents. Choose the ramp speed and plate setup. Define the specific probe and samples, fluorescence reporter and the quencher for the experiment. Assign the probes and samples to selected wells.
CRITICAL: Include a no-template control. Prior to initializing the run, assign target genes with the appropriate fluorophore (GAPDH-VIC, FFAR2-FAM).
l. Set up the following program: PCR cycling condition m. Data Analysis -Review the QC summary (select amplification plot to show amplification curves in all wells. Specific PCR products should display a single sharp peak in the melting curve rather than multiple peaks or single broader peaks (non-specific PCR product)). n. Calculate relative levels of mRNA gene expression using the 2 ÀDDCT method and plot fold change. Relative gene expression = 2 ÀDDCT , where DCT is the difference in CT value of target gene with respect to housekeeping gene, DDCT is the difference in CT value of target gene of patients with respect to control. 8. Gene expression in spike protein stimulated obese subjects. This section describes culturing lung epithelial cells, monocyte isolation and culturing followed by real time PCR analysis.
a. Lung epithelial cell culture. i. Purchase Normal human primary bronchial epithelial (NHBE) cells from non-obese and obese subjects from a commercial source (key resources table). ii. Culture NHBE cells in BEGM media (Lonza, MD, USA) supplemented with 1% antibiotic antimycotic solution (Wisent, QC, CA) in 12 well tissue culture plates coated with Type 1 Rat tail collagen (Sigma-Aldrich, Ontario, Canada). iii. Grow cells to 90% confluency and starve using BEBM Basal Medium (Lonza, MD, USA) supplemented with 1% antibiotic antimycotic solution (Wisent, QC, CA) over night. iv. The next day, stimulate cells with 1 mg/mL of SARS-CoV-2 spike protein (S1+S2) for 3 h. v. Collect the culture media in microtubes, centrifuge at 5,000 g for 5 min. Aliquot supernatants into fresh microtubes and freeze at À80 C. b. Monocyte isolation and culture. i

EXPECTED OUTCOMES
Using this protocol, it is possible to process scRNA-seq data to identify cell types enriched for any gene set and determine markers associated with disease/pathway. The protocol will guide you to perform single cell clustering and data analysis, pathway analysis, real time PCR and cell culture.

LIMITATIONS
Potential confounders that can affect the outcome of this kind of study are lack of sample replicates and the batch effect arising from different samples. Batch effect is technical source of variation which affects the data quality and the subsequent downstream analysis (Tran et al., 2020). Here, our analysis was conducted independently using batch corrected data for each single cell (SC) transcriptome dataset without combining them. Hence, our analysis did not have any mix of batch affect. However, while using data from different sources and integrating them, batch effect should be corrected. The inclusion of sample replicates is subjected to availability of patient samples and cost incurred in single cell RNA sequencing. The small sample size of this cohort is a limitation (31 severe COVID-19 patients and 11 healthy controls). Although we did qPCR validation for FCGR3B using nasopharyngeal swabs (due to availability), the ideal validation sample would be flow cytometry sorted macrophages from BALF region to match the observed upregulated FCGR3B expression in MoAM of BALF region using single-cell RNA-seq data in silico.

TROUBLESHOOTING Problem 1
Unable to reproduce the UMAP which matches the original work.

Potential solution
In the preprocessing steps, the single cell transcriptome data will be available in different formats (Seurat, h5ad, loom and so on). It is important to use the same steps and the single cell analysis tool as used by original work and then converting the final output to the desired format for further processing. The overall topology of the UMAP will be preserved for comparison purpose.

Problem 2
Assigning of the cell type identity will be affected if the marker dataset is limited and consists only of a few canonical markers. As discussed in Figure S2 (Nassir et al., 2021b), limited markers will lead to vague cell type identification, for example -many clusters might represent same cell types, viz. macrophage representing many clusters.

Potential solution
It is always better to refer multiple resources (Human cell atlas, databases like CellMarker and PanglaoDB and latest peer reviewed research articles). A comprehensive marker database will aid in unbiased cell type mapping. Each gene should be present only once defining a specific cell type. In case of conflicting genes/cell types, that gene/cell type should be removed to maintain a unique cell type marker database.

Problem 3
Low RNA yield from nasopharyngeal swabs.

Potential solution
Some of the swab Universal Transfer Medium (UTM) may not contain enough number of cells, so, if the RNA yield is less, this step can be repeated until we get minimum number of cells accumulated to extract required amount of RNA. Steps involved in the compilation of unique marker gene database using various sources Marker gene information is present in literature in the form of heat map (Gene expression plotted against cell types using color gradients for expression levels), dot plot (circles denoting the gene expression across various cell types), tables or databases. In case of dot plot, the size and color intensity are proportional to the level of expression in the percent of cells and degree of expression, respectively (in ascending order). In order to constitute the final marker database, the cell types and gene list were combined from all the sources retaining only unique genes into the list for each cell type (Removing any duplicate genes within inter and intra cell types).

Problem 4
Non-standard gene names will lead to improper downstream analysis.

Potential solution
It's better to retrieve gene names from NCBI/Gene Cards. Care should be taken especially while retrieving from literature. For example, the same gene can be mentioned as ILR2 and IL2RA. Such conflicts should be carefully checked and resolved by comparing with NCBI nomenclature.  Alternatives: 1. Different clustering tools, cell type databases, parameters or statistical tests can be used for analyzing single-cell RNA-seq data. 2. The severe, moderate and control data can be either clustered separately or merged depending on the objective of the study. 3. Automated cell type identification tools can be used instead of the cell type annotation described in the protocol.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, [Dr. Mohammed Uddin, Email: mohammed.uddin@mbru.ac.ae].

Materials availability
This study did not generate new unique reagents.

Data and code availability
Complete datasets used for this analysis are deposited in Zenodo data: https://doi.org/10.5281/ zenodo.5809990.
Analyses were conducted in R and Python; the accession number for all codes reported in this paper have been deposited as Github data: https://github.com/MBRULab/2021_Nassir-etal.
Any additional information required to reanalyze the data reported in this paper is available through personnel contact with the corresponding author upon request.