A network-based systems biology approach for identification of shared Gene signatures between male and female in COVID-19 datasets

The novel coronavirus (SARS-CoV-2) has expanded rapidly worldwide. Now it has covered more than 150 countries worldwide. It is referred to as COVID-19. SARS-CoV-2 mainly affects the respiratory systems of humans that can lead up to serious illness or even death in the presence of different comorbidities. However, most COVID-19 infected people show mild to moderate symptoms, and no medication is suggested. Still, drugs of other diseases have been used to treat COVID-19. Nevertheless, the absence of vaccines and proper drugs against the COVID-19 virus has increased the mortality rate. Albeit sex is a risk factor for COVID-19, none of the studies considered this risk factor for identifying biomarkers from the RNASeq count dataset. Men are more likely to undertake severe symptoms with different comorbidities and show greater mortality compared with women. From this standpoint, we aim to identify shared gene signatures between males and females from the human COVID-19 RNAseq count dataset of peripheral blood cells using a robust voom approach. We identified 1341 overlapping DEGs between male and female datasets. The gene ontology (GO) annotation and pathway enrichment analysis revealed that DEGs are involved in various BP categories such as nucleosome assembly, DNA conformation change, DNA packaging, and different KEGG pathways such as cell cycle, ECM-receptor interaction, progesterone-mediated oocyte maturation, etc. Ten hub-proteins (UBC, KIAA0101, APP, CDK1, SUMO2, SP1, FN1, CDK2, E2F1, and TP53) were unveiled using PPI network analysis. The top three miRNAs (mir-17–5p, mir-20a-5p, mir-93–5p) and TFs (PPARG, E2F1 and KLF5) were uncovered. In conclusion, the top ten significant drugs (roscovitine, curcumin, simvastatin, fulvestrant, troglitazone, alvocidib, L-alanine, tamoxifen, serine, and doxorubicin) were retrieved using drug repurposing analysis of overlapping DEGs, which might be therapeutic agents of COVID-19.


Introduction
Coronaviruses (CoVs) belong to the Coronaviridae family, one of eight families whose members infect humans and vertebrates. CoVs made up of single-stranded RNA. The upper respiratory tract is the main region of humans infected by the CoVs [1][2][3]. However, some other regions, such as the gastrointestinal, hepatic, and central nervous systems of humans, can also be infected by the CoVs. In 2002-2003 severe acute respiratory syndromes associated with coronavirus (SARS-CoV-1) was emerged in China and spread to the other four countries. SARS-CoV-1 infected around 8000 cases with a case-fatality ratio (CFR) of 11% [4]. The Middle East respiratory syndrome-associated coronavirus (MERS-CoV) is another type of CoVs emerged in 2012. The number of reported deaths was 858 out of 2494 infected cases by MERS-CoV with a higher CFR, 34% [5]. The novel coronavirus-2019 (COVID- 19), also known as SARS-CoV-2, has been declared as a pandemic by the world health organization (WHO) [6,7]. It has given global challenges and threats to the whole human with an enormous loss of lives worldwide [8]. It first appeared in Wuhan, China, in December 2019 [4]. There are several variants of SARS-CoV-2 such as B.1.1.7 (alpha), B.1.351 (beta), P.1 (gamma), B.1.427 (epsilon) and B.1.617.2 (delta) [9,10]. The alpha variant was the first outbreak in the United Kingdom (UK) in November 2020. The beta variant was first detected in South Africa in October 2020. Gamma variant, also known as Brazilian variant, was first detected in January 2021. The delta variant of SARS-CoV-2 was detected in late 2020 in India. Alpha variant and delta variant are both more transmissible than the original virus identified in China. Most people infected with the COVID-19 show mild to moderate respiratory illness (like cold, fever, and cough), and no special treatment is required [11][12][13]. Older adults and people with different comorbid diseases such as cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to experience serious respiratory illness and they may need hospitalization, intensive care, or a ventilator to support them breath, or people even die [14,15]. Therefore, candidate drugs and vaccines of COVID-19 are urgently needed. The traditional de novo drug discovery procedure is expensive and requires a long time. Drug repurposing is another way to explore the candidate drugs amid the existing drugs using the bioinformatics and integrative systems biology approach, which could shorten the time and expense compared to the traditional procedure. However, the identification of biomarkers is so important for further downstream analysis like drug discovery. Also, it is very challenging because there is a large number of a gene relative to the small number of samples. In general, biomarkers are correlated and sometimes show the same activities in the complex regulatory networks and pathways. So it is necessary to understand the underlying mechanism and functions of biomarkers [16][17][18].
Demographic variables such as age and sex are the main risk factors of COVID-19 disease. Men are more likely to undergo severe symptoms with different comorbidities and show more significant mortality than women [19,20]. One of the main reasons is that men are more likely to participate in smoking and drinking alcohol [21]. Other reasons are chromosomal factors (sex-specific hormones and steroids) and gender-specific factors (behaviors and social activities). Recently, Blanco-Melo et al. revealed transcriptional signatures and pathways of SARS-CoV-2 by identifying differentially expressed genes (DEGs) from the RNA sequencing (RNA-Seq) data [22]. Previous studies also examined DEGs and molecular gene ontology and pathway analysis using lung epithelial cells [8]. The premeditated HIV, Ebola, and malaria drugs have been tested to prevent COVID-19 [23,24]. However, the absence of vaccines and proper drugs against the COVID-19 has increased the mortality rate worldwide. Therefore, common biomarkers between males and females may play an important role in discovering drugs against the COVID-19. No specific studies were performed to identify biomarkers from gene expression levels by considering the sex differences using a robust approach. Hence, in this paper, we aim to identify shared gene signatures between males and females from the human RNAseq dataset in blood. To conduct gene ontology (GO) and pathway enrichment analysis, the overlapped differentially expressed genes (DEGs) or biomarkers between males and females were employed to conduct gene ontology (GO) and pathway enrichment analysis. These analyses revealed that the mutual DEGs are involved in various BP categories such as nucleosome assembly, DNA conformation change, DNA packaging, and different KEGG pathways such as cell cycle, ECM-receptor interaction, progesterone-mediated oocyte maturation, etc. Finally, the ten hub genes (UBC, KIAA0101, APP, CDK1, SUMO2, SP1, FN1, CDK2, E2F1, and TP53) were revealed from protein-protein interaction (PPI) network analysis and underwent in the online databases to explore the candidate drugs of COVID-19.

Data acquisition and identification of differentially expressed genes from peripheral blood cells of SARS-CoV-2
The RNA-Seq dataset of COVID-19 was retrieved from Gene Expression Omnibus (GEO) [25] with the accession number GSE152418 under the platform GPL24676 [26]. This dataset comprises 34 samples. Among them, 17 samples came from peripheral blood cells with SARS-CoV-2 infected patients, and 17 samples came from peripheral blood cells of healthy control people. There is an outlier sample in peripheral blood cells with SARS-CoV-2; therefore, we have discarded this sample from this dataset [27]. Among the 16 samples of SARS-CoV-2 infected patients, 7 came from males, and 9 came from female patients. For distinct identification of DEGs between males and females, firstly, we divided the whole dataset into two independent datasets. One dataset consists of 7 male SARS-CoV-2 infected patients, and the other datasets consist of 9 female SARS-CoV-2 infected patients, where the number of healthy control people was the same in both datasets (17). However, information on variants of SARS-CoV-2 infected patients is unavailable in the original publication that provided this dataset [26]. For robust identification of DEGs from both male and female datasets, we employed a robust voom approach [28]. The DEGs were identified using adjusted p-value <0.05 and absolute log2 fold change (FC) ≥1 [29]. The overlapped DEGs between male and female dataset was used for further downstream analyses.

Gene ontology and pathway enrichment analysis
To decode identified DEGs' biological functions and pathways, we used Database for Annotation, Visualization and Integrated Discovery (DAVID) and Metascape [30,31]. Different gene ontology (GO) categories such as biological process (BP), cellular component (CC), molecular functions (MF), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were mined using this database. Adjusted p-value <0.05 of the hypergeometric test was used to declare significant categories. We also considered a minimum number of 2 genes in each category as the cut-off.

Protein-protein interaction analysis and hub-protein identification
Protein-protein interaction (PPI) has been carried out using the online tool NetworkAnalyst [32]. STRING was used to construct the PPI. The hub-proteins were extracted with a high confidence level (900) of PPI and based on degrees>30. The hub-protein were then visualized using the GeneMANIA web-server [33].

DEGs-miRNA network analysis to identify the potential micro RNAs
The DEGs-miRNA interaction analysis was performed using the miRTarBase database and visualize via NetworkAnalyst [32]. This database collected fifty thousand miRNA-target interactions. To determine the significant core-miRNAs, we considered the degree of interaction >30.

DEGs-transcription factor regulatory network analysis
The DEGs-TFs interaction analysis was conducted using the JASPAR database and visualize via NetworkAnalyst. JASPAR is a collection of large curated and non-redundant DNA binding TFs respiratory [34]. We retrieved hub-TFs with a cutoff value of degree>40.

Hub proteins specific drug repositioning
The Drug Signatures Database (DSigDB) [35] and DrugMatrix [36] were used via Enrichr [37] for the identification of potential drug candidates using hub proteins. A collection of 22,527 gene sets comprises 17,389 distinct compounds covering 19,531 genes contain in DSigDB.

Sensitivity and specificity analysis of the hub proteins
Sensitivity and specificity analysis has been performed by different classification methods using the R package MLSeq and caret. This Table 1 Gene Ontology (GO) enrichment analysis using 1341 overlapping DEGs between male and female dataset. GO   analysis was accomplished to verify the sample classification performance of the identified hub proteins based on two independent datasets.

Results
Identification of differentially expressed genes from peripheral blood cells of SARS-CoV-2.
We applied a robust voom approach to identify the DEGs from the male and female SARS-CoV-2 dataset. A total of 2067 and 1900 DEGs were identified from the male and female datasets, respectively. There are 1341 overlapping mutual DEGs identified between males and females, depicted in a Venn diagram of Fig. 1A. These overlapping DEGs were then used for further downstream analyses. The overlapping DEGs between male and female has been shown in a circus at the gene level (Fig. 1B). The DEGs between SAR-CoV-2 infected male and female patients versus healthy control have been shown in volcano plots of Fig. 1C and D, respectively. In these figures, the green and red colors represent the down-regulated and up-regulated DEGs.

Functional annotation and pathway enrichment analysis
Genes like to interact with each other, and they do not work alone. Sometimes most of the genes show similar biological functions and pathways. To interpret the biological mechanism of 1341 overlapping DEGs, we performed GO and KEGG pathway enrichment analysis. GO analysis revealed that BPs are mainly enriched in nucleosome assembly,   Table .1. From KEGG pathway enrichment analysis, we uncovered various pathways such as cell cycle, ECM-receptor interaction, progesterone-mediated oocyte maturation, alcoholism, and so on that are statistically significantly using a hypergeometric test with adjusted p-value <0.05. The top ten KEGG pathways have been summarized in Table .2 and plotted against enrichment ratio using a dot diagram in Fig. 2A. In addition, from Metascape, we explored that overlapping DEGs were significantly enriched in the immune system process, response to stimulus, metabolic process, developmental process (P < 0.05, Fig. 3A and Fig. 3B).

Determination of hub-proteins using protein-protein interaction analysis
Ten hub-proteins (UBC, KIAA0101, APP, CDK1, SUMO2, SP1, FN1, CDK2, E2F1 and TP53) were discovered using protein-protein interaction (PPI) analysis. The PPI network has been shown in Fig. 4A. The hub-proteins with higher degrees of interaction determined using the topological analysis of PPI have also been displayed in Fig. 4B. The logarithmic values of RNASeq count expression of hub-proteins were shown in a heatmap plot in Fig. 2B.

Identification of potential transcription factors
Transcription factors (TFs) are the proteins that regulate the transcription of genes from DNA to mRNA by binding DNA sequences. Therefore, the gene-TFs regulatory network has been carried out in Fig. S2 to revealed key TFs. The core TFs were found from this analysis are PPARG, E2F1, KLF5, FOXC1, and GATA2.

Identification of candidate drugs based on hub proteins
The top ten significant candidate drug agents identified by Enrichr are roscovitine, curcumin, simvastatin, fulvestrant, troglitazone, alvocidib, L-alanine, tamoxifen, serine, and 2-Butanone. They were summarized in Table .3. The other significant drug agents were also retrieved from this database such as water, tamoxifen, doxorubicin, roscovitine, ns-398, vinblastine, aflodac, resveratrol, rapamycin, mechlorethamine. The top 20 drug candidates have been presented in   Table S1.

Sensitivity and specificity analysis of the hub proteins
To investigate the sample discriminative performance of the identified ten hub-genes we computed various performance measures such as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy (ACC) by the six classifiers (SVM, PLDA, PLDA2, NBDLDA, voomDLDA, voomNSC). To execute this task, we randomly divided this dataset (GSE152418) into a training dataset and a test dataset. The training dataset consists of 9 control and 8 COVID-19 samples. The rest of the samples belong to the test dataset. After that, ten hub-genes were selected from both datasets to construct the reduced training and test datasets. We performed 5-fold crossvalidation to train the six classifiers, and the above performance measures were recorded. The average values of these performance indices were presented in Table .4. The accuracy values in this table indicate that the discrete distribution-based methods such as NBDLDA, PLDA2, voomDLDA, and voomNSC performed well than SVM. The boxplot of the accuracies (Fig. 5A) also demonstrates the same results as Table .4. The proposed ten hub-genes were ranked according to their importance using SVM (Fig. 5B). We also showed count gene expression values of ten hub-genes between control, male and female sample in Fig. 6. This figure depicts that three hub-genes (CDK2, E2F1, and TP53) were downregulated and seven hub-genes (UBC, KIAA0101, APP, CDK1, SUMO2, SP1, FN1) were up-regulated.

Discussion
Though sex-specific biomarker identification is crucial for developing drugs or therapies from the RNASeq count gene expression level of COVID-19 disease, none of the studies considered it yet. Men are more likely to become a serious condition in the presence of different comorbidities than women, and the mortality rate of COVID-19 is larger in men than women. Therefore, biomarkers may also be different between males and females of COVID-19. From this point of view, in this study, we aimed to identify the common DEGs between males and females of COVID-19. We identified 1341 overlapping DEGs between males and females. These DEGs were then undergone for GO and KEGG pathway analysis using DAVID and Metascape to explore the biological mechanisms of DEGs. From GO analysis we revealed that the top three categories BP (nucleosome assembly, DNA conformation change, DNA packaging), CC (DNA packaging complex, nucleosome, chromosomal region), and MF (protein heterodimerization activity, heparin-binding, sulfur compound binding) were significantly enriched. Using meta-Scape, we discovered that DEGs are involved in different cancer pathways, immune system, response to stimulus, metabolism process, and so on. DAVID also divulged some important KEGG pathways such as cell cycle, ECM-receptor interaction, progesterone-mediated cyte maturation etc. using the overlapping DEGs. Furthermore, we conducted PPI to identify hub-proteins, DEGs-miRNA to identify potential miRNAs, gene-TFs to determine core TFs. Ten hub-proteins (UBC, KIAA0101, APP, CDK1, SUMO2, SP1, FN1, CDK2, E2F1, and TP53) were identified using PPI topological network analysis. UBC is also known as Ubiquitin C is a protein-coding gene. Phlyctenulosis and cystic fibrosis diseases were found to associate with UBC [38]. Disease-associated with KIAA0101 (protein-coding gene) is thyroid carcinoma and heart conduction [39]. The GO annotation related to amyloid-beta precursor protein (APP) is protein binding and enzyme binding. Different neuro-diseases such as app-related and Alzheimer Alzheimer's disease are associated with this gene [40]. CDK1 (Cyclin-dependent kinase 1) is the protein-coding gene. The diseases related to genes involve retinoblastoma, breast cancer, and glioblastoma multiforme. Pathways associated with these genes include ACC = accuracy, LACC = lower limit of ACC, UACC = upper limit of ACC, PPV = positive predictive value, NPV = negative predictive value. the ATM pathway and cell cycle [41]. SUMO2 is a protein-coding gene, and it is related to Gordon holmes syndrome [42]. The GO annotation related to SP1 includes DNA-binding transcription factor activity, and Huntington's disease was associated with this gene [43]. Heparin-binding and protease binding are found from GO annotation of the FN1 gene [44]. CDK2 (Cyclin-dependent kinase 2) is also a protein-coding gene. Diseases associated with CDK2 include breast Cancer and glioblastoma multiforme [45]. Linked diseases of E2F transcription factor 1 (E2F1) are retinoblastoma and glioblastoma multiforme. Pathways of this gene include regulation of activated PAK-2p34 by proteasome-mediated degradation and E2F transcription factor network [46]. The GO annotation of the TP53 gene includes DNA-binding transcription factor activity and protein heterodimerization activity [47]. The top three miRNAs (mir-17-5p, mir-20a-5p, mir-93-5p) and TFs (PPARG, E2F1 and KLF5) were uncovered. Finally, using drug repurposing analysis top ten significant drugs (roscovitine, curcumin, simvastatin, fulvestrant, troglitazone, alvocidib, L-alanine, tamoxifen, serine, and doxorubicin) were retrieved for therapeutic targets of COVID-19. Most of the drugs are FDA-approved and have been used to treat different types of cancer diseases.

Conclusions
The novel coronavirus (SARS-CoV-2) has expanded rapidly in today's world. SARS-CoV-2 mainly affects the respiratory systems of humans that can lead up to severe illness or even death with comorbidities. Still, drugs of other diseases have been used to treat the COVID-19. Nevertheless, in the absence of vaccines and proper drugs against COVID-19 has increased the mortality rate. Furthermore, COVID-19 infected men are more likely to experience severe illness than women. Hence, sex might be a major risk factor of COVID-19, and biomarkers related to the sex might be useful for discovering drugs against the COVID-19. Therefore, this paper attempts to identify the biomarkers by considering the sex effects using a robust voom approach. A total of 1341 overlapping DEGs were identified between males and females datasets. Using these DEGs' PPI analysis, we explored ten hub-proteins (UBC, KIAA0101, APP, CDK1, SUMO2, SP1, FN1, CDK2 E2F1, and TP53) that are involved in some important and interesting pathways of cancer-related disease. In sum, using these hub-proteins ten significant candidate drugs (roscovitine, curcumin, simvastatin, fulvestrant, troglitazone, alvocidib, L-alanine, tamoxifen, serine, and doxorubicin) were retrieved that might be therapeutic targets of COVID-19 disease. Our future work will be covered with a network-based gene coexpression analysis based on COVID-19 datasets.

Author contributions
MSJ conceived and designed the study; MSJ analyzed data; MSJ wrote the draft manuscript; MRR and MRA reviewed and edited the manuscript.