Transcriptomic analysis of differential gene expression in staphylococcus aureus-induced pneumonia in pediatrics based on microarray analysis


 Staphylococcus aureus is a leading cause of about 80% of infections in humans and up to 60–70% of hospital-acquired infections. Pneumonia is the broad term for a range of disorders that result in infection of the lung parenchyma and are caused by a number of organisms. It is a common disease that affects people from all walks of life. Identification of an etiologic agent for pneumonia is critical in order to provide appropriate therapy and maintain epidemiological records. Study on the transcriptional profiling of patients infected with S. aureus is a pivot to the analysis of differentially expressed gene in the blood of patients infected with S. aureus. This study performed the analysis of gene expression dataset GSE30119 available on the Gene expression Omnibus (GEO) which is based on the hypothesis tested that patient clinical heterogeneity will be reflected in transcriptional profile heterogeneity. The study comprised 143 pediatric patients, with 44 healthy individuals, 81 pneumonia-free, and 18 pneumonia infection patients. We discovered a total of 54 genes related to S. aureus infection and 612 genes associated with Pneumonia. According to Gene Ontology (GO) functional and pathway enrichment studies, the S. aureus infection associated genes are predominantly engaged in the innate immune response, calcium-mediated signaling, Neutrophil extracellular trap formation, Formation of Fibrin Clot (Clotting Cascade). Whereas the genes associated with Pneumonia are enriched in adaptive immune response, inflammatory, Interferon alpha/beta signaling, TCR signaling, Gα(i) signalling events. This study shows differentially expressed genes and their biological activities in relation to S. aureus infection and Pneumonia, and it may provide more light on the underlying molecular mechanisms and possibly important gene signatures in Pneumonia development.

Despite being a brief sickness, it is associated with chronic conditions that have protracted consequences (Quinton et al., 2018).
Pneumonia has a wide range of risk factors ranging from poor nutrition to microbial infections (Bacteria, Viruses and fungi). Microoganisms such as Streptococcus pneumoniae and Haemophilus in uenza, corona virus, rhinovirus, human metapneumovirus, human bocavirus, parain uenza, Respiratory syncytial virus, Mycoplasma pneumonia, S. pneumoniae, Staphylococcus aureus, Moraxella catarrhalis, H. in uenzae, Mycoplasma pneumonia, Chlamydophila pneumoniae etc. The advent of molecular diagnostics techniques such as polymerase chain reachtion (PCR), Microarray techniques, and nextgeneration sequencing (NGS) has been helpful in the detection of the pathogens associated with Pneumonia (Bhuiyan et al., 2018). However, NGS has proven to be superior over the traditional or conventional diagnostic techniques. In severe pneumonia, NGS may lead to a quick and effective diagnosis with a better clinical outcome than traditional detection approaches. It rst demonstrates that NGS may swiftly provide etiology proof for severe pneumonia patients, guide clinic care, and ultimately reduce mortality (Li et al., 2021).
Staphylococcus aureus being a common cause of hospital-acquired infection with Pneumonia recognized as the second most common hospital-associated infection, we assayed to study the biomarkers which may be linked to Pneumonia infection in staphylococcus infected patients. Due to the emergence antibiotic resistant strains of Staphylococcus aureus, e.g. Methicillin resistant Staphylococcus aureus (MRSA), there has been also increase in the number of infections arising from it. A longitudinal study of roughly 10 million pneumonia cases requiring hospitalization found that Staphylococcus aureus pneumonia was identi ed as the primary diagnosis in just 1.08 percent of the cases (Jacobs and Shaver, 2017). Staphylococcus aureus has long been recognised to play a big role in the development of pneumonia, and its importance as a pneumonia pathogen was recently demonstrated in an observational study in different Intensive care unit across Europe (Paling et al., 2020). Furthermore, SARS-CoV-2 patient morbidity and death have recently been linked to Staphylococcus aureus pneumonia. Furthermore, SARS-CoV-2 patient morbidity and death have recently been linked to Staphylococcus aureus pneumonia (Lai et al., 2020). The fact that S. aureus is multidrug-resistant adds to the problem's complexity. The nares and extranasal locations, such as the epidermis, perineum, and throat, have been proven to be colonised by S aureus, particularly MRSA (Gagnaire et al., 2017). The absence of nasal colonization has been linked to a reduced risk of future MRSA infection (Kapali et al., 2021). When the nares are colonized, S aureus has opportunity to hide from the host's defenses, which can lead to infection if the host's defenses are breached (Ajayi, 2018). This study was carried out based on the hypothesis that "transcriptional pro le heterogeneity will re ect patient clinical heterogeneity" and also identify gene signatures that may serve as biomarkers of staphylococcus infection in human. It is aimed at investigating the biomarker panel of pneumonia infection caused by staphylococcus aureus.

Page 4/26
The original submitter-supplied datasset GSE30119 was obtained from GEO (http://www.ncbi.nlm.nih.gov/geo/), which was based on the platform of GPL6947 Illumina Human HT-12 V3.0 expression beadchip. The data was submitted by Banchereau et al., 2012 collected from Genome-wide analysis of whole blood transcriptional response to community-acquired Staphylococcus aureus infection in vivo Total RNA extracted from whole blood (lysed in Tempus tubes) drawn from pediatric patients with acute community-acquired Staphylococcus aureus infection. A total of 143 samples are included in this dataset, comprising 44 healthy individuals, 81 pneumonia-free, and 18 pneumonia infection patients. Total RNA extracted from whole blood was utilized for gene expression microarrays. This dataset was generated using the platform GPL6947 Illumina HumanHT-12 V3.0 expression beadchip. Differential gene expression analysis Data pre processing was performed using GEO2R (https://www.ncbi.nlm.nih.gov/geo/geo2r) and was applied to screen Differentially Expressed Genes (DEGs) between the following groups: staphylococcus infection (SI) vs. Healthy (H), Pneumonia-free (PF) vs Healthy and Pneumonia infection (PI) vs. Healthy. GEO2R is a web-based tool that allows users to compare two or more groups of Samples in a GEO Series to nd genes that are differentially expressed under different experimental settings. The results are supplied as a table of genes ordered by signi cance, as well as a collection of graphic plots to aid in the visualization of differentially expressed genes and the evaluation of data set quality. Using the Bioconductor project's GEOquery and limma R packages, GEO2R compares original submitter-supplied processed data tables. Bioconductor is an open source software project that provides tools for analyzing high-throughput genetic data. It is based on the R programming language. The R package GEOquery parses GEO data into R data structures that other R tools can use. Log transformation was applied to the data. The adjusted P<0.01 and |log 2 fold change (FC)| >1 (i.e., FC >2) were selected as the threshold for each group.

Venn Diagram Analysis of DEGs
Venn diagram for DEGs of the comparison groups was constructed using Venny (http://bioinfogp.cnb.csic.es/tools/venny/index.html). The similarities and differences in three comparison groups were observed. The DEGs that overlap the three comparison groups were recognized as genes associated with S. aureus infection. The other DEGs, observed between Pneumonia vs. Healthy but not Pneumonia free vs. Healthy were identi ed as Pneumonia-associated genes associated.

Functional, pathway enrichment Analysis
To undertake enrichment analysis for the DEGS, the Metascape database for annotation, visualization, and integrated discovery (, http://metascape.org) was introduced. As an enrichment background, all genes in the genome were employed. Terms with a P-value less than .01, a minimum count of 3, and an enrichment factor more than 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are gathered and classi ed into clusters based on membership commonalities. To adjust for repeated testings, P-values are determined using the accumulative hypergeometric distribution (Karp et al., 2021), and q-values are calculated using the Banjamini-Hochberg technique (Menyhart et al., 2021). When doing hierarchical clustering on the enriched terms, Kappa scores (Gu and Huebschmann, 2021) are used as the similarity metric, and sub-trees with a similarity of >0.3 are deemed a cluster. Protein-protein interaction enrichment analysis was performed on each gene list using the STRING database (Szklarczyk et al., 2016). In STRING (physical score > 0.132), only physical interactions are exploited. The resulting network comprises the proteins that have at least one physical contact with another member of the list. The Molecular Complex Detection (MCODE) algorithm has been used to discover highly coupled network components in networks with between 3 and 500 proteins.

Protein-protein Interaction analysis
Protein products of the differentially expressed genes were obtained from the String database (http://string-db.org/) and used to construct a network of protein-protein interaction pro le. This database is one of the Cytoscape software 3.

Identi cation of DEGs
The gene expression dataset GSE30119 was downloaded from the GEO database. DEGs between the disease and healthy samples were determined using the GEO2R tool. As presented in Fig. 1, a total of 821 DEGs were identi ed in the all the comparison groups using the threshold of P < 0.05 and |log 2 FC| >1, including 488 upregulated genes and 333 downregulated genes. The top 10 up-and downregulated genes for each comparison group are listed in Table 1. Pathway and process enrichment analysis For each given gene list, pathway and process enrichment analysis has been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. The genes associated with staphylococcus infection, being 54 in number from differential gene expression analysis are enriched several pathways and processes as presented in Table 2. The table only shows the top 16 clusters with their representative enriched terms (one per cluster). While Table 3 shows the top 20 clusters associated with pneumonia infection. "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the userprovided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10. The enrichment category for staphylococcus-infection related genes ( Table 3 Top 20 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.

Protein-protein Interaction Analysis
From the 54 identi ed genes related to staphylococcus infection, C-X-C motif chemokine ligand 8 (CXCL8) has the most signi cant interaction with the other proteins (Table S1). Its expression was however downregulated. Cathelicidin antimicrobial peptide (CAMP) and lipocalin 2 (LCN2) shows the highest expression values. It's however intriguing that both shows strong interaction as shown in Fig. 4. About 613 genes are related to pneumonia infection in this study (Table S3), with carcinoembryonic antigen related cell adhesion molecule 8 (CEACAM8) and Fc fragment of IgE receptor Ia (FCER1A) emerging as the most upregulated and downregulated proteins respectively (Table 1c). However, FCER1A is not shown on the network in Fig. 5 because there's no known functional and physical protein associations. In the network graph, the nodes represent proteins and the edges indicate both functional and physical protein associations existing among the nodes. The sources from which the interactions were obtained includes Text-mining from literature, empirical studies, Databases, co-expression, Neighborhood, gene fusion and co-ocurrence studies. Minimum required interaction score was set at 0.4 being the default value on string database. Nodes without any connection were excluded from the network.

Discussion
This study was carried out based on the hypothesis that "transcriptional pro le heterogeneity will re ect patient clinical heterogeneity" and also identify gene signatures that may serve as biomarkers of staphylococcus infection in human". It is our goal to identify the genes which exhibit differential expression in pneumonia infection induced by staphylococcus aureus. One of the most common uses of sequencing data is differential gene expression (DGE) analysis. This method is commonly utilized in many sequencing data analysis applications since it enables for the identi cation of differentially expressed genes across two or more conditions. Due to the variety of formats based on the tool of choice and the multiple bits of information contained in these results les, interpreting DGE ndings can be di cult and time consuming (Wang et al., 2019). In the ICU, Staphylococcus aureus is the second most prevalent cause of pneumonia. Toxins and enzymes produced by the bacteria highlight its virulence, causing signi cant lung tissue damage. Clinical signs are insu cient to identify Staphylococcus aureus pneumonias from those caused by other pathogens, and clinical diagnosis suffers from the same limitations as other bacterial pneumonia causes (Hooper and Smith, 2012).
The comparison groups set for differential analysis in this study include staphylococcus aureus infected patients, staphylococcus infected patients with pneumonia infection and staphylococcus infected patients without pneumonia infection. The infection present aside pneumonia included bacteremia, osteomyelitis, suppurative arthritis, pyomyositis, empyema, abscess. Downregulation of gene expression is an indication of the inhibitory activity of the pathogen while the genes whose expression were Cathelicidin is an antibacterial peptide of the cathelicidin family. It is a small molecule (composed of 12-100 amino acids) with wide antibacterial activity that is thought to play a role in the innate immunity as the rst line of defense against microbes. (Iacob and Iacob, 2014). When cathelicidin is produced enzymatically, it has an N-terminal prosequence followed by a C-terminal variable sequence with strong microbial activity. This antimicrobial peptide group is called cathelicidin because the structure of the prosequence is extremely similar to that of a protein called cathelin. Although the exact method of CAMP (Cathelicidin Antimicrobial Peptide) gene regulation is unknown, cathelicidin is reported to be upregulated

Conclusions
The molecular mechanism of infection and the involvement of the host defense against pneumonia induced by staphylococcus aureus was critically examined. However, due to the fact that the study was carried out on pediatric patients, the results found may not be generalised on other age groups. There is a need for a comparative study to compare and contrast the mechanisms involved in other members of the population. The black points (NO) stands for genes that have a fold change less than 1.0. The blue points represent the genes which fold change is lower than -1.0 but their p-value is lower than 0.05 (down regulated genes). The genes depicted by red points have a p-value lower than 0.05 and a fold change higher than 1.0 (upregulated genes).  Bar-graph for top-level Gene Ontology biological processes enriched by staphylococcus aureus infection associated genes (colored by p-values). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multitest adjusted p-value in log base 10.  Bar-graph for top-level Gene Ontology biological processes enriched in pneumonia infection associated genes (colored by p-values). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.

Figure 6
Protein-protein interaction network of staphylococcus infection related genes. Blue nodes are proteins whose expression were downregulated while the red nodes are those which were upregulated Protein-protein interaction network of pneumonia infection related genes. Blue nodes are proteins whose expression were downregulated while the red nodes are those which were upregulated. CEACAM8 has the highest fold change in expression value. Two clusters appear from the network. One basically comprises only downregulated proteins (right) while the other contains upregulated and downregulated proteins and a protein whose expression did not change signi cantly.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.