Identi cation of differentially expressed genes and hub genes of human hosts with tuberculosis through an integrated bioinformatics strategy


 Background: Tuberculosis is a chronic infectious disease caused by Mycobacterium tuberculosis. Until now, molecular mechanisms underlying the occurrence, development and prognosis of tuberculosis have not been fully characterized. The aim of the study was to identify hub genes involved in tuberculosis.Methods: We used four microarray datasets (GSE51029, GSE52819, GSE54992, and GSE65517) from the Gene Expression Omnibus (GEO) and GEO2R software to identify differentially expressed genes (DEGs) between samples from humans infected with M. tuberculosis and a healthy control group (using cutoffs of LogFC > 1 and p value < 0.05). DEGs shared by the four microarray datasets were further identified. Next, we carried out functional enrichment analysis using the Gene Ontology (GO) and Kyoto Encyclopaedia of Genes and Genomes (KEGG); Then, the host hub genes with a relatively high number of connections to other DEGs, were identified by Cytoscape. Other bioinformatics methods are also performed, including protein–protein interaction (PPI) network analysis and construction of miRNA–hub gene networks and transcription factors (TF)–hub gene networks. Finally, the expression of the host hub genes was verified using the reverse transcription polymerase chain reaction(RT–PCR).Results: A total of 46 DEGs were identified. GO analysis showed that the biological functions of DEGs were mainly in immune response regulation, cytokine/chemokine activity, and receptor ligand activity. DEGs were significantly enriched in membrane rafts, the mitochondrial outer membrane, cytoplasmic vesicle cavities, and nuclear chromatin. KEGG enrichment analysis showed involvement of the genes in the NOD-like receptor and toll-like receptor signaling pathways. Five highly differentially expressed hub genes – STAT1, TLR7, CXCL8, CCR2, and CCL20 – were identified. Finally, based on NetworkAnalyst's database, we constructed miRNA–hub gene networks and TF–hub gene networks.Conclusions: In summary, bioinformatics analyses were used to identify DEGs to find potential biomarkers that may be associated with tuberculosis. This study provides a set of candidate DEGs and five important hub genes that can be potential for the early detection, prognostic determination, risk assessment, and targeted therapy of tuberculosis.


Background
Tuberculosis (TB) is a chronic infectious disease caused by Mycobacterium tuberculosis, spread through the inhalation of droplets from the coughs or sneezes of an infected individual.TB can involved in many organs, but pulmonary TB is the most common infection. According to the World Health Organization (WHO), an estimated 10 million people worldwide were infected with TB in 2020, of which 7.1 million were newly diagnosed with TB(1). The incidence of TB varies from less than ve to more than 500 cases per 100,000 people per year, and TB remains a deadly disease even in developing countries with wellestablished healthcare systems.
Infection with M. tuberculosis causes clinical signs and symptoms when host defense is reduced or cellmediated allergy is increased (2). Respiratory symptoms include cough, sputum, hemoptysis, chest pain, and varying degrees of chest tightness or shortness of breath. However, these symptoms are relatively nonspeci c, and cannot be used to de nitively diagnose TB. Sputum smear microscopy, bacterial culture and M. tuberculosis isolation are the most traditiona and classical diagnostic tools for TB. However, these tools need a lot of time, and their accuracy is not high, it is di cult to realize the early diagnosis and effective treatment of TB patients (3). The tuberculin skin test (TST) and the interferon (IFN) -γ release assay (IGRA)are widely used in the diagnosis of M. tuberculosis infection (4). However, TST and IGRA are essentially unable to distinguish between active and latent TB infections (5). As an intracellular pathogen, M. tuberculosis largely depends on its ability to destroy the host's macrophage innate immune defense system(6). Therefore, there is an urgent need for potential biological markers for e cient diagnosis and treatment, and bioinformatics research on macrophages stimulated by M. tuberculosis is particularly important.
Microarray technology has rapidly developed in recent years, and is widely used to more accurately diagnose various diseases, compare gene expression levels, predict disease progression, and provide prognosis evaluation (7,8). As a result, a large amount of genes expression microarray data have been published on public database platforms, including genomic data on TB; these databases could be integrated to identify the molecular mechanisms of disease. This novel approach could signi cantly improve molecular disease prediction and reveal opportunities for drug-based molecular targeting and molecular therapy (9). It could not only provide new insights into the molecular and cellular processes involved in the pathogenesis of TB, but also establish much-needed rapid, sensitive, and effective methods to better diagnose and treat TB (10).
With the development of genomics technology, there is a large amount of data in the eld of TB research (11). So far, many related studies have explored potential biomarkers of TB using bioinformatics, but the e ciency and accuracy of diagnosis and treatment of the disease remain unsatisfactory. To further explore sensitive and speci c TB biomarkers, we screened the DEGs between the M. tuberculosis infection groups and the healthy control groups in four separate GEO datasets, and then performed GO function enrichment analysis, KEGG pathway enrichment analysis and PPI network construction and module analysis. The results of this study may help to explore potential targets for the diagnosis and treatment of TB.

Microarray dataset
The GEO database (http://www.ncbi.nlm.nih.gov/geo) is a free public genomics database containing a variety of data including microarray and next-generation sequencing data. We used the following keywords and medical subject heading (MeSH) terms to speci cally identify human tuberculosis datasets in the GEO database: ("tuberculosis" [MeSH Terms] OR tuberculosis[All Fields]) AND "Homo sapiens" [Organism]. The inclusion criteria of gene expression pro le are as follows: i)the datasets must include samples from both peripheral blood mononuclear cells infected with M. tuberculosis and non-infected controls; ii) the sample size must be six or greater; iii each sample must be supported by a extremely large number of data. The interactive network tool GEO2R is used to analyze the gene expression data of the microarray and nd DEGs (12). In this study, the selected datasets of M. tuberculosis infection groups and healthy control groups were rst analyzed by GEO2R. Subsequently, the analysis results are downloaded in Microsoft Excel format, and the genes that meet the following conditions: |log FC| > 1 and p values < 0.05,was considered DEGs (13). Finally, we used the enrichment analysis tool FunRich (version 3.1.3) to show the intersection of DEGs. In addition, we used the R language tool (version 4.0.3) and R Studio to draw a heatmap and a correlation circle map for the DEGs. The gene expression matrix data used to draw these maps were derived from the GSE51029 dataset from the GEO database.

Function and pathway enrichment analysis of DEGs
We use Metascape online software for GO analysis and KEGG pathway enrichment analysis (14), and to further explore the primary biological functions of the identi ed DEGs through functional enrichment analysis based on GO and KEGG databases (15). The purpose of GO analysis is to identify the key biological characteristics of genes, gene products and sequences, including biological processes (BP), cell components (CC) and molecular functions (MF)(16). KEGG pathway enrichment analysis was performed to provide a complete set of biologically interpreted genome sequence and protein interaction network information (17).The CC, BP and MF categories and the KEGG pathway were classi ed and presented in the form of bubble charts. These bubble charts were drawn based on the p value using the ggplot2 function in the R software package, using a cutoff value for statistical signi cance of p < 0.05.

Protein-protein interaction (PPI) network construction and module analysis
The STRING database (http: // www. string-db. org/) is an online tool used to identify and predict physical and functional interactions between genes or proteins(18). These interactions include physical and functional associations, data mainly come from computational predictions, high-throughput experiments, automatic text mining and co-expression networks (19). We mapped the DEGs identi ed in our analysis to the PPI network from the STRING database with a threshold interaction score > 0.4 to build a PPI analysis network for these DEGs. Analysis of DEGs in the context of protein interactions can help clarify the biochemical complexes or signal transduction components that control biological output (20), and PPI analysis is very important to explain the underlying molecular mechanisms of key cell activities in pathogenicity(10).
Next, we used Cytoscape software (version 3.7.2) to visualize the PPI network of DEGs. Each node in the network represents a gene, protein or molecule. The connections between nodes represent the interaction of these biological molecules and can be used to identify genes that are differentially expressed Interactions between encoded proteins and signal pathway relationships (9). Subsequently, the modules of the PPI network were screened through Metascape online tool and the MCODE plugin in Cytoscape software (21).The default parameters were as follows: degree cutoff = 2, node score cutoff = 0.2, k-score = 2, and maximum. depth = 100. As another plugin of Cytoscape (22), Cytohubba studies important nodes in the network through 11 methods, of which MCC has shown satisfactory performance, and the rst ve genes are selected as the key hub genes.

Construction of miRNA-hub gene networks and TF-hub gene networks
NetworkAnalyst (https: //www. networkanalyst. ca/) is a comprehensive network visualization analytical platform for gene expression analysis (23). In order to construct miRNA or TF regulatory network that interacts with hub genes in the post-transcription stage (24,25), we applied NetworkAnalyst to integrate miRNA databases(26). In the study, the targeted miRNA-hub genes were de ned according to the positive results of ≥ 3 miRNA-target predicting databases, including TargetScan, miRanda, PicTar, and PITA. The targeted TF-hub genes were de ned according to the positive results of databases ENCODE (http://cistrome.org/BETA/) (27,28). Finally, we visualized target miRNA-hub gene and TF-hub gene networks by employing Cytoscape software.

Con rmation of gene expression using RT-PCR
We selected the immune cell line THP-1 to con rm gene expression in a RT-PCR assay, as it has previously been successfully used to investigate TB immune defense, antigen presentation, and phagocytosis (29). During the stimulation phase of cell culture, the standard strain M. tuberculosis H37Rv was used to infect the macrophages induced and differentiated from human THP-1 monocytes in vitro (30), Trizol reagent was used to extract total RNA from cells at 24h, 48h, and 72h, respectively.
We used the PrimeScript RT kit (Takara, Dalian, China) according to the manufacturer's protocol to reverse-transcribe total RNA into complementary DNA (cDNA) strands, and used the company's SYBR® Premix Ex Taq II (2×) to perform RT-PCR using the CFX96 PCR detection system (BioRad, California, USA). The con guration system and sample addition operations are performed on ice, and two replicate wells are made for detecting the target genes and internal reference gene of each sample. The reaction volume of 25 µL contains 12.5 µL SYBR® Premix Ex Taq II (Tli RNaseH Plus) (2×), 1 µL forward primer, 1 µL reverse primer, 1 µL cDNA template, and 8.5 µL RNase-free dH2O. The ampli cation conditions were as follows: 95°C for 15 s, then 40 cycles, in which the denaturation process at 95°C lasts for 5 s, and the annealing process at 58.5°C (β-actin) lasts for 30 s. SYBR® TB Green Premix PCR mix and primers speci c to our target gene are used to effectively amplify the target region. The 2 −△△CT method was used to calculate gene relative expression and perform statistical analysis (31,32). The primer sequences were shown in Table 2, and the β-actin gene was used as an internal reference gene. The four microarray expression datasets of GSE51029, GSE52819, GSE54992 and GSE65517 were obtained from the GEO database. The above data sets were uploaded to GEO2R and standardized ( Fig. 1) to screen DEGs between the M. tuberculosis infection groups and the healthy control groups, and create a volcano map of the distribution of these DEGs in four datasets. The differential expression of multiple genes in the 2 sets of sample datas in each array is shown in Fig. 2.
According to the criteria of |log FC|> 1 and p values < 0.05, we used FunRich software to display the intersection of DEGs in the four microarray expression datasets, a total of 46 overlapping genes were found (Fig. 3A), which were regarded as candidate DEGs and used for further analysis. We use the pheatmap software package in R Studio to visualize the heatmap of 46 DEGs (Fig. 3B). It is worth mentioning that by using the corrplot and circlize software packages in R Studio to visualize 46 DEGs, the generated correlation circle map can show the correlation of multiple genes in one picture (Fig. 3C). The outer circle represents genes, and the line between the two genes represents the correlation coe cient.
Positive correlation is shown in red, and negative correlation is shown in green. The darker the color, the more signi cant the correlation.

Enrichment analysis of GO and KEGG pathways
Enrichment analysis is the core of most existing gene annotation portals (33). In the enrichment analysis process, the input gene list is compared with thousands of gene sets, which are de ned by their participation in speci c biological processes, protein localization, enzyme functions, pathway members, and other characteristics. We used RSQLite, org.Hs.eg.db, clusterProiler and other software packages in R software to analyze and visualize the GO function and KEGG pathway enrichment of the 46 identi ed DEGs. The results are shown in Fig. 4.
The top 20 most important items of enrichment analysis are illustrated in the form of bubble diagrams. The size and color of the bubbles indicate the number of DEGs and the importance of enrichment in the enrichment analysis of GO and KEGG pathways, respectively. The rst 5 important terms of GO enrichment analysis indicate that in the BP category, DEGs participate in the defense response to the virus, the regulation of the immune effect process, the cellular response of interferon-γ, the regulation of the innate immune response and the chemokine-mediated signal pathway (Fig. 4A). For the MF category, DEGs are related to cytokine receptors, cytokine activity, G protein-coupled receptor binding, receptor ligand activity and signal receptor activator activity (Fig. 4B). For the CC category, DEGs are signi cantly enriched in membrane rafts, mitochondrial outer membrane, cytoplasmic vesicle cavities, endocytic vesicles and nuclear chromatin (Fig. 4C).
For KEGG pathway enrichment analysis, the rst several important KEGG pathways of DEGs include NODlike receptor signaling pathway, in uenza A, chemokine signaling pathway, COVID-19, toll-like receptor signaling pathway and nuclear factor (NF)-kappa B signaling pathway. In order to show in detail the position and importance of DEGs in the pathway, based on the KEGG enrichment analysis, the pathview software package in R Studio was further used to visualize the 46 DEGs identi ed. The pathway diagrams are stored as the additional les. 3.3 PPI network construction analysis and hub gene selection PPI network analysis has been regarded as a useful tool for exploring biological responses in health and disease (34). According to the analysis we conducted using the STRING database and Cytoscape software, the 46 candidate DEGs were placed in a PPI network complex, including 46 nodes and 83 edges. The PPI enrichment p value was less than 1.0e − 16 , and the con dence score was greater than 0.4. Finally, Cytoscape is used to visualize the PPI network of these DEGs(35) (Fig. 5). After deleting isolated and partially connected nodes, a complex PPI network was successfully constructed (Fig. 5A).
The online software Metascape applies the mature complex recognition algorithm MCODE, which can automatically extract protein complexes embedded in large-scale networks(36). It is worth mentioning that Metascape has obvious advantages in PPI network module analysis and GO enrichment analysis (37), and can directly obtain relevant data. We used Cytoscape to visualize the PPI module analysis results generated by Metascape. The MCODE plugin of Metascape lters the module network in the PPI network again (Fig. 5B-C), and the options are set as default parameters. Host hub genes are de ned as genes that play a vital role in diverse biological processes, and usually regulate the activity of other genes(38). We used the Cytohubba plugin, which uses maximal clique centrality (MCC) to study important nodes in the network and select the most relevant genes as the target hub genes(39). The top ve hub genes, which had the highest degree of interaction with other genes in the PPI, were identi ed (Fig. 5D), these included signal transducer and activator of transcription 1-α/β (STAT1), toll-like receptor 7 (TLR7), C-X-C motif chemokine ligand 8 (CXCL8), C-C chemokine receptor type 2 (CCR2), and C-C motif chemokine 20 (CCL20).

Construction and analysis of the miRNA-hub gene network and TF-hub gene networks
NetworkAnalyst is used to screen targeted miRNA and TF for hub genes (40). For these ve identi ed hub genes, the top three miRNA targeted DEGs are TLR7 regulated by 39 miRNAs, CXCL8 regulated by 36 miRNAs, and STAT1 regulated by 19 miRNAs. The miRNA that controls the largest number of hub genes (three genes) was found to be hsa-mir-335-5p (Fig. 6A), and other important miRNAs are shown in Table 3. In the TF hub gene network analysis, when the selected 5 hub genes were imported into the ENCODE of the TF database, only two of the ve genes could be su ciently analyzed, namely CCL20 regulated by 39 transcription factors and CCR2 regulated by 2 transcription factors ( Fig. 6B).

Veri cation of hub gene expression levels using RT-PCR
In order to verify the accuracy of the prediction results, this study used RT-PCR to detect the mRNA expression levels of the 5 host hub genes at 24h, 48h, and 72h after M. tuberculosis infection of the cells. At least 3 biological replicates were performed for each experiment. The results showed that the expressions of hub genes is basically consistent with the heatmap results drawn by the gene expression matrix (GSE51029_series_matrix) data. Speci cally, compared with the PBS groups, gene expression, as indicated by mRNA levels, of STAT1, CCL20, and CXCL8 in the M. tuberculosis-infected groups were all upregulated at the three time points (Fig. 7A-C), while TLR7 and CCR2 expression was downregulated at all three time points (Fig. 7D-E).

Discussion
TB is an infectious disease that seriously harms human health. It is caused by M. tuberculosis that is parasitic in macrophages (41). As the human body's rst line of immune defense, macrophages can kill M. tuberculosis through phagocytosis, oxidative stress, acidi cation and antigen presentation (42). Due to the lack of speci c auxiliary examination indicators, it is di cult to realize the early diagnosis and effective treatment of TB patients with the traditional TB diagnosis scheme.
The method of bioinformatics is helpful to analyze the expression of key genes to reveal the potential molecular mechanism of the biological behavior of TB, and to provide novel views for elucidating the pathogenesis of TB. Microarray technology allows us to explore the host genetic changes and gene expression related to TB, and has proven to be a useful method for identifying new biomarkers in other diseases (43).In this study, four GEO microarray data sets (GSE51029, GSE52819, GSE54992 and GSE65517) were integrated to identify DEGs between PBMC of TB patients and healthy persons to offset the false positive rate in the analysis of independent datasets. According to the criteria of |log FC|> 1 and p values < 0.05, this study used FunRich software to display the intersection of DEGs in the four microarray expression datas, and found a total of 46 overlapping DEGs that can be considered as candidates. The GO function and KEGG pathway enrichment analysis, PPI network analysis and hub gene selection, the construction of miRNA-hub genes network and target TF-hub genes network, and the veri cation of the expression of hub genes at the mRNA level were carried out successively.
According to the results of GO enrichment analysis, for the BP category, DEGs participate in the defense response to the virus, the regulation of the immune effect process, the cellular response of IFN-γ, the regulation of the innate immune response and the chemokine-mediated signal pathway. Previous studies have shown that during infection, macrophages encounter M. tuberculosis before being stimulated by IFN-γ produced by T-helper 1 (Th1) cells (44). However, IFN-γ stimulation is necessary for the complete activation of antibacterial and antigen presentation functions in macrophages (45). For the MF category, DEGs were found to be associated with cytokine/chemokine receptor binding, cytokine/chemokine activity, G protein coupled receptor binding, receptor ligand activity, and signal receptor activator activity. M. tuberculosis is known to induce host proin ammatory mediators that play an important role in disease control(46), including chemokines, which are small molecular weight proteins involved in immune regulation and in ammation (41). For the CC category, DEGs were found to be signi cantly enriched in membrane rafts, the mitochondrial outer membrane, cytoplasmic vesicle cavities, endocytic vesicles and nuclear chromatin. KEGG enrichment analysis showed that NOD-like and toll-like receptor signaling pathways were associated with M. tuberculosis infection. Innate immune cells are known to use various pattern recognition receptors, such as toll-like receptors (TLRs), C-type lectin receptors, and NOD-like receptors to respond to pathogen components when performing a variety of biological functions (47,48). Previous research using eperimental models of TB have emphasized the importance of TLRs in the prevention of M. tuberculosis infection. (49). In addition, antigen recognition of NOD-2 (nucleotidebinding oligomerization domain 2), a member of the NOD-like receptor family, is also crucial in conferring immunity against viruses or bacteria, which may include M. tuberculosis (50,51). This suggests that the coordinated triggering of TLRs and NOD-2 may lead to a stronger and lasting immune response, thereby limiting the growth of M. tuberculosis (52), which would be consistent with our results showing enrichment of these pathways during M. tuberculosis infection.
By constructing a PPI network and analyzing it with MCODE and Cytohubba in Cytoscape, ve hub genes were identi ed, including STAT1, TLR7, CXCL8, CCR2 and CCL20. Previous studies have reported that in the early stage of TB infection, STAT1 can promote downstream apoptotic factors to activate transcription through phosphorylation (53). At the same time, STAT1 plays an important role in the polarization of macrophages to the M1 type, involved in the immune response to viruses and bacteria, including M. tuberculosis. Polarized M1 macrophages have been shown to eliminate M. tuberculosis infection more effectively than Polarized M2 macrophages (54). It is also reported that after ssRNA upregulates TLR7, the number of M. tuberculosis in macrophages is signi cantly reduced, and the macrophage viability is signi cantly increased, indicating that TLR7 can effectively inhibit the growth of M. tuberculosis and increase the viability of macrophages (42). Kane et al. found that broblasts have a previously unrecognized role in regulating TB in ammation via a CXCL8-dependent contribution to immune cell recruitment and mycobacterial killing in granulomas (55). In the study of Dunlap et al., the mouse model provided evidence that the CCR2 axis is essential for protective immunity against the emerging M. tuberculosis lineage infection(56). Another report showed that CCL20 is overexpressed in monocytes infected by M. tuberculosis and inhibits the production of reactive oxygen species (ROS) (57).Therefore, prior research provides biologically plausible mechanisms by which these genes may be involved in immune responses to M. tuberculosis infection, supporting our results.
To study the molecular mechanisms of potential hub gene disorders, it is necessary to search for potential miRNAs through bioinformatics methods. The miRNA is an endogenous non-coding RNA molecule with a length of 18-22 nt that targets the 3'UTR region of a gene. It can regulate gene expression at the post-transcriptional level to degrade or inhibit the translation of target genes(58).
MiRNAs are known to regulate protein translation inhibition or targeted mRNA cleavage (59). More and more evidences have showed that miRNAs are closely related to the occurrence and development of cancer and other major diseases. In our analysis, the three DEGs most associated with miRNA regulation were CXCL8, TLR7, and STAT1. At the same time, we observed 10 miRNAs, and found that their targeting involves at least 2 hub genes. Among these, hsa-mir-335-5p was found to be associated with the greatest number of genes; however, there are relatively few previous studies on this miRNA. One study found that the gain or loss of hsa-miR-335-3p function can lead to changes in the expression of GATA4 and NKX2-5 markers during the cardiac differentiation of human embryonic stem cells(60). In addition, hsa-miR-335-3p has been identi ed as an upstream regulator of two modules related to the recurrence of osteosarcoma patients(61). The results of bioinformatics analysis found that hsa-mir-335-5p has high potential value used as a new biomarker. Our study also established a TF-hub genes regulatory network to further explore the molecular regulatory mechanisms underlying TB(62). TFs are the primary regulators of gene expression and are associated with pathogenesis in TB. In our study, we also found several TFs that interact closely with hub genes, including max-like protein X (MLX), transcription factor DP 1 (TFDP1), retinoid X receptor alpha (RXRA), zinc nger protein 197 (ZNF197), glucocorticoid modulatory element binding protein 2 (GMEB2), and tripartite motif containing 22 (TRIM22). The complex interactions between TFs and hub genes have made a huge contribution to the development of the disease. Our analysis, which identi es miRNAs and TFs associated with newly identi ed hub genes, provides potential candidates for the development of therapeutic targets and exploration of the biological mechanism of TB in future research.

Conclusions
In summary, this study identi ed several candidate DEGs, including ve key hub genes, associated with M. tuberculosis infection, some of which may be potentially useful biomarkers for TB. These results may provide a basis for screening candidate therapeutic agents and biomarkers for diagnosis of TB and could inform future research on molecular mechanisms involved in this disease. However, further in vitro and in vivo studies are needed to con rm the predicted functions of these DEGs in TB-related physiological and pathological processes. Experimental models in studies exploring the molecular mechanisms of TB could then potentially be constructed based on these genes and may enable improvements in early detection, prognosis, disease risk estimation, and targeted therapies in TB.

Declarations
Ethics approval and consent to participants Not applicable.

Consent for publication
Not applicable.
Availability of data and materials