A three‐tiered integrative analysis of transcriptional data reveals the shared pathways related to heart failure from different aetiologies

Abstract Heart failure (HF) is the end stage of most heart disease cases and can be initiated from multiple aetiologies. However, whether the molecular basis of HF has a commonality between different aetiologies has not been elucidated. To address this lack, we performed a three‐tiered analysis by integrating transcriptional data and pathway information to explore the commonalities of HF from different aetiologies. First, through differential expression analysis, we obtained 111 genes that were frequently differentially expressed in HF from 11 different aetiologies. Several genes, such as NPPA and NPPB, are early and accurate biomarkers for HF. We also provided candidates for further experimental verification, such as SERPINA3 and STAT4. Then, using gene set enrichment analysis, we successfully identified 19 frequently dysregulated pathways. In particular, we found that pathways related to immune system signalling, the extracellular matrix and metabolism were critical in the development of HF. Finally, we successfully acquired 241 regulatory relationships between 64 transcriptional factors (TFs) and 17 frequently dysregulated pathways by integrating a regulatory network, and some of the identified TFs have already been proven to play important roles in HF. Taken together, the three‐tiered analysis of HF provided a systems biology perspective on HF and emphasized the molecular commonality of HF from different aetiologies.

Thus, exploring the commonalities will help achieve better understanding of the aetiology of HF.
The rapid development of high-throughput 'omics' technologies (such as DNA microarrays and next-generation sequencing) has resulted in the increasing availability of transcriptional data. 3 The availability of these data for HF provides a good opportunity to employ computational systems biology approaches to advance our understanding of the mechanisms underlying the development of HF. 4,5 Studies based on transcriptional and interactome data have identified 60 common functional modules related to HF 6 and analysed the differences in the pathogenesis of HF arising from different aetiologies. 7 By examining the expression profiling of miRNAs in failing human hearts, Zhu et al identified miR-340 as a key miRNA contributing to the progression of HF. 8 By integrating miRNA-target interactions and differentially expressed genes/lncRNAs, a recent study investigated the function of lnRNAs in HF and identified some lncRNAs that were verified to show a strong diagnostic power for HF. 9 Although some studies have been carried out to decipher the molecular mechanisms of HF based on transcriptional data, a systematic research study that comparatively analyses HF from different aetiologies through integration of vastly available omics data and curated pathways has not been performed.
To address this issue, we perform a three-tiered data analysis

| Overview of the data analysis procedure
To understand the molecular commonalities of HF from different aetiologies, we performed a three-tiered data analysis ( Figure 1). First, we started the analysis at the gene level, in which differentially expressed genes (DEGs) were inferred for each disease. Then, we extended our analysis to the pathway level, where we identified frequently dysregulated pathways across HF from different aetiologies. Finally, by integrating the regulatory network, we identified TFs regulating the expression of the frequently dysregulated pathways.

| Data collection and pre-processing
Transcriptional data related to HF resulting from 11 different aetiologies were collected from the GEO (Gene Expression Omnibus) database. 10 Normalized data were directly downloaded from the GEO database. Probe sets were mapped to their corresponding gene symbol according to the annotation files from GEO, and replicated probes of the same gene were averaged. Curated pathways were gathered from the KEGG pathway, 11 Reactome pathways, 12 BioCarta pathways, Pathway Interaction Database, 13 Sigma-Aldrich gene sets, Signal Transduction KE gene sets, Signaling Gateway gene sets and SuperArray gene sets from the molecular signatures database (MSigDB, v6.1). 14 The curated pathways were downloaded in GMT format. Of the available pathways, we used those from C2:CP (canonical pathways). After excluding pathways that were too large or too small (>300 genes or <5 genes, respectively) and removing overlapping pathways (overlap ratio > 0.8), 610 pathways were kept for further analyses.

| Differential expression analysis
DEGs between disease samples and the corresponding control samples were inferred using the function RankProducts in the BioConductor package RankProd. 15 For each disease, the fold changes (FCs) of genes between the disease and control samples were first translated to the ranks of genes. Then, the combined rank of each gene from multiple comparisons was defined as the rank product. Independent permutated expression data were used to calculate the null density of the rank product and to determine the p-value and the percentage of false-positive predictions (pfp) F I G U R E 1 Overview of the workflow of the three-tiered data analysis. First, we collected HF-related transcriptional data from the GEO database, regulatory network data from RegNetwork and HTRIdb, and curated pathways from MSigDB. Then, we performed three-tiered data analysis: gene-centric differential expression analysis, pathway-centric enrichment analysis and network-centric regulatory analysis associated with each gene. Finally, genes with pfp values less than 0.05 were defined as differentially expressed.

| Machine learning analysis
Random forest (RF) classification models were built using the 'ran-domForest' package (https://cran.r-proje ct.org/web/packa ges/ rando mFore st/) in R with genes (features) in columns and samples in rows. We utilized a 10-fold cross-validation procedure to assess the performance of the classification models. Samples were randomly partitioned into 10 parts with approximately equal number of samples. Nine parts were used to train the RF classifier, and the remaining one part was used to test the performance. The value of the area under the curve (AUC) from the receiver operating characteristic (ROC) curve was used to assess the prediction accuracy of the RF model. The higher AUC value, which ranges from 0 to 1, indicates better prediction performance. After all samples have been used as the testing set, the predicted values were imported into the R package PRROC to visualize the ROC curves. 16

| Gene ontology enrichment analysis
Gene ontology (GO) enrichment analysis for DEGs was performed with BiNGO (version 3.03), a plugin in Cytoscape. Using the whole annotation of human genes as the reference set, GO terms with Benjamini-Hochberg (BH)-adjusted P-values less than 0.05 were extracted as significantly enriched.

| Gene set enrichment analysis
The pre-ranked gene set enrichment analysis (GSEA) tool 17 [GSEA PreRanked (1000 permutations, minimum term size of 5, maximum term size of 300)] was used to determine whether the curated pathways exhibited statistically significant, concordant difference between HF and normal tissue samples. Briefly, genes were first ranked based on their FCs between disease samples and the corresponding normal samples. Then, ranked genes were used as the input for GSEA PreRanked. Finally, curated pathways with P-values less than 0.05 were identified as significant.

| TF-pathway regulation analysis
To identify TFs regulating the dysregulated pathway, we first obtained TF-target regulatory relationships from two databases, RegNetwork 18 and HTRIdb. 19 HTRIdb is an open access database on transcription factor binding sites. Then, for each TF and dysregulated pathway, Fisher's exact test was used to test the enrichment of TF targets in the dysregulated pathway, and a P-value was obtained.
We also calculated the proportion of TF targets in the pathway and obtained a ratio. Finally, the TF was predicted to regulate the dysregulated pathways when the BH-corrected P-value was less than 0.05 and the ratio was larger than 0.2.

| Overview of gene expression in HF from different aetiologies
To understand how genes are expressed during HF, we collected 505 samples from 11 microarray studies measuring gene expression during HF from 11 different aetiologies from the GEO database 10 ( Table 1, Table S1). Among the 505 collected samples, 414 samples detected gene expression in patients with HF, and the remaining 91 samples were from controls. The DEGs between the disease samples and their corresponding control samples were inferred using the RankProd package, which was developed from the rank product method. 15 Rank product is a nonparametric statistical method for identifying DEGs (up-regulated or down-regulated) based on the estimated pfp. By keeping genes with a pfp of less than 0.05, we obtained 6685 DEGs that were differentially expressed in at least one of the 25 comparisons (Table S2; see Materials and Methods section for details).
As shown in Figure 2, only approximately 3.6% (242/6685) of the DEGs were differentially expressed in more than 12 comparisons. This indicated that only a small fraction of DEGs were frequently differentially expressed during the progression of HF from different aetiologies.

| Identification of a common gene expression signature of HFs from different aetiologies
To identify the genes that were frequently differentially expressed (named frequently differentially expressed genes, FDEGs) during HF from different aetiologies, we selected genes that were differentially expressed in more than 60% (15/25) of comparisons and obtained 111 such genes (Table S2). Approximately 60% (67/111) of FDEGs were classified as up-regulated since these genes were generally more up-regulated in all comparisons, whereas the remaining 44 genes were classified as down-regulated. Next, we investigated whether the two conditions (HF and normal control) could be successfully discriminated using the FDEGs by employing a RF classifier. in serum has already been associated with the pathophysiology of HF. 23 The other FDEGs (such as STAT4 and SERPINA3) with undefined roles in HF were good candidates for further experimental verification as these genes were frequently differentially expressed in HF from different aetiologies.

| Suppressed immune responses in HF
GO annotation analysis showed that the GO term 'defence response' was significantly enriched in down-regulated FDEGs. We wondered whether FDEGs were significantly enriched in defence genes. Thus, we first obtained 2209 genes that were involved in the immune response from two databases, namely InnateDB 24 and the Immunogenetic Related Information Source (IRIS). 25 After removing genes without expression values from 2209 immune response genes, 2056 genes were kept for further analysis. Statistical analysis showed that FDEGs were significantly enriched in these immune-related genes (Fisher's exact test, P = 1.01 × 10 −06 ; Figure 3B). Further investigation showed that immune-related FDEGs were significantly enriched in down-regulated FDEGs (Table S2, Fisher's exact test, P = 1.04 × 10 −08 ) rather than in up-regulated FDEGs (Fisher's exact test, P = 0.13; Figure 3C). Moreover, the top 5 FDEGs (ie SERPINA3, FCN3, NPPA, CCL2 and PLA2G2A) were all involved in the immune system. All of these 5 genes except NPPA were down-regulated in the majority of conditions (Table 2). Briefly, these results indicated that immune systems were suppressed in the progression of HF.

| Gene set enrichment analysis reveals dysregulated biological pathways in HF
To identify the biological pathways that were frequently influenced by HF from different aetiologies, we obtained 1329 curated pathways from the MSigDB database. 14 After excluding pathways that were too large or too small, overlapping pathways and disease-related pathways, 26 610 pathways were kept for further analysis (see Materials and Methods section). Pathway enrichment analysis of these curated pathways performed with GSEA helped us to obtain a P-value per pathway per condition. GSEA is a computational method that identifies gene sets (eg biological pathways) that show a statistically significant, concordant difference between two biological states. 17 Based on GSEA, we found that 610 pathways were all dysregulated in at least one of the 414 disease samples (Table S4) To identify the pathways that were frequently dysregulated in HF from different aetiologies, we selected pathways that were dysregulated in more than 60% of the HF samples 27 and get 19 frequently dysregulated pathways in Table 3. As shown in Figure

| Potential TFs regulating frequently dysregulated pathways
To gain an in-depth understanding of how TFs regulated the ex-  Table S6. These genes may serve as a vital role in regulating these dysregulated pathways and potentially affect the initiation and progression of HF. was unclear. In this study, we integrated transcriptional profiles and pathway information to investigate the molecular commonalities of HF from 11 different aetiologies.

| D ISCUSS I ON
Some previous analyses explored HF by using transcriptional data and interaction networks. 6,7,[32][33][34][35][36] However, using small-scale transcriptional data and interaction networks with high positive rates may lead to constrained results. In this work, we performed a three-tiered transcriptional data analysis by integrating large-scale transcriptional data and curated pathways, which can produce more solid results. The advantages of integrating curated pathways include reducing the complexity by grouping thousands of DEGs into just several hundred pathways and increasing the explanatory power by identifying impacted curated pathways with specific functions.
Our approach not only successfully uncovered several key genes (such as NPPA, NPPB and FRZB) already involved in HF but also pro- represent the most abundant group of essential amino acids that cannot be synthesized de novo. 46 BCAA catabolic deficiency was proposed as a novel metabolic feature in HF with a broad impact on the progression of pathological remodelling and dysfunction. 47 Multiple studies have already shown HF-related changes in cardiac ECM, including the accumulation in glycoproteins, collagens and proteoglycans. 48 We noticed that a disease pathway 'genes involved in diabetes pathways' was also identified as a frequently down-regulated

F I G U R E 6
The 241 regulatory relationships between 64 TFs and 17 frequently dysregulated pathways. Circle and triangle nodes represent frequently dysregulated pathways and TFs, respectively. TF-pathway regulatory relationships were predicted using Fisher's exact test pathway (Table 3). HF is closely related to diabetes: patients with HF are at higher risk of developing diabetes. 49 The enrichment of canonical pathway 'genes involved in diabetes pathways' in HF further confirmed the close connection between HF and diabetes. In addition, our analysis also discovered several biological pathways with potential roles in HF from different aetiologies, for example the 'PDGFR-beta signalling pathway'. Chintalgattu et al found that PDGFR-beta knockout mice exposed to load-induced stress resulted in HF and showed that cardiomyocyte PDGFR-β signalling plays a vital role in stress-induced cardiac angiogenesis. 50 It is reasonable to speculate that the PDGFR-beta signalling pathway may regulate angiogenesis in the heart, which substantially contributes to HF through several different mechanisms.
It has been long recognized that immune system activation or dysregulation plays a significant role in the development and progression of HF. 51 In this work, both gene-centric differential expression analysis and pathway-centric enrichment analysis revealed that immune system-related genes and pathways were sig-  (Table 3). These results further confirmed the importance of the immune system in HFs and that their role in HF was independent of the aetiologies of HF.
Generally, TFs play key roles in regulating the expression of encoding genes and tend to regulate genes within the same pathways.
In this study, we also predicted 64 potential TFs regulating 17 dysregulating pathways. Some predicted TFs, such as TP53 and NFKB1, have already been reported involving in the progression of HF TP53 was proven to be a master regulator of the cardiac transcriptome and a key molecule, which triggered the development of HF. 52,53 NFKB is a pleiotropic TF involved in different signalling pathways and strongly implicated in the development of cardiac remodelling, hypertrophy and HF. 54-56 NFKB1 belongs to the NFKB TF family, and it has been reported that NFKB1 polymorphism is associated with the heart function in patients with HF from different aetiologies. 57 The other predicted TFs with unknown roles in HF are good candidates for further experimental verification, such as ETS1 and EGR1.
It is known that ETS1 is important in heart development. Moreover, a previous study found that patients with congenital heart disease had a de novo frameshift mutation in ETS1. 58 In the present study, we found that ETS1 is predicted to regulate the expression of fourteen dysregulating pathways. We speculated that ETS1 participated in the progression of HF by regulating the expression of these dysregulated pathways. EGR1 is an early-response TF that can be rapidly induced by various environmental stimuli. It was predicted to regulate 8 dysregulated pathways and identified as a FDEG (Table S2). In the previous studies, EGR1 was found to involve in multiple cardiovascular pathobiology including cardiac hypertrophy, atherosclerosis, ischaemic pathology and angiogenesis. 59 Furthermore, the expression level of EGR1 can discriminate between chronic HF patients and control patients. 59 Therefore, we speculated that ETS1 might be a potential biomarker of HF.
Finally, we recognized some limitations in this work. Our results are based on currently available data and should be interpreted with caution. First, our analyses are limited by the availability of pathway and gene expression information. Therefore, some genes with potential roles in HF are ignored in this work, as these genes are not detected on the microarray or included in the curated pathways. Second, the analysis of transcriptomes is often not enough to reflect the level of pathway activity, this weaken the conclusions that can be drawn from our results. Third, further molecular biological experiments are needed to confirm the function of these key genes and TFs, and how they involve in the progression of HF.
In summary, we performed a three-tiered transcriptional data analysis to explore the molecular commonalities of HF from different aetiologies. Our analyses indicate that HF from different aetiologies is associated with 111 FDEGs and 19 frequently dysregulated pathways.
It is hoped that our current analyses can provide new insight to understand the molecular mechanisms of HF from different aetiologies.