Keywords
biomolecular networks, co-expression networks, edge relevance, hierarchical stochastic block model, missing and spurious edges, module detection, network-based omics analysis
This article is included in the Bioinformatics gateway.
biomolecular networks, co-expression networks, edge relevance, hierarchical stochastic block model, missing and spurious edges, module detection, network-based omics analysis
High-throughput measurement techniques are advancing and become ever less expensive, enabling the screening of multiple biological data layers in single patients as almost standard clinical diagnostic tools. The wealth of the biological data can only be understood if treating the measured entities – gene promotors, mRNA, metabolites, proteins and their activity – not separately, but in their network context1. Thereby, one method to capture the interdependencies of the intracellular machinery relies on the hypothesis that strongly connected molecular entities are either co-activated or co-repressed, i.e. their measured abundances should be correlated2–4. Fully connected, weighted biomolecular networks can be established, in which each node corresponds to a molecular entity and is connected to each other node by an edge. The weight of the edge is the correlation between the measurements and is considered to represent how strongly the nodes are connected, interact or regulate each other.
Approaching a network-level analysis of a biological system by correlation-based interactions has the advantage that it does not require a priori knowledge, and thus it is focused on the interaction profile for which evidence can be found in the measurements of the considered condition. However, this is a blessing and a curse: correlation-based networks suffer from the fact that the biological measurements are inherently noisy, even more so for small sample sizes as in rare diseases or in personalized medicine. This affects the values of the edge weights and the results of network reduction, and can be deleterious for subsequent analyses of the networks. In these cases, considering in addition alternative sources of edge relevance that are based on more global characteristics of the system might be beneficial and could make the network representations of the system and their analysis more robust.
One of the commonly performed analyses of correlation-based networks relies on the observation that biological networks exhibit a modular structure, in which tightly regulated modules are loosely connected to other modules. An important readout from correlation-based networks therefore is the biological and functional characterization of condition-specific modules, i.e. communities of co-regulated entities. A plethora of methods for module detection in networks has been proposed5–14. Also for the inference of edge relevance, or edge prediction, as being tightly linked to the problem of the detection of missing or spurious edges, numerous methods have been suggested15–23. In this work, we showcase a method that is able to derive, within a single framework, modules as well as scores of edge relevance: representing the biomolecular correlation-based networks as stochastic block models (SBMs)24.
SBMs are the simplest form of generative network models for community structures and can also accommodate hierarchies25, which are key to convey robustness to biological function. Other generative model approaches relying on scale-freeness of the network architecture have been used for edge prediction in protein-protein interaction networks15, and the SBM has been used for representing other types of networks17,18,26,27, but not yet for biomolecular networks. In generative network model approaches, the network is described by a stringent mathematical framework based on statistical assumptions on network characteristics. For the case of the stochastic block model, this step delivers already the modular structure of the network as the nodes are assigned to blocks according to their connectivity properties to all other nodes. In contrast to other module detection methods, blocks in the SBM are not necessarily formed by tightly intra-connected entities but by entities which interact similarly with the nodes from all other blocks. Therefore, comparing SBM-derived modules between networks representing different (e.g. biomedical) conditions could be especially informative to shed light on regulatory changes. In a second step, the mathematical representation of the networks by SBMs is exploited to estimate edge probabilities. Specifically, it is assessed whether the existence of an edge in the network improves or reduces the fit of the network to the SBM. The resulting edge confidence scores are based on global network structure and can be used as alternative measure of edge relevance.
Here we showcase the SBM-based analysis (overview in Figure 1A) for networks of three different molecular types, derived from transcriptomic, proteomic and metabolomics data of breast cancer tumours. We assessed which of the different versions of SBM fits each data layer best. Then, we investigated whether the SBM representation is able to capture functionally relevant structures in our biomolecular networks. In detail, we determined the agreement between the predicted blocks and independent biomolecular functional annotations from databases. Finally, we took advantage of the description of the networks as SBMs for the computation of an edge confidence score for each edge as measure of edge relevance. The edge confidence scores can be exploited to re-establish erroneously removed edges or to remove spurious edges, or they could serve by themselves for deriving disease-relevant differences when comparing groups of patients.
All code is freely available on https://gitlab.com/biomodlih/sbm-for-correlation-based-networks.
Breast cancer mRNA expression from RNAseq was obtained from the TCGA BRCA cohort via RTCGA28 downloading TCGA level 3 preprocessed BRCA files (search term: mRNAseq_Preprocess) on Nov 2, 2017. We used the normalized RSEM values. Protein data was obtained via the CPTAC homepage. We used the first replicate of samples measured in duplicates. We employed the unshared log ratio value for each sample to maximize reliability of protein identification. Clinical data for both TCGA and CPTAC data was retrieved and evaluated using the RTCGA package. Specifically, we used the following files for mRNA, protein and clinical data, respectively:
gdac.broadinstitute.org_BRCA.mRNAseq_Preprocess.Level_3.2016012800.0.0/ BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt
TCGA_Breast_BI_Proteome_CDAP_Protein_Report.r3/Protein_data/CDAP/TCGA_Breast_BI_Proteome.itraq.tsv
gdac.broadinstitute.org_BRCA.Merge_Clinical.Level_1.2016012800.0.0/BRCA.clin.merged.txt.
For both mRNA and protein data, we only used samples from patients whose entry "patient.breast_carcinoma_estrogen_receptor_status" in their clinical data was "negative". In addition, we restricted our analysis to solid tumour samples, i.e. TCGA sample identifiers ending with 01. This gave rise to 237 samples for the mRNA data and 36 samples for the protein data. Metabolite data was used from the Excel file provided in 29 using the 67 samples containing ERn in their label.
We removed missing values in the mRNA data by replacing them by -10 to account for the fact that they arose from logarithmizing read counts of zero. In the protein data, we removed the 2195 genes which had more than 20% of missing values over the considered samples (i.e. more than 7 NAs among the 36 samples). The metabolite data did not contain missing values as imputation had been performed before29.
We used the correlation computation from the Hmisc package30 (function rcorr) to determine Spearman’s correlation of the measurements of each pair of entities (mRNA, protein or metabolite) over all samples. Only pairwise complete observations were employed. Unless stated otherwise, we neglected self-edges.
The Hmisc package30 was used to determine the p-value associated to the correlation (significance of the correlation being different from zero). We corrected for multiple testing using Bonferroni (mRNA expression) or Benjamini-Hochberg31 (protein, metabolite) correction. Correlations were considered significantly different from zero for corrected p-values lower than 0.05 (mRNA, metabolite) or 0.01 (protein). For the reduction by imposing a scale-free architecture of the reduced network, we employed the pick_hard_threshold function of the WGCNA package3 with the default requirement (0.85) on goodness of fit to a power-law degree distribution of the nodes.
We converted the adjacency matrices of the reduced networks to edge lists in csv format, added the disconnected nodes and fed the resulting networks as graphs into the Python graph-tool32 framework. Using an agglomerative multi-level Markov Chain Monte Carlo algorithm33, that module allows for determining a (potentially hierarchical) partition, b, of the network, G, which minimizes the description length, DL, of the SBM34:
Thereby, P(A|C, D) denotes the probability of A given C and D, and λ captures all parameters of the model apart from the partition, such as the number of edges between blocks and degree distribution parameters for the degree-corrected SBM, in their planar or hierarchical version. Note that the above relationship holds only under a microcanonical formulation of the priors, i.e. hard constraints imposed on the values of the model parameters λ by the structure of the network G the SBM is fitted to (see 34), which enables considering only one value for the parameter λ for a given partition b and thus
Because the probability of the network itself, P(G), is constant for a fixed observed network, the description length is monotonously inversely related to the probability that partition b is responsible for the observed network G, P(b|G). Therefore, finding the partition which minimizes the description length is equivalent to finding the partition with maximal posterior probability, P(b|G). Furthermore, as can be seen in its second term in the first line, the description length DL contains a "penalty" term for the complexity of the SBM description. Thus, it can be used to distinguish which of the four examined SBM types (classical, hierarchical, degree-corrected, hierarchical+degree-corrected) is most suitable to describe the network G. Due to the large sizes of the networks, sampling over the posterior distribution is costly. Therefore, we decided to compare different SBM versions using only the estimated maximum of the posterior, i.e. at the partition b with the lowest minimal description length, in the expression of the posterior odds ratio34. Assuming that both SBM types that are compared, SBM1 and SBM2, are equally probable a priori, P(SBM1) = P(SBM2), we obtain for the posterior odds ratio Λ34:
We can consider the SBM type with the lowest minimal description length most likely; the distance of the posterior odds ratio to 1 determines how much more likely it is than the other SBM type.
We performed 500 runs with random initial partitions for each of the four SBM types and each of the six networks. The SBM type with lowest minimal description length was used for further analyses.
Within the mRNA and protein networks, biological annotation of the blocks and overrepresentation was performed using Reactome pathways and the package ReactomePA35. We restricted our analysis to Reactome pathways containing at least 10 and at most 500 annotated genes, all entities of the network were used as background. We employed Benjamini-Hochberg multiple-testing correction. Pathways were considered enriched for default settings (p-value < 0.05, q-value < 0.2), and only human pathways occurring in the file from the Reactome database, https://reactome.org/download/current/ReactomePathwaysRelation.txt (downloaded June 6, 2018), were used. This file was also employed as representation of the Reactome hierarchy to determine distances between Reactome pathways.
For the metabolite dataset, we mapped the pubchem IDs and metabolite names to KEGG compound IDs using the MBRole webserver version 236 and merged them semi-manually, thereby preferring metabolite names (in case of mismatch with pubchem record) and KEGG IDs with pathway annotation. We downloaded the human KEGG pathway annotation for the mapped metabolites from MBRole and used it as user-defined annotation for overrepresentation analysis with the enricher function from the package clusterProfiler37. We only considered pathways with a minimal size of 2 and otherwise the same settings as for mRNA and protein overrepresentation analysis.
We employed two distance measures of Reactome pathways considering the graph given by the Reactome familial hierarchy tree: (i) the distance in terms of number of steps necessary in the Reactome hierarchy graph to reach from one pathway to the other (distance 1), and (ii) the hierarchy level of the lowest common ancestor, or least common subsumer, for ontology graphs (distance 2). We incorporated an artificial top-Reactome pathway into the hierarchy to connect all pathways with each other and to have our distance measures well-defined. It lies at the 10th hierarchy level, so the largest distance between two pathways is 10 for distance 1 and 18 for distance 2 (nine steps to the highest level and nine back). For all comparisons, only blocks with at least one overrepresented pathway were considered. For comparing the distances of Reactome pathway annotations between blocks, i.e. to associate a distance to a pair of blocks, we median-averaged the distances over all combinations of Reactome pathways associated to the two blocks. For distances of pathway annotations within blocks, we median-averaged the distances between all possible combination of different Reactome pathways associated to the block. The trivial distances of zero for the distance of a Reactome pathway to itself were omitted, as well as blocks with only one enriched Reactome pathway for intra-block distances. Distances between blocks in the SBM hierarchy, as employed in Figure 3D, were defined analogously to the distance measure (i) based on the graph of the SBM hierarchy.
Alternative clustering using the WGCNA package3 was performed following the WGCNA tutorial on clustering. We employed the full correlation matrices including self-edges. First, the correlation values were scaled to values between zero and 1 (by (1+corr)/2), and the soft threshold delivering scale free network topology was determined using the pickSoftThreshold.fromSimilarity function with default settings. Power estimates were 8, 16 and 12 for the mRNA, protein and metabolite network, respectively. We calculated the dissimilarity using TomSimilarity on the soft thresholded correlation matrix, used hclust with method "average" and cutreeDynamic with "deepSplit" parameter of 4 (mRNA) or 2 (else), "pamRespectsDendro" set to FALSE and "minClusterSize" of 3 (for mRNA, metabolite) or 4 (for proteins) to make the clustering most similar to the one obtained from the SBM.
We aim to derive edge confidence scores for each single edge in the network exploiting the representation of the network as SBM. Let us consider a fixed network given as graph G. If δG is a set of edges which do or do not occur in the network G, the probability that these edges belong to the observed network (for edges missing in G) or do not belong to the observed network (for edges from G), P(δG, G), can be written as27:
for b being partitions of the network G (please refer to Fit to SBM for further information on notation). The derivation assumes that the original network, G and the altered network with edges in δG added or removed, G + δG, has been generated by some SBM type (and all probabilities are conditional on that SBM type), and that the set δG has been chosen by some uniform distribution among all possible edges. The proportionality factor between the expressions depends on the network G and the number of edges in δG, and thus can be in particular neglected if only comparing edge confidences between single edges of a network. Because we aim to score all edges, and due to the sizes of the networks making computations slow, we refrained from sampling over the posterior distribution of the partitions and instead employed the single-point approximation for edge prediction proposed in 27:
It resorts to neglecting the summands for all partitions except for the one, b*, which contributes most to the posterior distribution, b* = maxbP(b|G). In addition, the approximation relies on the assumption that the estimated optimal partition for the representation of G by the SBM is the same for G and its altered version with added or removed edge, G + δG, which is reasonable for our application case of single edge predictions, i.e. for δG being composed of a single putatively missing or spurious edge. The term P(b*|G) can be considered constant for a fixed G and SBM type, so we can shift it to the proportionality factor. Considering the microcanonical formulation of the SBM (see Fit to SBM), it becomes clear that the edge predictions for δG directly depend on the difference between the description length of the original network G with partition b*, DLG,b*, and the description length of the altered network G + δG with partition b*, DLG+δG,b*27:
The difference between the description lengths, DLG,b* − DLG+δG,b*, was computed via the function get_edges_prob from graph-tool32. Note that as of time of writing, get_edges_prob does not work for weighted SBMs with real-normal edge covariate (see filed issue #452 at graph-tool.skewed.de). In addition, this function employs the natural (instead of the dual) logarithm and consequently, the obtained value has to be scaled by log2(e) to obtain the plain difference of description lengths.
In order to make clear that neither these edge predictions nor their dual (or natural) logarithm correspond to actual probabilities, we used the term edge confidence scores or simply edge scores throughout the manuscript.
We showcase the methods for data from a subgroup of breast cancer patients: those with estrogen receptor negative (ER-) tumours. ER status is predictive of patient outcome, with ER- leading to unfavourable prognoses, and its assessment is part of standard breast cancer patient screening38–40. The Cancer Genome Atlas (TCGA) initiative has provided data on the mRNA level (from RNAseq) for several hundred patient tissue samples; mass-spectrometry-based proteomic data (4plex iTRAQ) are available for a subgroup of 36 patients41. Metabolomics data have been measured by GC-TOF-MS in a different breast cancer cohort study for 67 samples29.
We treated each measured entity as node and established a correlation-based, weighted, biomolecular network for each single measurement layer. Therein, each pair of nodes is connected by an edge for which the weight is determined by Spearman’s correlation of the measurements of the nodes (over the samples), delivering values between -1 and 1. Thus, an edge that connects nodes with a correlation close to 1 or close to -1 represents strong positive regulation and strong negative regulation, respectively; edges connecting nodes with a correlation close to zero represent weak or absent regulation. We employed Spearman’s correlation because it captures also non-linear relationships between measurements, and it is robust to outliers and any monotonous transformation (e.g. logarithmization). We dealt with missing values in the data by replacing them by small values (NAs due to log transformation of zero counts in the mRNA data) or removing entities with >20% missing values for all samples (for the protein data). Subsequently, we computed the correlation only considering pairwise complete observations. The distributions of the computed correlation values for the three data types are shown in Figure 1B.
The resulting biomolecular networks capture the relationships of the intracellular machinery, and their analyses deliver important insights on altered regulations in disease states2,3. However, because these networks are fully connected, i.e., every entity is connected to each other entity within a layer, the networks become very large and their analyses difficult. A common approach is to reduce the networks, either by selecting a subset of entities as nodes prior to network establishment, mainly by using criteria on abundance, or by using the assumption that weak regulations are less important for the biological network and can be omitted without impairing the represented function of the network. We decided on the latter approach in order not to bias our choice of considered molecular entities.
We used two different techniques of network reduction by thresholding: In the first, we only kept edges for which the correlation was significantly different from zero ("sign. corr."), i.e., the regulation being sufficiently strong. We adapted the significance thresholds for the different molecular data layers to their respective sizes. Larger networks were subjected to more stringent cut-offs in order to achieve an appropriate degree of reduction (Figure 1C, employed thresholds marked in blue). In the second reduction method, we removed weak links until the reduced network met a criterion of scale-freeness (3, "scale-free"). Scale-freeness is considered a key property of self-organized networks and thus also of biological molecular networks1. In scale-free network formation, highly connected nodes tend to attract more connections than lowly connected nodes leading to a degree distribution following a power-law; the degree distributions of the reduced networks are shown in Figure 1D. Both reduction techniques are hard-thresholding techniques meaning that edges are removed from the networks. The resulting six reduced networks, two for each data layer, were used in a binary form, i.e., weight information was discarded after reduction. Some characteristics of the original and reduced networks are shown in Table 1; the node count of the reduced networks coincides with the number of entities considered after missing value removal. In the following sections, we describe how to analyze these different homogeneous cancer networks by fitting them to SBMs.
Biological networks are known to be modular and hierarchical. Different molecular entities, such as genes, mRNAs, and proteins, are interconnected and form different modules to fulfill a specific function. Modularity can convey more robustness to the overall system, e.g. by preventing perturbations in single modules to spread fast and to cause erroneous behavior in other modules and thus functions. Hierarchies capture two characteristics of biological systems: (i) the ordered combination of functions, i.e., multiple simple functions resulting in more complex behavior or responses; and (ii) the inherent levels of complex organization of life, from single molecules to cell organelles, cells, tissue, organs and whole organisms.
The stochastic block model (SBM) is the simplest form of a generative network model based on communities, i.e., group structures of the nodes. Thereby, nodes are assigned to blocks according to their connectivity properties (Figure 2A left); the block associations of two nodes fully determine the probability of an edge between them. A shortcoming of the classical SBM for representations of real networks is that nodes within one block need to have similar degrees. The degree-corrected version of the SBM42 accounts for that and enables different degrees for nodes within a block. Another extension of the SBM is its hierarchical version, in which the blocks are further partitioned into blocks of higher levels25 (Figure 2A right). This model is especially suitable to represent large networks with many nodes. Fitting biological networks to hierarchical (also called nested) SBMs is therefore most appropriate.
We assessed which of the four following types of stochastic block models could best represent the biological networks: the classical SBM, the degree-corrected SBM, the hierarchical SBM or the degree-corrected and hierarchical SBM (Figure 2B). We used the Python module graph-tool32 to fit SBMs to the networks, i.e., to find which partition (basal and/or hierarchically ordered) describes the network best as SBM.
Note that we also examined the performance of weighted stochastic block models for our purpose of edge prediction as they have been successfully employed before for non-biomolecular networks18,27,43. However, the characteristics of the optimal weighted SBM (number of blocks) were severely impacted by prior assumptions on the edge weight distributions, and the derived edge confidence scores did not coincide well with evidence on edge relevance given by the correlations of the edges for fully-connected weighted networks (see Figure S144). Taking in addition the increased computational effort for fitting the weighted SBM compared to the binary version into account, we restricted our further analyses to non-weighted SBMs.
The model fit is performed following the rationale of Occam’s razor: The simplest model describing the data is the best. Thus, we searched for the partition that minimized the description length, i.e., the amount of information necessary to describe the network as an SBM. Additional information required to capture degree-correction and/or hierarchies compared to the classical SBM is thereby taken into account. Consequently, it can be directly concluded which of the four SBM model types is most appropriate for a certain network: The one with the lowest minimal description length. The optimization of the description length runs via an agglomerative Markov Chain Monte Carlo algorithm33. It is non-deterministic and multiple initiations of the underlying partition of the SBM are required to obtain globally instead of locally optimal partitions. We performed optimizations for 500 initial partitions for each network and SBM type.
For all four mRNA and protein networks, the classical SBM delivered the worst fit and the degree-corrected, hierarchical SBM fitted best (Figure 2B left and middle). Degree correction did not prove necessary to describe the metabolite networks, for which the hierarchical SBMs fitted slightly better than the classical SBM (Figure 2B right). A graphical representation of the best fitting SBMs in circular layout, showing the blocks from the lowest layer (for mRNA, protein) or the metabolites (metabolite network) and their connections in color, and the hierarchical structure on top in black, is given in Figure 2C.
We wanted to assess how well the biological content of the biomolecular networks is captured in the stochastic block models, i.e., if the clustering of the nodes into the blocks is biologically meaningful. To that purpose, we estimated whether the blocks show common biological function based on Reactome or KEGG pathways. In particular, we performed overrepresentation analyses of Reactome pathways for the blocks at each level of the SBMs of the four protein and mRNA networks, and overrepresentation of KEGG pathways for the blocks predicted for the two metabolite networks. We find that a high percentage of blocks in each level has at least one associated Reactome or KEGG pathway, i.e., the genes of the pathway are overrepresented within the block, they occur more frequently than expected by chance (Figure 3A bars). Except for the highest hierarchy levels that consist only of a few blocks, this percentage is decisively higher than for a random clustering of exactly the same structure (results for 3 random clusterings shown as black crosses in Figure 3A).
Many blocks, however, have not only one but multiple Reactome pathways assigned. If blocks are biologically meaningful, we would expect to observe similar Reactome pathways within blocks but less similar pathways when comparing the pathways of different blocks.
The dissimilarity of Reactome pathways is naturally represented by their distance in the Reactome familial hierarchy structure; the more distant pathways are, the less connected are their biological functions. We used two closely related measures to assess it: (i) the distance of two pathways is the length of the shortest path in the hierarchy tree from one pathway to the other, and (ii) the higher the hierarchy level of the lowest common ancestor (least common subsumer) of two pathways, the more distant they are. Please note that we restricted this analysis to the mRNA and protein networks because the KEGG pathways employed for metabolite data are only assorted into a very shallow hierarchy.
We compared the distances of Reactome pathways using the average of the two distance measures over pairs of Reactome pathways associated with one block (within blocks) and over pairs of Reactome pathways associated with two different blocks (between blocks) for the lowest hierarchy level blocks, as shown in Figure 3B. Thereby, for all four networks and both measures, we observe significantly smaller distances of Reactome pathways within blocks than between blocks (Welch’s test p < 0.01). This suggests that biological function is coherent within and distinct between blocks, thus further enhancing the notion that SBMs represent well the biological function of the networks.
We also examined the relationships of the hierarchies within the SBMs to that of the Reactome pathways. In analogy to the distance of Reactome pathways, we defined the distance of blocks within a hierarchical SBM by counting the number of steps necessary to reach a block from the other (via their lowest common higher-level block). We compared the distance of each pair of blocks within the SBM on the lowest level to the average of the pairwise distances of their associated Reactome pathways (Figure 3C). Therein, we found only a weak positive correlation for Reactome pathway distance measure (i), that is even further reduced if neglecting intra-block coherence, i.e., neglecting blocks with distance zero. We concluded that the hierarchy within the SBM does not strongly coincide with the hierarchy within the Reactome pathways.
In order to assess how the clustering by SBM relates to established clustering techniques in correlation-based networks, we also performed module detection by using the WGCNA package3. Note that for this approach, no model reduction is necessary, which means that we obtain only one result for each data layer. After soft thresholding to enforce scale-free network architecture, we employed WGCNA module detection to obtain comparable numbers of modules as for the SBM-based approach: 333 (mRNA), 109 (protein) and 12 (metabolite) for WGCNA, which are similar to the 438, 113 and 15 blocks detected by SBM for the corresponding scale free networks. Overall, the WGCNA clusterings show a larger diversity of module sizes than those obtained for the SBM approach (Figure 3D). For all three data layers, a higher percentage of WGCNA-derived modules have a biological annotation compared to the blocks from the SBM (compare Figure 3E to 3A). However, for the mRNA layer, many of the blocks also have an enriched pathway annotation for randomly shuffled gene names which hints on reduced significance of the WGCNA results and better performance for the modules detected by fitting to SBM. For protein and metabolites, in terms of detection of biologically annotated modules, the WGCNA approach seems to provide slightly better results than the SBM approach.
The observed differences could stem from conceptual differences in the approaches: Instead of detecting clusters of entities with highly positively correlated abundances only as in WGCNA, nodes from the same block are characterized by common connectivity characteristics in the SBM. Clearly, as proteins interact and bind directly for complex formation or regulation, and metabolites are interconverted into one another, entities which act together tend to have similar abundances and thus the modularization by WGCNA shows good results. For the detection of modules for mRNAs whose interaction can be considered less direct, assorting entities with similar connectivity patterns as in the SBM-derived modules is beneficial. In addition, the WGCNA framework, as most other module detection methods, cannot aid in assessing edge relevance - which is enabled by the fit of the networks to SBMs.
The description of the biomolecular networks as SBMs were exploited to determine a confidence score for each existing or non-existing edge of the network. This score would capture how well the existence (or non-existence) of the edge fits to the network’s description by the SBM: Erroneously kept or removed edges lead to a worse fit of the SBM to the network, and removing or reinstalling these edges lead to fit improvement. Indeed, under certain assumptions (see Methods), the derived edge confidence score is proportional to the actual absolute probability that the edge belongs to the network, and thus signifies relative edge relevance. Therefore, the confidence score could be used to predict whether an edge is spurious (or missing)17,18,27. A high missing edge confidence score suggests that the edge in question is missing and should be restored, it has a high relevance; a high spurious edge confidence score suggests that the edge in question is spurious and should be removed from the network, it has a low relevance. The SBM-based edge confidence scores rely on global network connectivity characteristics and complement the correlation-based weights of the edges which stem from the local, measured characteristics of their nodes.
For each putatively missing and spurious edge, we used the Python module graph-tool32 to compute its edge confidence score (Figure 4). Thereby, we took advantage of the following fact: Entities of degree zero, i.e., nodes that have no connection to any other node in the network after reduction, are indistinguishable to the SBM. Consequently, also all putatively missing edges connecting any of these nodes to a specific second node are only distinguishable by this second node, and thus carry the same missing edge confidence score. This reduces the number of scores we need to compute considerably, depending on the degree of reduction (see counts of entities of degree zero in Table 1). We display the scores for the different types of missing edges in different histograms (Figure 4A, B middle and bottom). For very big networks with a low degree of reduction (the mRNA and protein networks reduced by significance of correlation), it is still computationally not feasible to compute the scores for all missing edges. We therefore resorted to computing it for as many missing edges as we have existing edges in the network (i.e. approx. 2.4×106 for the protein network, 4.6×106 for the mRNA network, see Table 1), and chose those with highest absolute edge weights (Figure 4B left and middle).
Recall that the edge confidence scores are relative, i.e., they serve for comparing relevance between edges only. In addition, computation of the scores relies on the assumption that the partition of the originally fitted SBM is correct for the network. Different edge confidence scores might be obtained for SBMs with different partitions but with similarly good fit to the network. We neglect this complication for the sake of computational efficiency. Still, we have to keep both facts in mind for the interpretation and usage of the edge confidence scores. For example, an evident threshold for declaring an edge as relevant ("missing") or not relevant ("spurious") would be to have the respective edge confidence score larger than zero, as this indicates an improvement in fit quality if adding or removing the edge, respectively. However, we observe an imbalance of our computed scores which is inherent to the approach: Because having less edges reduces the complexity of the underlying network, removing edges preferentially reduces also the amount of information required to describe the SBM, i.e., its description length turns smaller, its goodness of fit improves. Therefore, the edge confidence score distributions of missing edges are shifted to the left - the great majority of edges would be predicted as not missing for the threshold of zero, they are not relevant and should be left out (Figure 4A, B, 2nd line). Conversely, the spurious edge confidence score distributions are shifted to the right - the great majority of existing edges would be predicted as spurious, they are not relevant and should be removed for an edge score threshold of zero (Figure 4A, B, top). Both measures point to making the networks smaller.
In order to determine whether the edge confidence scores are overall a reasonable assessment of edge relevance, we compared the predicted scores to the edge weights of the edges as derived directly from Spearman’s correlation of the measurements of the nodes. It is important to note that these edge weights (correlations) were used exclusively for the reduction of the correlation-based networks. The edge correlations were at no point provided to the SBM, neither during the fitting to the SBM (except for the weighted version and only for Figure S1, Figure S244) nor during the computation of the edge scores. Thus, they are close to an independent validation of the edge relevance, edges with large correlation being more relevant to the system encoded by the network than edges with a correlation close to zero.
Indeed, for all six networks and SBMs, we find an overall positive correlation between the (absolute) edge correlations and the missing edge confidence scores: Edges with high correlation are preferentially predicted as missing, i.e., relevant to the network, also in terms of edge confidence score (Figure 5). Similarly, we find an overall negative correlation between (absolute) edge correlations and spurious edge confidence scores: Edges with high correlation are preferentially not predicted as spurious, i.e., they are predicted as relevant to the network, also in terms of edge confidence score. Consequently, the comparison to the edge correlation suggests that SBM-derived edge confidence scores could be used as additional information for assessing the relevance of edges for multiple omics correlation-based networks.
Using example cases of correlation-based transcriptomic, proteomic and metabolomics networks from breast cancer tumour samples, here we show how stochastic block models can be employed for the analysis of biomolecular networks. The networks can be best represented by the hierarchical version of the stochastic block models. This gives rise to biologically meaningful separation of the biomolecules into many functionally relevant blocks. In addition, the SBM framework enables the computation of edge confidence scores that can be used to predict missing or spurious links.
The representation of the networks by SBMs poses a challenge: The model fit and the derivation of edge confidence scores can be computationally very demanding, especially for networks with many edges. This was the case here for the mRNA networks and the protein network reduced by significance of correlation. Therefore, the approach of network analysis by SBM seems most suitable for smaller and/or sparser networks. On the other hand, it delivers two opportunities. First, modules derived from blocks in SBMs are not only defined as clusters of tight relationships, i.e. co-expression clusters, as obtained with other clustering approaches3,9,14. In an SBM, nodes from the same block are characterized by common connectivity characteristics, i.e., common interaction profiles with nodes from the same and from other blocks. Consequently, comparing SBM-derived modules between conditions naturally points towards detecting altered regulations. This has not been investigated here, the biological role and disease-relevance of the blocks in our example case of ER- breast cancer still need to be assessed.
Second, the derived edge predictions can reproduce the interaction strengths as estimated from measurements. SBM-based edge confidences tend to score missing edges higher that have high correlation, i.e., that would be considered a strong regulation, an important edge. Analogously, existing edges in the network with low correlation tend to be predicted as spurious with high confidence, i.e., as dispensable. Keep in mind that the correlations corresponding to the edges were not provided at any point to the SBM construction. Thus, the SBM approach proves very strong in delivering relevant edge predictions. A further biological validation of the predicted edge confidence scores, for example by comparison to interaction databases, would be an interesting next step.
Still, a question remains: How should the edge confidence scores be translated into edge predictions to alter the network, i.e. to actually remove or re-install edges? There is a natural threshold for declaring an edge as missing or spurious, namely if the respective confidence score would be larger than zero. However, due to the minimal description length approach for SBM-based edge relevance assessment, the confidence scores are shifted towards reducing the networks as much as possible (Figure 4), such that this natural threshold for deriving the prediction of actual missingness or spuriousness from the confidence scores is not valid. Further examinations on other possible thresholds are required. However, for relative comparisons of relevance between edges in a network the SBM-based scores are suitable.
Additional directions of SBM-based analysis of biomolecular networks remain to be explored.
(i) For edge relevance assessment in other network types, it has been proposed that edge predictions with SBMs are more reliable if resorting to an ensemble of good fits instead of using the best fit only27. Due to the sizes of the employed correlation-based biomolecular networks, this approach is computationally not of practical relevance here, but it would be an interesting point to assess in the future.
(ii) We considered the weighted version of the SBM that could be an interesting option for the analysis of biomolecular networks because it enables representing fully-connected weighted networks as SBM without prior reduction. It could prove exciting especially for smaller networks, e.g. primary metabolites, or in more targeted data analysis approaches. However, in our case, we found hints that the weighted version of the SBM might not be appropriate for the task of edge prediction from SBM-based confidence scores for fully-connected networks (see Figure S144). For reduced networks, the results for weighted SBMs seem more promising (see Figure S244). A final assessment on the usefulness of the weighted version of the SBMs is still pending, as long as nothing is known about interaction strength distributions between and within modules: The fitted SBM strongly depends on the prior assumptions chosen for these distributions. An extensive analysis of different choices would be required, which is beyond the scope of this study.
We propose to employ SBM-based modules and edge confidence scores as additional pieces of information to make the results of follow-up analyses more robust that rely on relationships between nodes, i.e., on edges and their strength. Edges and their strength play a key role in interpreting biomolecular effects, e.g. of mutations or drugs. If a drug “activates a protein”, this typically means alteration of the protein’s interaction strengths with other proteins or the DNA. The “mutation of a gene” may severely alter the binding properties of the corresponding translated protein to interaction partners. In sum, interactions and their strengths are at the heart of biologically relevant alterations in biomolecular networks, and characterizing the former is required in order to understand the effects of the latter. The SBM framework enables assessing the edge relevance or interaction strength on the basis of consistent, global network characteristics, instead of on the basis of correlation of measurements. Especially for personalized analyses which generally rely on a characterization by only few error-prone measurements of each molecule, this will be crucial to derive more reliable predictions.
In this study, data from TCGA and CPTAC was used. Proteomics data stem from 41, metabolomics data stem from 29. The metabolomics raw data can only be obtained upon accessing the cited article29; processed data (Spearman’s correlations and associated p-values) can be found in the gitlab repository below.
Code used to perform the analyses together with a detailed work-flow documentation: https://gitlab.com/biomodlih/sbm-for-correlation-based-networks
Archived code as at time of publication: http://doi.org/10.5281/zenodo.262041944
License: GNU GPLv3 license.
Zenodo: Analysis of correlation-based biomolecular networks from different omics data by fitting stochastic block models. http://doi.org/10.5281/zenodo.262041944. This project contains the following extended data files:
Figure S1: The weighted SBM seems not appropriate for edge prediction from edge confidence scores for fully connected networks
Figure S2: Edge predictions for a reduced weighted network with planar or hierarchical weighted SBMs
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
This research was funded by Luxembourg National Research Fund, FNR, SINGALUN project. KB acknowledges funding by an Add-on Fellowship for Interdisciplinary Life Sciences of the Joachim Herz Stiftung.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, functional genomics, transcriptomics
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Statistics, Computational biology, Data analysis, Genomics.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics, systems biology
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 27 Aug 19 |
read | read | |
Version 1 14 Apr 19 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)