Disparity-filtered differential correlation network analysis: a case study on CRC metabolomics

Abstract Differential network analysis has become a widely used technique to investigate changes of interactions among different conditions. Although the relationship between observed interactions and biochemical mechanisms is hard to establish, differential network analysis can provide useful insights about dysregulated pathways and candidate biomarkers. The available methods to detect differential interactions are heterogeneous and often rely on assumptions that are unrealistic in many applications. To address these issues, we develop a novel method for differential network analysis, using the so-called disparity filter as network reduction technique. In addition, we propose a classification model based on the inferred network interactions. The main novelty of this work lies in its ability to preserve connections that are statistically significant with respect to a null model without favouring any resolution scale, as a hard threshold would do, and without Gaussian assumptions. The method was tested using a published metabolomic dataset on colorectal cancer (CRC). Detected hub metabolites were consistent with recent literature and the classifier was able to distinguish CRC from polyp and healthy subjects with great accuracy. In conclusion, the proposed method provides a new simple and effective framework for the identification of differential interaction patterns and improves the biological interpretation of metabolomics data.


Abbreviations
area under the receiver operating characteristics CRC colorectal cancer CV cross-validation FDR false discovery rate FNR false negative rate LC-MS/MS liquid chromatography coupled with tandem mass spectrometry NPV negative predictive value PLS-DA partial least square-discriminant analysis PPV positive predictive value VIP variable importance in projection

Introduction
Moreover, although the metabolic networks of several organisms like eukaryotes, bacteria or archea, for example, show a power-law degree distribution and a high clustering coefficient [1], metabolite correlation networks have not been fully characterized in terms of network topology [16].
In this paper, we defined and applied a new simple but effective workflow to conduct a weighted differential correlation network analysis on metabolomics data and we propose a classification model based on the inferred network's information able to measure the explanatory power of the resulting network model and to evaluate if the identified differential associations had a discriminative power. The main novelty of this work relies in the usage of the so-called disparity filter [17] to reduce the network to it is connection backbone. Once the differential network is built, this filter allows the identification of edges relevant with respect to a null model, exploiting the heterogeneity present in the intensity (weights) of the differential links, both at global and local levels, without down-playing small-scale interactions. The advantage of the disparity filter is that the obtained backbone preserves almost all nodes of the initial network and a large fraction of the total weight, while reducing considerably the number of links that pass the filter. Moreover, this procedure does not need any assumption on the distribution of the data.
For the classification task, we chose to adopt the partial least square-discriminant analysis (PLS-DA) method, since it is one of the most popular classification methods in metabolomics. To the best of our knowledge, the application of the disparity filter to differential network analysis and the coupling of it with PLS-DA are original in the metabolomic literature.
We applied our method to a publicly available metabolomic dataset on colorectal cancer (CRC) [18], which includes 224 serum samples from healthy controls and patients who suffer from colorectal cancer or polyp. The proposed network differential analysis allowed to identify hub metabolites and differential association patterns which were consistent with current knowledge and the original paper, and the PLS-DA classifiers were able to distinguish with a good accuracy CRC sample from both healthy and polyp samples, suggesting that the identified differential networks can be meaningful and have discriminative power.

Architecture and implementation
In metabolomic studies at least two phenotypes are compared, e.g., cases versus controls. The differential network analysis framework here proposed consists of three main steps: the inference of a weighted differential association network (based for the sake of simplicity on Pearson's correlation), the reduction of such network to its backbone by application of the disparity filter, and the construction of a model that exploits the information obtained from the differential network to classify the two phenotypes.

The weighted differential correlation network
For each pair of metabolites i and j, we define the differential association measure as follows: where cases (i, j) and controls (i, j) are the Pearson's correlation coefficients for metabolites i and j in the two conditions, respectively and the factor > 1 is chosen to push low differences towards 0, while conserving high values. This differential quantity is a power function of the absolute change of correlation across the two conditions. The statistical significance of each differential association is assessed using a 1000-fold permutation test: briefly, for each permutation, the samples are randomly allocated to one or the other status, in order to remove the relation among metabolites' correlations and the phenotypes. The statistical significance (p-value) of diff (i, j) is estimated as the proportion of the permuted differential associations that are greater than the observed values calculated using the original real data. Therefore, we define a differential network, where the nodes are metabolites and two metabolites are linked if and only if the p-value of their related differential association measure is lower than the fixed cut-off 0.05. Let us note that the so inferred network is naturally weighted on the edges, associating the value of the differential association measure to each correspondent edge. Despite the limitations due to fact that Pearson's correlation measure just considers linear interactions, this method allows to infer a differential network using a widely known statistical tool, making the model interpretable and it does not require any conditions on the distribution of the data.

The disparity filter
Given the complex nature of correlation in metabolomics it is necessary to perform a filtering analysis to extract the relevant information from the inferred differential network. Moreover, filtering techniques allow a reduced but more relevant representation while preserving the key differential connections. To this purpose, we propose the usage of the disparity filter presented in [17] that is a network reduction method that exploits the weighted nature of the differential network and it operates at all the scales present in the system. The disparity filter analyses the edges at the node's level and preserves just the ones that have weights unexpectedly high, or in other terms that represent significant deviations with respect to a null hypothesis of uniform randomness. As a result, this filtering technique significantly reduces the total number of edges, keeping a large fraction of the total weight and unlike the global threshold filter, it preserves the form of the weight distribution, and the clustering coefficient.
The filtering method starts by normalizing the weights w i,j of the n i edges linked to a certain metabolite i, as follows: The heterogeneity in the local distribution of the edges' weights insisting on i is characterized by the disparity measures [17]: This is a function that has been extensively used in complex networks theory and it characterizes local heterogeneity. If all the links have the same weight, we are in a situation of perfect homogeneity and it holds Y (n i ) = 1, whereas for perfect heterogeneity, i.e., one of the links carries all the weight, it holds Y (n i ) = n i .
In real network, we usually observe intermediate behaviour, proportional to a power function of the node's degree with exponent close to 1 2 . As reported in [17], this is the situation when the disparity filter results more useful.
After normalizing the weights, the disparity filter proceeds by identifying which links for each node i should be preserved in the network. The null model used for this discrimination is based on the null hypothesis that the n i normalized weights are produced by a random assignment from a uniform distribution. All the edges that reject the null hypothesis, i.e., those with weights not compatible with the null model, can be considered as significant deviations due to the network-organizing principles. By imposing a cut-off , the relevant edges for a node i will be those whose weights satisfy the relation [17]: By lowering the parameter , we can filter out the links progressively focusing on more relevant edges. The network backbone is therefore obtained by preserving all the edges that satisfy the above criterion for at least one of the two nodes they insist on, while discounting the rest.

The classification model
Once performed the differential network analysis, it is also interesting to study the explanatory power of the proposed differential method and to evaluate if the extracted information can be useful for classification. With this purpose in mind, we define a novel set of features that "translates" the properties of the filtered differential network's backbone, and it consists of: -the nodes that are connected to at least another metabolite in the backbone, -the interaction term, i * j, for each preserved edge in the backbone linking metabolites i and j.
Thus, this novel set of features can be used to train a classification model. To this purpose, we choose to perform a PLS-DA since it is widely adopted by the metabolomics community and, being a dimensionalityreduction technique, it can handle the intrinsic multicollinearity of the novel dataset. However, other suitable classification methods may be chosen.
For the evaluation of the proposed method, we used a published metabolic dataset on colorectal cancer (CRC) [18]. The dataset is publicly available at the NIH Common Fund's National Metabolomics Data Repository (NMDR) website, the Metabolomics Workbench, https://www.metabolomicsworkbench.org, where it has been assigned the Project ID PR000226. The data can be accessed directly via its Project DOI: https://doi.org/10 .21228/jib-2021-0030.
Briefly, metabolomics was performed on serum samples of 234 subjects (both healthy and patients) undergoing either colonoscopy or CRC surgery; samples were collected after overnight fasting and bowel preparation. The groups consisted of healthy controls (n = 92), CRC patients (n = 66), and patients with colorectal polyps (n = 76), based on colonoscopy examination results. Patients were age-and gender-matched in each group. A targeted liquid chromatography-tandem mass spectrometry (LC−MS/MS) approach was used for comprehensive CRC serum metabolic profiling under a standard operating procedure. In total, 113 metabolites were reliably detected [18].
In this analysis, after a log-transformation, we randomly split the dataset into training (70%) and test set (30%) with equal balance among the groups and applied the proposed method for differential network analysis to the training set.
For the differential association measure, the power function parameter was set to 4 to obtain a scalefree distribution of the weights. Moreover, we recall that the disparity filter was applied to a network with statistically significant edges, tested by permutation test. Therefore, in order to reduce considerably the number of links, while maintaining most of the total weight, we set the cut-off of the disparity filter to 0.3. Networks' topology was then exploited to identify key metabolites in the differential networks. Two centrality measures were considered: nodes' degree i.e., the number of edges insisting on a node, and betweenness i.e., the number of times a node is part of the shortest path between any pair of nodes [19]. In simple terms, nodes with high degrees are usually referred to as "hubs" since they have more connections than the rest of the nodes in the network, while nodes with high betweenness may be considered as "bottlenecks" since they are crucial in controlling the information flow.
The two differential network analyses were performed on the training set. Then, for each of the comparisons, we considered the novel dataset, enriched with the network backbone information, as explained before, and we trained a PLS-DA model on the training set, aiming to classify subjects according to their phenotype. To avoid overfitting and improve the reproducibility of the results, the classifier was validated through 20-times repeated 10-fold cross validation. The optimal number of dimensions for the PLS-DA model was determined by cross validation. Then, we tested the final model on the remaining part of the subjects (test set). The performance during cross-validation and testing was measured by accuracy and AUROC measure. Variable importance in projection (VIP) measure was adopted to evaluate the variables contribution to the classification models.
The analyses were conducted using the R software version 4.0.5.

Results
The proposed workflow was applied using a metabolomic dataset previously published that involved patients with CRC versus healthy controls or subjects with colorectal polyps.

CRC versus healthy subjects
The differential network of the statistically significant differential correlations (by permutation test) between CRC and healthy subjects consists of 111 nodes and 450 edges ( Figure 1A). The degree's distribution showed high variability, spanning from 0 to 21 edges per node and average degree 8. At the local level, the heterogeneity of the nodes presented a skewed distribution and, as function of the nodes' degree, the disparity measure Y (n i ) was found proportional to n i 1 2 ( Figure 1B). This means that for most of the nodes the weights insisting on them are peaked on a small number of links and the remaining connections carry just a small fraction of the node's strength. In this situation, the disparity filter is particularly useful, extracting structures impossible to detect using the more common global threshold filter.
The network's backbone obtained after applying the disparity filter with a threshold = 0.3 was comprised by one connected component of 100 nodes and 158 edges, preserving just the 35% of the edges but at the same time the 64% of total weight. By analysing the topology of the extracted backbone, we were able to detect 7 central metabolites, which are both hubs and bottlenecks for the differential interactions of the resulting network ( Figure 2A). Those metabolites are related to amino acids metabolism, like glutaric acid, kynurenic acid and tryptophan, the energy metabolism (lactate and adenosine monophosphate), plus glucuronic acid and glycocholate. Although these results are in agreement with the published findings, it is worth noting that only two of these metabolites (glycocholate, kynurenic acid) showed a significative difference in the distribution (Mann-Whitney's test p-value < 0.05) and therefore, were considered relevant in the original paper [18, p. 4123]. The other five metabolites, while not significantly different in concentration and therefore not detectable with standard analysis, resulted central in the differential network, suggesting a role of such metabolites in CRC occurrence.
To obtain a predictive model and at the same time to assess the explanatory power of the resulting network model, we trained a network-based PLS-DA for the classification of CRC and healthy subjects, as detailed in the previous section. The classification model was able to distinguish the two groups ( Figure 2B), with high accuracy and AUROC (95% and 0.98 from cross-validation; 80% and 0.81 on the independent test set, respectively). The other performance measurements on the testing sub cohort reported in Table 1, like F1  suggesting that the differential information extracted with the disparity filter is meaningful and it plays a significant role in the discrimination. As shown in Table 2, the variables with higher discriminative power according to VIP score were also significantly different in distribution (Mann-Whitney's test p-value < 0.05, after FDR adjustment).

CRC versus polyp subjects
In the comparison between CRC and polyp subjects, we observed a similar performance of the proposed algorithm. The differential network of the statistically significant differential correlations consists of 112 nodes and 462 edges ( Figure 3A). The average nodes degree was 7, spanning from 0 to 30 edges linked on one node and, regarding the disparity measure, we assessed that Y (n i ) ∝ n 3 5 ( Figure 3B). Therefore, once again we demonstrated the usefulness of the disparity filter in the analysis of this type of data.
Once applied the disparity filter with threshold equal to 0.3, we filtered out the 64% of the edges, while preserving 65% of total weight. Figure 4A shows the most central metabolites, according to degree and betweenness. Among them, there are metabolites related to energy metabolism (lactate, oxalic acid), amino acids (orotate, 2-aminoadipate), purine metabolism pathway (xanthosine, inosine monophosphate), coherently with the original study [18]. Only two of such central metabolites (glycochenodeoxycholate and orotate) showed a significantly difference in the distribution (Mann-Whitney test's p-value < 0.05), remarking once again that the methodology here proposed offers a complementary view with respect to standard statistical analysis.
The PLS-DA classifier was able to distinguish the two groups with an accuracy of 95% and an AUROC of 0.97 from cross-validation. Despite the slight decrease in performance we expected, the classification models obtained good results on the independent testing set, too (accuracy 80%, AUROC 83%; see Table 1). Among the top 25 features that contribute the most to such classification according to VIP score (Table 3), there were  , the importance of nodes preserved in the differential network's backbone between CRC and polyp subjects, characterized by betweenness centrality (Btw, y-axis) and node degree (x-axis). Key nodes with high degrees and high betweenness (degree * betweenness > 0.5) were labelled with their metabolite names. In panel (B), scores plot of PLS-DA classification model between CRC subjects and polyp controls. 6 nodes and 19 edges, most of them significantly different between the two groups (Mann-Whitney's p-value < 0.05, FDR adjusted), confirming that detected relevant interaction terms play a role in the discrimination, also in this case.

Conclusions
A complex disease phenotype, like cancer, alters different biological mechanisms that interact in a network [1]. Here, we developed a simple but effective framework to perform differential network analysis and applied it to a published CRC metabolomics dataset. Focusing on differential interactions rather than differential concentrations, network differential analysis offers a complementary perspective with respect to standard analysis techniques and it has become an important tool for the analysis of the underlying pathophysiological processes. The evaluation of the CRC dataset using differential network analysis showed that the method here proposed provided useful insights into the backbone of the differential interactions between two phenotypes (presence or not of cancer) and was able to achieve classification. Compared to the original analyses, this methodology revealed important alterations of the interactions network occurring in CRC with respect to both healthy and polyp subjects and it allowed the identification of several novel metabolites which resulted central in the differential information flow, although further validations will be necessary.
In conclusion, the proposed method is an easy-to-use novel approach for reconstruction and analysis of differential association networks and may constitute a first step towards inferring causal relationships and discovering novel candidate biomarkers.