The ‘un-shrunk’ partial correlation in Gaussian graphical models

Background In systems biology, it is important to reconstruct regulatory networks from quantitative molecular profiles. Gaussian graphical models (GGMs) are one of the most popular methods to this end. A GGM consists of nodes (representing the transcripts, metabolites or proteins) inter-connected by edges (reflecting their partial correlations). Learning the edges from quantitative molecular profiles is statistically challenging, as there are usually fewer samples than nodes (‘high dimensional problem’). Shrinkage methods address this issue by learning a regularized GGM. However, it remains open to study how the shrinkage affects the final result and its interpretation. Results We show that the shrinkage biases the partial correlation in a non-linear way. This bias does not only change the magnitudes of the partial correlations but also affects their order. Furthermore, it makes networks obtained from different experiments incomparable and hinders their biological interpretation. We propose a method, referred to as ‘un-shrinking’ the partial correlation, which corrects for this non-linear bias. Unlike traditional methods, which use a fixed shrinkage value, the new approach provides partial correlations that are closer to the actual (population) values and that are easier to interpret. This is demonstrated on two gene expression datasets from Escherichia coli and Mus musculus. Conclusions GGMs are popular undirected graphical models based on partial correlations. The application of GGMs to reconstruct regulatory networks is commonly performed using shrinkage to overcome the ‘high-dimensional problem’. Besides it advantages, we have identified that the shrinkage introduces a non-linear bias in the partial correlations. Ignoring this type of effects caused by the shrinkage can obscure the interpretation of the network, and impede the validation of earlier reported results. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04313-2.

A GGM consists of a network structure of nodes (representing genes, transcripts, metabolites, or proteins), which are inter-connected by edges reflecting significant partial correlations. Partial correlations measure linear associations between pairs of random variables, where the contribution from the remaining variables is adjusted for. Unlike RNs, which are based on Pearson's correlation, GGMs remove the spurious correlations caused by confounded variables (e.g. when genes share a common regulator). Compared to BNs, GGMs scale up more efficiently to large network analyses and often yield comparable network reconstruction accuracies [5]. Although the edges in a GGM are undirected, there are various methods to learn their directions [6,7]. These aforementioned features made GGMs a popular tool in bioinformatics and biomedical studies of colon cancer [8], immunological diseases [9], diabetes [10], respiratory diseases [11], functional connectivity between brain regions [12], and chronic mental disorders [13].
Partial correlations can be computed from the (standardized) inverse of the covariance matrix (i.e. the precision matrix). In principle, the covariance matrix is unknown and has to be estimated from data. The estimated covariance matrix must be well-conditioned to ensure that its inverse exists, and that numerical (or estimation) errors are not magnified during its computation. The sample covariance, as obtained from a dataset of n samples and p variables, is (i) invertible and well-conditioned when n is greater than p, (ii) invertible but ill-conditioned when n is comparable to p, and (iii) not invertible when n is smaller than p [14]. The last case is known as a 'high-dimensional problem' , 'small n, large p' , or 'n ≪ p' . This scenario is common in omics' studies, where often a large set of genes, proteins or metabolites is quantified in relatively few samples.
Shrinkage approaches are widely applied to deal with the 'high-dimensional problem' . They produce a more stable estimator at the cost of some bias. The most popular shrinkage approaches for estimating GGMs are graphical least absolute shrinkage and selection operator (gLASSO) [15] and the Ledoit-Wolf (LW) shrinkage [14,16]. gLASSO estimates a 'shrunk' precision matrix using L1 regularization, which forces some entries to be equal to zero. The LW-shrinkage estimates a 'shrunk' covariance (or 'shrunk' correlation) matrix, which is invertible and henceforth allows the indirect computation of the partial correlations. Although both approaches have been successfully applied in bioinformatics, to the best of our knowledge, it has not been studied in the literature yet how the shrinkage affects/biases the estimated partial correlations.
In this paper, we present an improvement for the 'shrunk' partial correlations inferred with the LW-shrinkage [17]. While in our previous work we focused on the estimation of p-values accounting for the shrinkage [18], here we study a fundamental source of bias, namely the effect of the shrinkage value on the 'shrunk' partial correlation. Most importantly, we show that this effect is non-linear, so that the magnitudes and also the order of the estimated partial correlations change with varying the shrinkage value. This has unexpected consequences on the results, as GGMs learnt from different experiments (e.g. datasets differing in sample size or number of nodes) have unequal shrinkage values, and thus are affected differently. Therefore, to compare studies, or to decide whether partial correlations are relevant, the users require (i) a 'shrunk' test of significance, and (ii) an accurate assessment of the 'shrunk' partial correlation coefficients. The first points was addressed in our recent study [18], and the second is the focus of this article.

Analysis of simulated data
Here we demonstrate the advantages of the new 'un-shrunk' estimator over the 'shrunk' estimator (which uses a fixed shrinkage value ). The evaluation consists of comparing the magnitudes of the estimates to their actual (population) values, and assessing their order using the Area Under the Receiver Operator Curve (AUROC). For the 'shrunk' and 'un-shrunk' estimators we compute the shrinkage-based p-values via Eq. (13) with λ equal to the optimal value and to zero, respectively [18]. We will used Cohen's criterion to threshold the magnitude of the partial correlations [24]. This criterion establishes a cut-off of 0.1 to classify as weak the correlation coefficient. In total, 2702 data sets were simulated (see Additional file 1: Table S2).
First, we study the shrinkage distortion on the partial correlations coefficient. We create a (random) network structure and simulate 10 datasets with sample sizes n = 10, 20, …, 90. Figure 1 shows the average 'shrunk' and 'un-shrunk' partial correlations as a function of n. The size of the symbols is proportional to the standard error to reflect their uncertainty.
The new 'un-shrunk' partial correlation approaches the actual (population) values as n increases, while the shrunk counterpart stays biased. In Additional file 1: Figure  S2 we see how a GGM structure (i.e. its edges) changes with varying λ. We used p = 10 and n = 1000 (a large sample size) to reduce the sampling variability, such that the observed effect on the edges order can be attributed to λ. Additional file 1: Table S1 lists the edges sorted by their magnitudes. Additional file 1: Figure S3 compares the performance of both methods for different combinations of p and n for a single partial correlation equal to 0.5. The results are presented in the form of a heatmap over a p-n grid. The color scale shows the L1 distances to the population value of 0.5. We see that the proposed 'un-shrunk' estimation is consistently closer to the population value. In Additional file 1: Figure S4 we illustrate how, in general, partial correlations are deflated when the samples size is very small, e.g. n = 10.
To assess the order of the estimates we use the Area Under the Receiver Operator Curve (AUROC). The AUROC shows the trade-off between the true positive rate (TPR) and the false positive rate (FPR) for every threshold. A higher AUROC means that the Fig. 1 Partial correlation versus sample size. We create a fixed network structure for p = 50, and simulated data for sample size n ranging from 10 to 90. a, b show the average 'shrunk' and 'un-shrunk' partial correlations (over 10 simulations). The proposed 'un-shrunk' partial correlations are closer to their actual values order of the estimates is superior. Tables 1 and 2 reports the AUROC for the 'un-shrunk' , 'shrunk' , and for gLASSO [15,19] for a random network with 1% and 3% of true positives and simulated data with varying sample sizes. It can be seen that the 'un-shrunk' estimates have a significantly higher AUROC than its 'shrunk' counterpart in most of the cases.
Next, we study whether the methods produce comparable networks in datasets with the same partial correlation patterns but different sample sizes n (and thus different optimal ). We proceed as follows, (i) simulate a network structure, (ii) simulate a dataset from this network, (ii) create a copy of the dataset and concatenate it with the first dataset. The concatenated dataset encodes the same information as the first dataset but with doubled sample size. Besides the number of samples, every other characteristic in the two datasets remain fixed. Figure 2 shows that the new 'un-shrunk' partial correlations are comparable (differences in partial correlations are in the order of 10 -07 ).

Table 1 Area Under the Receiver Operator Curve (AUROC)
The table provides the average AUROC score across 25 data sets for a network with p = 100 nodes with 1% of true positives edges and varying sample size n *Two-sides t-test between the 'un-shrunk' and 'shrunk' methods are statistically significant at 0.05 **STAR model selection is used from the R package 'huge' [15,19]   The table provides the average AUROC scores across 25 data sets for a network of p = 100 nodes with 3% of true positives edges and varying sample size n *Two-sides t-test between the 'un-shrunk' and 'shrunk' methods statistically significant at 0.05 **STAR model selection is used from the R package 'huge' [15,19]  An additional comparison of ROC curves can be found in Additional file 1: Figure S5. There it can be observed how the novel method is particularly superior for the strongest partial correlations (the left most part of the curve).

Data
Escherichia coli microarray data This data set consists of Escherichia coli microarray gene-expression measurements from a study of the temporal stress response upon expression of recombinant human superoxide dismutase (SOD) [20]. SOD expression was induced by isopropyl β-D-1-thiogalactopyranoside (IPTG), which is a lactose analogue inducer of the lac operon, and measured at 8, 15, 22, 45, 68, 90, 150, and 180 min. The authors identified 102 out of 4289 protein coding genes as differentially expressed in Datasets were simulated to encode the same associations differing only in their sample sizes (and thus their optimal shrinkage values). We proceed as follows: (i) simulate a network structure, (ii) simulate a dataset from this network, (ii) create a copy of the first dataset and concatenate them. The new concatenated dataset encodes the same association structure as the first, however it has double the sample size. The optimal shrinkages are 0.42 and 0.54. In grey: The 'shrunk' estimates. In red: The new 'un-shrunk' estimates. Unbiased/ comparable estimates must be around zero. The new 'un-shrunk' method provides coefficients that are directly comparable with differences in the order of 10 -07 one or more samples after induction. Data pre-processing included log 2 -ratio transformation with respect to the first time point. The final data set consists of expression values of 102 genes with 9 time points and was obtained from the R package GeneNet version 1.2.13. Accessed May 15, 2020.

Mus musculus RNA sequencing data
This dataset corresponds to single end RNA-Seq reads from 21 male mice from two strains (B6, n = 10 and D2, n = 11), and is available at ReCount: http:// bowtie-bio. sourc eforge. net/ recou nt/ under the identifier 21455293 [21]. Accessed May 15, 2020. Genes with low count's averages across samples (less than 5) were excluded. After correcting by strain type, 223 genes out of 9431 were identified as differentially expressed using the R package limma and Benjamini-Hochberg (BH) adjusted p-values < 0.05 [22,23]. We applied upper quartile normalization, log 2 -transformation and a correction by strain type using linear models. The final data set consists of 223 genes with 21 samples.

Effects of human superoxide dismutase (SOD) protein expression on transcript expression in E.
coli Following previous works, the dataset is treated as static and nominal p-values are considered significant at the 0.05 level [17,18]. The optimal shrinkage used in the 'shrunk' approach is 0.18.
A Volcano plot of partial correlations and their p-values is presented in Fig. 3. The panels are segmented into four regions using a threshold of |pcorr|= 0.1 and p-values = 0.05. Points in the outer quadrants are simultaneously the strongest and most significant edges.
The 'shrunk' network has 126 significant edges at 0.05 (involving 61 genes) with only 15 stronger than 0.1. Following previous analysis, we proceed further only with the significant edges regardless of their magnitudes. The 'un-shrunk' network has 78 significant edges that are stronger than 0.1 (involving 22 genes). The enriched Gene Ontologies (GOs) are reported in Additional file 1: Tables S4 a-b. [25,26]. The complete set of 102 genes was set as background, which mapped 96 proteins.
The 102 genes retrieved 289 reported interactions (p-value = 1.0 10 -16 ) in STRINGdb. The expected number of interactions for a random set of proteins of similar size was 105. The connected genes for the 'shrunk' method mapped onto 58 proteins with 108 interactions (expected = 36, p-value = 1.0 10 -16 ). For the 'un-shrunk' it mapped onto 41 proteins with 23 interactions (expected = 23, p-value = 6.69 10 −12 ). Both methods are enriched for molecular processes related to (i) lactose metabolic process (GO:0005988) and (ii) galactitol transport (GO:0015796). The network structures can be found in Additional file 1: Figure S6.

Discussion
GGMs are undirected graphical models that represent pairwise partial correlations in the form of a network. They are widely used in many fields, because they are computationally fast and simple to interpret. Despite of that, the estimation of partial correlations from gene-expression data is challenging whenever there are fewer samples than genes. This motivated the development of estimators based on shrinkage; however, we have observed an unexpected effect of the shrinkage and we have investigated it.
In particular, we have identified a bias in the 'shrunk' partial correlations. The bias is a non-linear effect caused by the shrinkage value, which modifies the magnitudes and order of the partial correlations, see Fig. 5, S1, S2. Both the Ledoit Wolf shrinkage and gLASSO estimates can be subject to this non-linear effect. As partial correlations are full-conditional correlations, the shrinkage value affects the configuration of all random variables. As the order is affected non-linearly, re-scaling the 'shrunk' partial correlations (e.g. dividing it by 1 − ) is not sufficient (see Additional file 1: Figure S1). Consequently, GGMs learnt from different experimental conditions use different shrinkage values, and are not comparable.
To correct for this bias, we have introduced the concept of 'un-shrinking' the partial correlation. On the theoretical side, the 'un-shrunk' partial correlation is a generalization of the classical partial correlation, and it is defined even for singular matrices. For the well-conditioned case, the shrinkage effect can be considered as a transformation of the original dataset into a new 'shrunk' dataset, see Fig. 5b. Hence, 'un-shrinking' could be interpreted as the limit of the partial correlations when the 'shrunk' data points approach their original values. For the corrected 'un-shrunk' estimator, p-values were computed with the shrinkage-based test using a shrinkage value equal to zero [18]. On the applied side, the 'un-shrunk' partial correlation is easy to interpret, because its magnitude is between -1 and 1 and it is closer to its actual (population) value (Figs. 1, 2, and S3). Additionally, it provides a superior trade-off between the true positive rate (TPR) We simulated data (n = 10) from a random network (p = 8), and the effect of the shrinkage at the data level is obtained via the Singular Value Decomposition of the data matrix, see Additional file 1: S3. In black: the original data points. In grey: the data points changing their positions for ǫ(0, 1) . The dots' sizes are proportional to . In white: the 'shrunk' data points for the optimal shrinkage of 0.65 and the false positive rate (FPR), as reported in Tables 1 and 2, and Additional file 1: Figure S5.
Our empirical results show that they are local (edge-wise) shrinkage distortions in the GGM. For the E. coli dataset, the strongest edges (in both models) were lacA-lacZ, lacY-lacZ, and lacA-lacY, all related to the lac operon (that was induced by IPTG in the experiment). The result is in agreement with previous analyses [17,18]. For the M. musculus' dataset, all samples come from healthy mice. Consequently, it is expected that the new method have not retrieved any enriched GO term, and that the 'shrunk' approach found a normal inflammatory term. However, as non-linear effect on the partial correlations translates into a non-linear effect on their p-values, the final inferred network structure are different, see Additional file 1: Figures S6-7.
In Additional file 1: Figure S3 we see that for every p and n ≥ 30, the 'un-shrunk' coefficients are closer to their actual values. From this figure the new method seems particularly superior when p / n > 2. For very small samples, e.g. n = 10 or 20, we see that both methods are suboptimal because the model's assumptions are not necessarily fulfilled, see Additional file 1: Figure S4a. Very often in bioinformatics the data is transformed e.g. log transformed, scaled. The transformed data is approximately a sample from a Gaussian distribution, where the approximation improves for larger samples. This mismatch is not necessarily negligible for very small samples. For instance, the Law of large numbers ensures that the sample mean converges to the population mean when n approaches infinity. The Central limit theorem states that the scaled sample mean is asymptotically normal, with an error that depends on n, see Berry-Esseen theorem [27]. The same applies to the sample correlation [28], and consequently, for very small samples the distributional assumptions of GGMs are not always met.
To the best of our knowledge, this is the first study aiming to de-regularize the estimates. While the 'un-shrunk' partial correlation can be found algebraically, we employed an approximation to achieve feasible computational costs for large scale applications. In principle, large samples can cause weak associations to be statistically significant, and small samples can cause strong associations to be non-significant. A proper discussion should therefore report the magnitude and significance of the estimates, as they provide different pieces of information. In this case, it must be accounted for the fact that the partial correlations are 'shrunk' (biased), before concluding about the result. While one may be tempted to say that the 'shrunk' estimates are just less variable, due to the variance-bias trade-off their magnitudes and order are biased. Ignoring the shrinkage value would divorce the analysis from the original data (and its biological meaning), what could obscure the interpretation and impede the validation of earlier reported results (e.g. in biomarker's discovery).

Methods
In this section, we review Gaussian graphical models (GGMs) and the Ledoit-Wolf (LW) shrinkage [14,16]. We illustrate how the shrinkage modifies the network structure (i.e. the magnitudes and orders of the partial correlations) in a non-linear way. To overcome this pitfall, we propose the 'un-shrunk' partial correlation and discuss its properties. Throughout the text, uppercase bold letters are used for matrices (e.g. Ρ is the matrix of partial correlations) and the hat symbol denotes the statistical estimators (e.g. P is an estimator of P).

The 'shrunk' partial correlation
The partial correlation is a measure of (linear) association between Gaussian variables, where confounding effects coming from the other variables are removed (i.e. a full-conditional correlation). A GGM is represented by a matrix P of partial correlations, where the element P ij is the partial correlation between the i-th and j-th variable [29]. Partial correlations can be computed via the relationship where is the inverse of the covariance matrix C (or equivalently, the inverse of the correlation matrix R , see Additional file 1: S1). For a dataset D that consists of p variables and n samples, C is a p × p matrix that can be estimated from data, e.g. by the sample covariance matrix C SM . However, estimating C is challenging when n is comparable to, or smaller than, p as the estimator then becomes ill-conditioned (numerically unstable) or non-invertible.
The LW-shrinkage estimator C [ ] consists of a convex linear combination of C SM and a target matrix T (e.g. a diagonal matrix of variances), and it is defined as where ǫ (0, 1) , also called the shrinkage value, represents the weight allocated to T . The inverse of C [ ] , denoted by [ ] , can then be plugged into Eq. (1), yielding This is the 'shrunk' partial correlation [17], and an edge in the GGM is selected according to its magnitude and/or statistical significance [18]. The operations involved in Eq. (3) (i.e. matrix inversion, square roots, and standardization) suggest that P [ ] ij is a non-linear function of λ. In addition, the value of λ is usually optimized by minimizing the mean square error between C [ ] and C , but this λ does not necessarily minimize the MSE between [ ] and .

Pitfalls of the 'shrunk' partial correlation
Let us consider the following covariance matrix C and its inverse , . As λ increases, the value of the two 'shrunk' partial correlations change. Figure 5a shows that for λ greater than 0.3, P [ ] 12 gets weaker than P [ ] 34 and their relative order reverses. Although Eq. (2) is a linear shrinkage on C [ ] , operations like matrix inversion and standardization in Eq. (3) propagate the effect of λ to P [ ] in a non-linear way. An equivalent plot for the estimator gLASSO [15,19] is presented in Additional file 1: Figure S1.

Additional properties
Partial correlations can be found from the covariance matrix C , or equivalently, the inverse of the correlation matrix R , see Additional file 1: S3.1. Without loss of generality, we now switch to the correlation matrix R (the standardized C ). Using Eq. (2), R (instead of C ) can be 'shrunk' towards its diagonal ( T is the identity matrix). For small samples sizes, the sample correlation R SM (the standardized C SM ) is not positive definite. Some eigenvalues of R SM can be (i) near to zero, (ii) equal to zero, or (iii) slightly negative. This translates into R SM being (i) ill-conditioned, (ii) singular, or (iii) indefinite, respectively. The case of an indefinite matrix arises from a numerical inaccuracy that can make zero eigenvalues slightly negative.

Shrinking the data
Traditionally, the shrinkage is interpreted as a modification of the statistical model (a 'shrunk' covariance/correlation/partial correlation), where the data remains unchanged. However, most research questions need to be interpreted in terms of the dataset. We therefore propose to discuss the shrinkage from a different perspective, namely from the data level. To this end, we use the Singular Value Decomposition (SVD) of the data matrix D , and that the shrinkage only modifies the eigenvalues of C SM = 1 n−1 D t D , while the eigenvectors stay identical, see Additional file 1: S2. As singular values are the positive square roots of the eigenvalues α [ ] given in Eq. (5), we can derive the SVD of the 'shrunk' data matrix D [ ] as (5) Here U and V are the matrices of (left and right) singular vectors of D , and the singular values are replaced by their 'shrunk' counterparts. This relationship allows us to study the shrinkage effect at the data level. That is, analyzing the original dataset D with a 'shrunk' model C [ ] is equivalent to analyzing D [ ] with the classical model C SM . To illustrate this, we generate data from a network (p = 8, n = 10) and investigate what happens to the original data points as the shrinkage increases; see Fig. 5b and Additional file 1: S3 for more details.

The 'un-shrunk' partial correlations
Here we propose the concept of the 'un-shrunk' partial correlation, which we define as the limit of P [ ] ij as λ approaches zero, A proof that P [ ] ij is a continuous and bounded function of , as well as a general proof of the existence of this limit, can be found in Additional file 1: S4 and S5. The key idea is that there is no divergence in Eq. (8), and to illustrate this we start with the eigen-decomposition, where V is a matrix whose columns are the eigenvectors of R [ ] , and diag 1/α [ ] is the diagonal matrix of eigenvalues 1/α [ ] k (k = 1, 2, . . . , p). Let us assume that R SM is singular and recall two facts from the previous subsection. First, R [ ] has a zero eigenvalue α k = 0 which is transformed into α [ ] k = in Eq. (5), and that the corresponding eigenvalue of R [ ] is 1/α [ ] k = 1/ by Eq. (6). Second, R SM and R [ ] have the same eigenvectors, because the shrinkage only changes the eigenvalues (see Additional file 1: S2). Substituting Eq. (9) in Eq. (3), and factorizing out the term 1/ gives Any singularity disappears by cancelling the term 1/ . As α [ ] k > , the diagonal elements /α [ ] in Eq. (10) have limits equal to (i) zero for α k = 0 , or (ii) one for α k = 0 , see Equations (S43, S48, S52, S55, S58). In this sense, the ratio that defines the 'shrunk' partial correlation in Eq. (10) does not diverge when removing the shrinkage. For a general proof of the existence of this limit, see Additional file 1: S5. We propose this limit as a generalization of the classical partial correlation.
The idea resembles the classical example from Calculus, where the limits to zero of x −1 and x −2 are both infinite; while the limit of their ratio x −1 /x −2 is finite (i.e. zero). For illustration, let us consider a 3 × 3 correlation matrix of ones (all variables are maximally correlated). The matrix is singular as all α k = 0 , and Eq. (2) gives using Eq. (3) we obtain and we see that P [0] ij = lim →0 P [ ] ij = 1/2 . Here it is worth noting that a simple linear rescaling by (1 − ) is not sufficient to remove the shrinkage effect (see Additional file 1: Figure S1). In this example, P [ ] ij would become 1/(2 − ) which for = 1/3 gives P [1/3] ij = 3/5 = 0.6 , and for = 2/3 is P [2/3] ij = 3/4 = 0.75 . More toy examples can be found in Additional file 1: S6.

Practical implementation
From a mathematical perspective, Eq. (8) can be computed by means of the analytical results in Equations (S43, S48, S52, S55, S58). However, these results are un-practical as numerical inaccuracies make the elements of V unreliable, and often render zero eigenvalues (slightly) positive/negative. To circumvent numerical issues, we apply a simple approximation. We compute the 'shrunk' partial correlation P [ ] ij for different ǫ (0, 1) values, and apply Fisher's transformation to ensure they are normally distributed. Finally, we fit a weighted smoothing splines through these points. Considering that small shrinkage values cause uncertain estimates, we use weights according to the reciprocal condition number of the correlation matrix. By extrapolating the fitted function to = 0 , the limit in Eq. (8) is approximated. For more details, we refer the reader to the Additional file 1: S7.
P-values are computed from the probability density of the 'shrunk' partial correlation where Beta denotes the beta function, and df are the degrees of freedom, which can be found via Maximum Likelihood Estimation. Further details can be found in our previous work [18]. The parameter is the optimal shrinkage value for the 'shrunk' partial (11)