Constrained Covariance Matrices With a Biologically Realistic Structure: Comparison of Methods for Generating High-Dimensional Gaussian Graphical Models

Emmert-Streib, Frank; Tripathi, Shailesh; Dehmer, Matthias

doi:10.3389/fams.2019.00017

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 12 April 2019
Sec. Systems Biology Archive
Volume 5 - 2019 | https://doi.org/10.3389/fams.2019.00017

Constrained Covariance Matrices With a Biologically Realistic Structure: Comparison of Methods for Generating High-Dimensional Gaussian Graphical Models

Frank Emmert-Streib^1,2^*

Shailesh Tripathi^1,3

Matthias Dehmer^3,4,5

¹Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
²Institute of Biosciences and Medical Technology, Tampere, Finland
³Faculty for Management, Institute for Intelligent Production, University of Applied Sciences Upper Austria, Steyr, Wels, Austria
⁴Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria
⁵College of Computer and Control Engineering, Nankai University, Tianjin, China

High-dimensional data from molecular biology possess an intricate correlation structure that is imposed by the molecular interactions between genes and their products forming various different types of gene networks. This fact is particularly well-known for gene expression data, because there is a sufficient number of large-scale data sets available that are amenable for a sensible statistical analysis confirming this assertion. The purpose of this paper is two fold. First, we investigate three methods for generating constrained covariance matrices with a biologically realistic structure. Such covariance matrices are playing a pivotal role in designing novel statistical methods for high-dimensional biological data, because they allow to define Gaussian graphical models (GGM) for the simulation of realistic data; including their correlation structure. We study local and global characteristics of these covariance matrices, and derived concentration/partial correlation matrices. Second, we connect these results, obtained from a probabilistic perspective, to statistical results of studies aiming to estimate gene regulatory networks from biological data. This connection allows to shed light on the well-known heterogeneity of statistical estimation methods for inferring gene regulatory networks and provides an explanation for the difficulties inferring molecular interactions between highly connected genes.

1. Introduction

High-throughput technologies changed the face of biology and medicine within the last two decades [1–3]. Whereas traditional molecular biology focused on individual genes, mRNAs and proteins [4], nowadays, genome-wide measurements of these entities are standard. As an immediate consequence, transcriptomics, proteomics, and metabolomics data are high-dimensional containing measurements of hundreds and even thousands of molecular variables [5–10]. Aside from the high-dimensional character of these data, there exists a non-trivial correlation structure among the covariates, which establishes considerable problems for the analysis of such data sets [11–13]. The reason for the presence of the correlation structure is due to the underlying interactions between genes and their products. Specifically, it is well-known that there are transcriptional regulatory, protein, and signaling networks that represent the blueprint of biological and cellular processes [14–20].

In order to design new statistical methods, which are urgently needed to cope with high-dimensional data from molecular biology, usually, simplifying assumptions are made regarding the characteristics of the data. For instance, one of the most frequently made assumptions is the normal behavior of the covariates [21–24]. That means, the distribution of the variables is assumed to follow a univariate or multivariate normal distribution [25]. This assumption is reasonable because by applying a z-transformation to data with an arbitrary distribution one can obtain (standard) normal distributed data [26]. For this reason, a z-transformation is usually applied to the raw data as a preprocessing step. Due to the fact that we investigate in this paper high-dimensional data with a complex correlation structure, we focus in the following on multivariate normal distributions, because to use a univariate distribution in this context, it is necessary to make the additional assumption of a vanishing correlation structure between the covariates in order to be able to approximate the multivariate distribution sensibly by a product of univariate distributions, i.e., $p (x_{1}, \dots, x_{p}) = Π_{i = 1}^{p} p (x_{i})$ .

To fully specify a multivariate normal distribution, a vector of mean values and a covariance matrix is needed. From the covariance matrix follows the correlation matrix that provides information about the correlation structure of the variables. For instance, for data from molecular biology measuring the expression of genes, it is known that the correlation in such data sets is neither vanishing nor random, but is imposed by biochemical interactions and bindings between proteins and RNAs forming complex regulatory networks [27, 28]. For this reason, it is not sufficient to merely specify an arbitrary covariance matrix in order to simulate gene expression data from a norm distribution for investigating statistical methods, because such a covariance matrix is very likely not to possess a biologically realistic correlation structure. In fact, it is known that biological regulatory networks have a scale-free and small-world structure [29, 30]. For this reason, several algorithms have been introduced that allow to generate constrained covariance matrices that represent specific independence conditions, as represented by a graph structure of gene networks. If, for instance, a gene regulatory network or a protein interaction network is chosen for such a network structure, these algorithms generate covariance matrices that allow to generate simulated data with a correlation structure that is consistent with the structural dependency of such biological networks, and hence, is close to real biological data [31, 32]. Here “consistent” means that for multivariate normal random variables there is a well-known relation between the components of the inverse of their covariance matrix and their partial correlation coefficients, discussed formally in the section 2. This relation establishes a precise connection between a correlation structure in the data and a network structure. As a result, such a constrained covariance matrix establishes a Gaussian graphical model (GGM) [33, 34] that can be used to simulate data for the analysis of, e.g., methods to identify differentially expressed genes, differentially expressed pathways or for the inference of gene regulatory networks [11, 35–37], to name just a few potential areas of application.

The major purpose of this paper is to study and compare three algorithms that have been introduced to generate constrained covariance matrices. The algorithms we are studying are the Iterative Proportional Fitting (IPF) algorithm [38], an orthogonal projection method by Kim et al. [37] and an regression approach by Hastie, Tibshirani, and Friedman (HTF) [39]. Data generated by such algorithms can be used to simulate, e.g., gene expression data from DNA microarrays to test analysis methods for identifying differentially expressed genes [22, 40], differentially expressed pathways [41–43] or to infer gene regulatory networks [44, 45]. Furthermore, we connect these results, obtained from a probabilistic perspective, to statistical results of studies aiming to infer gene regulatory networks. This connection allows to shed light on the known heterogeneity of statistical estimation methods for inferring gene regulatory networks.

The paper is organized as follows. In the next section, we present the methods we are studying and necessary background information. This includes a description of the three algorithms IPF, Kim, and HTF to generate constrained covariance matrices and also a brief description of the networks we are using for our analysis. In the sections 3 and 4, we present our numerical results and discuss the observed findings. Furthermore, we place the obtained results into a wider context by discussing the relation to network inference methods. This paper finishes in the section 5 with a summary and an outlook to future studies.

2. Methods

Multivariate random variables, X ∈ ℝ^p, from a p-dimensional normal distribution, i.e., X ~ N(μ, Σ), with mean vector μ ∈ R^p and a positive-semidefinite p × p reel covariance matrix Σ, have a density function given by

\begin{array}{l} p (x) = \frac{1}{{(2 π)}^{\frac{p}{2}} | Σ |^{\frac{1}{2}}} \exp (- \frac{1}{2} {(x - μ)}^{t} Σ^{- 1} (x - μ)) . & (1) \end{array}

For such normal random variables there is a simple relation between the components of the inverse covariance matrix, Ω = Σ⁻¹, (also called “precision” or “concentration matrix”) and conditional partial correlation coefficients [46] (chapter 5). This relation is given by

\begin{array}{l} ρ_{i j | N \ {i j}} = - \frac{ω_{i j}}{\sqrt{ω_{i i} ω_{j j}}} . & (2) \end{array}

Here ρ_ij|N\{ij} is the partial correlation coefficient between gene i and j conditioned on all remaining genes, i.e., N\{ij}, whereas N = {1, …, p} is the set of all genes. Furthermore, ω_ij are the components of the concentration matrix Ω. That means, if ρ_ij|N\{ij} = 0 then gene i and j are independent from each other,

\begin{array}{l} X_{i} ⊥ X_{j} | {all remaining genes}, & (3) \end{array}

if and only if ω_ij = 0. The relation in Equation (3) is also known as Markov property [46] (chapter 3). In the following, we abbreviate the notation for such partial correlation coefficients briefly as,

\begin{array}{l} ψ_{i j} = ρ_{i j | N \ {i j}}, & (4) \end{array}

and denote the entire partial correlation matrix by Ψ.

A multivariate normal distribution that is Markov with respect to an undirected network G is called a Gaussian graphical model (GGM) [33, 34, 46], also known as “graphical Gaussian model,” “covariance selection model,” or “concentration graph model.” This means that all conditional independence relations that can be found in Σ⁻¹ are also present in G [46] (chapter 3). Hence, such a Σ⁻¹ can be considered as consistent [or faithful [47]] with all conditional independence relations in G.

2.1. Generation of a Random Covariance Matrix Using Conditional Independence for a Given Graphical Model

In the following, we describe briefly the three algorithms IPF, Kim and HTF [37, 38, 46, 48], we use for generating constrained covariance matrices that are consistent with a given graph structure by obeying its independence relations.

Kim Algorithm: The Kim algorithm [37] applies iteratively orthogonal projections to generate a covariance matrix with the desired properties. A formal description of this algorithm is as follows:

ALGORITHM 1

Algorithm 1 Generation of a constrained covariance matrix using the Kim algorithm

We are providing an R package with the name mvgraphnorm that contains an implementation of the Kim algorithm. The package is available from the CRAN repository.

Before we continue, we would like to emphasize that in the following, we use the notation W and V to indicate covariance matrices. However, the important difference is that W is unconstrained whereas V is consistent with conditional independence relations given in a network G.

IPF Algorithm: The working principles of the Iterative Proportional Fitting algorithm [38] is as follows. Let us assume that X is a p-dimensional random variables from a normal distribution with mean μ = 0 and a covariance matrix Σ. From a sample of size m, the sample covariance matrix is estimated from a given W. Suppose, we partition the vector X into X_a, X_b, for randomly selected index vectors a and b. Then these vectors, X_a and X_b, follow a normal distribution with mean μ = 0 and variance

\begin{array}{l} V_{a \cup b, a \cup b} = (\begin{matrix} V_{a a} & V_{a b} \\ V_{b a}^{T} & V_{b b} \end{matrix}) . & (5) \end{array}

Furthermore, the marginal distribution of X_a is normal with variance V_aa and the conditional distribution of X_b|a is also normally distributed with $N (V_{b a} {(V_{a a})}^{- 1} x_{a}, V_{b b} - V_{b a} V_{a a}^{- 1} V_{a b})$ [46]. Let us assume that f is a given density function and g is the density function of a Gaussian graphical model with a similar marginal distribution as f.

The iterative proportional fitting (IPF) algorithm [38] adjusts iteratively the joint density function of X_a and X_b. This can be written in general form as,

\begin{array}{l} g_{a b}^{t + 1} = g_{b | a}^{t} f_{a}, & (6) \end{array}

corresponding to the (t + 1)th iteration step. In this notation, the expectation value of X, for $g_{a b}^{t + 1}$ , is given by,

\begin{array}{l} 𝔼 [X | g_{a b}^{t + 1}] = 0, & (7) \end{array}

which remains zero for all iteration steps t. For this reason, we do not need to consider update equations for this expectation value. In contrast, the variance of X, for $g_{a b}^{t + 1}$ , is given by

\begin{array}{l} V^{t + 1} = (\begin{matrix} V_{a a}^{f} & V_{a a}^{f} {(B_{b | a}^{t})}^{T} \\ B_{b | a}^{t} V_{a a}^{f} & (V_{b b | a}^{t} + B_{b | a}^{t} V_{a a}^{f} {(B_{b | a}^{t})}^{T}) \end{matrix}), & (8) \end{array}

with $B_{b | a}^{t} = V_{b a}^{t} {(V_{a a}^{f})}^{- 1}$ [46].

The IPF algorithm, formalized in Algorithm 2, provides iterative updates for the components of the covariance matrix V^t+1, given by Equation (8). In this algorithm, the first step is to generate a sample covariance matrix W and V is initialized as identity matrix with the same number of rows and columns as W. In the second step, the maximal cliques of a given graph G are identified. Here a clique is defined as a fully connected subgraph of G. Next, the components of the partitioned covariance matrix are iteratively updated, in order to become consistent with the independence relations in G. This is accomplished by utilizing the identified cliques. This procedure is iterated for all cliques, until the algorithm converges, as specified by a scalar threshold parameter δ, with δ ≪ 1.

ALGORITHM 2

Algorithm 2 Generation of a constrained covariance matrix using the IPF algorithm

HTF Algorithms: We call the following algorithm HTF because it has been proposed by Hastie, Tibshirani, and Friedman [39]. In Algorithm 3 we show pseudocode for this algorithm.

Let us assume, we have a p-dimensional random variable, X ∈ ℝ^p, sampled from a normal distribution with mean μ and covariance matrix Σ, and a sample covariance matrix S estimated from m samples. The log likelihood for the (unconstrained) concentration matrix Ω is given by,

\begin{array}{l} L (Ω) = log det Θ - trace (S Ω), & (9) \end{array}

which is maximized for Ω = Σ⁻¹.

The HTF method uses a regression approach for each node by selecting its neighbors as predictor variables, utilizing model based estimates of predictor variables. For this approach, Lagrange constants are included in Equation (9) for the non-edge components of a given graph structure,

\begin{array}{l} L (Ω) = log det Ω - trace (S Ω) + Σ_{j, k \notin E} γ_{j k} ω_{j k} . & (10) \end{array}

Here j, k ∉ E means that there is no edge between these two variables, i.e., A_ij = 0. We maximize this likelihood by taking the first derivative with respect to Ω, which gives

\begin{array}{l} Ω^{- 1} - S - Γ = 0 . & (11) \end{array}

Here Γ is the matrix of Lagrange parameters with non-zero values for the non-edge components of a given graph structure.

ALGORITHM 3

Algorithm 3 Maximum likelihood estimation of independence of a sample covariance matrix for a given graph using HTF algorithm.

Because one would like to obtain W = Ω⁻¹, we can write this identify separated into two major components,

\begin{array}{l} (\begin{matrix} W_{11} & w_{12} \\ w_{21}^{T} & w_{22} \end{matrix}) (\begin{matrix} Ω_{11} & ω_{12} \\ ω_{21}^{T} & ω_{22} \end{matrix}) = (\begin{matrix} I & 0 \\ 0^{T} & 1 \end{matrix}) . & (12) \end{array}

Here the first component consists of p − 1 dimensions and the second component of just one. That means, e.g., W₁₂ and I are (p − 1) × (p − 1) matrices, w₁₂ and ω₁₂ are (p-1)-dimensional vectors and w₂₂ and ω₂₂ are scalar values.

The iterative algorithm of HTF repeats the steps given in Equations (11–18). At each step, one selects one of the p variables randomly for the partitioning given in Equation (12). This variable defines w₁₂ and ω₁₂, whereas the remaining variables define W₁₁ and Ω₁₁. For reasons of simplicity, we select in the following the last variable.

From Equation (12), we obtain the following expression

\begin{array}{l} w_{12} = W_{11} ω_{12} / ω_{22} . & (13) \end{array}

Setting β = ω₁₂/ω₂₂ and placing w₁₂ into the right block of Equation (11), namely,

\begin{array}{l} w_{12} - s_{12} - γ_{12} = 0, & (14) \end{array}

leads to

\begin{array}{l} W_{11} β - s_{12} - γ_{12} = 0 . & (15) \end{array}

This system is solved only for the q components in β that are not equal to zero, i.e., q = |{i|β_i ≠ 0}, which can be written as

\begin{array}{l} {W_{11}}^{*} β^{*} - s_{12}^{*} = 0 & (16) \end{array}

Here it is important to note that $β^{*}, s_{12}^{*} \in ℝ^{q}$ and ${W_{11}}^{*}$ is a q×q matrix. From this, ${\hat{β}}^{*}$ is given by

\begin{array}{l} {\hat{β}}^{*} = {W_{11}^{*}}^{- 1} s_{12}^{*} & (17) \end{array}

and the overall solution follows from padding ${\hat{β}}^{*}$ with zeros in the q components given by I_p = {i|β_i = 0} is β′. Finally, this is used to update w₁₂ in Equation (13) leading to

\begin{array}{l} w_{12}^{'} = W_{11} β^{'} . & (18) \end{array}

The above steps are iterated, for each variable, until the estimates for w₁₂ converge.

The qpgraph package by [48] provides an implementation of the IPF and HTF algorithm.

Common Step of IPF and HTF: The IPF and HTF algorithm have in common that they are based on the random initialization of a covariance matrix W that is obtained from a (parametric) Wishart distribution [49]. More precisely, assume X₁, X₂, …X_m are m samples from a p-dimensional normal distribution N(0, Σ), then

\begin{array}{l} W = X^{T} X ~ W i s h a r t_{p} (Σ, n) & (19) \end{array}

is from a Wishart distribution. Here n is the degrees of freedom and Σ is a p × p matrix. The expectation value of W is given by,

\begin{array}{l} 𝔼 [W] = n Σ . & (20) \end{array}

In order to obtain a covariance matrix W from a Wishart distribution given by $W i s h a r t_{p} (\frac{1}{n} Σ, n)$ , the Bartlett decomposition can be utilized given by [49–51],

\begin{array}{l} W (r) = L (r) {A A}^{T} L {(r)}^{T} . & (21) \end{array}

Here L(r)L(r)^T is obtained from a Cholesky decomposition of $\frac{1}{n} Σ (r)$ and A is defined by,

\begin{array}{l} A = (\begin{matrix} \sqrt{c_{1}} & 0 & . & . & . & . & . & 0 \\ n_{21} & \sqrt{c_{2}} & . & . & . & . & . & 0 \\ n_{31} & n_{32} & \sqrt{c_{3}} & . & . & . & . & 0 \\ . & . & . & . & . & . & . & . \\ . & . & . & . & . & . & . & . \\ . & . & . & . & . & . & . & . \\ . & . & . & . & . & . & . & . \\ n_{p 1} & n_{p 2} . & . & . & . & . & . & \sqrt{c_{p}} \end{matrix}) & (22) \end{array}

Here the n_ij ~ N(0, σ), for i ∈ {2, …, p} with i > j, and $c_{i} ~ χ^{2} (p + 1 - i)$ , Chi-squared distribution with p + 1 − i degrees of freedom, with i = 1…p. For reasons of simplicity, Σ(r) can be defined as

\begin{array}{l} Σ (r) = (\begin{matrix} 1 & r & . & . & . & . & . & r \\ r & 1 & . & . & . & . & . & r \\ r & r & 1 & . & . & . & . & r \\ . & . & . & . & . & . & . & . \\ . & . & . & . & . & . & . & . \\ . & . & . & . & . & . & . & . \\ . & . & . & . & . & . & . & . \\ r & r . & . & . & . & . & . & 1 \end{matrix}) & (23) \end{array}

which results in a constant correlation coefficient r, with 0 ≤ r ≤ 1, between all variables. For this reason, we write the covariance matrix, and the resulting L(r) and W(r) matrices, explicitly as a function of the parameter r.

The IPF and the HTF algorithm use a randomly generated W(r) covariance matrix, as shown above, as initialization matrix. Due to the fact that this matrix is a function of r, with 0 ≤ r ≤ 1, both algorithms depend on this parameter in an intricate way. In the results section, we will study its influence.

2.2. Generating Networks

For reasons of comparison, we are studying in this paper three different network types. Specifically, we use scale-free networks, random networks and small-world networks [52, 53] for our analysis. Because there are various algorithms that allow the generation of each of the former network types [54, 55], we select three network models that have been widely adopted in biology: (1) The preferential attachment model from Barabasi and Albert (Ba) [56] to generate scale-free networks, (2) the Erdös-Rényi (ER-RN) model [57, 58] to generate random networks, and (3) the Watts-Strogats (WS) model [59, 60] to generate small-world networks. A detailed description how such networks are generated can be found in [15].

Due to the fact that the reason for generating these networks is only to study the characteristics and properties of the three covariance generating algorithms, the particular choice of the network generation algorithms is not crucial. Each of these algorithms results in undirected, unweighted networks that are sufficiently distinct from each other that allows to study the influence of these structural differences on the generation of the covariance matrices.

Specifically, we added random networks for a baseline comparison because this type of networks is classic having been studied since the 1960s [57, 58]. In contrast, scale-free networks and small-world networks are much newer models [61] that have been introduced to mimic the structure of real world networks more closely. For our study, it is of relevance that various types of gene networks, e.g., transcriptional regulatory networks, protein networks, or metabolic networks, have been found to have a scale-free or small-world structure [29, 30]. That means in order to produce simulated data with a realistic biological correlation structure an algorithm should be capable to produce data with such a characteristic.

Furthermore, each algorithm allows to generate networks of a specific size (number of nodes) to study the effect of the dimensionality.

2.3. Implementation

We performed our analyses using the statistical programming language R [62]. For the IPF and HTF algorithms we used the qpgraph package [48] and for the Kim algorithm we developed our own package called mvgraphnorm (available from CRAN). The networks were generated using the R package igraph [63] and the networks were visualized with NetBioV [64].

3. Results

3.1. Consistency of Generated Covariance Matrices With G

We begin our analysis by studying the overall quality of the algorithms IPF, Kim, and HTF by testing how well the independence relations in a given graph, G, are represented by the generated covariance matrices, respectively the partial correlation matrices.

In order to evaluate this quantitatively, we generate a network, G, that we use as an input for the algorithms. Then each of the three algorithms results in a constructed covariance matrix Σ_IPF, Σ_Kim and Σ_HTF from which the corresponding concentration matrices are obtained by,

\begin{array}{l} Ω_{I P F} (G) = Σ_{I P F}^{- 1} (G), & (24) \end{array}

\begin{array}{l} Ω_{K i m} (G) = Σ_{K i m}^{- 1} (G), & (25) \end{array}

\begin{array}{l} Ω_{H T P} (G) = Σ_{H T P}^{- 1} (G) . & (26) \end{array}

The partial correlation matrices Ψ_IPF(G), Ψ_Kim(G), and Ψ_HTF(G) follow from the concentration matrices and Equation (2). Here we included the dependency of the concentration and partial correlation matrices on G explicitly to emphasize this fact. However, in the following, we will neglect this dependency for notational ease.

We use the partial correlation matrices and compare them with G to check the consistency of the constructed structures. In order to do this, we need to convert a partial correlation matrix into a binary matrix, because G is binary. However, due to numerical reasons, all three algorithms do, usually, not result in components of the partial correlation matrices that are exactly zero, i.e., ψ_ij = 0, but result in slightly larger values. That means, we cannot just filter a partial correlation matrix by

\begin{array}{l} {ψ^{'}}_{i j} = {\begin{matrix} \begin{matrix} 0 \\ 1 \end{matrix} & | \begin{matrix} ψ_{i j} \\ ψ_{i j} \end{matrix} | & \begin{matrix} \leq θ \\ > θ \end{matrix} \end{matrix} & (27) \end{array}

with θ = 0 but a threshold that is slightly larger than zero, i.e., θ > 0, is needed. For this reason, we use the following procedure to assess the compatibility of Ψ with G:

1. Obtain the indices from the adjacency matrix A(G) of G for all edges and non-edges, i.e.,

\begin{array}{l} I_{e} = {(i, j) | A {(G)}_{i j} = 1}, & (28) \end{array}

\begin{array}{l} I_{n e} = {(i, j) | A {(G)}_{i j} = 0} . & (29) \end{array}

2. Identify the sets of all element of Ψ that belong to edges and non-edges, i.e.,

\begin{array}{l} ‖ Ψ (edge) ‖_{I} = {ψ_{m} || m \in I_{e}}, & (30) \end{array}

\begin{array}{l} ‖ Ψ (non-edge) ‖_{I} = {ψ_{m} || m \in I_{n e}} . & (31) \end{array}

Here ||X||_I is the set of absolute values of X and ||Ψ(edge)||_I and ||Ψ(non-edge)||_I are the sets of such elements.

3. Calculate a score, s, as the difference between the minimal element in ||Ψ(edge)||_I and the maximal element in ||Ψ(non-edge)||_I, i.e.,

\begin{array}{l} s = \min (‖ Ψ (edge) ‖_{I}) - \max (‖ Ψ (non-edge) ‖_{I}) . & (32) \end{array}

4. If the score s is larger than zero, i.e., s > 0, then Ψ is consistent with all independence relations in G. In this case we can set θ = max(||Ψ(non-edge)||_I) to filter the partial correlation matrix.

We want to remark that for s ≤ 0 the algorithm would result in false positive edges and hence, would indicate an imperfect result. In general, the larger s the further is the distance between the edges and non-edges and the better is their discrimination.

We studied a large number of BA, ER-RN, and WS networks with different parameters and different sizes. For all networks, we found that all three algorithms represent the independence relations in G perfectly, which means that for all three algorithms we find FP = FN = 0 (results not shown) and

\begin{array}{l} \min (‖ Ψ (edge) ‖_{I}) - \max (‖ Ψ (non-edge) ‖_{I}) > 0. & (33) \end{array}

In Figure 1, we show exemplary results for a BA network of size 100. More precisely, we show the distribution of the absolute partial correlation values for the three different methods and different parameter settings (see x-axis). In this figure, an “e” corresponds to the partial correlation values for edges, i.e., ||Ψ(edge)||_I, and “ne” for non-edges, i.e., ||Ψ(non-edge)||_I.

FIGURE 1

Figure 1. Distribution of absolute partial correlation values, ||Ψ||, for different methods. For G, we used a BA network of size 100. An “e” corresponds to the partial correlation values for edges and “ne” for non-edges.

We would like to emphasize that the algorithm by Kim is parameter free, whereas IPF and HTP depend on a parameter r (see section 2). Interestingly, for IPF/e and HTF/e with r = 0.6 the median partial correlation values are larger than 0.3. In contrast, these methods result for r = 0.0 in median partial correlation values around 0.05. Hence, this parameter allows to influence the correlation strength.

Furthermore, for all three algorithms one can see that the maximal partial correlation values for non-edges are close to zero.

3.2. Global Structure of Covariance Matrices and Influence of Network Structures

Next, we zoom into the structure of the generated covariance matrices and the resulting concentration and partial correlation matrices in more detail. For this reason, we study distances between elements in these matrices. More precisely, we define the following measures to quantify such distances,

\begin{array}{l} d_{a} (1; Ω) = \min (‖ Ω_{a} (edges) ‖_{I}) \\ - \max (‖ Ω_{a} (non-edges ‖_{I})), & (34) \end{array}

\begin{array}{l} d_{a} (2; Ω) = median (‖ Ω_{a} (edges ‖_{I})) \\ - median (‖ Ω_{a} (non-edges ‖_{I})), & (35) \end{array}

\begin{array}{l} d_{a} (1; Ψ) = \min (‖ Ψ_{a} (edges) ‖_{I}) \\ - \max (‖ Ψ_{a} (non-edges ‖_{I})), & (36) \end{array}

\begin{array}{l} d_{a} (2; Ψ) = median (‖ Ψ_{a} (edges ‖_{I})) \\ - median (‖ Ψ_{a} (non-edges ‖_{I})) . & (37) \end{array}

Here an “a” means either the algorithm IPF, Kim, or HTF.

The first measure, d_a(1;Ω), gives the distance between the smallest element in ||Ω_a(edges)||_I and the largest element in ||Ω_a(non-edges)||_I, whereas, e.g., ||Ω_a(edges)||_I corresponds to all elements in the concentration matrix that belong to an edge in the underlying network, as given by G [see the similar definition for the partial correlation matrix in Equations (30, 31)]. That means, formally,

\begin{array}{l} ‖ Ω (edge) ‖_{I} = {| ω_{m} | | m \in I_{e}}, & (38) \end{array}

\begin{array}{l} ‖ Ω (non-edge) ‖_{I} = {| ω_{m} | | m \in I_{n e}} . & (39) \end{array}

In Figures 2–4 we show results for the algorithms IPF, Kim, and HTF for BA, ER-RN, and WS networks of different sizes, ranging from 25 to 500 nodes. Due to the fact that all three algorithms result in a perfect reconstruction of the underlying networks, as discussed at the beginning of the results section, the entities d_a(1;Ω), d_a(2;Ω), d_a(1;Ψ), and d_a(2;Ψ) are always positive (as can be seen from the figures).

FIGURE 2

Figure 2. Effect of size differences in the elements of concentration and partial correlation matrices for the IPF algorithm. Shown are differences between values for edges and non-edges in dependence on the network size and network type. All values are averaged over 50 independent runs.

Asymptotically, for large network sizes, the values of the four measures decrease monotonously, except for the Kim algorithm for d_Kim(2;Ω) (Figure 3B). Furthermore, the structure of the underlying network has for the Kim algorithm a larger influence than for the IPF and HTF algorithms, because the values for d_Kim(1;Ω) and d_Kim(2;Ω) do not overlap for the three different network types.

FIGURE 3

Figure 3. Effect of size differences in the elements of concentration and partial correlation matrices for the Kim algorithm. Shown are differences between values for edges and non-edges in dependence on the network size and network type. All values are averaged over 50 independent runs.

The results from this analysis show clearly that the three algorithms have different working characteristics. First, the IPF and HTF algorithms are only weakly effected by the topology of the underlying network and this effect is even decreasing for larger network sizes; see, e.g., Figures 2, 4. In contrast, the Kim algorithm shows a clear dependency on the network topology, because all three curves for BA, ER-RN, and WS networks are easily distinguishable from each other within, at least, one standard error; see Figure 3. Second, the distances between the median values of the concentration matrix, given by d_a(2;Ω), show a different behavior, because they are increasing. This is a reflection of the different scale of the elements of the concentration matrix generated by the IPF and HTF algorithm on one side and the Kim algorithm on the other.

FIGURE 4

Figure 4. Effect of size differences in the elements of concentration and partial correlation matrices for the HTF algorithm. Shown are differences between values for edges and non-edges in dependence on the network size and network type. All values are averaged over 50 independent runs.

In order to clarify the latter point, we show in Figure 5 the possible range of these values. Specifically, we show normalized results for edges,

\begin{array}{l} r_{a} (edge) = \max (‖ Ω_{a} (edges ‖_{I})) - \min (‖ Ω_{a} (edges ‖_{I})), & (40) \end{array}

as a function of the network size n, i.e., r_a(edge, n). The results are normalized, because we divide r_a(edge, n) by the maximal value obtained for all studied network sizes, i.e., r_a(edge, n)/max_n (r_a(edge, n)), to show the curves for all three algorithms in the same figure. One can see that the range of possible values in r(edge, n) increases for the Kim algorithm but decreases for IPF and HTF.

FIGURE 5

Figure 5. Range of (normalized) values of concentration matrices for the three methods IPF (red), Kim (green), and HTF (blue). All values are averaged over 10 independent runs.

The situation becomes different when one uses values of the control parameter r of the IPF and HTF algorithms that are larger than zero. In order to investigate this quantitatively, we repeat the above analysis for the IPF and HTF algorithm, however, now we set r = 0.3 and r = 0.6. The results of this analysis are shown in Figure 6. The first two columns show results for r = 0.3 whereas the third column presents results for the IFP algorithm for r = 0.6. For these parameters, d_a(2, Ω) and d_a(2, Ψ) (see Figures 6D–F,J–L) are nearly constant, even for small network sizes. Furthermore, these distances are much larger than for r = 0.0 (see Figures 2–4). Another difference is that the distances d_a(1, Ω) and d_a(1, Ψ) (see Figures 6A–C,G–I) are increasing for increasing sizes of the networks, except for the BA networks (red curves). This indicates also that for r>0 the topology of the underlying network G has a noticeable effect on the resulting concentration and partial correlation matrices, in contrast to the results for r = 0.0 (see Figures 2–4). Overall, the parameter r gives the IPF and HTF algorithms an additional flexibility that allows to increase the observable spectrum of behaviors considerably.

FIGURE 6

Figure 6. Effect of size differences in the elements of concentration and partial correlation matrices. (A–L) Differences between values for edges and non-edges in dependence on the network size and network type. All values are averaged over 50 independent runs.

3.3. Local Structure of Covariance Matrices and Heterogeneity of Its Elements

Finally, we investigate the local structure of covariance matrices. In Figures 7A–C we show a BA network with 100 nodes. The color of the edges codes the value of the elements of the (normalized) concentration matrices, obtained for IPF, Kim, and HTF. Specifically, we map these values from low to high values to the colors blue, green, and red. From the shown three networks one can see that the coloring is quite different implying a significant difference in the rank order of the elements of the concentration matrices.

FIGURE 7

Figure 7. Local structure and heterogeneity of concentraton/partial correlation matrices. Top: In (A–C) estimates of the concentration matrix are shown for a BA network with 100 nodes. The color of the edges corresponds to the value of the elements of the (normalized) concentration matrices, obtained for IPF, Kim, and HTF. The colors blue, green, and red correspond to low, average and high values. Bottom: Normalized mean values of Ω (D,E) and Ψ (F,G) for BA and ER-RN networks of sizes 100. All values are averaged over 50 independent runs.

Next, we study the heterogeneity of the elements in the concentration and partial correlation matrices. More precisely, we are aiming for a quantification of the values of the elements of the concentration/partial correlation matrices that belong to edges with a certain structural property. For reasons of simplicity, we are using the degree (deg) of the nodes that enclose an edge to distinguish edges structurally from each other. Specifically, we calculate for each edge an integer value, v, given by

\begin{array}{l} v (i, j) = d e g (i) + d e g (j) . & (41) \end{array}

Here deg(i) is the degree of node i, corresponding to the number of (undirected) connections of this node. This allows us to obtain the expectation value of the concentration/partial correlation elements in a network with a particular value of v, e.g., for v = d,

\begin{array}{l} 𝔼 [‖ Ω (edges) ‖_{I} | for edges with v = d], & (42) \end{array}

\begin{array}{l} 𝔼 [‖ Ψ (edges) ‖_{I} | for edges with v = d] . & (43) \end{array}

In Figures 7D–G we show results for BA and ER-RN networks with 100 nodes. The results are averaged over 50 independent runs. For reasons of representability, we normalize the results for the IPF, Kim, and HTF algorithm independently from each other, by division with the maximal values obtain for different network sizes. This allows a representation of all three algorithms in the same histogram, despite the fact that the algorithms result in elements on different scales. Overall, we observe that edges with a higher degree-sum are systematically associated with lower expectation values of the elements of the concentration/partial correlation matrix. Due to the fact that all three algorithms, even for different values of the parameter of r, lead to similar results, our findings hint that this is a generic behavior that does not depend on the underlying network topology or algorithm. In summary, these results reveal a heterogeneity of the values of the concentration/partial correlation matrices.

4. Discussion

4.1. Origin of Inferential Heterogeneity of Gene Regulatory Networks

It is interesting to note that the presented results in Figure 7 follow a similar pattern as results for the inference of gene regulatory networks from gene expression data. More precisely, in previous studies [65–68] it has been found that inferring gene regulatory networks from gene expression data leads to a heterogeneity with respect to the quality (true positive rate) of the inferred edges. That means it has been shown that edges that are connecting genes with a high degree are systematically more difficult to infer than edges connecting genes with a low degree. This has been demonstrated for a number of different popular network inference methods and different data sets and, hence, is method independent [65–68]. In addition, more general structural components of networks have been investigated, e.g., network motifs by using local network-based measures [65, 68]. Also for these measures a heterogeneity in the inferability of edges has been identified.

The important connection between these results and our study is that the results presented in Figures 7D–G provide a theoretical explanation for the heterogeneity in the network inference. In order to understand this connection, we would like to emphasize the double role of the covariance matrix in this context. Suppose, there is a GGM with a multivariate normal distribution given by N(μ, Σ) consistent with a network G. Then, by sampling from this distribution, we create a data set, D(m) = {X₁, …, X_m}, with X_i~N(μ, Σ), consisting of m samples. The data set D(m) can then be used for estimating the covariance matrix of the distribution, from which the data have been sampled, resulting in

\begin{array}{l} S (D (m)) = \frac{1}{m - 1} \sum_{i = 1}^{m} (X_{i} - \bar{X}) {(X_{i} - \bar{X})}^{T}, & (44) \end{array}

with $\bar{X} = 1 / m \sum_{i} X_{i}$ . Asymptotically, i.e., for a large number of samples, we clearly obtain

\begin{array}{l} Σ = lim_{m \to \infty} S (D (m)), & (45) \end{array}

as a converging result.

The double role of the covariance matrix is that it is a (1) population covariance matrix for generating the data, and its is a (2) sample covariance matrix estimated from the data. Both will in the limit coincide, but not in reality when the samples m are finite. For this reason, asymptotically, i.e., for m → ∞, there is no heterogeneity in the inference of edges with respect to the error rate, because, as we saw at the beginning of the results section in this paper, Σ allows a perfect (error free) inference of the network G, due to the fact that Σ is the population covariance matrix of a GGM consistent with G. However, for a finite number of samples this is not the case, as we know from a large number of numerical studies [e.g., [69]] due to the fact that for finite data sets, we will not be able to estimate Σ without errors. Hence, the results of gene regulatory network inference studies mirror the results shown in Figures 7D–G because the decaying normalized mean values of ∥Ω(edges)∥_I, respectively ∥Ψ(edges)∥_I, are indicative of a decaying signaling strength whereas smaller signals are more difficult to infer in the presence of noise (measurement errors) than larger signals.

Based on the results of our paper (especially those in Figures 7D–G), we can provide an answer to the fundamental question, if the systematic heterogeneity observed for the inference of gene regulatory networks is due to the imperfection of the statistical methods employed for estimating the sample covariance matrix S, or is this systematic heterogeneity already present in the population covariance matrix Σ. Our results provide evidence that the latter is the case because our study did not rely on any particular network inference method. Hence, this provides a probabilistic explanation for the statistical observations from numerical studies.

4.2. Computation Times

Finally, we present information about the time it takes to generate constrained covariance matrices that are consistent with a given graph structure. The following execution times have been obtained with a 1.6 GHz Intel Core i5 processor with 8 GB RAM.

In Table 1, we show the execution time for the algorithms IPF, Kim, and HTF. We would like to emphasize that the shown execution times refer only to the generation of one constrained covariance matrix and do not include any other analysis component. One can see that there are large differences between the three algorithms and HTF is considerable faster than the other algorithms. For instance, for generating a constrained covariance matrix of dimension m = 500, HTF is almost 12-times faster than Kim and 3-times faster than IPF.

TABLE 1

Table 1. Average computation times for the algorithms IPF, Kim, and HTF.

The parameter r has also an influence on the execution time. For instance, for HTF it takes 2.6-times longer to generate a constrained covariance matrix of dimension m = 500 with r = 0.3 than with r = 0.0. For r = 0.6 this effect is even increased by a further factor of 3.5. Hence, utilizing the additional flexibility of this parameter increases the computation times significantly.

In summary, the three simulation algorithms are sufficiently fast to study problems up to a dimension of D ~ O(10³−10⁴). Considering that essentially all simulation studies for the inference of gene regulatory networks are performed for such dimensions, e.g., [70–72], because it has been realized that such network sizes are sufficient in order to study the ocurring problems in high-dimensions, all three algorithms can be used for this analysis.

Beyond this application domain, it is interesting to note that also in general GGM are numerically studied up to a dimension of D ~ O(10³−10⁴), see e.g., [73, 74]. Hence, for essentially all application domains the three algorithms can be used to study high-dimensional problems but the HTF algorithm could be favored for reason of computational efficiency.

5. Conclusion

In this paper, we investigated three different methods for generating constrained covariance matrices. Overall, we found that all methods generate covariance matrices that are consistent with a given network structure, containing all independence relations among the variables. For a parameter of r = 0.0 for the IPF and HTF algorithms, we found that the Kim algorithm leads to favorable results. However, for r>0 for the IPF and HTF algorithms, these two methods are resulting in a broader spectrum of possible distributions that is considerably larger than that of the Kim algorithm. This extra flexibility could be an advantage for simulation studies.

Regarding computation times of the algorithms, we found that KIM performs slowest. For the IPF algorithm the execution times can be extended due to some outliers that can considerably slow down the execution. The HTF and IPF algorithm perform similarly with slight advantages for HTF, which is overall fastest. Taken together, the HTF algorithm is the most flexible and fastest algorithm that should be the preferred choice for applications.

Aside from the technical comparisons, we found that the generated concentration and partial correlation matrices possess a systemic heterogeneity, independent of the algorithm and the underlying network structure used to provide the independence relations, which is similar to the well-known systematic heterogeneity in studies inferring gene regulatory networks via employing statistical estimators for the covariance matrix [65–67]. Hence, the empirically observed higher error rates for molecular interactions connecting genes with a high node-degree seem not due to deficiencies of the inference methods but the smaller signaling strength in such interactions, as measured by the components of the concentration matrix (Ω) or the partial correlation matrix (Ψ). The implication from this finding is that perturbation experiments are required, instead of novel inference methods, to transform an interaction network into a more amenable form that can be measured. To accomplish this, the simulation algorithms studied in this paper could be utilized for setting up an efficient experimental analysis design.

Author Contributions

FE-S conceived this study. FE-S and ST performed the analysis. FE-S, ST, and MD wrote the paper. All authors proved the final version of the manuscript.

Funding

MD thanks the Austrian Science Funds for supporting this work (project P 30031).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank Robert Castelo and Ricardo de Matos Simoes for fruitful discussions.

References

1. Lander ES. The new genomics: global views of biology. Science. (1996) 274:536–9. doi: 10.1126/science.274.5287.536

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Nicholson J. Global systems biology, personalized medicine and molecular epidemiology. Mol Syst Biol. (2006) 2:52. doi: 10.1038/msb4100095

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Quackenbush J. The Human Genome: The Book of Essential Knowledge. Curiosity Guides. New York, NY: Imagine Publishing (2011).

4. Beadle GW, Tatum EL. Genetic control of biochemical reactions in neurospora. Proc Natl Acad Sci USA. (1941) 27:499–506. doi: 10.1073/pnas.27.11.499

PubMed Abstract | CrossRef Full Text | Google Scholar

5. Dehmer M, Emmert-Streib F, Graber A, Salvador A. (Eds.) Applied Statistics for Network Biology: Methods for Systems Biology. Weinheim: Wiley-Blackwell (2011).

ORIGINAL RESEARCH article

Constrained Covariance Matrices With a Biologically Realistic Structure: Comparison of Methods for Generating High-Dimensional Gaussian Graphical Models

1. Introduction

2. Methods

2.1. Generation of a Random Covariance Matrix Using Conditional Independence for a Given Graphical Model

2.2. Generating Networks

2.3. Implementation

3. Results

3.1. Consistency of Generated Covariance Matrices With G

3.2. Global Structure of Covariance Matrices and Influence of Network Structures

3.3. Local Structure of Covariance Matrices and Heterogeneity of Its Elements

4. Discussion

4.1. Origin of Inferential Heterogeneity of Gene Regulatory Networks

4.2. Computation Times

5. Conclusion

Author Contributions

Funding

Conflict of Interest Statement

Acknowledgments

References

People also looked at