Node-Structured Integrative Gaussian Graphical Model Guided by Pathway Information

Up to date, many biological pathways related to cancer have been extensively applied thanks to outputs of burgeoning biomedical research. This leads to a new technical challenge of exploring and validating biological pathways that can characterize transcriptomic mechanisms across different disease subtypes. In pursuit of accommodating multiple studies, the joint Gaussian graphical model was previously proposed to incorporate nonzero edge effects. However, this model is inevitably dependent on post hoc analysis in order to confirm biological significance. To circumvent this drawback, we attempt not only to combine transcriptomic data but also to embed pathway information, well-ascertained biological evidence as such, into the model. To this end, we propose a novel statistical framework for fitting joint Gaussian graphical model simultaneously with informative pathways consistently expressed across multiple studies. In theory, structured nodes can be prespecified with multiple genes. The optimization rule employs the structured input-output lasso model, in order to estimate a sparse precision matrix constructed by simultaneous effects of multiple studies and structured nodes. With an application to breast cancer data sets, we found that the proposed model is superior in efficiently capturing structures of biological evidence (e.g., pathways). An R software package nsiGGM is publicly available at author's webpage.


Introduction
Genomic data have been extensively applied to analyze disease mechanism on the basis of predictive signatures from DNA alterations (e.g., genotyping and mutation), RNA transcription (e.g., gene or isoform expression and fusion transcripts), and gene regulation by epigenetic changes (e.g., methylation, protein-DNA interaction, and miRNA expression). In particular, gene regulation is a complicated system that builds on tens of thousands of cellular components' interactions and diverse activities across multiple layers. Biological networks are the most popularly used data resource to sketch this interconnectivity of gene regulations. High-throughput genomic technologies are paving the way toward systematically characterizing diverse types of biological networks and suggestive of underlying gene regulation mechanisms. And yet a complete inference of network's complexity has been a long concern in the field of systems biology.
To circumvent the shortcoming of single feature-based analysis, the activity of a gene or of a whole biological process in a disease can be assessed by sets of genes (a.k.a. gene set enrichment analysis or pathway analysis). In doing so, a bulk of pathways have been identified through many cancerrelated researches [1]. Pathway information demonstrates cellular functions and biological processes or represents a unique signature of deregulation of a given gene [2]. For example, the pathway or signature associated with the activity of a given oncogene is defined as the set composed of those genes most differentially expressed by perturbation of oncogenes [3][4][5]. Importantly, the usage of pathway information is increasingly prevalent in biomedicine. For instance, target drug associated with potential pathway is taken as a practical solution to overcome the traditional drug discovery that usually adopts the one-drug-one-target approach. This strategy takes into account the fact that the disease occurrence is usually the result of complex interactions of molecular events.
In recent years, large-scale genomic data generated from relevant biological experiments or clinical hypotheses have increasingly soared, as high-throughput experiment technologies have markedly advanced [6]. Such increasing genomic data has been publicly available in data repositories (e.g., Gene Expression Omnibus and Sequence Read Archive). This abundance of biological experiments poses a new challenge of multiple data in regard to exploring and validating biological signatures and pathways. More precisely, a question of network analysis often relates to how to characterize underlying transcriptomic patterns or molecular mechanisms across disease subtypes or between case-control groups, because it is commonplace that biological signals are not coherently present across studies. Generally a single network [7][8][9] is found to accurately estimate underlying dependency with an adjustment of gene perturbation effects (e.g., polymorphic genotype alteration [10,11]). Nonetheless, these methods hardly discover network patterns of subtle signals and dynamic features in the midst of coupled networks under diverse conditions. Moreover, single networks potentially generate many potential false positive signals (edges) attributed to experimental biases and errors. To address this challenge, the recent trend of data analysis has been in the spotlight to data integration allowing for multiple data to achieve a more accurate network inference. To this end, many have proposed methods to combine multiple networks based on unified model [12][13][14]. This approach is also known as integrative analysis and is analogue to traditional meta-analysis.
The joint Gaussian graphical model (JGGM; Danaher et al. [12]) focuses on incorporating nonzero edge effects (i.e., off-diagonal entries of precision matrix) to combine multiple studies in view of integrative analysis. This model, however, inevitably is dependent on post hoc analysis when validating biological significance. Therefore it is interesting to combine not only DNA and/or transcriptomic changes but also pathway information as such well-ascertained biological evidence. Normally we perform post hoc analysis to see if the estimated gene networks are enriched for any pathways. Contrary to this, it is also sensible to estimate gene networks, with an adjustment of pathway information. It is common that we hardly combine pathway information in spite of its biological significance. To the best of our knowledge, no method has been proposed that can accommodate overlapping node structures, mainly due to overlapped gene annotations of pathway gene sets. To tackle this problem, we propose a new graphical model called "node-structured integrative Gaussian graphical model (nsiGGM)" jointly leveraging a priori knowledge of pathway information. This method allows for overlapping group lasso problems, making it possible to integrate overlapped genes of pathways. It is worthwhile for biological pathways to intervene the network estimation to reveal true gene regulatory network. The nsiGGM builds on prespecified structured nodes with multiple genes as building blocks in the stage of estimating a precision matrix. The implementation rule employs 1 / 2 lasso penalty of structured input-output lasso model [15], in order to estimate sparse precision matrix that accounts for simultaneous effects of multiple studies and structured nodes. With an application to simulated and breast cancer genomic data, the proposed model is found to be superior in efficiently capturing transcriptional modules predefined by pathway database. A software package (nsiGGM) is publicly available at author's webpage (https://sites.google.com/site/sunghwanshome/). This paper is outlined as follows. In Section 2, we review background knowledge of the standard and joint Gaussian graphical models. In addition, we propose the nodestructured integrative Gaussian graphical model (nsiGGM). In Section 3, we describe an implementation strategy that is primarily based on the input-output lasso. In Section 4, we compare performance of our proposed methods with other methods using real breast cancer data (TCGA) and simulated data. In Section 5, conclusions and further studies are discussed.

Method
In this section, we briefly discuss methodological backgrounds on the Gaussian graphical models (GGM) aiming at constructing gene networks. In what follows, we propose the node-structured integrative Gaussian graphical model (nsiGGM) that can accommodate a priori biological knowledge (e.g., pathway data or targeted predictive genes of miRNA).

Gaussian Graphical Models for Gene Networks.
A Gaussian graphical model demonstrates the conditional dependency of multiple random variables, 1 , . . . , , with a graph = ( , ), where = {1, . . . , } is a set of nodes and is a set of edges indicating that nodes are linked and conditionally dependent. Let follow the multivariate Gaussian distribution (0, Σ), where Σ is a × covariance matrix. Let Σ −1 = Θ denote the inverse covariance matrix (also known as a precision matrix). More precisely, each nonzero off-diagonal element implies conditional dependency between the th and th nodes given all the other variables, , = 1, . . . , , whereas the covariance Σ presents marginal dependencies without considering other variables. This model is also called a GGM [16]. The graphical lasso [9,17] produces a sparse Gaussian graphical model constructed in nonpenalized edges in Θ. The graphical lasso minimizes the negative log-likelihood with the 1 lasso penalty: where tr( ) is the trace of matrix , is the sample covariance matrix, and ‖Θ‖ 1 = ∑ ∑ | | is the regularization parameter Computational and Mathematical Methods in Medicine 3 adjusting the degree of sparsity. The optimal value for can be chosen by cross-validation or the Bayesian information criterion (BIC; Schwarz [18]; Yuan and Lin [8]).

Joint Gaussian Graphical Models for Combining Multiple
Studies. In this section, we revisit the joint Gaussian graphical models (JGGM) proposed by Danaher et al. [12]. Simply put, the JGGM combines multiple studies and constructs multiple networks in a unified model. Let denote the number of studies in our data and {Σ −1 } = (Σ −1 1 , . . . , Σ −1 ) the true precision matrices. Consider genomic data of studies, (1) , . . . , ( ) , each of which consists of samples with common features, where ≥ 2. We assume that ∑ =1 observations are independent and that those of each data set follow the multivariate normal distribution as ( ) ∼ ( , Σ ) for 1 ≤ ≤ . It is well known in metaanalysis that multiple data sets are of common associations and genomic characteristics among features (e.g., genetic association intensity). It, therefore, is worth estimating precision matrices across studies in parallel rather than separate estimation. To this end, we assume that the features within each data set are centered and take the form of a penalized log-likelihood with the group sparsity-inducing 2 penalty that maximizes (2) with respect to {Θ}: subject to {Θ} = (Θ (1) , . . . , Θ ( ) ) being positive definite, where ( ) = (1/ )( ( ) ) ( ) is the sample covariance matrix of ( ) and 1 , 2 are nonnegative tuning parameters. It is interesting to note that the 2 -penalty captures similarity across the precision matrices. Due to this property, the penalty terms of (2) are also referred to as the joint graphical lasso (JGL). Moreover, the 1 penalty induces estimated precision matrices to be sparse.

Node-Structured Integrative Gaussian Graphical Model.
In this section, we propose an integrative graphical model that can accommodate a priori known structure of genomic features. Learning gene networks, the sparseness of precision matrix can be guided to some extent by known feature modules (e.g., pathway information). Typically data integration allows picturing the interplay of underlying biological factors. In this regard, it is worthwhile accommodating known feature module information ascertained in previous experiments. In doing so, we seek to integrate a priori feature module information to be embedded across multiple networks via an additional 2 group penalty. The following objective function is taken to minimize where g is a subset of off-diagonal entry indices of Θ for 0 ≤ ≤ , = {g 1 , . . . , g }, is the number of a priori feature modules, and ‖Θ ( ) g ‖ 2 = (∑ ( , )∈g ( ) 2 ) 1/2 . Importantly, it is noted that elements of g can be overlapped (e.g., duplicated genes of two different pathways). The third penalty, ‖Θ ( ) g ‖ 2 adjusted by 3 ≥ 0 pertains to structured feature modules (i.e., structured node in networks) on the basis of a priori known information. Here, unbiased regularization to each feature should be taken into consideration, in the sense that the feature overlapping inevitably comes into play.
In what follows, we present a toy example to demonstrate how a priori information constructs feature modules in Θ ( ) g . In Figure S1, in Supplementary Material available online at https://doi.org/10.1155/2017/8520480, we take an example of networks consisting of 5 common nodes (e.g., genomic features) across three studies. In Figure S1A, the second penalty with 2 captures matched up common edges (e.g., 14 , (2) 14 , (3) 14 ) identical to the joint graphical lasso. Besides, the third group lasso penalty with 3 accommodates the six edges of the three features in a predefined module Θ ( ) g 1 so that feature regulatory effects can be further modeled in the context of data integration (see Figure S1B). Importantly note that this module structure (e.g., pathway) is priorly known knowledge. It is interesting that this approach is in line with the integrative cluster [19] that allows for cisregulatory effects and target gene prediction for miRNAs. In the case of multiple modules in network, suppose that we are given a set of five genes { } and a precision matrix { } for 1 ≤ , ≤ 5. Let a priori information generate two feature modules defined as Module 1,  4) is simultaneously present in both g 1 and g 2 , implicating that a suitable implementation is required for regularization to the overlapped component (3,4). To estimate solutions to (3), we apply the structured input-output lasso [15] that can handle overlapped features, making it possible to learn a model allowing for both single-node effects across studies and predefined node structures (e.g., pathway modules). Inspired by integrative nature of this method, we call this graphical model the node-structured integrative Gaussian graphical model (nsiGGM). When it comes to tuning the penalty parameters ( 1 , 2 , and 3 ), the BIC is applied to determine the optimal sparseness of networks' edges.

Structured Alternating Directions Method of Multipliers
Algorithm. In this section, we delineate the implementation strategy for the nsiGGM. We solve problem (3) by using structured alternating directions method of multipliers algorithm (sADMM). The alternating directions method of multipliers algorithm (ADMM) was previously introduced to tackle the problem of the JGL [12]. Similar to the JGL, the sADMM proposed in spirit of the ADMM is designed to adopt the structured input-output lasso in order to embed node structures into the model. We first reformulate (3) with (Θ) and as maximize Θ, where . . , ( ) ) that satisfies positive definiteness. Boyd et al. [20] proposed the scaled augmented Lagrangian to solve problem (4) by where { } = ( (1) , . . . , ( ) ) are dual variables and ‖ ‖ denotes the Frobenius norm of matrix (i.e., ‖ ‖ = √∑ ∑ 2 ). The sADMM algorithm repeatedly solves the three-step optimization with respect to Θ ( ) , ( ) , and ( ) , starting with initial values of the related parameters: Θ ( ) = , ( ) = 0, and ( ) = 0 for 1 ≤ ≤ . The iteration is repeated until convergence as follows: In Θ-step for 1 ≤ ≤ , update In -step, for = 1, . . . , , update ( ) that minimizes where ( ) ( ) = Θ ( ) ( ) + ( ) ( −1) . To find the optimal solution of (7), we directly apply the structured input-output lasso [15] to (7) using both coordinate descent algorithm and KKT conditions considered to boost up the computational speed. For more details, see [15]. In -step, for = 1, . . . , , update ( ) as ( ) Update repeatedly the three parameters until convergence by a stopping rule below: Putting together, Algorithm 1 encapsulates the structured alternating directions method of multipliers algorithm.

Simulated Data.
In this section, we carry out experimental studies to assess performance of the nsiGGM. In brief, the following describes how we generate simulated data. The experimental scheme is largely motivated by Chun et al. [14]. Let be the total number of studies, each containing true signal genes = 40 for a priori module (e.g., pathway genes) and sample size ( ) = 100, where 1 ≤ ≤ (=3). Starting off with edges of signal genes, we first generate network edges of 100 nodes subject to the scale-free network structures, the most commonly observed structures in biology, being simulated by applying the Barabasi Albert algorithm [21]. Subsequent to this, we randomly added four edges to impose random effects. Constructing network structures, we simulate the precision matrices by setting values of the offdiagonals sampled from Unif(−0.1, 0.1) and by setting the diagonal elements with ∑ ̸ = | ( ) , |. The process is repeated until Θ ( ) becomes a positive definite matrix. For simulating ( ) , we first consider a scenario such that no covariate incurs dependency among genes. Thus, this is an ideal experiment scenario in that any conditional dependency is not taken into consideration to the model. We simulated ( ) , where each th row of ( ) was randomly sampled from (2) Update Θ ( ) , ( ) and ( ) until convergence for = 1, . . . , : (ii) -step For = 1, . . . , , update ( ) that minimizes Algorithm 1: The structured alternating directions method of multipliers algorithm.   sensitivity, specificity, and Youden index were benchmarked by comparing the JGGM [12] and GGM [16]. Youden index is defined as Sensitivity + Specificity − 1, ranging from −1 to 1. In principle, the higher the Youden index, the higher the prediction accuracy.
In Table 1, Youden index of the nsiGGM appears to be clearly declining as noise edges increase in number and yet is consistently larger than that shown in the JGGM and GGM. This is mainly due to the fact that the JGGM suffers low specificity (0.8685-0.8733) compared to the nsiGGM (0.9433-0.9481). In contrast, the JGGM slightly outperforms, when 30 and 40 noises are augmented, the nsiGGM for sensitivity at the expense of poor specificity. Taken together, it is clear to say that the nsiGGM is superior to the JGGM and GGM in detecting the true underlying pathway sets.

Application to Genomic Data.
In this section, we demonstrate applications to three mRNA expression profiles for breast cancer. We collected two microarray profiles from Desmedt et al. [22], Wang et al. [23], and TCGA cancers data from TCGA's web portal (https://cancergenome.nih.gov/), where we retrieved mRNA data of breast carcinoma (BRCA). We matched up features across all studies and filtered out probes by the rank sum of mean and standard deviation (SD < 0.99; Wang et al. [24]), which selected 106 genes. Table 2 delineates detailed information of miRNA expression data. In what follows, we examine if the nsiGGM is suited to improve accuracy for detecting pathway genes. We collected gene sets from exploring the Molecular Signatures Database (MSigDB) v2.5 gene set collections [25], consisting of at least 11 genes of 106 genes, of which 53 distinct genes belong to the 11 pathways presented in Table 3. To evaluate a detection rate of pathway genes, we define an evaluation benchmark, (⋅) as follows: (Φ ) = ∑ ∈Φ ( th gene of th network's node belongs to any of given pathways) # of total selected nodes , where Φ is a set of gene indices, whose genes form th network and (⋅) is an indicator function. Comparing the JGGM, we examine whether the nsiGGM effectively captures the existing pathway structures better than the JGGM in context of connectivity and proportions of identified pathway genes. To first appearances, the nsiGGM effectively represents modules well enriched with pathway genes in Figure 1, as compared to those of the JGGM in Figure 2. In support of this notion, given that we observed ∑ (Φ ) of nsiGGM = 0.573 and ∑ (Φ ) of JGGM = 0.521, where 1 ≤ ≤ 3, it is not surprising to say that the nsiGGM can facilitate constructing gene networks biologically more enriched for pathway gene sets than the JGGM. Table 3 enumerates the pathway genes discovered by the nsiGGM, each being highlighted by bold and underlined characters (note: asterisks represent pathway genes identified by only the nsiGGM not by JGGM). Interestingly, there are many pathways genes monitored by the nsiGGM, but not by the JGGM. Focusing on the cell signaling pathway, we particularly notice that EREG [26], SLC1A1 [27], STC2 [28], GAD1 [29], and TRH [30] are genes not selected by the JGGM but nonetheless previously were monitored in signaling pathways. Importantly, Hou et al. [28] showed that STC2 inhibited tumorigenesis and metastasis of breast cancer cells, indicating that STC2 may inhibit epithelialmesenchymal transition (EMT) at least partially through the PKC/Claudin-1-mediated signaling in human breast cancer cells. Therefore, STC2 can be taken into consideration as a potential biomarker for metastasis and targeted therapy in human breast cancer. Besides, signaling through glutamate receptors in regard to SLC1A1 has been reported in human cancers [31]. In support of this evidence, it is also well known that increases in SLC1A1 expression subject to hypoxiainducible factors (HIFs) possibly contributes to increased efflux of glutamate, by which glutamate transporters and receptors are regulated to activate key signal transduction pathways that promote cancer progression [27]. Therefore, it is clear to say that the nsiGGM is superior in detecting genes capable of implicating the functional process of human cancers in essence.

Conclusion and Discussion
In this article, we propose a new graphical model called nodestructured integrative Gaussian graphical model (nsiGGM) jointly learning Gaussian graphical models with an emphasis of prior knowledge of pathway information. It is highlighted that this method allows us to handle overlapping group lasso problems, making it possible to integrate overlapped pathway gene sets. With applications to experimental and real data, we verified outstanding numerical performance of the nsiGGM and analytical capability of inducing biological significance related to breast cancer. And yet it might be controversial whether prior knowledge too excessively determines the network structures. Despite apprehension to overly guided network structures, a priori known information can be still acceptable in that the nsiGGM selects tuning parameters on the basis of the likelihood-based BIC. The proposed nsiGGM is highly subject to computational complexity in nature, mainly due to the coordinate decent algorithm to tackle the sparse overlapping group lasso. Since the sparse overlapping group lasso applied here deals with both study-specific effects and prior knowledge, the optimization becomes inevitably complicated. Our current package is implemented in R and the routine flows can be further expedited via C/C++ in the future. Currently, the prior knowledge of regulatory structure is accommodated to an unidirectional graphical model. It is also interesting that we impose the prior knowledge to directional networks instead, so that the presence or absence of directional edges