Estimation of multiple networks in Gaussian mixture models

: We aim to estimate multiple networks in the presence of sample heterogeneity, where the independent samples (i.e. observations) may come from diﬀerent and unknown populations or distributions. Speciﬁcally, we consider penalized estimation of multiple precision matrices in the framework of a Gaussian mixture model. A major innovation is to take advantage of the commonalities across the multiple precision matrices through possibly nonconvex fusion regularization, which for example makes it possible to achieve simultaneous discovery of unknown disease subtypes and detection of diﬀerential gene (dys)regulations in functional genomics. We embed in the EM algorithm one of two recently proposed methods for estimating multiple precision matrices in Gaussian graphical models. We demonstrate the feasibility and potential usefulness of the proposed methods in an application to glioblastoma subtype discovery and diﬀerential gene network analysis with a microarray gene expression data set. We also conduct realistic simulation studies to evaluate and compare the performance of various methods.


Introduction
We consider the problem of estimating multiple networks in the presence of sample heterogeneity; that is, the samples come from several populations with C. Gao et al. different Gaussian distributions, however it is unknown which samples are from which distributions.The precision matrix of each distribution corresponds to a network.This is related to but differs from the usual task of inferring and contrasting multiple networks in Gaussian graphical models, where it is known which samples are from which distributions (Guo et al. 2011;Danaher et al. 2014;Zhu et al. 2014 [11,4,38]).Although Gaussian mixture models are widely used for model-based clustering, our primary goal is for estimation and comparison of cluster-specific precision matrices, for which existing model-based clustering methods ( [18,8,37]) are not suitable.The existing model-based clustering methods either specify a common precision matrix or estimate multiple unconstrained cluster-specific precision matrices; due to the lack of a fusion penalty or other mechanisms, the cluster-specific precision matrix estimates are either exactly the same or completely different.On the other hand, in many applications one would expect both commonalities and differences among the cluster-specific precision matrices.Accounting for their commonalities not only improves statistical estimation efficiency through information borrowing, but also enhances the ability of interpretation with a focus on few possible changes across the cluster-specific precision matrices.
Our proposed methods were motivated by genomic applications to disease subtype discovery while accounting for differential gene expression and/or differential gene regulations across (unknown) disease subtypes.This is in contrast to existing methods allowing for only differential gene expression in disease subtype discovery (Verhaak et al. 2010 [33]).Arguably, a biologically more interesting problem is not only in detecting differential gene expression, but also in discovering gene dysregulations, across to-be-discovered disease subtypes, which will facilitate understanding disease mechanisms and thus developing individualized treatments.
Our approach is in the framework of multivariate Gaussian mixture modeling (McLachlan and Peel 2001 [18]).The majority of the existing literature on mixture modeling focus on regularizing only the mean parameters with diagonal covariance matrices (Pan and Shen 2007; Wang and Zhu 2008;Xie et al. 2008 [23, 34, 36]), though some (Zhou et al. 2009; Hill and Mukerjee 2013; Wu et al. 2013 [37,12,35]) have started considering regularization of the covariance parameters too, all of which, however, do not touch on the key issue of identifying both common and varying substructures of the precision matrices across the components of a mixture model.Since these methods always give different networks for different populations unless a common network is assumed, they do not address the question of interest here: which parts of the networks change with the populations.To address this question, we propose embedding one of the current methods of estimating multiple Gaussian graphical models (Danaher et al. 2014;Zhu et al. 2014 [4,38]) in the EM algorithm (Dempster et al. 1977 [6]) for the Gaussian mixture model, for which existing algorithms can be effectively used in the M-step of an EM algorithm for a Gaussian mixture model.
Since these methods apply a fusion penalty to shrink multiple networks towards each other, they not only are statistically more efficient with information borrowing, but also facilitate interpretation in identifying differential network substructures.In particular, due to the use of a non-convex penalty, the method of Zhu et al. (2014) [38] strives to uncover the commonalities among multiple networks while maintaining their unique substructures too.
Due to the connections to and differences from our current problem, we briefly review the literature on Gaussian graphical models without sample heterogeneity; that is, it is known that the samples come from the same Gaussian distribution.Gaussian graphical models are commonly used to describe conditional dependence relationships between interacting variables for continuous multivariate data.They are widely applied to reveal the structures in gene regulatory networks ( [9,7]), protein interaction networks ( [10,30,15]) and brain functional connectivity ( [13,15]).Each network or graph consists of a set of nodes representing variables (e.g.genes) and edges; each edge between two nodes indicates the conditional dependency of the two nodes, given all other nodes.In Gaussian graphical models, the edges between nodes are determined by the non-zero offdiagonal elements in the precision matrix (the inverse of the covariance matrix).Therefore, reconstruction of the graph is equivalent to estimating the precision matrix in the Gaussian graphical model.Friedman et al. (2008) [10] proposed the graphical lasso method to estimate the (inverse) covariance matrices, where they provided an efficient algorithm to directly maximize the L 1 -penalized loglikelihood.While the graphical lasso is fast, it only focuses on estimating a single graph.It ignores the structural similarities of multiple graphs when graphical lasso is applied to estimate each graph separately.Recent works aim to recognize possible commonalities among multiple graphs.Peterson et al. (2015) [24] proposed a Bayesian approach to estimating multiple Gaussian graphs by placing a Markov random field prior on the edges and a spike-and-slab prior to control the similarity between graphs.Qiu et al. (2015) [25] proposed a kernel method for joint estimation of multiple Gaussian graphs.Guo et al. [11] proposed to control the sparsity of the off-diagonal elements of the precision matrices and to use the L 1 penalty to control the differences between the off-diagonal elements for each pair of precision matrices.Danaher et al. (2014) [4] proposed the joint graphical lasso algorithm, which uses the L 1 penalty to regularize both the sparsity and the differences between the corresponding off-diagonal elements for each pair of precision matrices.Mohan et al. (2014) [21] extended the joint graphical lasso by taking a node-based approach for estimation of multiple Gaussian graphs.Recently, Zhu et al. (2014) [38] proposed a regularized maximum likelihood method for estimation of multiple precision matrices, In addition to seeking sparseness with a non-convex penalty to regularize the off-diagonal elements in each precision matrix, it also imposes a non-convex fusion penalty on the differences between each pair of some related precision matrices that can be flexibly specified.
The rest of this paper is organized as follows.In Section 2 we introduce our proposed new methods for estimating component-wise precision matrices in the framework of a Gaussian mixture model.Section 3 presents simulation studies to demonstrate the promising performance of our proposed methods, followed in Section 4 for an application to a glioblastoma gene expression data set.We conclude in Section 5 with a summary of our findings.

Gaussian mixture model
We assume that each of n iid p-dimensional observations, x 1 , x 2 , . . ., x n , comes from a Gaussian mixture distribution with probability density function where g is the number of components (or populations), π i is the prior probability for component i with is the set of the mean and covariance matrix parameters for cluster i, and f i is a multivariate Normal density (with a component-specific mean μ i and covariance matrix V i ), Since each component corresponds to a cluster, we will refer to component and cluster exchangeably.The primary goal here is to estimate the cluster-specific precision matrices W i = V −1 i , though identifying the clusters is often of interest either as a direct or side product.
Given the data, the log-likelihood is where Θ = {(π i , θ i ) : i = 1, 2, ..., g} denotes the set of all unknown parameters.An Expectation-Maximization (EM) algorithm [6] is often used to obtain the maximum likelihood estimates.For high-dimensional data, it is often beneficial to use the maximum penalized likelihood estimator based on a penalized loglikelihood log where p λ (Θ) is to be specified as a penalty on all or a subset of the parameters.
Various penalties have been proposed to achieve better performance in different contexts.

New method 1: With a convex penalty
A zero entry W i;kl , the (k, l)th entry of W i , indicates conditional independence between the kth and lth variables in cluster i given other variables.Estimating multiple cluster-specific precision matrices can reveal changes of dependency structures across multiple clusters.To facilitate detecting structural changes, a penalty is imposed on the differences between the corresponding entries across multiple precision matrices.We propose using a joint lasso and fused graphical lasso (FGL) penalty of Danaher et al. (2014) [4] on each precision matrices W i 's: where λ 1 and λ 2 are nonnegative tuning parameters.In addition to achieving sparseness as in graphical lasso, FGL also encourages identical entries across cluster-specific precision matrices.This feature helps to reveal both commonalities and cluster-specific network structures, in addition to improving statistical estimation efficiency through borrowing information across the multiple networks.
Note that Danaher et al. (2014) used the above penalty in the context of Gaussian graphical modeling, knowing which observations are from which Gaussian distribution, differing from our Gaussian mixture modeling.Nevertheless, we will show how to apply their proposed ADMM algorithm ( [1]) (as implemented in the R package JGL) in the M-step of an EM algorithm in the current context.
We denote the new method that incorporates the use of the joint lasso and fused graphical lasso (JGL) in our Gaussian mixture modeling as New-JGL.

New method 2: With a non-convex penalty
In the context of Gaussian graphical modeling, Zhu et al. (2014) [38] proposed the following non-convex penalty function for W i , where λ 1 , λ 2 and τ are nonnegative tuning parameters, and J τ (z) = min(|z|, τ) is the truncated Lasso penalty (TLP) (Shen et al. 2012 [28]).The two penalties serve the corresponding sparseness and fusion roles as in JGL.However, in contrast to FGL in (3), only non-diagonal elements, but not diagonal elements, are penalized for their differences in (4).The non-convex TLP reduces the bias induced by the lasso penalty because no more penalty is imposed if |z| > τ in J τ (z).In the current context, the TLP can do better in maintaining the magnitudes of non-zero entries or differences between two unequal entries.The scaled TLP, J τ (z)/τ , approximates the L 0function, I(z = 0), as τ tends to 0 + .Like FGL, this method is able to detect possible element-wise heterogeneity across multiple networks, for example in identifying signaling network changes across distinct cancer subtypes.
We propose using the same non-convex penalty (4) in our current context of Gaussian mixture modeling, and will demonstrate that the algorithm of Zhu et al. (2014) can be applied in the M-step of an EM algorithm for our purpose.

C. Gao et al.
We denote the new method that incorporates the use of structural pursuit (SP) penalty (4) in our Gaussian mixture modeling as New-SP (New-Structural-Pursuit).

Computing
We develop an EM algorithm to obtain the maximum penalized likelihood estimates (MPLEs).In particular, we will demonstrate how to use an existing Gaussian graphical modeling algorithm in the M-step of the EM algorithm for a penalized Gaussian mixture model.
We introduce z ij as the indicator of whether x j belongs to component i, so z ij = 1 if x j comes from component i and z ij = 0 otherwise.Here z ij 's are treated as missing data.If z ij 's are observed, the complete data penalized log-likelihood is where p λ (Θ) is a penalty on the parameters; typically only the mean parameters μ i 's and/or covariance matrices V i 's are penalized, which is assumed throughout.Define the posterior probability of x j 's belonging to component i as ρ ij = P (z ij = 1|x j ; Θ), then the E-step calculates the following with the current estimate Θ (r) at iteration r, In the M-step, we find π(r+1 i and Ŵ (r+1) i that maximize Q P .Using the Lagrange multiplier η to constrain g i=1 π i = 1, we omit the terms without π i 's and rewrite Q P as Taking the partial derivative of L(π, η) with respect to π i and set it to 0, we arrive at the updating formula for πi π(r+1 To update μ i , if there is no penalty on μ i , we take the derivative of Q P with respect to μ i and set it to 0, obtaining the updating formula for μi as μ(r+1 On the other hand, if the Lasso penalty is imposed on μ i , then its updating formula involves a soft-thresholding on the above quantity (e.g., Pan and Shen 2007 [23]).Finally, to update V i or equivalently, W i = V −1 i , we only need to consider the terms related to W i in Q P : as a weighted sample covariance matrix.
Typically there is no closed-form solution to update W i or V i when one of them is penalized.However, we can take advantage of the existing methods for penalized Gaussian graphical models.Below we point out their connection.
If we know the cluster label for each observation x j , as in Gaussian graphical modeling, then the penalized log-likelihood for where n i is the sample size for cluster i, and S i is the sample covariance matrix for cluster i. Correspondingly, in the current context of Gaussian mixture modeling, the Q P function in the EM algorithm with a penalty on W i is To maximize Q P , we use the soft assignment, instead of hard assignment, of each observation x j into a cluster.Specifically, setting then maximizing expression (15) will be equivalent to maximizing (14).Since there are already efficient computational algorithms to maximize the penalized log-likelihood (14) They proposed an L 1 -penalty for the mean parameters, where μ ik is the mean of kth variable for component i.Using the L 1 penalty, small estimates of the mean parameters will be shrunken to be exactly zero.If for a given variable k, μ ik = 0 for all components i, then this variable will have no effect on clustering.Hence this penalty is used for variable selection, but not for inferring the cluster-specific networks.Zhou et al. (2009) [37] relaxed the diagonal covariance matrix assumption and adopted unconstrained covariance/precision matrices.To regularize the parameters in the precision matrices, they proposed a penalty function of the form where W = V −1 is the common precision matrix (or inverse covariance matrix).The first term in the above penalty function aims at variable selection as in Pan and Shen (2007), while the second term uses the L 1 -penalty to promote the sparseness of the precision matrix.For penalized covariance matrix estimation, they used the graphical lasso algorithm of Friedman et al. (2008) and maximized the following objective function where λ = 2λ 2 /n and is a weighted sample covariance matrix based on the soft assignments of all the samples as for S(r) i .Zhou et al. ( 2009) also considered the case where each component i in the mixture model has an unconstrained covariance matrix V i .Then they proposed the following penalty function to regularize the means and cluster-specific covariance matrices, where W i;j,l is the (j, l)th entry of W i .They again used the graphical lasso algorithm to obtain the estimate of the cluster-specific precision matrix in the M-step of the EM algorithm.
Our methods are implemented in an R package called pGMM that will be freely downloadable on CRAN.

Simulations
Due to the unknown truth for real data, it is difficult to draw definitive conclusions on the relative performance of various methods.As an alternative, we conducted simulations to evaluate and compare the performance of the methods in both clustering (i.e. the assignments of the samples to clusters) and precision matrix estimation.
To mimic real data, we used the fitted models to the glioblastoma gene expression data by Zhou et al. (2009) and our proposed new methods as the true model to generate simulated data; in this way, we avoided possible biases in using only one true model to generate simulated data that might favor one of the methods.In each case, there were 4 clusters with n = 173 or n = 346 observations with p = 20.We then applied the usual non-penalized model-based clustering as implemented in R package mclust (Fraley and Raftery 2006 [8]), the methods of Pan and Shen (2007) [23] and Zhou et al. (2009) [37], and our proposed two new methods.To measure the accuracy of parameter estimation for precision matrices, we used the average entropy loss (EL) and average quadratic loss (QL), To measure the accuracy of estimating zero or non-zero entries and grouping structures in precision matrices, following Zhu et al. ( 2014) [38], we used the average number of false positives for sparseness pursuit (FPV), average number of false negatives for sparseness pursuit (FNV), average number of false positives for grouping (FPG), and average number of false negatives for grouping (FNG): where C(g, 2) is the combinatorial number of choosing 2 from g.
Table 1 shows the results for n = 173 based on 50 simulations for each set-up.With the true model as the fitted model by the method of Zhou et al. (2009), the method of Zhou et al. (2009) itself gave the highest Rand index, suggesting the best accuracy for clustering.However, It did not give the lowest average entropy loss (EL) and quadratic loss (QL) for precision matrix estimation, though the differences were not large.Recall that the true model here was based on four largely differing cluster-specific precision matrices, which might not favor fusing the cluster-specific precision matrices.Impressively our method New-SP gave the second highest Rand index that was quite close to that of Zhou et al. (2009), and more importantly, New-SP gave the most or second most accurate estimates of the cluster-specific precision matrices with the lowest average EL and second lowest QL.In addition, it also gave low false positive rates of sparseness and grouping, but high false negative rates.It is noted that New-JGL also performed well.

Table 1
Simulation results with n = 173 and the true model being that estimated byone of the three methods based on the glioblastoma dataset.The means (standard deviations) of the Rand Index (RI), adjusted Rand Index (aRI), average entropy loss (EL) average quadratic loss (QL), average false positive for sparseness pursuit (FPV), average false negative for sparseness pursuit (FNV), average false positive for grouping (FPG)  On the other hand, if the true model was the fitted one by New-SP, then New-SP was the clear winner for both clustering and precision matrix estimation, followed by New-JGL.This was the case when the cluster-specific precision matrices differed but sharing some commonalities.Finally, if the true model was the fitted one from New-JGL, the winners were New-JGL and New-SP, followed by mclust and the method of Pan and Shen (2007) (where a common diagonal precision matrix was assumed).
We also investigated the sensitivity of the EM algorithm to its starting values.For the set-up with the true model as the one fitted by New-JGL, instead of using the K-means output as the starting value for New-JGL and New-SP, we used some randomly generated numbers as the starting value.The resulting Rand index values for New-JGL and New-SP decreased from 0.926 and 0.930 to 0.635 and 0.757 respectively, confirming the importance of using good starting

Table 2
Simulation results with n = 346 and the true model being that estimated byone of the three methods based on the glioblastoma dataset.The means (standard deviations) of the Rand Index (RI), adjusted Rand Index (aRI), average entropy loss (EL) average quadratic loss (QL), average false positive for sparseness pursuit (FPV), average false negative for sparseness pursuit (FNV), average false positive for grouping (FPG)  values for the EM algorithm.However, the estimation errors for the precision matrices were less influenced: for example, the mean EL for New-JGL and New-SP increased from 22.533 and 23.493 to only 23.964 and 25.743, respectively, still lower than those of the other methods.
Next we doubled the sample size in simulations.With the increased sample size, the proposed method New-SP became the clear overall winner, followed by New-JGL (Table 2).Although mclust performed well in the last two setups (with the true model being that fitted by New-SP or New-JGL), it did not work well in the first set-up.Again it is noted that the two new methods largely outperformed the method of Zhou et al. (2009) [37] for estimating the cluster-specific precision matrices, perhaps due to the former two's use of the fusion penalties for information borrowing across multiple cluster-specific precision matrices.Verhaak et al. (2010) [33] studied a gene expression data set of glioblastoma tumor and normal samples.They used a consensus hierarchical clustering method to identify four disease subtypes.It is noted that, due to the limitation of the clustering method, the conditional dependencies between genes in each cluster were ignored and thus not revealed.This leaves room for our new and other methods to explore possible dependency relationships among the genes.Furthermore, the identified four clusters, albeit biologically reasonable, are in no way to be perfect, which bears importance when one uses their sample assignments as a reference to compare various methods.

Glioblastoma gene expression data
To be practically focused, we restricted our analysis to the gene expression data from the 173 core samples as used by Verhaak et al.(2010) [33], and we selected only 20 genes that are related to cell signaling pathways.Some of these genes were demonstrated to be altered in Figure 4B in Brennan et al. (2013) [2] and Figure 5 in Mclendon et al. (2008) [19].Specifically, genes EGFR, PDGFRA, FGFR3 are members of the RTK signaling pathway.RASGRP3 and RRAS are downstream targets of the RTK signaling pathway.PIK3C2B, PIK3R1, PIK3R3, PIK3IP1 and AKTIP are components of the PI3K/AKT signaling pathway.NFIB is the downstream target of RTK and PI3K/AKT signaling pathways.CDKN3, CDK4, CDKN1A, CDKN2C, CCND2 are involved in RB signaling pathways and they play important roles in cell cycle regulation.CASP1 and CASP4 are important genes in cell apoptosis.

Estimated networks
We applied our method New-SP to the glioblastoma gene expression data set.Trying with g = 1, 2, 3, 4, 5 clusters, it reached four clusters/subtypes.Each cluster showed a distinct conditional dependency structure among the genes, though their overall structures were similar (Figure 1).This suggests distinct cell signalling network changes across the disease subtypes.A closer examination of the estimated precision matrices reveals that the conditional dependencies among the receptor kinases and the downstream target genes were altered.The PI3K/Akt signaling pathway plays an important role in cell survival and proliferation in glioblastoma ( [3,22]).One of the estimated networks shows that the AKTIP gene was conditionally correlated with CDKN2C, a gene encoding a cyclin-dependent kinase inhibitor that regulates cell growth.However, this link was lost in all other three estimated networks.Similarly, PIK3IP and AKTIP were not conditionally correlated with CDKN1A, CDKN2C and CDKN3 in one or more estimated networks, while the network in bottom left of Figure 1 preserved most of the connections.The PI3K/Akt signaling pathway is reported to be upstream of CCND2, a gene encoding the cell cycle regulating protein Cyclin D2 ( [20]).Only one out of four subtypes demonstrated a conditional dependence between AKTIP and CCND2.The changes C. Gao et al. between these links collectively suggested dysregulation of cell growth by the PI3K/Akt signaling pathway in some subtypes of glioblastoma.
Gene IDH1 is known to have a higher mutation frequency in some glioblastoma subtypes, and here it exhibited cluster-specific associations with FGFR3, which was also reported to have mutations in glioblastoma subtypes classified by Verhaak et al. (2010) [33].We found that gene IDH1's expression was positively correlated with that of FGFR3 in only one cluster, suggesting possibly altered co-expressions in other clusters.IDH1 mutation is reported to cause widespread changes in histone and DNA methylation and potentially promoting tumorigenesis ( [32,16]).CCND2 was found to be amplified in IDH1 mutant medulloblastoma subtypes ( [29]).Therefore, the abnormal IDH1 gene level and its disconnection with CCND2 observed in the estimated network pointed to possible roles of IDH1 in oncogenesis in certain subtypes of glioblastoma.
For comparison, we applied Zhou et al.'s method to the glioblastoma gene expression data set with cluster-specific covariance matrices.Among g = 1, 2, 3, 4, 5 clusters, it selected four clusters.The estimated cluster-specific precision matrices demonstrated cluster-specific dependencies among the genes (Figure 2).The estimated networks using Zhou et al.'s method confirmed that the conditional correlation between IDH1 and CCND2 was lost in one network estimated by the New-SP method.The conditional correlation between AKTIP and CCND2 was present in three subtypes, though the correlation was weak in one subtype.Compared to the networks estimated by the method of New-SP, the dependency changes across the clusters estimated by Zhou et al.'s method were much more dramatic, reflecting possibly large variations of the estimates without borrowing information across clusters.
The New-JGL method also yielded four clusters (Figure 3).Like the networks estimated by the New-SP method, the networks estimated by the New-JGL method shared some structural similarity.The AKTIP and CCND2 correlation was found in two out of four subtypes, although the correlation in one subtype was weak.This agreed with the correlation in the networks estimated by the New-SP method.Unlike in Figure 1, the conditional correlation between IDH2 and CCND2 was present in all four subtypes, though the magnitude of correlation was small.
The non-penalized mclust yielded four clusters with a common covariance matrix, suggesting that the differences among the four cluster-specific covariance matrices were possibly subtle.This was also reflected from the overall similarity across the four cluster-specific estimates of the two new methods.

Sample cluster assignments
Using the cluster assignments in Verhaak et al. (2010) [33] as the reference, the agreement between our New-SP method and the reference as measured by the Rand index was 0.747 and by the adjusted Rand index was 0.354 (Table 3).Its performance was compared with several other methods.First, the method of Pan and Shen (2007) [23] based on a common diagonal covariance matrix  3).Although the method of Pan and Shen yielded a slightly higher Rand index, it was possibly due to the bias of the reference clustering method (that ignored varying within-cluster dependencies that would in turn favor the results of Pan and Shen (2007)).More importantly, a common diagonal covariance matrix assumed and estimated by the method cannot be used to examine possibly varying within-cluster dependency structures.Finally, the two new methods seemed to perform better than the two other methods.

Model assessment
To check the goodness-of-fit of a final model, we propose using the parametric bootstrap, which was used by McLachlan and others to select the number of components in a Gaussian mixture model ( [17,26]).For example, for our real data, the New-SP method selected a final model with four components, each with a component-specific precision matrix, which is called an alternative model here; it may be of interest to compare this alternative model with a null (or reduced) model with four components but a common precision matrix, which could be achieved by forcing a large λ 2 value (while other tuning parameters were selected as before).We generated 50 bootstrap samples from the fitted null and alternative models respectively, then fitted the two models respectively to the bootstrap samples; finally, we compared their corresponding CV loglikelihood values, as shown in Figure 4.For the bootstrap samples, in both cases fitting the alternative model seemed to yield a higher mean value of the CV loglikelihood; however, the difference between the two fitted models was larger when the bootstrap samples were generated from the alternative model, as expected.
Since the CV log-likelihood value difference between the two fitted models based on the original data was larger than that from the bootstrap samples generated from the alternative model, there was some evidence to support the use of the alternative model.Nevertheless, perhaps due to the relatively small sample sizes and shrinkage effects of the four component-specific precision matrices towards each other (as imposed by the fusion penalty even in the alternative model), the difference between the two models was not overwhelming, and cautions must be taken in not over-interpreting their differences.

Discussion
We have presented a new approach to estimation of multiple networks in the context of a penalized Gaussian mixture model.The primary goal is for estimating and comparing cluster-specific network changes, though automatic cluster discovery is often of interest too.For the primary goal, it is necessary to encourage the equalities of the entries across the cluster-specific precision matrices while maintaining their differences if any, which is best accomplished by fusion with a non-convex penalty such as TLP as adopted in our proposed method New-SP ( [28], [38]).Note that standard and existing penalized model-based clustering methods are not suitable for our primary goal: due to the lack of fusion penalties, the existing methods cannot highlight few major differneces across multiple precision matrix estimates, in addition to their loss of estimation efficiency without information borrowing.Both our proposed methods pursue both sparseness and fusion for multiple precision matrices in the framework of Gaussian mixture modeling.Our approach takes advantage of the existing methods using convex or non-convex penalties to regularize the parameters in the unconstrained precision matrices based on Gaussian graphical models, which assumes that it is known that which samples are from which Gaussian distributions, differing from our current context with unknown sample heterogeneity.We applied the methods to a real data set containing gene expression profiles of glioblastoma patients.Using the New-SP method, the samples were parti-tioned into four disease subtypes, as reported in Verhaak et al. (2010) but based on only differential gene expression.Importantly, our method reconstructed disease subtype-specific gene networks, suggesting candidates for possibly subtypespecific gene dysregulations that can be followed up in further biological experiments.Since the truth is unknown for the real data, we recoursed to realistic simulations mimicking the real data to evaluate the methods; it was demonstrated that our method New-SP based on the non-convex TLP gave the best overall performance in both clustering (i.e.subtype discovery) and network estimation when the sample size was at least moderately large, followed by the other proposed method New-JGL based on the convex (fused) Lasso penalty.The better performance of New-SP over New-JGL is likely due to the nonconvex TLP adopted in the former, as demonstrated in Shen et al. (2012) [28] for regression and single precision matrix estimation and Zhu et al. (2014) [38] for estimating multiple precision matrices in Gaussian graphical models.On the other hand, New-JGL is simpler and faster than New-SP, and thus can be used for larger problems and/or to provide a quick preliminary solution; in particular, we advocate using the results of New-JGL (or any other method with a convex penalty) as a good starting value for New-SP, thus the latter can be regarded as a refinement of the former.We also note that partition rules discussed in Zhu et al. (2014) can be used to speed up the new methods for high-dimensional data.
We emphasize that the existing methods for estimation of multiple networks, including the two used here (Danaher et al. 2014;Zhu et al. 2014 [4, 38]), are based on Gaussian graphical models without sample heterogeneity; that is, each sample is assumed to be known from a given Gaussian distribution.In our target applications and other settings, this sample homogeneity assumption may not hold.For example, in clinical genomic studies, due to disease heterogeneity, the assumption that all the gene expression profiles of cancer patients come from the same Gaussian distribution is not practical.To discover unknown disease subtypes, clustering or unsupervised learning becomes useful, which will facilitate personalized medicine.To our knowledge, existing clustering methods of gene expression have focused on detecting differential mean expression levels across clusters or disease subtypes, as demonstrated in Verhaak et al. (2010) [33].However, in addition to differential gene expression, there are possibly differential gene regulations or dysregulations across disease subtypes.If disease subtypes are known, then differential gene regulations can be treated as estimating multiple precision matrices in Gaussian graphical models, as handled by many existing methods; otherwise, as discussed here, both disease subtypes and possibly differential precision matrices must be inferred simultaneously based on a Gaussian mixture model.
Our methods are implemented in an R package pGMM that will be available on CRAN.

Fig 1 .
Fig 1.Estimated cluster-specific networks based on 173 core samples using the new method New-SP.

Fig 2 .
Fig 2. Estimated cluster-specific networks based on 173 core samples using the method of Zhou et al. (2009).

Fig 3 .
Fig 3.Estimated cluster-specific networks based on 173 core samples using the method of New-JGL.
McLachlan and Peel 2001; Fraley and Raftery 2006; Zhou et al. 2009 Pan and Shen (2007)))ical model, we can incorporate one of them into the M-step in our EM algorithm to obtain an update for W i .Zhou et al. (2009)used this idea in applying graphical Lasso(Friedman et al. 2008  [10]) in their penalized model-based clustering with unconstrained covariance matrices.We applied the R functions ofDanaher et al. (2014)andZhu et al. (2014)in the M-step for the proposed two new methods respectively.Different choice of p λ (W i ) will lead to different penalized maximum likelihood estimates of W i and corresponding algorithms.For comparison, we briefly review two existing penalized mixture modeling methods(Pan and Shen (2007);Zhou et al. (2009)[23,37]).The method ofPan and Shen (2007)specifies each component in the Gaussian mixture model as a multivariate normal with a common diagonal covariance matrix

Table 3
[33] Index (RI) and adjusted Rand Index (aRI) for the glioblastoma gene expression data with 20 genes by various methods.The class assignments given in[33]are used as the reference.Rand index of 0.749 and the adjusted Rand index of 0.358.For the purpose of comparison, we also examined the clustering results of the method by forcing 4 clusters, which led to a Rand index of 0.780 and the adjusted Rand index of 0.439 (Table