High-dimensional generalized linear models incorporating graphical structure among predictors

: In this paper, we propose a sparse generalized linear model incorporating graphical structure among predictors (sGLMg), which is an extension of [37] where they exploit the structure information among predictors to improve the performance for the linear regression. There is an explicit expression between the coeﬃcient and the predictor graph measured by the precision matrix in the linear regression, however, this structure does not exist in generalized linear model for the explicit expression of the coeﬃcient in generalized linear model is usually hard to be obtained. To incorporate the graphical structure among predictors for generalized linear models, we make use of the suﬃcient reduction techniques to reestablish the relationship between the coeﬃcient and the precision matrix. The oracle inequalities of the estimator for sGLMg are also presented and the ﬁnite sample performance of the proposed methods is examined via numerical simulations and a breast cancer data analysis.


Introduction
With the development of science and technology, the problem with high dimensionality has become increasingly important over the recent years. Regularization is fundamental in analysis of high-dimensional data. A well-known example for regularization is Lasso ( [33]), however, in some applications, such as ANOVA or multi-task regression, the selection of important predictors corresponds to the selection of the groups of predictors. As a natural extension of Lasso, the group Lasso, which is proposed by [1] and further developed by [38], exploits a weighted sum of 2 norms of the coefficients associated with a group of features and leads to feature selection at group level. For more details about the group Lasso we refer to [13]. An obvious limitation of group Lasso is the non-overlapping structure which introduces a barrier to its applicability in practice where features may be encoded in more than one group. A solution to this problem is the overlapping group Lasso which proposed by [15] and further studied by [27] in the linear regression model setting.
In many studies, discrete data, such as categorical data or count data, is frequently encountered. Generalized linear models (GLMs, [22]) are the most commonly-used regression models for discrete data. Regularization method has been proposed to manipulate the small n and big p problems in GLMs [30,23,29,3,40]. The main shortcoming of these procedures is that the group structure of predictors must be pre-specified and this is not always possible in practice. The graphical structure of predictors, however, can be obtained from prior information, for example, in biological studies, the massive information about gene interaction can be used to construct the predictor graph where nodes represent genes and edges indicate regulatory relationships [20,31]. If the prior information cannot be obtained in some applications, we can construct the predictor graph by sparse estimation of the covariance (or precision) matrix of the predictors [39,11,6]. On the other hand, it is reasonable to assume that two neighboring genes in a network are more likely to participate together in the same biological process than two genes far away in the network [28,16]. In particular, as demonstrated in cancer marker discovery [7], changes in expression of some causal genes governing metastatic potential (e.g., ERBB2 and MYC) may be only subtle and nonsignificant while some of their neighbors have much stronger alterations. Hence, it is reasonable and helpful to take the neighbors of a predictor in the graph as a group.
There are now a lot of methods to utilize the graphical information of predictors. Recently, Yu and Liu [37] propose a node-by-node method to incorporate the graphical information among predictors for linear regression model. By motivating by the least square estimator, they note that the true coefficients of the model can be expressed as where β 0 is the true coefficients of the model, Ω = Σ −1 is the precision matrix, Σ xy is the cross-covariance vector. Thus, there are a natural relationship between the predictors and the graphical structures in the linear regression model for the graphical structure of predictors can be defined by the precision matrix, Ω. However, this strategy won't work when we seek to construct such relationship between the predictors and the graphical structures in generalized linear models, since the closed form for the estimator of the generalized linear models, such as logistic regression model or poisson regression model, usually is hard to be obtained in practice. Hence, how to incorporate the graphical structure among predictors for generalized linear models becomes a very interesting and challenging problem. In this paper, we note that the true coefficients of the generalized linear models can also be expressed as the form as (1) by using the sufficient dimension reduction (SDR, [8]) techniques: where β * is the true coefficients of the generalized linear models, μ is the expectation of the predictors and η(y) is a function of y, which is a given value of the response. Based on the equation (2) and motivated by Yu and Liu [37], we propose a sparse generalized linear models incorporating graphical structure among predictors (sGLMg) to model the graphical structure information of predictors for generalized linear models. The oracle inequality and the model selection consistency of the proposed sGLMg method is presented in this paper by assuming that the predictors graphical structure G is given. In fact, when the graphical structure G is unknown, it also can be obtained by sparse estimation of the covariance (or precision) matrix of the predictors [39,11,6].
In simulation studies, we compare both cases when the graphical structure of predictors is given and it is estimated. The sGLMg method proposed in this paper can utilize the neighborhood information of the graph directly and many popular methods such as adaptive Lasso, group Lasso and ridge regression can be included as special cases. The remainder of this paper is organized as follows. In section 2, we introduce our proposed sGLMg model. In section 3, we study the theoretical properties of sGLMg. In section 4 and 5, Monte Carlo simulation studies and a breast cancer data analysis are conducted to examine the performance of sGLMg method in finite samples. Finally, we conclude this paper with some discussion in section 6.

Sparse generalized linear models incorporating graphical structure
We consider the generalized linear models (GLMs) introduced by [22]. Let F be a probability distribution on R not concentrated on a point and (X, Y ) be a pair of random variables, where X ∈ R p and Y ∈ R. We assume that X follows some multivariate distribution with mean 0 p×1 and covariance matrix Σ. The condi- is the so called link function. In fact, the link function g can be any strictly monotone differentiable functions and we only consider canonical link functions, that is, g(μ) = β * 0 + β * X. The standard linear regression model is obviously an example of GLMs, in addition, the common examples of GLMs including: logistic regression model, poisson regression model, gamma model and exponential (or weibull) model.
Let (X 1 , Y 1 ), · · · , (X n , Y n ) be the i.i.d. copies of the population (X, Y ), where X i = (X i,1 , · · · , X i,p ) , i = 1, · · · , n. We consider the case of high dimensional regression and assume that • (A1) the variable X is almost surely bounded by a counstant K, that is, there exists a constant K > 0 such that X ∞ ≤ K a.s.; where Int(Θ) denote the set of all interior point in Θ; • (A3) the sample size n and the number of predictors p satisfy log(2p) n ≤ 1.

Remark:
Technically, Assumption (A1) is not a reasonable condition when the graphical structure among predictors is considered, since, in that case, it usually assume that X follows multivariate normal distribution in order to measure the graphical structure by precision matrix conveniently. However, in fact, Assumption (A1) may be extended to that X is bounded by O( n log(n) ) in the light of the method described in [32]. In this paper, for the sake of simplicity, we assume that X is bounded by a constant rather than a bound proportional to n log(n) . Further, we note that Assumption (A1) is also a technique condition required in [3,34].
The log-likelihood for GLMs is given by We denote the loss function for GLMs by (β 0 , β) := (β 0 , β; x, y) := −y(β 0 + β x) + φ(β 0 + β x). Notice that (β 0 , β) is convex in β (as φ is convex). The associated risk is denoted by P (β 0 , β) := E (β 0 , β; X, Y ) and the empirical risk by P n (β 0 , β) : and it is obvious that (β * 0 , β * ) = arg min (β0,β)∈Ξ P (β). From the theorem 2.1 of [9] and the condition 3.1 of [19] we have where Ω is the precision matrix which measures partial correlations among predictors, then by (3) we know that β * can be reformulated as β * = Ωγ, thus we have Notice that β * can be formulated as the sum of p parts, {(γ j ω 1j , γ j ω 2j , · · · , γ j ω pj ) : 1 ≤ j ≤ p}, by (4). For the jth part, (γ j ω 1j , γ j ω 2j , · · · , γ j ω pj ) , there is a common factor γ j . If γ j is equal to 0, then all the components in the jth part of β * will be 0 simultaneously. On the other hand, if γ j is not zero and the graphical structure of predictors is defined by Ω, then the support of (γ j ω 1j , γ j ω 2j , · · · , γ j ω pj ) becomes the N j , which is the set including predictor j and its neighbors in the predictor graph. From the above analysis, we can conclude that it is reasonable to incorporate the graphical structure of the predictors into generalized linear models. Furthermore, the results in [41] and [14] indicate that it is indeed helpful to incorporate the graphical structure among the predictors in logistic regression. In this paper, we treat the neighbors of each predictor as a group, since the predictor graph can generally not be represented as some no-overlapping groups, these groups are overlapping. Therefore, we consider a latent decomposition of β * into p parts based on the overlapping groups N 1 , N 2 , · · · , N p . After choosing the non-zero candidate in each part can be viewed as the effect arising form the marginal correlation between the jth predictor and the response variable. If they are uncorrelated, W (j) k will be zero for each k ∈ N j and the components in the set {W (j) k E kj : k ∈ N j } will be zero together. Thus, it is reasonable to use the group Lasso penalty to encourage the selected components in each part to be zero or nonzero simultaneously if the the candidate nonzero components in each part have been selected based on N 1 , N 2 , · · · , N p . Based on this idea which is motivated by [37], given the graph of the predictors and the training data (X 1 , Y 1 ), · · · , (X n , Y n ), we propose the following sparse generalized linear models incorporating graphical structure among predictors (sGLMg).
Here, λ is a tuning parameter which can be determined by cross validation and d j is the positive weight for the jth group and the choice of d j will be discussed in section 4.
The optimization problem of (6) can be solved by the predictor duplication method proposed in [26]. More precisely, let W (j) Nj and X iNj denote the |N j | × 1 sub-vector of W (j) and the |N j | × 1 sub-vector of X i with indices in N j , respectively, where i = 1, · · · , n, j = 1, · · · , p. LetX i = (X iN1 , X iN2 , · · · , X iNp ) Np ) , then, it is easy to verify that β X i = W X i . Therefore, the optimization problem (6) is equivalent to the following ordinary group Lasso problem: There are now a lot of efficient R packages, such as grpLasso [23], grpreg [4] and gglasso [36], can be used to solve the optimal problem (7). Recently, Zeng and Breheny [40] develop an R package called grpregOverlap based on grpreg, which can be used to solve the overlapping group Lasso directly. By settingŴ Nj : j ∈ F } is indistinguishable and therefore the decomposition of β is not unique (i.e. {W (1) , W (2) , · · · , W (p) } is not unique). In this case, the vector in {W (j) Nj : j ∈ F } can not be estimated stably, however, we can estimate j∈F W (j) Nj directly and stably using the penalty term (j) , different decompositions of β lead to the same estimation of β.

Theoretical properties of sGLMg
In this section we study the theoretical properties of the proposed sGLMg and the Oracle inequalities for the estimator of sGLMg will be presented in the finite sample setting. Given the predictor graph G and positive weights d j , for β ∈ R p , we denote Note that β G,d defined in (8) is similar to the latent group Lasso penalty defined in [26], however, it is very different in motivation between them since our proposed method is a graph based penalization problem. To give a geometric illustration of this norm we consider the case of p = 4. We consider that 3, 4} and N 4 = {3, 4}, then the norm β G,d we defined is a unit ball in R 4 that has two circular sets of singularities corresponding to cases where (β 1 , β 2 ) only or (β 3 , β 4 ) only is nonzero and two spherical sets of singularities corresponding to cases where (β 1 , β 2 , β 3 ) only or (β 1 , β 3 , β 4 ) only is nonzero. The graph of this norm in R 3 can refer to figure 2 of [26]. Thus, the minimum in (6) is equivalent to Note that the optimal decomposition of β minimizing β G,d always exists, but may not be unique [26]. We introduce the following notations: denote J * = {j : β * j = 0} as the true nonzero coefficient set and J * c = {j : β * j = 0} as the true zero coefficient set. Let s * = |J * | denote the number of true nonzero coefficients. For each β ∈ R p , we denote W(β) as the set of all optimal decompositions of β.
, which denotes the number of nonzero W (j) in the optimal decomposition of β that has the minimal number of nonzero W (j) . Denote K G = sup supp(β)⊂J * K G (β). It is easy to check that K G = s * if the graph G has no edge, K G = K 0 if G consists of some disconnected complete subgraph and J * is the union of K 0 nodes sets of those disconnected subgraph. Denote N max = max{|N j | : j = 1, · · · , p} as the number of variables in the neighborhood which contains the maximum number of predictors. We make the following assumption for the neighborhood N j : This condition assumes that predictors connected to the useful predictor are also useful.

The sub-gradient conditions for sGLMg
We introduce the following sub-gradient conditions for the problem (9).

The connections between sGLMg and some existing methods
Some existing methods, such as the adaptive Lasso, group Lasso and ridge regression, can be included as special cases of our proposed sGLMg method when the given predictor graph has some special structures. The following proposition shows this connections.

Proposition 2.
• (a) If the predictor graph has no edge, the proposed sGLMg method is identical to the adaptive Lasso mehtod for each tuning parameter λ; • (b) If the predictor graph is composed of T disconnected complete subgraphs, our proposed sGLMg method is the same as the ordinary group Lasso method for each λ; • (c) If the predictor graph is a complete graph, our proposed sGLMg method has the same nonzero solution set as the ridge regression, i.e. for each nonzero solution acquired by ridge regression (or sGLMg), sGLMg (or ridge regression) could acquired the same solution using a different tuning parameter.
The proof of this proposition is parallel to [37]. The proposition 2 indicates that our proposed sGLMg method is much more general than Adaptive Lasso, Group Lasso and ridge regression and can deal with any arbitrary predictor graph structures.

The oracle inequalities for sGLMg
In this section we study the finite properties of the estimator of our proposed sGLMg method and present the oracle inequalities for estimation and prediction of sGLMg. For each β ∈ Ξ, we need to prove the concentration inequalities for the empirical process P n (β), i.e. we need to give an appropriate lower bonds on (P n − P)( (β n ) − (β * )). To do this, we decompose the empirical process into a linear part and a part which depends on the normalized parameter φ, i.e.
In addition, we need to make the follow assumption for β * : We define the following events where a = 8b + n , n = 1 n . First, we want to show that the events A and B occur with high probability, i.e., we will give a lower bound for the probabilities of the events A and B, which is equivalent to prove the concentration inequalities for the linear and nonlinear part of the empirical process.
Note that to prove this Lemma we need to use the assumption (A.1), the details of the proof of this Lemma can refer to [3].
where A > 1 and C K,b is the same as Lemma 1.
To prove the concentration inequalities for the nonlinear part of the empirical process, we need to use the boundedness assumption for X and to show that we can restrict the study of φ to a suitable compact set. Since φ is Lipchitzian on this compact set, we can use the concentration results for Lipchitzian loss funcitons [18] to bound the probability of event B. Thus, a lower bound for the probability of event B can be obtained.
where A ≥ 1, then there exists a constant C such that After obtaining the lower bonds for the probability of events A and B, the following corollary can be easily inferred.
where μ and C are universal constants and A ≥ √ 2. The definition of C K,b is the same as Lemma 1.
Thus, according to the Theorem 1 and Corollary 1 we can deduce the upper bounds for the linear part and nonlinear part of the empirical process, i.e., Theorem 3. On the event A, Theorem 3 shows that the difference between the linear part of the empirical process and its expectation is bounded above by the tuning parameter multiplied by the norm (defined by (8)) of the difference between the estimator of sGLMg and the true parameter. Note that the norm defined by (8) is associated to the predictor graph. A similar result for the nonlinear part of the empirical process can be also stated, the key of the proof is based on the following lemma which show that the estimator of sGLMg,β n , is in the neighborhood of the target parameter β * on the event A ∩ B.

Lemma 2. On the event
n . An upper bound for (P n − P)( φ (β * ) − φ (β n )) can be directly obtained based on the definition of the event B and the Lemma 2.
According to the restricted strong convexity condition for M-estimators in [25], we need to ensure that the the loss function is not too flat after stating the concentration of the loss function around its mean, i.e., there exists ε > 0, when Notice that the boundedness assumption on the components of X is not required to obtain such kind of strong convexity, however, we need it to establish Theorem 1. As stated by [25], if the tail of the covariates is sub-gaussian and the covariance matrix is positive definite then the loss function satisfies a kind of restricted strong convexity property with high probability. Therefore, the primary condition to prove the oracle inequalities for the estimator of sGLMg rests on the correlation between the covariates, i.e., on the behaviour of the Gram matrix 1 n n i=1 X i X i which is necessarily singular when p > n. Meire et al. [23] show that the group lasso is consistent under the logistic regression model and give an upper bound for the prediction error under the assumption that E(XX ) is nonsingular. Blazere et al. [3] present the oracle inequalities for the estimation and prediction error of the generalized linear models under the group satbil condition which is similar to the restricted eigenvalue conditions in [24] and [21]. However, the group structure in [3] must be non-overlapping and specified in advance. The same stabil conditions are used by [27] and [37] who proved the theoretical properties of overlapping group Lasso and of linear regression model incorporating the graphical structure among predictors, respectively. In this paper, we will present the oracle inequalities for the estimation and prediction error of our proposed sGLMg under the similar conditions as we discussed above.
For a given graph G, positive weights d j 's and subset J ⊂ {1, 2, · · · , p}, denote Ω(β, J) as the set of all optimal decomposition of β such that for all > 0. Denote Σ := E(XX ) as p × p covariance matrix and consider the following assumption: Note that assumption (A6) plays a key role in the proof of the oracle inequalities for the estimator of sGLMg. This assumption is similar to the restricted strong convexity in [25], which is considered as a key condition for ensuring the fast convergence rates and well theoretical properties of the regularized M-estimators in high dimension scaling. The next theorem is the most important result in this paper, which present the finite sample bounds for estimation and prediction of the estimator of our proposed sGLMg.
where μ is the universal constant, A ≥ √ 2 and the definition of C K,b is similar as Lemma 1, then, for any optimal solutionβ n of problem (9), we have The results presented in Theorem 5 are very general for some existing results have close connections with it if the predictor graph has some special structure. For example, the oracle inequalities for the prediction and estimation error of GLMs obtained by group Lasso and Lasso methods, respectively, in [3] are special cases of Theorem 5. In fact, when the given graph G has no edge, we have K G = s * and β n −β * G,d = β n −β * 1 if d j = 1 for j = 1, · · · , p. Theorem 5 indicates that the same results of the estimation and prediction error in [3] for the Lasso method can be re-derived (Theorem III.8). When the predictor graph G consists of some disconnected complete subgraphs and J * is the union of K 0 node sets of those disconnected subgraphs, we have K G = K 0 . In this setting, the results presented in [3] for the group Lasso can be also recovered (Theorem III.6). In addition, the results about the linear regression model in [2], [24], [21] and [37] are also connected with the result shown in Theorem 5.
Notice  (exp(n)). Under the similar conditions, the estimation error for the linear model in [37] is of the order O(exp(n)). Compare to this, the term log(p) in the GLMs is the price to pay for having a large number factors and not knowing where are the nonzero ones.

Model selection consistency
In this section we discuss model selection consistency for the case with a fixed dimension p. For every β ∈ R p , denote β J * and β J * c as sub-vectors of β with indices in J * and J * c respectively. Theorem 6. Assume assumption (A2) and (A4) hold. Suppose the tuning parameter λ and d i are chosen such that √ nλ → 0 and n (γ+1)/2 λ → ∞ for some

the sub-matrix of I(β) consisting of the entries with row and column indices in J * and I(β) is the Fisher information matrix of the model.
Note that Theorem 6 shows that our proposed sGLMg method is model selection consistent for the fixed p case. It also provide a guideline on how to choose the positive weight d j . In fact, the choice of weights of overlapping groups is much more important and complicated than in the case of disjoint groups. Obozinski et al. [26] have made a thorough discussion on the choice of weights for the overlapping groups and proposed some guidelines on the choice of weights. They suggest to consider weights of the form d j = m γ j , where m j = |N j | is the number of predictors in the neighborhood N j and γ ∈ (0, 1 2 ). And, γ = 0 and γ = 1 2 correspond to two extreme cases that only the largest and only the smallest groups are active, respectively. Furthermore, they give a critical value with γ = log (2) 2 log (3) , which is the smallest value that it is possible to select two singleton only. In our simulation studies, we suggest to choose d j = m γ j , γ = log (2) 2 log(3) .

Simulation study
We consider the Logistic model. In order to examine the performance of our proposed sGLMg, we compare it with some popular penalized methods such as Lasso, ridge regression, adaptive Lasso and elastic net. In the simulation, the predictor graph is defined by the precision matrix of the predictors. The performance of sGLMg using both the estimated predictor graph and the oracle true predictor graph are evaluated on all examples. We denote sGLMg-O as the sGLMg method using the true predictor graph. The response Y of Logistic regression is generated by Y ∈ {0, 1} and P (Y = 1|X) = exp(Xβ * ) 1+exp(Xβ * ) . We divide the data set X into three separate subsets: a training data set, a validation data set and a testing data set. All the models are fitted on the training data set only. The validation data set are used to choose the tuning parameter and the test data set is used to evaluate different methods. We use the notation ./././ to show the sample size in the training, validation and test sets, respectively. For each example, we consider three cases: (A) 40/40/400, (B) 80/80/400 and (C) 120/120/400. For each case, we repeat the simulation 100 times. The predictor graph is estimated by the graphical Lasso method [11] only using the training data in all cases.
To evaluate the different methods, we use the following measures: High-dimensional GLMs incorporating graphical structure among predictors 3175 Zero match ratio: where NMR (or ZMR) is used to check whether the estimated coefficients of two connected useful (or useless) predictors are both nonzero (or zero). Note that we use NMR and ZMR when there is at least one edge connecting two useful predictors and one edge connecting two useless predictors. Thus, these two ratios are well defined and always between 0 and 1.    Table 1 and Table 2 show the performance comparison of estimation, prediction and model selection of Example 1. The comparison results indicate that the ridge regression method obtains the better estimation than Lasso, adpative Lasso and Elastic net methods, however, the ridge regression method can not select the predictors automatically. Comparing with Lasso, adaptive Lasso and ridge regression methods, the Elastic net method acquires the overall optimal estimation, prediction and model selection by using the combination of 1 and ridge penalty. Specifically, the estimation acquired by elastic net method is almost the same with ridge regression and the prediction obtained by elastic net method is better than Lasso and adaptive Lasso. Although the elastic net method has relatively high FPR than Lasso and adaptive Lasso, the FNR of elastic net is the smallest of the three. Compared with the other methods our proposed sGLMg method delivers the best performance of estimation and prediction. For the cases with smaller sample sizes (condition A and B), our proposed sGLMg (sGLMg-O) method has slightly higher FPR than adaptive Lasso method, however, with the increase of the sample size (C), our proposed sGLMg (sGLMg-O) method acquire the lowest FPR than the other methods compared with it. Especially, the FNR obtained by our proposed sGLMg (sGLMg-O) method is much lower than the other methods. The reason is that our proposed sGLMg (sGLMg-O) method using the information of the predictor graph and the predictors connected in the graph has much more chances to be selected or removed simultaneously.
When the signal strength is weak, the sGLMg (sGLMg-O) method tend to select more predictors as the significant predictors. If the signal strength becomes stronger, the FPR of our proposed sGLMg (sGLMg-O) method will be decreased gradually even smaller than the adaptive Lasso method. The results of Table 10 in Appendix A indicates that the the performance of the model selection of our proposed sGLMg (sGLMg-O) method may be achieved the optimal results compared with the other methods when the signal is increased properly. For this example, since the estimated predictor graph is almost the same as the true predictor graph, the performance of sGLMg method is similar to sGLMg-O method.  The performance comparison for Example 2 is displayed in Table 3 and Table 4. As Example 1, the ridge regression method has better performance of estimation and prediction than the other methods when the sample size is relative small. However, our proposed sGLMg (sGLMg-O) method may acquire better performance of estimation or prediction than the ridge regression for the relative large sample size. For example, for the case C, the sGLMg (sGLMg-O) method obtains better performance of estimation than the ridge regression method; for the cases B and C, the sGLMg (sGLMg-O) method acquires better performance of prediction than the ridge regression method.
Compared with Lasso, adaptive Lasso and Elastic net methods, the adaptive Lasso and Elastic net method acquire the lowest FPR and FNR respectively. Our proposed sGLMgsGLMg (sGLMg-O) method has the lowest FNR in the performance of model selection than the other methods, although the FPR obtained by sGLMg (sGLMg-O) is little higher than the other methods, which is mainly because of the weak signal strength. The results of Table 11 in Appendix A shows that the performance of model selection of our proposed sGLMg (sGLMg-O) method may be significantly improved if the signal intensity is increased, especially the FNR is almost 0 when the sample size is relative large. However, the other methods can benefit a little from increasing the signal strength. For example, the performance of FPR obtained by the Elastic net method will be worse when the signal strength is increased. Overall, the performance of sGLMg-O method is better than the sGLMg method in estimation, prediction and model selection.   Table 5 and Table 6 display the results for Example 3. Our proposed sGLMg (sGLMg-O) method delivers the best performance of estimation and prediction compared with the other methods (not including the ridge regression method). Note that all the methods here are not good at the performance of model selection, especially the FNR acquired by these methods is too high, however, our proposed sGLMg (sGLMg-O) method still has the lowest FNR than the other methods. This results indicates that it is more difficult to do model selection for generalized linear models than for linear regression model when the predictor graph is complicated. The reason may be that the generalized linear models have larger number of unknowing factors than the linear regression factors. As the previous two examples, our proposed sGLMg methods has the overall optimal results for both estimation, prediction and model selection compared with the other methods.
The comparison results on NMR and ZMR for the cases with sample sizes 40/40/400, 80/80/400 and 120/120/400 are shown in Table 7, 8 and 9. Compared with the other methods, our proposed sGLMg-O method acquires the best performance in NMR and competitive performance in ZMR. From the above analysis, the reason for that the sGLMg-O method acquires the lower NMR than the other methods may be that our proposed sGLMg method tend to select more predictors as significant predictors, which lesson the zero match ratio of the predictors. Our proposed sGLMg method acquires very competitive performance in NMR and ZMR when the predictor graph is not very complicated (Example 1 and 2). However, even in these cases the NMR acquired by   Lasso and adaptive Lasso is almost 0, which because that the Lasso method tend to select only some predictors from the highly correlated predictors. When the predictor graph is complicated (Example 3), the NMR acquired by all these methods is not very well, however, our proposed sGLMg method is still the best. The NMR's and ZMR's of sGLMg-O indicate that our proposed sGLMg method incorporate the predictor graph information to group the predictors and make use of most edges between useful and useless predictors efficiently. Therefore, our proposed sGLMg method can choose those connected useful predictors simultaneously and exclude those connected useless predictors jointly.
In conclusion, the simulation results indicate that when the group structure is unknown our proposed sGLMg method can make use of the structure information among predictors to group the predictors efficiently and performs well for both estimation, prediction and model selection.

Application
In this section, we consider a real example to compare the performance of our proposed sGLMg method with Lasso, adaptive Lasso, ridge regression and Elastic net. The breast cancer data consists of 22,283 gene expression levels of 133 subjects, including 34 subjects with pathological complete response (pCR) and 99 subjects with residual disease (RD). The dataset were analysed by [12] and are available at http://bioinformatics.mdanderson.org/pubdata. html. The pCR is defined as no evidence of viable, invasive tumor cells left in surgical specimen, which has been considered to have a high chance of can-cer free survival in the long term, justifying its use as a surrogate marker of chemosensitivity [17]. Thus, it is of considerable interest to study the response states of the patients (pCR or RD) to neoadjuvant (preoperative) chemotherapy. [10] and [6] apply linear discriminant analysis to predict wether or not a subject can achieve the pCR state by estimating the inverse covariance matrix (or precision matix) of the gene expression levels. In this paper, we follow the same analysis scheme used by [10] and [6] to estimate the precision matrix and then compare the performance of our proposed sGLMg method with Lasso, adaptive Lasso, ridge regression and Elastic net based on the estimated precision matrix.
To estimate the precision matrix, we randomly divide the data into the training and testing sets of sizes 112 and 21, respectively, and repeat the whole process 100 times. A stratified sampling method is used in order to maintain a similar class proportion for the training and testing datasets. We randomly select 5 pCR subjects and 16 RD subjects each time from the corresponding groups to form the the testing data (both are roughly 1/6 of the subjects in each group) and the remaining subjects will be used to constitute the the training set. For each training set, a two-sample t test is performed between the two groups and the most significant 113 genes that have the smallest p-values are selected as the predictors for prediction. We note that the training sample size n = 112 is slightly smaller than the variable dimensionality p = 113, which allows us to examine the performance when p > n. Then, a gene-wise standardization is performed by dividing the data with the corresponding standard deviation, estimated from the training dataset. Finally, we estimate the precision matrix Ω using the training data and the predictor graph G is estimated by the graphical Lasso [11] based on the estimated precision matrix. Note that all the models are fitted using training data and evaluated by the mean squared error (MSE) calculated from the testing data. We perform a 10-fold CV to choose the tuning parameters of different methods. Figure 1 shows the box plot of the averaged mean squared errors of different methods. The results indicates that our proposed sGLMg method acquires better performance on MSE score than Lasso, adaptive Lasso and Elastic net methods and slightly worse than the ridge regression method, which consistent with the simulation results.

Conclusion
In this paper, we propose the sparse generalized linear models incorporating graphical structure among predictors (sGLMg) which can be used to analysis the sparse graphical or overlapping structure data with generalized linear models. For sGLMg model, the overlapping structure does not need to be specified in advance and it can be obtained by the graphical structure among predictors. Even the graphical structure is unknown, we can also construct it by the sparse estimation of the covariance matrix of predictors. Since the closed form of estimator for generalized linear models usually can not be obtained, the graphical structure among predictors can not be incorporated into the estimation process of GLMs as the linear regression model where the estimator can be formulated as the multiplication of the inverse of covariance matrix, which can be defined as the predictor graph, and a vector between predictors and the response from least square estimation. In this paper, we use the sufficient dimension reduction techniques to show that we can also formulate the estimators of GLMs as the multiplication of the inverse of covariance matrix and a vector. Thus, the graphical structure of predictors can be also incorporated into the GLMs. In order to utilize the neighborhood information of the graph we apply a node-by-node strategy to convert the graphical structure to the overlapping group structure. Furthermore, our proposed method is very general and some popular methods such as Lasso, group Lasso and ridge regression can be included as special cases. The theoretical results we obtained are still true when the overlapping group structure is pre-specified.    Table 10 Performance comparison of model selection for β * = (3, · · · , 3, 0, · · · , 0) in Example 1
Lemma S.2. Let X 1 , · · · , X n be independent random variables on X and f 1 , · · · , f n be real-valued functions on X which satisfies for all i = 1, · · · , n and all j = 1, · · · , p Ef j (X i ) = 0, |f j (X i )| ≤ a ij . Then, Proof of Lemma S.2. The proof of this Lemma can be directly deduced by Hoeffding inequality.