Model selection of hierarchically structured covariates using elastic net

: Hierarchically associated covariates are common in many ﬁelds, and it is often of interest to incorporate their information in statistical infer-ence. This paper proposes a novel way to explicitly integrate the information of a given hierarchical tree of covariates in high-dimensional model selection. Speciﬁcally, a set of hierarchical scores is introduced to quantify the hierar- chical positions of the terminal nodes of the given hierarchical tree, where a terminal node represents either a single covariate or a group of covariates. These scores are then used to weight the corresponding penalty terms in a model selection approach. We show that the proposed estimation approach has a hierarchical grouping property , namely, two highly correlated covariates that are close to each other in the hierarchical tree will be more likely included or excluded together in the model than those which are far away. We also prove model selection consistency of the proposed estimator both between and within groups. The theoretical results are illustrated by simulation and also a real data analysis on the Systemic Lupus Erythematosus (SLE) dataset.


Introduction
Consider the ordinary linear regression model with n observations and p covariates: where y is an n-dimensional response vector, X = (x 1 , . . . , x p ) is an n × p deterministic design matrix, β = (β 1 , . . . , β p ) T is the corresponding regression coefficients and ε is the vector of independent random errors. We assume that p is very large and β is sparse, under which setting there is a large body of literature on developing consistent model selection approaches in the past two decades. Examples include Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001;Fan and Lv, 2011), SIS (Sure independence screening) and two-scale method (Fan and Lv, 2008), MCP (Zhang, 2010) and many others. In many cases, there are high correlations among the covariates x i 's. A number of publications have shown that the correlation should be taken into account to produce a stable and consistent result; see, e.g., Elastic Net (Enet) method (Zou and Hastie, 2005), the OSCAR (octagonal shrinkage and clustering algorithm for regression) approach (Bondell and Reich, 2008), the Mnet method (Huang et al, 2010), the SLS method , among others. Indeed, structured information in covariates is important in covariate selection. For instance, Yuan and Lin (2006) studied a Group-Lasso method that introduced the concept of group sparsity where individual covariates are grouped for selection. The benefit of grouping is greater regularization and improvements in the prediction power due to the stability in the presence of highly correlated covariates. Similarly, there are publications on the use of hierarchical information in covariate selection. In those studies, the purpose of hierarchical structure is to impose the prioritization of the covariates and their ancestors being selected. Examples include composite absolute penalty (CAP) (Zhao et al, 2009), structured covariate selection with sparsity-inducing norms (Jenatton et al, 2011), structured covariate selection and estimation (Yuan et al, 2009), A LASSO for hierarchical interactions (Bien et al, 2013) and learning with structured sparsity . In this work, we focus on a different type of hierarchical structure which is particularly common in hierarchical clustering analysis: instead of representing the priority of the covariates, the hierarchy contains the information on the split (or merge if agglomerative) sequences of clusters (i.e. groups of covariates). To avoid ambiguity, throughout the rest of this paper, "hierarchical structure" refers to this type of information. For instance, in cell biology and genetic studies, genes are often organized into groups based on their biological characteristics or genetic functions, and in many studies the groups are organized in hierarchical layers; see, e.g., (Nei, 1973;Beibbarth and Speed, 2004). See Subsection 1.1 for additional examples. It is prudent to utilize this type of information in our analysis to seek important gene predictors. However, to the best of our knowledge, the question on how to utilize the hierarchy information in model selection problems is still open, even though hierarchical structures are common in many applications. Specifically, we introduce a novel hierarchical scoring system to quantify the hierarchical positions and, based on it, develop a new high-dimensional model selection approach to incorporate the information of given covariate hierarchical trees. The advantage of our approach is that the hierarchical information can influence the final outcome of the analysis, leading to better scientific and statistical interpretations which are fully consistent with the given hierarchical structure.
Our motivation comes from the peripheral-blood mononuclear cell (PBMC) study reported in Chaussabel et al. (2008). The objective of the study is to eliminate the trivial genes and identify the important genes which can be used to predict the Systemic Lupus Erythematosus disease-activity index (SLEDAI) among 4779 potential candidates with 47 individual samples. According to Chaussabel et al. (2008), those 4779 genes are distributed among 28 modules (groups). The transcripts within each module are highly correlated. On top of these 28 modules, we can also obtain a hierarchical structure with each terminal node representing a single module; see Figure 1. The challenge is how to take advantage of the information presented in the hierarchical structure and use it in a model selection procedure which selects covariates at both group and individual levels. We use an Enet-type penalty throughout to illustrate our developments. This simplifies our presentation and keeps our focus on the main goal of incorporating the information contained in a covariate hierarchical tree. An Enet estimator iŝ β = argmin β y − Xβ 2 2 + λ 1 β 1 + λ 2 β T β; (1.2) cf. Zou and Hastie (2005). The Enet approach encourages sparsity and grouping simultaneously. It also consistently selects the true model under certain condition; c.f., Jia and Yu (2010). Here and throughout the paper, we use l 1 , l 2 and l ∞ norms. In particular, for a vector a = (a 1 , . . . , a p ) T , the l 1 , l 2 and l ∞ norms are denoted by a 1 = i |a i |, a 2 = ( i a 2 i ) 1/2 and a ∞ = max i |a i |, respectively. We also denote A 2 = sup a 2=1 Aa 2 and A ∞ = sup a ∞=1 Aa ∞ for a matrix A.
In order to integrate information of the covariate hierarchical tree, we propose a scoring system to quantify the positions of the terminal nodes in the given hierarchical tree. For ease of presentation and also avoiding other complications, we first start with a simple case in which each terminal node of the hierarchical tree represents only one covariate. In this setting, each covariate will be assigned a score which is derived from the hierarchical structure. The set of scores quantify the hierarchical information among the covariates, and we integrate them into the Enet penalty function. It can be shown that the resulting procedure not only performs model selection and estimation simultaneously, but also enjoys a desired feature, called hierarchical grouping property, which can be generally described as follows: • Two highly correlated covariates which are "close" to each other in the hierarchical tree will more likely be included or dropped together from the model than those that are "far away".
A formal definition of this hierarchical grouping property will be provided in Section 2.1. In practice, the terminal nodes of many hierarchical trees contain multiple covariates; see, e.g., Breiman et al. (1984) and also our motivating example of SLE study mentioned above. We extend our development for the simply setting in the first part to the more complicated case in which each terminal node can contain potentially multiple covariates (thus a terminal group of covariates). Given a hierarchical structure, we assign each terminal group a score and prove that with an appropriate choice of group penalty function, the resulting procedure still retain the hierarchical grouping property for the terminal nodes. That is, the procedure will include or exclude together those highly correlated terminal groups (nodes) which are "close" in the hierarchical structure. Furthermore, we prove that the proposed estimator has model selection consistency at both levels, i.e., between-terminal-groups and within-terminal-groups.
The rest of the paper is organized as follows. The remaining of this section (Section 1.1) introduces several terminologies and notations to be used throughout the paper. In Section 2, we define the "hierarchical grouping property" and construct the "hierarchical score" along with the corresponding "hierarchical Elastic Net" estimator for the simplified case in which each terminal node represents only one covariate. We also prove that the proposed estimator has the "hierarchical grouping property" and can provide consistent result in model selection. Section 3 extends the results to the general case in which each terminal node corresponds to a group of covariates. Computational algorithm is presented in Section 4. Numerical studies including simulations and a real data analysis are carried out in Section 5. Further remarks, including potential extensions of the method to other types of penalty functions, are provided in Section 6. We relegate technical proofs to the Appendix.

Terminologies and notations
Consider a hierarchical structure represented by an upside-down tree; for example, Figure 2 (a) or (b). The top of the tree is the root and we refer to the nodes at the bottom of the upside-down tree as terminal nodes. The nodes (splits) in between are the internal nodes which, viewed from the bottom to the top, show how individual covariates are grouped and how the groups are merged into supergroups. The concept depth is defined for each node (split) as its position in the sequence of splitting when viewed from the top to the bottom. The root node, which represents the first split, has depth 1. Then the node representing the second split has depth 2, etc. See the trees in Figure 2 (a) and (b), in which we have marked the depths of their splits on the vertical coordinate. We assume in this paper that the hierarchical tree structure is known and all internal nodes have distinct depths, i.e. the split or merge is sequential. In real applications, the depth values can be obtained from different ways. For instance, in a biological evolutionary tree, each splitting of lineages of species can be arranged in chronological order. In this case, the depth values can be derived from the time of origination of each new species. Also, classical divisive hierarchical clustering algorithms recursively divide one of the existing clusters into two sub-clusters at each iteration (cf., e.g., Breiman et al. (1984)). In this case, the iteration order can be used to define the depths. How to handle the uncertainty of the tree structure (especially when the tree is generated from a computer algorithm) is not a topic of the current paper; Section 6 contains further discussions. In addition, following Zou and Hastie (2005) and others, we assume that each 3780 W. Qiao et al. covariate has been standardized to have l 2 -norm n before data analysis: x 2 ij = n, for j = 1, . . . , p. (1.3)

Hierarchical covariate selection when a terminal node contains only a single predictor
This section considers the simple case in which each terminal node of the given hierarchical tree contains only one covariate. We design in Subsections 2.1 and 2.2 a set of positive hierarchical scores s j for covariates x j , for j = 1, . . . , p, to reflect the structure of a given tree. We use these hierarchical scores as a set of weights on the Enet penalty terms (1.2), and propose the following Hierarchical Enet (HEnet) estimator: Here, (λ 1 , λ 2 ) are the tuning parameters and S = diag{s 1 , . . . , s p }. In the special case when all s j ≡ 1, (2.1) reduces back to the conventional Enet estimator (1.2). This Hierarchical Enet (HEnet) estimator has several desirable properties.

Hierarchical grouping property
We define below the notion of ancestors, which will be used to determine the closeness of a pair of covariates in a given tree.

Definition 2.1 (Ancestors & closeness).
For a covariate x j in a given hierarchical structure, we define ancestors of the covariate x j , denoted by A xj , as the set of depths associated with the splits in the hierarchical tree that lead to the terminal node x i . We also define ancestors of a set of covariates B = {x i , i ∈ I}, denoted by A B , as the common ancestors of x i over i ∈ I, i.e., A B = ∩ i∈I A xi . In addition, we call x j is closer to x i than to x k in the hierarchy tree, if x i and x j share more ancestors than x i and x k , i.e., For examples, the ancestors of covariate x 1 (corresponding to the terminal node 1) in Figure 2 Figure 2 Now, we formally state the hierarchical grouping property as follows.
Definition 2.2 (Hierarchical grouping property). We call predictor covariates grouped, if they are selected or dropped together by a model selection procedure. A hierarchical grouping property refers to: P1. For two covariates that share the same ancestors, if they are also highly correlated, then they tend to selected or dropped together by the model selection procedure. P2. For any three covariates, say if x j is closer to x i than to x k in a hierarchy tree, then x i and x j tend to be grouped together with a higher chance than that of x i and x k .
The P1 property is similar to the conventional grouping property discussed in Zou and Hastie (2005) and Bondell and Reich (2008): highly correlated covariates tend to be selected or dropped together from the estimated model. The P2 property is in compliance with hierarchical structure: a pair of covariates that are closer (i.e., share more ancestors) in the hierarchy will be more likely grouped in the estimated model than those that are farther away (i.e., share fewer ancestors), provided that both pairs have the same correlation. Here, the interpretation of "more likely" is that |β i −β j | has a smaller upper bound than To achieve the goal of hierarchical grouping property, the proposed hierarchical scores (i.e., s i 's) used in our proposed approach (2.1) need to satisfy certain conditions, as discussed in the following: Recall that the conventional Enet estimator in (1.2) has group properties. In particular, by taking derivatives of the right hand side of (1.2) with respect to β j and β k , respectively, we have assumingβ jβk = 0. Under the conditionβ jβk > 0, subtracting above two equations and combining with the fact that y − Xβ 2 ≤ y 2 , we have where φ jk = cor(x j , x k ). The inequality (2.2) indicates that as φ jk → 1, |β j − β k | → 0; thus, the grouping property in Enet is realized. Similarly, for the HEnet approach (2.1), by taking derivative on the right hand side, we have after a simple calculation, Thus, parallel to (2.2), we have where s (p) = max 1≤i≤p s i and ( Formally, we have the following lemma. A proof is provided in the Appendix. The remaining question is how to construct a set of hierarchical scores s i 's that satisfies conditions C1 and C2. We provide a method to construct such scores next in Section 2.2.

Construction of hierarchical scores
For a given terminal node with covariate x i , we define a binary vector v i ∈ R p−1 such that, for l = 1, . . . , p − 1, the lth element of v i is: Here, A xi is the set of ancestors of x i defined in Section 2.1. For example, corresponding to the terminal node x 1 in Figure 2 (a), A x1 = {1, 3, 5}. Thus, by (2.5), the binary vector v 1 = (1, 0, 1, 0, 1). Similarly, corresponding to the predictor x 3 in Figure 2 (a), A x3 = {1, 3} and, by (2.5), the binary vector v 3 = (1, 0, 1, 0, 0). We define the hierarchical score s i for the terminal node of the predictor x i as: (2.6) Here, τ and α are positive constants to be further explained. This set of scores s i are bounded above by max 1≤i≤p s i ≤ ( The following theorem states that the scores s i defined in (2.6) satisfy Conditions C1 and C2 for any τ ≥ 3 and α > 0. A proof of the theorem is given in the appendix. Note that, the requirement that τ ≥ 3 and α > 0 also ensures that the mapping from v i to s i is one-to-one.
Theorem 2.1. Suppose all the internal nodes have different depths in the given tree. If τ ≥ 3 and α > 0, then the scores s i defined in (2.6) satisfies Conditions C1 and C2.
From (2.6), we can show that the absolute score difference of two different predictors (for example, in Figure 2 (a), |s 1 −s 3 | = τ −5α ) is a decreasing function of τ . So the choice of τ = 3 numerically maximizes the differentiation among the scores of the covariates. Based on this consideration, we fix τ = 3 in all of our numerical studies.
For the parameter α > 0, we treat it as a tuning parameter. In particular, the score s i is decreasing in α and it also has the following properties: • When α → 0, all scores s i ≡ 1 and thus all scale weights ϕ ij ≡ 1. In this case, the hierarchical tree structure is not taken into account in (2.3) and the HEnet estimator (2.1) is just the conventional Enet estimator (1.2). • When α → ∞, only covariates sharing same ancestors have ϕ ij = 1 and otherwise ϕ ij = 0. So only the covariates with the same ancestors are considered for grouping and the hierarchical structure is strictly enforced.
Clearly, the tuning parameter α controls the extent to which the hierarchical structure impacts the regression parameter estimates. More details on how we choose α in our developments can be found in Sections 4 and 5.

Theoretical results
The proposed HEnet estimator in (2.1) can be re-expressed aŝ where δ = λ 2 /λ 1 is a new tuning parameter replacing λ 2 . We formally state the result derived in Section 2.1 in the following theorem.
Theorem 2.2 (Hierarchical grouping property). Letβ be the estimator in (2.7). Supposeβ iβj > 0, then Theorem 2.2 entails the hierarchical grouping property for the proposed HEnet estimator. For instance, suppose x i and x j are highly correlated and, without loss of generality, we assume . If x i and x j are also close (i.e., ϕ ij ≈ 1) in the hierarchical structure, it follows that |β i −β j | ≈ 0. Thus, the upper bound in Theorem 2.2 provides the same quantitative description of the grouping effect as in Zou and Hastie (2005) and Bondell and Reich (2008). In the case when ϕ ij < 1, the addition of the ϕ ij term channels the information of the hierarchical structure into the grouping process.
We show next that the proposed HEnet estimator in (2.1) or (2.7) also has the property of model selection consistency under a sparsity assumption. Without loss of generality, we assume that the first q true parameters β 0 j = 0 for j ∈ {1, . . . , q} and the rest p − q true parameters β 0 Also, let X (1) and X (2) be the first q and last p − q columns of the design matrix X, respectively, and Σ ij = X T (i) X (j) /n, for i, j ∈ {1, 2}. We define below a Hierarchical Elastic Irrepresentable Condition (HEIC): HEIC. There exists a positive constant η < 1 such that (1) . Intuitively, the HEIC condition imposes a regularization constraint on the regression coefficients of X 2 on X 1 with hierarchical scores. The HEIC condition is an extension of the simple Elastic Irrepresentable Condition (EIC) proposed by Jia and Yu (2010): EIC. There exists a positive constant η < 1 such that Specially, when all the scores s i ≡ 1, we have S (1) = I q , S (2) = I p−q , and in this case the HEIC condition is exactly the EIC condition. Jia and Yu (2010) proved model selection consistency of the conventional Enet estimator by assuming EIC. Similarly, under HEIC, we have the following theorem. A proof of the theorem is given in the appendix.
Then, under the HEIC condition, HEnet estimatorβ satisfies provided that the tuning parameters λ 1 and δ are chosen such that On the other hand, if the HEnet estimatorβ is sign consistent (for some (λ 1 , δ)), then we have that (2.8) holds with η = 0, i.e., Since, for any fixed α in (2.6), all the s i 's are bounded, the incorporation of the hierarchical scores in the Enet-type of penalty does not introduce additional complexity in the theoretical development compared to the original problem considered in Jia and Yu (2010), except that we need to include extra terms involving hierarchical scores.

Hierarchical covariate selection when a terminal node contains multiple predictors
When p is large, building a large hierarchical tree with each terminal node representing only one covariate typically causes overfitting problems. Many researchers have suggested to prune large hierarchical trees to prevent the overfitting problems yet still capture important structures; c.f., Breiman et al. (1984). The terminal nodes on a pruned tree often contain multiple covariates. In the example of the SLE dataset described in Section 1, all the 4779 covariates (genes) are divided into 28 modules (terminal nodes) with each module (node) having more than one gene. Indeed, the terminal nodes in most hierarchical clusters contain more than one member, whether it is by pruning, by its natural structure, or by other means. The grouping by terminal nodes in a hierarchical tree imposes an additional layer of complication to our model selection problem. Let us call the group of covariates formed by a terminal node as a terminal group. We need to consider two levels of model selections: selection of the terminal groups and also selection of the covariates within each terminal group. We define an important covariate as any covariate with non-zero true coefficient and we also define an important terminal group as any terminal group with at least one important covariate. Our goal of model selection is to identify both the important terminal groups (terminal nodes) and also the important covariates (inside each important terminal node), while taking into account the given hierarchical tree structure.
In this section, we extend the developments in Section 2 to deal with this more general and also more challenging case. Under this context, the scoring scheme proposed in Section 2.2 is applied to the hierarchy of the terminal groups (terminal nodes), and thus every terminal group (terminal node) has an assigned score.
As pointed out by a reviewer, HEnet could be directly applied in this case by duplicating the weights for covariates within each group. This is similar to the situation where one uses lasso for a sparse group lasso problem (Simon et al , 2013), ignoring the grouping structure. Although we have not carefully investigated the effect of using HEnet in this case, we believe ignoring group structure would be inferior, as has been demonstrated in the group lasso literature.

Group Hierarchical Enet
Consider again the linear model (1.1). Suppose now the p covariates are clustered in a hierarchical tree of K terminal nodes. Let G k , k = 1, . . . , K be the K non-overlapping subsets of the indices (1, . . . , p) of covariates corresponding to the K terminal nodes. Denote by p k = |G k | the size of the k th subset G k , so we have K k=1 p k = p. To simplify the notations and also without loss of generality, we assume the p × 1 vector of regression coefficient β = (β 1 , · · · , β p ) T are arranged according to terminal groups, so that we can rewrite (β 1 , Following Wang et al (2009), we introduce a coefficient γ k for the terminal group G k by reparameterizing the β's as: ( 3.1) Here, the parameter γ k ≥ 0 is a group level coefficient for the terminal group G k , and the parameters θ kj , j = 1, . . . , p k reflect different coefficients within (1.1), for any non-zero constant c = 0. So we have a parameter identifiability problem. To resolve the problem, we impose the following constraint: An explicit expression of γ k and θ kj can be derived based on (3.1) and (3.2). Specifically, when there is at least one nonzero β kj in kth group, Otherwise, in the case with β kj = 0 for all j = 1, . . . , p k , we set γ k = 0 and θ kj = 0 for j = 1, . . . , p k . Without loss of generality, we assume in our theoretical derivation that γ k > 0, k = 1, . . . , r, for the first r clusters and γ k = 0, k = r +1, . . . , K, for the remaining K −r clusters. Furthermore, we let q k = |{(k, j) : β kj = 0}| be the number of nonzero coefficients in the kth terminal group and write q = K k=1 q k . In another words, q k is the number of the important covariates in the kth terminal node and q is the total number of the important covariates in the entire model. We assume a sparsity condition that q n. Similar to the HEnet discussed in Section 2, we propose to consider a penalized likelihood estimator subject to the constraint (3.2). Here, λ 1 , λ 2 and δ are the tuning parameters, and θ k = (θ k1 , . . . , θ kp k ) T . In (3.4), the first penalty term K k=1 γ k +δγ 2 k s k is an Enet-type of penalty but on the terminal-group-level (terminal-node-level) parameter γ k , which encourages a sparse group selection. The second penalty term K k=1 γ k θ k 1 s k is a LASSO-type penalty on coefficients θ k , which encourages a selection of important covariates within each terminal group (terminal node).
Plugging the above into (3.4), it leads us back to (2.7).

Hierarchical grouping property
Denote byx k = p k j=1θ kj x kj , for k = 1, . . . , K, whereθ kl are the elements ofθ. The definition (3.4) and the constraint (3.2) imply that x k 2 2 = n for {k :γ i > 0}. We may viewx k as an "overall predictor vector" that represents terminal group G k . If we define a correlation of the terminal groups (terminal nodes) G k and G k as φ kk = cor(x k ,x k ) = p k j=1 p k j =1θ kjθk j cor(x kj , x k j ), a weighted average of the correlation coefficients of all paired individual predictors between nodes G k and G k . By treatingx k 's as x i 's in (2.7), we provide below in Theorem 3.1 a hierarchical grouping property for the terminal groups (terminal nodes). A proof of Theorem 3.1 is given in the appendix.
(ii) Suppose further that the predictors are orthogonal within each group such that x T kj x kj = 0 for j = j . Then, we can bound the term s kxk − s k x k 2 / √ n by s (K) 2(1 − ϕ kk φ kk ) and Here, s (K) = max 1≤k≤K s k and ϕ kk = 2s k s k /(s 2 k + s 2 k ), for any 1 ≤ k, k ≤ K. Comparing with Theorem 2.2, the additional term in the inequalities of Theorem 3.1 comes from the derivative of the Lasso penalty on individual level coefficients. This additional term for the individual level coefficients is dominated by the original term for grouping of terminal nodes when λ 2 max 1≤k≤K √ p k = o(n).
Thus, asymptotically, we have a similar statement of the hierarchical grouping property as the the simplified case discussed in Section 2. In particular, if two terminal groups, say k and k , are highly correlated (i.e., φ kk ≈ 1) and they also are close in the hierarchical tree (i.e., ϕ kk ≈ 1), then we have |γ k −γ k | ≈ 0. Using the terminology in Definition 2.2, the terminal nodes G k and G k are grouped.
In the simple case with only one covariate for each terminal node, Theorem 3.1 reduces back to Theorem 2.2. So it is a generalization of Theorem 2.2.
Theorem 3.1 only concern aboutγ k andγ k . We can also compareβ k witĥ β k , whereβ a (a = k or k ) is the p a -vector for the coefficients in the terminal group a. A proof of Theorem 3.2 is given in the appendix.

Theorem 3.2. Letβ
(1) k be a r × 1 subvector ofβ k andβ (1) k be a r × 1 subvectors ofβ k , both of which contain r nonzero coefficients, with r ≤ min{q k , q k }. Let X (1) k and X (1) k be the corresponding submatrices of X k and X k containing the columns associated with the nonzero coefficients, where X k and X k are the submatrices of X containing the predictors in the terminal groups k and k respectively. Suppose also sgn(β where . F for a matrix denotes its Frobenius norm.

Model selection consistency
To show that the GHEnet approach also has model selection consistency, we express (3.4) in terms of β. By plugging (3.3) into (3.4), we have: where β k = (β k1 , . . . , β kp k ) T . Let us denote by the true regression coefficients β 0 kj = γ 0 k θ 0 kj for k = 1, . . . , K, j = 1, . . . , p k . Also, the true parameter values γ 0 k > 0 for k = 1, . . . , r and γ 0 k = 0 for k = r + 1, . . . , K. We note that the proposal is reminiscent of sparse group lasso, with the important difference being that we try to take into account hierarchical structure information, besides that has an extra bridge penalty based on elastic net approach.
Define A = {(k, j) : β 0 kj = 0} the index set of all important covariates and define B 1 = {(k, j) : γ 0 k > 0, β 0 kj = 0} the index set of non-important covariates within important terminal groups, and B 2 = {(k, j) : γ 0 k = 0} the index set of non-important terminal groups. Set B = B 1 ∪ B 2 , which is the index set of all non-important covariates. Furthermore, corresponding to the set A, we denote by a |A| × |A| diagonal score matrix S A = diag{s k | (k, j) ∈ A}, a n × |A| design matrix X A = (x kj , (k, j) ∈ A), and the vector of nonzero coefficients β A = (β kj , (k, j) ∈ A) T . Similarly, corresponding to the set B, we have S B = diag{s k | (k, j) ∈ B} and X B = (x kj , (k, j) ∈ B). Finally, we denote Σ AA = X T A X A /n, and Σ BA = X T B X A /n. Similar notations are defined when B is replaced by B 1 or B 2 .
We propose below a Generalized Hierarchical Elastic Irrepresentable Condition (GHEIC), an extension of the HEIC discussed in Section 2.3: GHEIC. There exists a positive constant η, η with η < η such that where Σ = Σ AA + λ1δ n S −1 A . The second condition (3.7) in GHEIC mimics the HEIC condition by imposing a regularization constraint on the hierarchical-score-weighted regression coefficient of X B on X A . The first condition (3.6) in GHEIC is introduced due to the L 2norm term in the penalty function (3.5). Also, in the GHEIC, we only require Σ to be invertible instead of Σ AA . We further note that (3.6) can be trivially satisfied if we choose λ 2 large enough, although λ 2 should also satisfy condition (a) in Theorem 3.3 below. A simple sufficient condition for both (3.6) and (3.7) is Theorem 3.3 below establishes the covariate selection consistency of the GHEnet estimator. A proof of the theorem is given in the appendix.

Computational algorithm
In this section, we focus on the computational issues and modify two existing computing algorithms in the literature to obtain the HEnet and GHEnet estimators proposed in (2.7) and (3.4). Algorithm A below modifies the conventional Enet algorithm by Zou and Hastie (2005) for the Enet estimator to obtain the HEnet estimator. Algorithm B follows Wang et al (2009) and provides an algorithm for the GHEnet estimator. Since the hierarchical scores proposed in (2.6) are bounded and positive, the algorithms can be justified following simple algebra and the existing literature (i.e., Zou and Hastie (2005) and Wang et al (2009)); thus the proofs are omitted here.
We have the following two algorithms for a given data set (y, X) and known hierarchical scores (s 1 , . . . , s p ). Algorithm A (HEnet estimation) Step 1: Define an artificial data set (y * , X * ) by where S = diag(s 1 , . . . , s p ).

Algorithm B (GHEnet estimation)
Step 1: Obtain an initial value γ (0) k for each γ k ; for example, γ Step 3: Set m = m + 1, and repeat Steps 2 and 3 until the algorithm converges.
Note that setting θ (m) Step 2(a) of algorithm B will not cause any problem to the minimizations because the objective function is independent of θ kj for {k : γ (m−1) k = 0}. The two minimizations in Step 2 (a) and (b) of Algorithm B can be solved using a constrained quadratic programming algorithm. In addition, the values of the tuning parameters λ = (λ 1 , δ) in Algorithm A or (λ 1 , δ, λ 2 ) in Algorithm B are critical. The results are certainly sensitive to the choice of these values. In practice, these can be chosen via the AIC criterion AIC(λ) = log( y − Xβ(λ) 2 2 /n) + 2 β (λ) 0 /n, or the BIC criterion BIC(λ) = log( y − Xβ(λ) 2 2 /n) + log(n) β (λ) 0 /n. Here, β 0 is the number of non-zeroβ's. In the our numerical studies in the next section, the tuning parameters are chosen using the BIC criterion, which appears to perform satisfactorily in our numerical results.
In the above algorithms, we assume that the hierarchical scores s i are provided. In our numerical examples, the hierarchical trees are given and we use (2.6) to calculate s i with τ = 3 plus a given α. As discussed in Section 2.2, the parameter α can be viewed as a tuning control of the "degree" on how much the hierarchical structure is allowed to impact our estimation. It may be selected based on empirical studies. An illustrative example is provided in the following numerical studies section.

Simulation studies
Two simulation examples are carried out to evaluate the performance of the proposed methods. In Example 1, we consider a small scale dataset to test the HEnet estimator proposed in Section 2, in which each terminal node contains only one covariate. In Example 2, we investigate the performance of GHEnet estimator proposed in Section 3 by considering a relatively large dataset. The second example mimics the motivating Systemic Lupus Erythematosus (SLE) data example, in which each terminal node contains multiple covariates. In both examples, several simulation settings are considered to cover different scenarios and α values. The number of repetitions in each simulation setting is 1000.

Example 1
The data in this example consist of n = 20 samples with p = 6 covariates. The true parameter β = (0, 0, 0, 3, 3, 3) and σ = 1. Thus the true model is We consider two scenarios with different hierarchical and covariance structure: Scenario 1. The covariates are assumed to have a balanced hierarchical structure as in Figure 2 (a). Each row vector of the design matrix X is generated from the standard normal distribution with covariance matrix Cov(x i , x j ) = 0.9 1{i =j} for (i, j) ∈ {1, 2, 3} and (i, j) ∈ {4, 5, 6}. For any other pair of (i, j), Scenario 2. The covariates are assumed to have an unbalanced hierarchical structure as Figure 2 (b). Each row vector of the design matrix X is generated from the standard normal distribution with covariance matrix Cov(x i , x j ) = 0.9 1 {i =j} .
In our proposed approach, we have a tuning parameter α to control how much impact the hierarchy tree will have on the model selection results. When α = 0, s i ≡ 1. In this case, the tree has no impact and the HEnet reduces to the conventional Enet. Note that, the scale weight defined in (2.4) ϕ jk = 2sj s k s 2 j +s 2 k is a decreasing function of α. In addition, ϕ jk = 1 for α = 0 and lim α→∞ ϕ jk = 0 for any pair of (x j , x k ) with different ancestors. We plot ϕ jk against α for all pairs (j, k) in Figure 3. For each plot, the upper shade corresponds to the 75% or higher quantiles of all ϕ jk and the lower shade is the 50% or lower quantiles of all ϕ jk , at each fixed α value and over the range of 0 ≤ α ≤ 40. We use the plots for assist us choosing the tuning α value empirically. For scenario 1, we first pick α = 15 (marked with a vertical red line) which represents the largest range of ϕ jk meanwhile keeping the smallest ϕ jk away from 0. Then we pick α = 8 (marked with the second vertical red line) to test the impact of different ranges of ϕ jk on the model selection. Similarly, we choose α ∈ {5, 10} for Scenario 2. We also include α = 0.1 case in both Scenario to illustrate the idea that almost no hierarchy is contributed to the model selection result for small α.
We benchmark our proposed HEnet estimators, with several α values, against three existing estimators: Enet, Lasso and OSCAR. The numerical results are summarized in Table 1. The notation P S ij represents the frequency on which x i and x j being selected together, and the notation P D ij represents the frequency on which x i and x j being dropped from the model together. We can see that for Scenario 1, P S 45 , P S 46 , P S 56 are very close under Enet, Lasso, OSCAR and HEnet with α = 0.1 because pairwise correlations are the same within important covariates and (almost) no hierarchical structure is considered. When α increases to 8 and 15, the hierarchy impacts on the model selection result. Specifically, in Figure 2 (a) and among the 3 2 = 3 pairs of the three important covariates {x 4 , x 5 , x 6 }, the pair {x 5 , x 6 } is closer than the other two. Such structure leads to higher value of P S 56 than P S 45 and P S 46 , i.e. {x 5 , x 6 } is more likely selected together than other two pairs of important covariates. Similar results are found in P D 12 , P D 13 , P D 23 with α ∈ {8, 15} for the pairs of non-important covariates. In particular, {x 1 , x 2 } is more likely dropped together than the other two pairs, because they are closer in the hierarchical structure.
For Scenario 2, we also have the similar results. The P S 45 , P S 46 , P S 56 are very close under Enet, Lasso, OSCAR and HEnet with α = 0.1. The pair {x 5 , x 6 } are closer than other two pairs of important covariates on the hierarchical tree from Figure 2 (b). Thus P S 56 is higher than P S 45 and P S 46 with α = 5 or 10. For the non-important covariates, {x 2 , x 3 } is closer than other two pairs thus is more likely dropped together. Overall, the hierarchical grouping property of our proposed estimator is well illustrated by the above results.
We use two measures, sensitivity and specificity, to evaluate the covariate selection performance. Sensitivity is defined as the proportion of the selected covariates that are the important covariates. Specificity is defined as the proportion of the excluded covariates that are unimportant covariates. From Table 1, there is no significant advantage of HEnet in Scenario 1 in terms of sensitivity and specificity. HEnet with α = 5, 10 outperform others in Scenario 2.

Example 2
The second example consists of n = 50 samples with p = 4000 covariates, which is the similar size as the SLE dataset. The covariates are distributed among 20 terminal groups (G 1 , . . . , G 20 ) with 200 covariates in each group. The true parameters are β 1,j = β 2,j = β 3,j = 2, for j = 1, . . . , 10 and all other β k,j = 0. Also, σ = 0.25. Thus the true model is Qiao et al. Each row vector of the design matrix X is generated from the standard normal distribution with covariance matrix Cov(x ki , x k j ) = 0.7 1 {i =j} ·0.9 1 {k =k } , k, k = 1, . . . , 20, i, j = 1, . . . , 200. We assume that a hierarchical structure on top of the 20 terminal groups is as Figure 4. We still denote by P S ij and P D ij the frequencies on which terminal groups G i and G j are selected or dropped together, respectively. Here, we define a terminal group being selected if and only if at least one of the covariates in the group is selected. To demonstrate the hierarchical grouping property of our proposed GHEnet estimator, we calculate P S ij and P D ij for the important group pairs {i, j} = {1, 2}, {1, 3}, {2, 3} and the pairs among the non-important groups {4, . . . , 20}. We benchmark our proposed estimator with α ∈ {0, 0.1, 0.2, 0.3, 0.4, 1, 20} against a Group Lasso and Sparse-Group Lasso (SGL) approach where the covariates are pre-grouped by the same terminal groups (as a set of parallel groups) with the hierarchical information ignored. The choice of α is based on a similar empirical study as Example 1 (the details are omitted to avoid repetitions). We use GEnet to indicate GHEnet with α = 0, i.e., no hierarchical structure is considered. Note that GEnet is also similar to SGL with added advantage in dealing with strongly correlated covariates attributed to the ridge penalty.
From Table 2, P S 12 , P S 13 and P S 23 are not materially different under GEnet, as well as, Group Lasso and SGL because between-group correlations are the same within important groups and no hierarchical structure is considered. When α increases, the fact that {G 1 , G 3 } is closer than the other two pairs ({G 1 , G 2 } and {G 2 , G 3 }) on Figure 4 leads to higher value of P S 13 than those of P S 12 and P S 23 . This result indicates that {G 1 , G 3 } is more likely selected together than other two pairs of important groups. Similar results can be found in non-important group pairs P S ij and P D ij with {i, j} ∈ {4, . . . , 20} (due to space limitation, the numerical results are not reported in Table 2 but is available by request to the first author). These results demonstrate the proposed GHEnet estimator has indeed the hierarchical grouping property. We also report in Table 2 the sensitivity and specificity at both terminalgroup and individual levels to assess the model selection performance. Both GHEnet and Group Lasso perform well with the terminal group specificity reaching 100%. GHEnet outperforms Group Lasso in covariate selection specificity, since Group Lasso tends to select a larger model and can not perform selections within given groups. GHEnet performs slightly better than SGL in terms of group and covariate specificity. The terminal group sensitivity of GHEnet is better than Group Lasso and SGL for small α values. But for relatively large α, the terminal group sensitivity of GHEnet is relatively small as expected, since the hierarchical tree structure is used in the selection and it impacts the selection process. The covariate selection sensitivity is also relatively low for GHEnet because GHEnet performs model selection at covariate level and the performance deteriorates when the correlation between those covariates is high. Such behavior is consistent with the HEnet example above. On the other hand, Group Lasso and SGL either selects all the covariates or a large number of covariates within selected groups resulting in high covariate sensitivity. In addition and as we can anticipate, incorporating group structure is generally bad for variable sensitivity, and incorporating hierarchical information is generally bad for the overall group sensitivity. GEnet without taking into account the hierarchical structure works well in this example in terms of variable selection due to that group 2 is far away from groups 1 and 3 and thus with large α group 2 tends to be not selected together with the other two groups.
Example 2 used groups with equal sizes for illustration. Based on our observation from a number of cases (including the real data example), the performance under unbalanced group sizes appears to be similar (as long as the sample sizes are not extremely unbalanced). For the interest of space, an example of unbalanced group sizes is omitted in the simulation study.
In summary, the above simulation studies in both examples have demonstrated the hierarchical grouping property and model selection consistency of our proposed estimators.

Analysis of SLE dataset in PBMC study
In the blood genomic studies reported in Chaussabel et al. (2008), PBMC samples are obtained from n = 47 individuals with Systemic Lupus Erythematosus (SLE) condition. Transcriptional profiles were generated with Affymetrix U133A and U133B GeneChips (> 44000 probe sets). The gene intensity signal is assessed and normalized using Microarray Suite, Version 5.0 for each probe set. Then logarithmic transformation is performed on the gene intensity level. Among these 44000+ transcripts, 4779 of them considered "present" are selected as the input of the module-construction algorithm. Total 28 modules (terminal groups) are formed. Within each module, the transcripts are coordinately expressed, i.e. highly correlated and usually have similar functions. A hierarchical clustering algorithm is applied with "complete" linkage on the correlation structure among these 28 modules, and the resulting hierarchy tree structure is shown in Figure 1. On average, each terminal node represents about 170 transcripts. The goal is to identify the modules and the transcripts within these modules that are potentially related to the individual's disease index: SLE disease-activity index (SLEDAI). Our approach is to use regression analysis under the framework of Section 3. We treat the SLEDAI as the response covariate and all the transcripts as the predictors. Also, the interaction between different modules can be captured by hierarchical clustering. We deploy our proposed GHEnet procedure to perform the gene selection at both module and individual level. Since unlike the simulation studies we don't know which modules or transcripts are truly important in this dataset, we are not able to report P S ij , P D ij , sensitivity and specificity for our model selection. To examine the impact of hierarchical scores, we perform a sensitivity analysis by trying different α values. Summarized in Table 3 Part I are the selected modules and number of genes for α ∈ {0, 0.01, 0.1, 5, 10}. Table 3 Part II lists the functionality of these modules. The choice of α is based on an empirical study similar to that produces Figure 3 of Example 1. From Table 3 Part I, we can see how the identified modules evolve as α increases. The module identification results are the same for α = 0 and α = 0.01. As α increases, the hierarchical structure starts impacting the module selection results as expected: modules #7, #20, #12 are dropped sequentially. The detailed sequential information can help better design confirmatory experiments for medical researchers.
We would like to comment that the example is used to demonstrate how we can take into account the hierarchical structure in a real data analysis and how a hierarchical structure can impact the analysis outcomes. The actual impact and implications depend on whether the hierarchical structure is informative with respect to the underlying true covariate structure. Without any input from experts with domain knowledge, it is difficult to interpret the biological meanings and implications of the analytic results.

Concluding remarks
In this paper, we develop a new methodology to incorporate hierarchical structures among the covariates in model selection problems. We construct hierarchical scores from a known hierarchical tree and use them as weights on the penalty function in Elastic net approach. The resulting estimator is proved to have a desirable hierarchical grouping property, and at the same time provides consistent model selection. Although we present our idea through Elastic net penalty, the construction of hierarchical score is independent of the choice of penalty functions. We believe we can combine the hierarchical scores with other types of penalty function which encourages grouping, for example the OSCAR and other procedures.
There are a few questions that are worth further investigation regarding the construction of hierarchical trees, the scoring weights and the development. First, we have found a specific way of constructing hierarchical scores s i that possess desired properties. It is conceivable that other weights with the same desired properties may exist. But the s i constructed in Section 2.2 is the only set that we can find so far that is easy to construct and also intuitive. Second, the internal nodes are assumed to have distinct depths throughout the development. Sometimes some of these depths can be tied with equal values. In this case, our scoring method may not directly apply depending on the situations. For example, consider a full binary tree in which the ancestors of all terminal nodes have the same depths, and we would have assigned the same scores to all terminal nodes by the present method. The equal weights do not reflect that some pairs of terminal nodes are closer than others, which may be viewed undesirable depending on the applications. To overcome the problem, one may alternate and break the ties, i.e., equal depth values, to reflect the hierarchy of closeness before applying our method. Of course, it is also of interest to investigate whether we can find appropriate scoring method without alternating these tie values. Third, we treat α as a tuning parameter because the proposed methodology is not intended to improve the model selection performance but rather to provide a practical way to incorporate the hierarchical information. The parameter α control the impact of hierarchical information on the model selection outcome. User of this approach is recommended to try different level of α to produce sequential information. Such information can help better design, for example in confirmatory experiments for medical researchers. On the other hand, in some situations, it may be desirable to have a data-driven way of choosing α which we leave as future works. Fourth, we assumed the hierarchical tree is given and fixed, while in practice it is typically constructed from data and thus random. How to take into account the uncertainty in the tree structure is an important but challenging problem. Finally, in the theoretical results in this paper, we only focus on consistency of model selection. In some applications, the identification of important covariates is perhaps more important than the estimation of the values of the coefficients. Furthermore, one could also study the convergence rate (or even minimax rate) of the estimators, for which sparse eigenvalue conditions may be needed.
By definition of the binary vector v, if x j and x k have the same ancestors, we have v j = v k . It follows immediately s j = s k . Conversely, when τ ≥ 3, there is a unique one-to-one correspondence between the score s i and its binary vector representation v i . Thus, if s j = s k , we have v j = v k . It follows that the predictors x j and x k have the same ancestors. We have proved that Condition C1 holds for the set of scores s i . Now, without loss of generality, let us assume that the set of the ancestors of two predictors, say x i and x k , is a subset of the ancestor set of predictors x i and x j . There exist two integers L 1 and L 2 with 1 ≤ the set {l|L 2 < l ≤ L 1 and v i (l) = v k (l)} is not empty.
Since v i (l) = v j (l) for l = 1, . . . , L 1 , we have following inequality Also, denote by L * = inf{L 2 < l ≤ L 1 |v i (l) = v k (l)}. The binary vectors v i (l) = v k (l) for l = 1, . . . , L * − 1. We can show that, when τ ≥ 3, we have Thus, we have min( Condition 2 holds for the set of s i when τ ≥ 3. Finally, for any α > 0, the new score s i is just a simple power transformation of the aforementioned s i . Because the power function is monotonic and the score function is positive, Conditions C1 and C2 still hold for any new score s i with α > 0.
So by Chebyshev inequality and Condition (a), we have has a Gaussian distribution centered at 0, there is non-vanishing probability that the first element is positive. Therefore, inequality (A.6) does not hold with positive probability. This contradicts with the sign consistency assumption. This completes the proof.
It follows that Thus, the first inequality in Theorem 3.1 holds. The second inequality follows from θ k 1 ≤ √ p k θ k 2 = √ p k . The third inequality follows from simple algebra and the fact that y 2 = O( √ n).