Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study

Haixiang Zhang; Yinan Zheng; Grace Yoon; Zhou Zhang; Tao Gao; Brian Joyce; Wei Zhang; Joel Schwartz; Pantel Vokonas; Elena Colicino; Andrea Baccarelli; Lifang Hou; Lei Liu

doi:10.1515/sagmb-2016-0073

Publicly Available Published by De Gruyter July 22, 2017

Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study

Haixiang Zhang , Yinan Zheng , Grace Yoon , Zhou Zhang , Tao Gao , Brian Joyce , Wei Zhang , Joel Schwartz , Pantel Vokonas , Elena Colicino , Andrea Baccarelli , Lifang Hou and Lei Liu

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2016-0073

Abstract

In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A precision matrix obtained via the constrained ℓ₁ minimization method is used to account for the within-subject correlation among multivariate outcomes. Oracle inequalities of the regularized estimators are derived. The performance of our proposed method is illustrated via extensive simulation studies. We apply our method to study the relation between smoking and high dimensional DNA methylation markers in the Normative Aging Study (NAS).

Keywords: high-dimensional responses; multivariate regression; oracle inequality; tuning-insensitive; weighted square-root LASSO

1 Introduction

With the development of modern technology for data collection, high-dimensional data have become increasingly common in many scientific research fields, e.g. genome-wide studies (Lin et al., 2015), biomedical sciences (Mukherjee et al., 2015), economics and finance (Basu and Michailidis, 2015). Under these situations, the number of parameters is larger than the sample size, rendering traditional statistical procedures inappropriate. More recently, correlated data with high dimensional multivariate responses are often encountered in omics studies. Our motivating example is the Normative Aging Study, where methylation markers are taken as the multivariate outcomes. The methylation of DNA, where methyl groups are added to DNA at binding sites typically referred to as cytosine-phosphate-guanine (CpG) islands, could affect the DNA expression. DNA methylation levels measured from probes close to one another are correlated (Moen et al., 2013), resulting in high dimensional multivariate outcomes. Correlation may also exist for DNA methylation markers having the similar function, e.g. related to exposure such as smoking. It is thus necessary to account for the within-subject correlation in the estimation procedure.

Our objective is to conduct selection of regression coefficients in high dimensional multivariate DNA methylation markers. There are two challenging issues for such a study: (1) how to conduct variable selection in the high dimensional setting; and (2) how to tackle the correlation among multivariate outcomes. For the first challenge, many studies have focused on penalized methods, such as the least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996), the smoothly clipped absolute deviation (SCAD, Fan and Li, 2001), the elastic net (Zou and Hastie, 2005), the adaptive LASSO (Zou, 2006), and the minimax concave penalty (MCP, Zhang, 2010). The penalized approach has been applied to many research topics, e.g. linear models (Wang and Leng, 2007; Huang et al., 2011; Fan and Lv, 2014), generalized linear models (van de Geer, 2008; Jiang et al., 2016), survival models (Fan and Li, 2002; Bradic et al., 2011; Lin and Lv, 2013). Recently, Belloni et al. (2011) proposed a pivotal square-root LASSO method, which does not rely on the knowledge of the standard deviation for the error term. Later, Belloni et al. (2014) developed a self-tuning square-root LASSO method in high-dimensional nonparametric regression analysis. Liu and Wang (2017) proposed a new procedure for optimally estimating high dimensional Gaussian graphical models using the square-root LASSO. For more topics on variable selection, please refer to Bühlmann and van de Geer (2011).

There is limited research to tackle the second challenge, especially when the responses are high-dimensional. Rothman et al. (2010) proposed an iterative algorithm for variable selection and estimation in high-dimensional multivariate regression using the LASSO penalty. Sofer et al. (2014) considered variable selection for high-dimensional multivariate regression using the penalized likelihood method, but the dimensionality of the multiple responses is still much smaller than the sample size. Liu et al. (2015) proposed a calibrated multivariate regression method for high-dimensional multivariate regression models, but they only considered the uncorrelated error structure. Li et al. (2015b) and Wilms and Croux (2017) studied the group LASSO for high-dimensional multivariate linear regression model. Most of these existing methods use cross-validation to choose tuning parameters over a full regularization path, which are computationally expensive and may potentially waste valuable training data. To deal with this problem, we will extend Belloni et al. (2011)’s (unweighted) square-root LASSO on a single outcome to multivariate outcomes with weighted square-root LASSO. The main advantage of our procedure over existing methods comes from the tuning-insensitive property, which is significantly faster than cross-validation. Another advantage is that we can use the entire dataset for variable selection, which may potentially learn a better model (Bishop et al., 2003).

The rest of the paper is organized as follows. In Section 2, we introduce the model and the weighted square-root LASSO procedure for multivariate linear regression with high-dimensional responses. In Section 3, we establish an error bound for the proposed estimator. In Section 4, we develop an efficient algorithm and conduct Monte Carlo simulations to assess the performance of our method. An empirical analysis of DNA methylation in the Normative Aging Study is presented in Section 5. Some concluding remarks are given in Section 6. All technical proofs are relegated to the Appendix.

2 Model and estimation

Consider the sparse, high-dimensional multivariate linear regression model

(1)Yi=BXi+ϵi,i=1,⋯,n,

where Yi=(Yi1,⋯,Yip)′ is a p-dimensional response, e.g. DNA methylation markers; Xi=(Xi1,⋯,Xiq)′ is a q-dimensional covariates vector; B=(B1,⋯,Bp)′∈ℝp×q is a sparse regression coefficient matrix with Bk=(βk1,⋯,βkq)′∈ℝq, k = 1, ⋯, p; ϵi=(ϵi1,⋯,ϵip)′∈ℝp is a random error term with mean 0 and covariance matrix Σ_p. Throughout this article, we assume that the predictor’s dimension q is fixed but the dimension of response p could be larger than n. The sparse regression coefficient matrix B suggests that only a small number of coefficients are non-zero. Our interest is to estimate the coefficient matrix B and establish oracle inequalities for corresponding estimators, while account for the correlated outcomes.

Assume that (Y_i, X_i) are independently and identically distributed (i.i.d.) observations, i = 1, ⋯, n. If ϵ_i ∼ N(0, Σ_p), the negative log-likelihood function is given by

(2)ℒ(B,Ωp)=12n{nlog(2π)+nlog|Ωp−1|+∑i=1n(Yi−BXi)′Ωp(Yi−BXi)},

where Ωp=Σp−1 is the precision matrix (Cai et al., 2011). Denote (B1′,⋯,Bp′)′ as β=(β1,⋯,βd)′, where d = pq. Let 𝒮={j;βj≠0} be the true model with size s=|𝒮|. We use Ω_p to account for the within-subject correlation (Sofer et al., 2014), and propose the following criterion function:

(3)Q(β;Ωp)=1n∑i=1n(Yi−BXi)′Ωp(Yi−BXi).

In practice, the precision matrix Ω_p can be estimated using the constrained ℓ₁ minimization method (Cai et al., 2011), which has been implemented in the R package flare (Li et al., 2015a). Basically, the estimation of the sparse inverse covariance matrix (precision matrix) Ω_p can be obtained by the following optimization problem:

min∥Ωp∥,

subject to:

|ΣnΩp−I|∞≤γ,

where γ > 0 is a tuning parameter and Σn=1n∑i=1n(Yi−B~Xi)′(Yi−B~Xi) with B~ being a consistent estimator (e.g. ridge estimator).

Belloni et al. (2011) proposed an unweighted square-root LASSO for β with a scalar outcome. They showed that the penalty (tuning parameter) is pivotal, i.e., it does not rely on the knowledge of the error variance, nor do we need to pre-estimate it. Consequently, the estimation of β is insensitive to the tuning parameter, greatly simplifying the computation which often resorts to cross validation for tuning parameter selection. Furthermore, the square-root LASSO method achieves near-oracle performance for the estimation of β. In comparison, the ordinary LASSO needs to estimate the error variance (Belloni et al., 2011) to achieve near-oracle performance, which is a very challenging issue in high dimensional data.

Motivated by Belloni et al. (2011), the corresponding weighted square-root LASSO version for correlated outcomes in Model (1) is defined as

(4)β^=argminβ∈ℝd{Q(β;Ωp)+λ∑j=1dwj|βj|},

where λ > 0 is the tuning parameter, w_j is a known weight, j = 1, ⋯, d. We can set wj=1/|β~j| along the lines of Zou (2006) with β~_j being the ridge estimator, j = 1, ⋯, d. Of note, we choose the ridge estimator rather than the ordinary least square estimator in the adaptive LASSO since p could be larger than n in Model (1). To obtain β^ in (4), we consider the optimization problem,

(5)β^=argminβ∈ℝd,ρ≥0{Q(β;Ωp)2ρ+ρ2+λ∑j=1dwj|βj|}.

For ρ ≥ 0, we have Q(β;Ωp)2ρ+ρ2≥Q(β;Ωp), so the objective function in (5) is an upper bound of that in (4). The equality is attained if and only if ρ=Q(β;Ωp). Similar to Proposition 3.1 of Liu and Wang (2017), the optimizations in (4) and (5) yield the same solution β^. This relationship between (4) and (5) provides an efficient algorithm as described below.

For given λ, we have the following procedure:

Step 0. Compute the precision matrix Ω^p (using R package flare; Li et al., 2015a) and β^ridge (using R package glmnet; Friedman et al., 2010). Set β(0)=β^ridge and ρ(0)=Q(β(0);Ω^p).

Step 1. Solve the optimization problem via coordinate descent algorithm (Friedman et al., 2010)

β(k+1)=argminβ∈ℝd{Q(β;Ω^p)2ρ(k)+ρ(k)2+λ∑j=1dwj|βj|}.

Step 2. Update

ρ(k+1)=Q(β(k+1);Ω^p).

Step 3. Repeat Steps 1 and 2 until convergence.

Of note, Ω^_p is consistent (Cai et al., 2011) and kept unchanged in the iteratively updated procedure, so the objective function in Step 1 is convex for β, which ensures fast convergence of the algorithm. The following Lemma 1 will present an explicit expression for λ, while the optimal value for the tuning parameter in (4) will be evaluated in Section 4 via empirical studies. The estimation procedure has been implemented in R (available upon request).

3 Theoretical results

In this section, we will establish the oracle inequality for β^ defined in (4). The following lemma lays the foundation for the tuning-insensitive property of the weighted square-root LASSO procedure, which is motivated by Bickel et al. (2009)’s choice on the penalty level for LASSO. Denote Wmin=min{w1,⋯,wd}, Wmax=max{w1,⋯,wd} and N = np. We first have the following lemma.

Lemma 1

Let λ=cWmin⁢2⁢a⁢log⁡dN with c > 1 and a > 2. Define

(6)Ω={λ≥cWmin∥∇Q1/2(β;Ωp)∥∞},

then we have

P(Ω)≥1−2πalogd⋅d1−a(1−2(a−1)logdN)−d1−a,

where ∇ is the gradient.

Similar to Belloni et al. (2011), the prediction norm is given as ||e||2,N2=1Ne′𝕏′𝕏e, where 𝕏 is defined in the Appendix and e ∈ ℝ^d. To derive the oracle inequalities, we define the compatibility factor (Huang et al., 2013) and restricted eigenvalue (RE, Bickel et al., 2009) as

κ(ξ,𝒮)=inf0≠e∈𝒞(ξ,𝒮)s1/2||e||2,N||e𝒮||1 and RE(ξ,𝒮)=inf0≠e∈𝒞(ξ,𝒮)||e||2,N||e||2,

respectively, where 𝒞(ξ,𝒮)={e∈ℝd:||e𝒮c||1≤ξ||e𝒮||1} with ξ=cWmax+WminWmin(c−1). Hereafter, 𝒜^c denotes the complement of set 𝒜; v𝒜=(vj:j∈𝒜) for a vector v. Then, we have the following conclusion.

Lemma 2

Denote e^=β^−β, then on the event Ω, we have e^∈C⁢(ξ,S). That is, the L₁ norm of the variables not relevant should be less than a multiple of those relevant.

The following theorem gives the upper bound for the estimation error.

Theorem 1

Let c > 1, ξ=c⁢Wmax+WminWmin⁢(c−1), suppose that Wmax⁢λ⁢s1/2κ⁢(ξ,S)≤ζ<1 and Q⁢(β;Ωp)≤K2 with K > 0. Then on the event Ω,

RE(ξ,𝒮)||β^−β||2≤||β^−β||2,N≤2(Wmax+Wminc)λs1/2K(1−ζ2)κ(ξ,𝒮).

Remark 1

The condition Q(β; Ω_p) ≤ K² with K > 0 is mild, since Q(β; Ω_p) converges to 1 in probability as n → ∞.

Remark 2

This result and Lemma 1 show that by choosing λ=ηWm⁢i⁢n⁢log⁡dN with some η = c2⁢a > 2, the obtained β^ achieves the near-oracle rate of convergence (Belloni et al., 2011). Since the choice of η does not rely on any unknown parameters or quantities, we call the property tuning-insensitive. Empirically, it is found that setting η = 8 works well in most cases we encountered, which will be verified via simulation.

4 Simulation studies

In this section, we will conduct simulation studies to validate the proposed methodology. We assume that X is from N_q(0, Σ_X), where q = 2 and Σ_X = (σ_ij) is given by σ_ij = 0.7^|i−j|. The random error term is generated from N_p(0, Σ_E), we consider the following two settings for the error covariance:

Case (a). AR(1) error covariance: Σ_{E, st} = 0.8^|s−t|.

Case (b). Fractional Gaussian Noise (FGN) error covariance:

ΣE,st=0.5((|s−t|+1)2H−2|s−t|2H+|(|s−t|−1)|2H)

with Hurst parameter H = 0.9, so the correlation is (0.74, 0.63, 0.58, 0.55, 0.52, 0.50, 0.49, 0.48, 0.46, 0.45) for distance |s−t|=1:10. It is noted that the inverse error covariance for Case (a) is a tri-diagonal sparse matrix, while Case (b) has a dense inverse error covariance. Set β = (1, 0.6, 0.3, 1.2, 0.8, 0.5, 0, ⋯, 0)^′, i.e., the first six elements are non-zero, while the rest are all 0. We take p = 200 and 300, respectively. All simulation results are based on 100 replications with n = 100 and 200.

According to Theorem 1, the tuning parameter λ = ηWminlogdN with some η > 2. To choose the suitable λ, we consider various values of η, and the corresponding performance is plotted in Figures 1–4. We can see that the tuning parameter’s effect is limited and the procedure has better properties when η ∈ [6, 10]. Thus, we suggest η = 8 for the weighted square-root LASSO (WSR-LASSO) method in both simulation and real application. For comparison, we also consider the LASSO and square-root LASSO (SR-LASSO) with weight w_j = 1 in (4).

Figure 1

Correct model rate for WSR-LASSO with AR(1) error covariance.

Figure 2

Model error for WSR-LASSO with AR(1) error covariance.

Figure 3

Correct model rate for WSR-LASSO with FGN error covariance.

Figure 4

Model error for WSR-LASSO with FGN error covariance.

Tables 1 and 2 report the results, which include the rate that the correct model (CMR) is selected I{𝒮^=𝒮}, the false positive rate (FPR) |𝒮^\𝒮|/|𝒮^|, the false negative rate (FNR) |𝒮\𝒮^|/(d−|𝒮^|), and the model error (ME) tr[(B^−B)ΣX(B^−B)′] (Yuan and Lin, 2007). It can be seen that LASSO and SR-LASSO have similar performance, which is in line with the conclusion of Belloni et al. (2011). Moreover, the WSR-LASSO method has a higher rate of selecting the correct model and a smaller model error than LASSO and SR-LASSO. The results on the false positive and negative rates also suggest that the proposed method is preferred over LASSO and SR-LASSO in practice.

Table 1

Simulation results on model selection with AR(1) error covariance.

		n = 100				n = 200
p	Methods	CMR	FPR	FNR	ME	CMR	FPR	FNR	ME
200	LASSO	0.02	0.6121	0.0006	0.4203	0.10	0.5806	<10⁻⁴	0.2435
	SR-LASSO	0.02	0.5924	0.0005	0.4282	0.07	0.5973	<10⁻⁴	0.2405
	WSR-LASSO	0.41	0.1437	0.0008	0.2213	0.69	0.0708	0.0004	0.1337
300	LASSO	0	0.6072	0.0004	0.4747	0.02	0.6205	0.0001	0.2534
	SR-LASSO	0.01	0.5857	0.0005	0.4795	0.01	0.6468	0.0001	0.2476
	WSR-LASSO	0.38	0.2956	0.0002	0.2389	0.68	0.0634	0.0003	0.1178

CMR: the correct model is selected I{𝒮^=𝒮}; FPR: the false positive rate |𝒮^\𝒮|/|𝒮^|; FNR: the false negative rate |𝒮\𝒮^|/(d−|𝒮^|); and ME: the model error tr[(B^−B)ΣX(B^−B)′].

Table 2

Simulation results on model selection with FGN error covariance.

		n = 100				n = 200
		p	Methods	CMR	FPR	FNR	ME	CMR	FPR	FNR	ME
200	LASSO	0.02	0.5360	0.0003	0.6467	0.06	0.4504	0.0002	0.2891
	SR-LASSO	0.03	0.5696	0.0004	0.6287	0.05	0.4508	0.0001	0.2869
	WSR-LASSO	0.33	0.1055	0.0014	0.3242	0.60	0.0692	0.0007	0.1769
300	LASSO	0.03	0.5436	0.0004	0.6887	0.09	0.4922	<10⁻⁴	0.3762
	SR-LASSO	0.02	0.5239	0.0003	0.6741	0.07	0.4980	<10⁻⁴	0.3640
	WSR-LASSO	0.33	0.2634	0.0006	0.2828	0.55	0.0917	0.0006	0.2356

CMR: the correct model is selected I{𝒮^=𝒮}; FPR: the false positive rate |𝒮^\𝒮|/|𝒮^|; FNR: the false negative rate |𝒮\𝒮^|/(d−|𝒮^|); and ME: the model error tr[(B^−B)ΣX(B^−B)′].

5 Application

We apply our proposed methodology to the DNA methylation (DNAm) data from the US Department of Veterans Affairs’ Normative Aging Study (NAS). We exclude participants who (i) were non-white or had missing information on race to minimize potential confounding effects of genetic ancestry, or (ii) had any cancer diagnosed and history of stroke or coronary artery disease as their blood methylation profiles could have been affected. A total of 169 individuals with samples collected at their first blood draw remain for analysis.

We are interested in the effect of smoking on DNA methylation. Gao et al. (2015) conducted a literature review on the DNA methylation change in response to active smoking exposure in adults. From it, we consider methylation markers at a total of 151 cytosine-phosphate-guanine (CpG) dinucleotides which had been reported multiple (≥2) times in the literature. In our studies, the correlations among the 151 CpG sites have a range of [−0.6440, 0.9369], manifesting high correlations among these CpGs.

We are interesting in the smoking pack year (packyr)’s effect on these DNAm markers. We also include age and BMI in the model. In total, we need to estimate 453 (151 × 3) regression coefficients. We use the proposed method in Section 2 with tuning parameter λ = 0.0026.

In Table 3, we compare our results to the original 151 CpGs listed by Gao et al. (2015). In Table 4 we report the selected CpGs and coefficient estimates. Thirty three CpGs among a total of 151 are selected by our method in the NAS data. We can see that CpGs reported more frequently in the literature are also more likely to be chosen by our method in the NAS data. For example, among 8 CpGs reported at least 7 times, 5 (62.5%) are selected by our method from the NAS data, while only 15 out of 89 (16.9%) CpGs reported two times in literature are selected by our method. The decreasing trend in Table 3 shows the consistency of our method with the literature. Of note, our method correctly identifies the top two CpGs – cg03636183 and cg05575921 (located in F2RL3 and AHRR genes), which have been reported 12 and 11 times in literature, respectively.

Table 3

Model selection comparison with Gao et al. (2015).

Frequency CpGs reported	≥7	5–6	3–4	2
Total no. CpGs	8	12	42	89
N (%) identified in NAS	5 (62.5%)	4 (33.3%)	10 (23.8%)	15 (16.9%)

The frequency of CpGs identified in literature, per the review by Gao et al. (2015); Total no. CpGs: Total number of CpGs reported in literature; N (%): Number (percentage) of CpGs selected by our method in the NAS data.

Table 4

Variable selection and estimation results for CpGs.

CpG	Gene name	CHR	β^₁	β^₂	β^₃
cg03636183	F2RL3	19	−0.0021	0.0029	0.0067
cg05575921	AHRR	5	−0.0056	0.0111	0.0083
cg06126421	*	6	−0.0014	0.0006	0.0094
cg21566642	*	2	−0.0019	0	−0.0058
cg06644428	*	2	−0.0037	−0.0085	−0.0306
cg03991871	AHRR	5	−0.0021	0.0146	0.0237
cg23576855	AHRR	5	−0.0013	0	0.0145
cg25189904	GNG12	1	−0.0027	0	−0.0092
cg08709672	AVPR1B	1	0.0011	0.0008	0
cg12803068	MYO1G	7	0.00164	0	0.0267
cg01692968	*	9	−3 × 10⁻⁵	−0.0104	−0.0051
cg06060868	SDHA	5	4 × 10⁻⁶	0.0116	0.0078
cg11207515	CNTNAP2	7	0.0004	−0.0049	0
cg11231349	NOS1AP	1	0.0005	0.0080	0.0149
cg22851561	C14orf43	14	0.0003	0	0
cg23771366	PRSS23	11	−0.0007	−0.0015	−0.0111
cg23916896	AHRR	5	−0.0010	−0.0092	−0.0135
cg26963277	KCNQ1	11	−0.0003	0.0133	0.0141
cg01500140	LIM2	19	0.0008	0.0072	0.0113
cg03274391	*	3	0.0027	0	0.0001
cg03604011	AHRR	5	0.0035	−0.0202	−0.0143
cg04716530	ITGAL	16	3 × 10⁻⁵	0.0111	0.0056
cg07465627	STXBP4	17	0.0002	−0.0052	−0.0092
cg11902777	AHRR	5	−0.0011	−0.0209	−0.0204
cg13039251	PDZD2	5	0.0031	0	0.0185
cg15187398	MOBKL2A	19	−0.0001	−0.0038	0
cg16201146	*	20	3 × 10⁻⁵	0.0032	0.0129
cg17619755	VARS	6	0.0006	0	0.0082
cg17924476	AHRR	5	0.0011	0	−0.0102
cg23480021	*	3	0.0012	0	0.0157
cg23667432	ALPP	2	0.0002	0.0009	0.0092
cg23973524	CRTC1	19	0.0009	0	0.0057
cg26764244	GNG12	1	−0.0007	−0.0101	−0.0183

β^₁, packyr; β^₂, age; β^₃, BMI; *denotes CpGs in the intergenic region

6 Concluding remarks

We proposed a weighted square-root LASSO method for high-dimensional multivariate regression models. We estimated the precision matrix by the CLIME method to account for the correlations between responses and obtained oracle inequalities for the estimator. Simulation studies were provided to illustrate the proposed procedure. We applied the method to study the relation between smoking and high dimensional DNA methylation markers.

There exist several topics to research in the future. First, the estimation of high-dimensional Gaussian graphical models is an active area of research (Cai et al., 2011; Cai and Yuan, 2012; Fan et al., 2013). It is of great interest to consider the joint estimation of regression coefficients and the precision matrix in (1). A possible solution is to add a penalty term in (4) for the elements of the precision matrix. Second, although it is assumed that the dimension of the covariates q is fixed, it is straightforward to extend the proposed procedure to the high-dimensional covariates setting, the main difficulty lies in the computational burden due to the ultra-high dimensional parameters. Third, statistical inference on the weighted square-root LASSO is an important and interesting topic. Fourth, since our method obviates the burden of cross-validation, our method is computationally efficient: for the DNA methylation data in Section 5, it took R software about 10.3 s to converge in a personal computer. As a reviewer suggested, it would be of interest to make the proposed algorithm scalable to genome-wide response and predictor markers, which can be implemented in high performance computing facilities. Fifth, we are interested in high-dimensional mediation analysis (Zhang et al., 2016) to determine whether high-dimensional DNA methylation markers mediate the path from intervention (e.g. diet, physical exercise) to health outcomes.

Acknowledgement

We would like to thank the Editor, the Associate Editor and two reviewers for their helpful comments and suggestions, which helped us improve the article substantially.

Funding: This work was supported by AHA 14SFRN20480260, 12GRNT12070254, and National Institute of Environmental Health Sciences grant R01ES021357, R01ES021733, and R01ES015172, National Natural Science Foundation of China (Nos. 11301212, 11401146), and China Postdoctoral Science Foundation (No. 2014M550861). The VA Normative Aging Study is supported by the Cooperative Studies Program/Epidemiology Research and Information Center of the US Department of Veterans Affairs.

Appendix

For notational simplicity, let 𝕐=(Y1′,⋯,Yn′)′∈ℝN and ℰ=(ϵ1′,⋯,ϵn′)′∈ℝN with N = np. For 𝒜=(aij)∈ℝm×n and ℬ=(bij)∈ℝp×q, the Kronecker product 𝒜⊗ℬ∈ℝmp×nq is defined as

𝒜⊗ℬ=[a11ℬa12ℬ⋯a1nℬa21ℬa22ℬ⋯a2nℬ⋮⋮⋮⋮am1ℬam2ℬ⋯amnℬ].

Let 𝕏=(𝒳1′,⋯,𝒳n′)′ with 𝒳k=Ip⊗Xk′, k = 1, ⋯, n . Denote Λ as the N × N block diagonal matrix with the i-th diagonal component Ω_p, i = 1, ⋯, n. Then (3) can be rewritten as

(A.1)Q(β;Λ)=1N(𝕐−𝕏β)′Λ(𝕐−𝕏β).

In the following, we denote Q(β; Λ) as Q(β). We first need the following lemma.

Lemma 3

(Laurent and Massart, 2000). Let X∼χd2, then for 0 ≤ t < 1/2, we have that

P(X≤{1−t}d)≤exp(−14dt2).

Proof of Lemma 1

Denote Λ1/2𝕐=Λ1/2𝕏′β+Λ1/2ℰ as 𝕐~=𝕏~′β+ℰ~, where ℰ~ follows the N-dimensional multivariate normal distribution with mean 0 and covariance matrix I_N×N. Then, by the definition of Q(β) in (A.1), we have

(A.2)N||∇Q1/2(β)||∞=||∑i=1N𝕏~i′(𝕐~i−𝕏~i′β)||∞∑i=1N(𝕐~i−𝕏~i′β)2=||∑i=1N𝕏~i′ℰ~i||∞∑i=1Nℰ~i2.

Note that ∑i=1N𝕏~ijℰ~i∼N(0,N) and ∑i=1Nℰ~i2∼χN2 , where j = 1, ⋯, d. Then it follows from Lemma 3 that

P(∑i=1Nℰ~i2≤N(1−rN))≤exp(−NrN24),

where 0 ≤ r_N ≤ 1/2. Moreover, we can derive the following inequality

P(||∑i=1N𝕏~i′ℰ~i||∞∑i=1Nℰ~i2>2alogd)≤P(||∑i=1N𝕏~i′ℰ~i||∞>1−rN⋅2Nalogd)+P(∑i=1Nℰ~i2≤N(1−rN))≤∑j=1dP(|∑i=1N𝕏~ijℰ~i|>1−rN⋅2Nalogd)+exp(−NrN24)≤2d{1−Φ(1−rN⋅2alogd)}+exp(−NrN24)≤2d⋅d−a(1−rN)2π⋅1−rN⋅2alogd+exp(−NrN24)=d−a(1−rN)π(1−rN)alogd+exp(−NrN24),

where the last inequality follows from 1−Φ(t)≤12πtexp(−t22).

Let rN=2(a−1)logdN, when n is large enough, we have

P(N||∇Q1/2(β)||∞≤2alogd)≥1−2πalogd⋅d1−a(1−2(a−1)logdN)−d1−a.

⊡

Proof of Lemma 2

First, from the definition of β^ in (4), we notice that

(A.3)Q1/2(β^)−Q1/2(β)≤λ∑j=1dwj|βj|−λ∑j=1dwj|β^j|≤λWmax||(β^−β)𝒮||1−λWmin||(β^−β)𝒮c||1.

Second, on the event Ω, there is c∥∇Q1/2(β)∥∞≤λWmin. Thus, using the fact that Q(β) is a convex function, we have

(A.4)Q1/2(β^)−Q1/2(β)≥−∇Q1/2(β)(β−β^)≥−∥∇Q1/2(β)∥∞⋅||β^−β||1≥−λcWmin||β^−β||1=−λcWmin(||(β^−β)𝒮||1+||(β^−β)𝒮c||1).

Combining (A.3) and (A.4), we can obtain

||(β^−β)𝒮c||1≤cWmax+WminWmin(c−1)||(β^−β)𝒮||1.

⊡

Proof of Theorem 1

We notice the following relation:

(A.5)Q(β^)−Q(β)=||e^||2,N2−2N∑i=1N(𝕐~i−𝕏~i′β)𝕏~i′e^≥||e^||2,N2−2Q1/2(β)||∇Q1/2(β)||∞||e^||1,

where (A.5) holds by the Hölder inequality. Then it follows from the definition of κ(ξ, 𝒮) that

(A.6)||e^||2,N2≤2Q1/2(β)||∇Q1/2(β)||∞||e^||1+[Q1/2(β^)+Q1/2(β)]⋅λ[Wmaxs1/2||e^||2,Nκ(ξ,𝒮)−Wmin||e^𝒮c||1].

Moreover, we note

(A.7)Q1/2(β^)≤Q1/2(β)+λWmax(s1/2||e^||2,Nκ(ξ,𝒮)).

From (A.6) and (A.7), we have

||e^||2,N2≤2Q1/2(β)||∇Q1/2(β)||∞||e^||1+2Q1/2(β)λWmax(s1/2||e^||2,Nκ(ξ,𝒮))+{λWmax(s1/2||e^||2,Nκ(ξ,𝒮))}2−2Q1/2(β)λWmin||e^𝒮c||1.

Since c∥∇Q1/2(β)∥∞≤λWmin, we have

||e^||2,N2≤2Q1/2(β)||∇Q1/2(β)||∞||e^𝒮||1+2Q1/2(β)λWmax(s1/2||e^||2,Nκ(ξ,𝒮))+{λWmax(s1/2||e^||2,Nκ(ξ,𝒮))}2.

Then,

(A.8){1−(Wmaxλs1/2κ(ξ,𝒮))2}||e^||2,N2≤2(Wmax+Wminc)Q1/2(β)λs1/2κ(ξ,𝒮)||e^||2,N.

Since Wmaxλs1/2κ(ξ,𝒮)≤ζ<1 and Q(β; Ω_p) ≤ K² hold, by solving the above inequality (A.8), we can obtain the error bound stated in the theorem. ⊡

References

Basu, S. and G. Michailidis (2015): “Regularized estimation in sparse high-dimensional time series models,” Ann. Stat., 43, 1535–1567.10.1214/15-AOS1315Search in Google Scholar

Belloni, A., V. Chernozhukov and L. Wang (2011): “Square-root LASSO: pivotal recovery of sparse signals via conic programming,” Biometrika, 98, 791–806.10.1093/biomet/asr043Search in Google Scholar

Belloni, A., V. Chernozhukov and L. Wang (2014): “Pivotal estimation via square-root LASSO in nonparametric regression,” Ann. Stat., 42, 757–788.10.1214/14-AOS1204Search in Google Scholar

Bickel, P. J., Y. Ritov and A. Tsybakov (2009): “Simultaneous analysis of LASSO and Dantzig selector,” Ann. Stat., 37, 1705–1732.10.1214/08-AOS620Search in Google Scholar

Bishop, C., D. Spiegelhalter and J. Winn (2003): “VIBES: a variational inference engine for Bayesian networks.” In Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun and K. Obermayer, eds.). MIT Press, Cambridge, MA, 777–784.Search in Google Scholar

Bradic, J., J. Fan and J. Jiang (2011): “Regularization for Cox’s proportional hazards model with NP-Dimensionality,” Ann. Stat., 39, 3092–3120.10.1214/11-AOS911Search in Google Scholar PubMed PubMed Central

Bühlmann, P. and S. van de Geer (2011): Statistics for high-dimensional data: methods, theory and applications, Springer.10.1007/978-3-642-20192-9Search in Google Scholar

Cai, T. and M. Yuan (2012): “Adaptive covariance matrix estimation through block thresholding,” Ann. Stat., 40, 2014–2042.10.1214/12-AOS999Search in Google Scholar

Cai, T., W. Liu and X. Luo (2011): “A constrained ℓ₁ minimization approach to sparse precision matrix estimation,” J. Am. Stat. Assoc., 106, 594–607.10.1198/jasa.2011.tm10155Search in Google Scholar

Fan, J. and R. Li (2001): “Variable selection via nonconcave penalized likelihood and its oracle properties,” J. Am. Stat. Assoc., 96, 1348–1360.10.1198/016214501753382273Search in Google Scholar

Fan, J. and R. Li (2002): “Variable selection for Cox’s proportional hazards model and frailty model,” Ann. Stat., 30, 74–99.10.1214/aos/1015362185Search in Google Scholar

Fan, Y. and J. Lv (2014): “Asymptotic properties for combined L₁ and concave regularization,” Biometrika, 101, 57–70.10.1093/biomet/ast047Search in Google Scholar

Fan, J., Y. Liao and M. Mincheva (2013): “Large covariance estimation by thresholding principal orthogonal complements (with discussion),” J. R. Stat. Soci. Ser. B, 75, 603–680.10.1111/rssb.12016Search in Google Scholar PubMed PubMed Central

Friedman, J., T. Hastie and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Software, 33, 1–22.10.18637/jss.v033.i01Search in Google Scholar

Gao, X., M. Jia, Y. Zhang, L. Breitling and H. Brenner (2015): “DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies,” Clin Epigenet, 7, 113.10.1186/s13148-015-0148-3Search in Google Scholar PubMed PubMed Central

Huang, J., S. Ma, H. Li and C.-H. Zhang (2011): “The sparse Laplacian shrinkage estimator for high-dimensional regression,” Ann. Stat., 39, 2021–2046.10.1214/11-AOS897Search in Google Scholar

Huang, J., T. Sun, Z. Ying, Y. Yu and C.-H. Zhang (2013): “Oracle inequalities for the LASSO in the Cox model,” Ann. Stat., 41, 1142–1165.10.1214/13-AOS1098Search in Google Scholar PubMed PubMed Central

Jiang, Y., Y. He and H. Zhang (2016): “Variable selection with prior information for generalized linear models via the prior lasso method,” J. Am. Stat. Assoc., 111, 355–376.10.1080/01621459.2015.1008363Search in Google Scholar PubMed PubMed Central

Laurent, B. and P. Massart (2000): “Adaptive estimation of a quadratic functional by model selection,” Ann. Stat., 28, 1302–1338.Search in Google Scholar

Li, X., T. Zhao, X. Yuan and H. Liu (2015a): “The flare package for high dimensional linear regression and precision matrix estimation in R,” J. Mach. Learn. Res., 16, 553–557.Search in Google Scholar

Li, Y., B. Nan and J. Zhu (2015b): “Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure,” Biometrics, 71, 354–363.10.1111/biom.12292Search in Google Scholar PubMed PubMed Central

Lin, W. and J. Lv (2013): “High-dimensional sparse additive hazards regression,” J. Am. Stat. Assoc., 108, 247–264.10.1080/01621459.2012.746068Search in Google Scholar

Lin, W., R. Feng and H. Li (2015): “Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics,” J. Am. Stat. Assoc., 110, 270–288.10.1080/01621459.2014.908125Search in Google Scholar PubMed PubMed Central

Liu, H. and L. Wang (2017): “TIGER: a tuning-insensitive approach for optimally estimating Gaussian graphical models,” Electron. J. Stat., 11, 241–294.10.1214/16-EJS1195Search in Google Scholar

Liu, H., L. Wang and T. Zhao (2015): “Calibrated multivariate regression with application to neural semantic basis discovery,” J. Mach. Learn. Res., 16, 1579–1606.Search in Google Scholar

Moen, E., X. Zhang, W. Mu, S. Delaney, C. Wing, J. McQuade, J. Myers, L. Godley, M. Dolan and W. Zhang (2013): “Genome-wide variation of cytosine modifications between European and African populations and the implications for complex traits,” Genetics, 194, 987–996.10.1534/genetics.113.151381Search in Google Scholar PubMed PubMed Central

Mukherjee, R., N. Pillai and X. Lin (2015): “Hypothesis testing for high-dimensional sparse binary regression,” Ann. Stat., 43, 352–381.10.1214/14-AOS1279Search in Google Scholar PubMed PubMed Central

Rothman, A., E. Levina and J. Zhu (2010): “Sparse multivariate regression with covariance estimation,” J. Comput. Graph. Stat., 19, 947–962.10.1198/jcgs.2010.09188Search in Google Scholar PubMed PubMed Central

Sofer, T., L. Dicker and X. Lin (2014): “Variable selection for high dimensional multivariate outcomes,” Stat. Sinica, 24, 1633–1654.10.5705/ss.2013.019Search in Google Scholar PubMed PubMed Central

Tibshirani, R. (1996): “Regression shrinkage and selection via the LASSO,” J. Royal Stat. Soci. Ser. B, 58, 267–288.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

van de Geer, S. (2008): “High-dimensional generalized linear models and the lasso,” Ann. Stat., 36, 614–645.10.1214/009053607000000929Search in Google Scholar

Wang, H. and C. Leng (2007): “Unified LASSO estimation by least squares approximation,” J. Am. Stat. Assoc., 102, 1039–1048.10.1198/016214507000000509Search in Google Scholar

Wilms, I. and C. Croux (2017): “An algorithm for the multivariate group lasso with covariance estimation,” J. Appl. Stat., accepted.10.1080/02664763.2017.1289503Search in Google Scholar

Yuan, M. and Y. Lin (2007): “Model selection and estimation in the Gaussian graphical model,” Biometrika, 95, 19–35.10.1093/biomet/asm018Search in Google Scholar

Zhang, C.-H. (2010): “Nearly unbiased variable selection under minimax concave penalty,” Ann. Stat., 38, 894–942.10.1214/09-AOS729Search in Google Scholar

Zhang, H., Y. Zheng, Z. Zhang, T. Gao, B. Joyce, G. Yoon, W. Zhang, J. Schwartz, A. Just, E. Colicino, P. Vokonas, L. Zhao, J. Lv, A. Baccarelli, L. Hou and L. Liu (2016): “Estimating and testing high-dimensional mediation effects in epigenetic studies,” Bioinformatics, 32, 3150–3154.10.1093/bioinformatics/btw351Search in Google Scholar PubMed PubMed Central

Zou, H. (2006): “The adaptive LASSO and its oracle properties,” J. Am. Stat. Assoc., 101, 1418–1429.10.1198/016214506000000735Search in Google Scholar

Zou, H. and T. Hastie (2005): “Regularization and vriable selection via the elastic net,” J. Royal Stat. Soci. Ser. B, 67, 301–320.10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar

Published Online: 2017-7-22

Published in Print: 2017-7-26

Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study

Abstract

1 Introduction

2 Model and estimation

3 Theoretical results

Lemma 1

Lemma 2

Theorem 1

Remark 1

Remark 2

4 Simulation studies

5 Application

6 Concluding remarks

Acknowledgement

Appendix

Lemma 3

Proof of Lemma 1

Proof of Lemma 2

Proof of Theorem 1

References

Journal and Issue

Articles in the same Issue