Designing penalty functions in high dimensional problems: The role of tuning parameters

: Various forms of penalty functions have been developed for regularized estimation and variable selection. Screening approaches are often used to reduce the number of covariate before penalized estimation. However, in certain problems, the number of covariates remains large after screening. For example, in genome-wide association (GWA) studies, the purpose is to identify Single Nucleotide Polymorphisms (SNPs) that are associated with certain traits, and typically there are millions of SNPs and thousands of samples. Because of the strong correlation of nearby SNPs, screening can only reduce the number of SNPs from millions to tens of thou- sands and the variable selection problem remains very challenging. Several penalty functions have been proposed for such high dimensional data. How- ever, it is unclear which class of penalty functions is the appropriate choice for a particular application. In this paper, we conduct a theoretical anal- ysis to relate the ranges of tuning parameters of various penalty functions with the dimensionality of the problem and the minimum eﬀect size. We exemplify our theoretical results in several penalty functions. The results suggest that a class of penalty functions that bridges L 0 and L 1 penalties requires less restrictive conditions on dimensionality and minimum eﬀect sizes in order to attain the two fundamental goals of penalized estimation: to penalize all the noise to be zero and to obtain unbiased estimation of the true signals. The penalties such as SICA and Log and SICA requires less restrictive conditions on dimensionality and minimum eﬀect sizes, while achieving the two fundamental goals of penalized estimation. For the tuning of the regularization parameters, our study shows that both SICA and Log penalties have very limited performance if only one of the two regularization parameters is tuned, while tuning both regularization parameters can signiﬁcantly improve their performances, although at the price of heavier computational burden. Our results are also insightful for


Introduction
In genome-wide association (GWA) studies, the goal is to identify the genetic factors such as single nucleotide polymorphisms (SNPs) that are associated with diseases. With the availability of a dense map of SNPs, it is statistically very challenging to select the important SNPs from millions of SNPs using only a couple of thousand samples. Regularized estimation procedures can be applied for simultaneous selection of important variables (SNPs) and estimation of their effects for high dimensional data in GWA studies. The objective function of the regularized estimation is composed of a model fitting metric (e.g., likelihood function) and a penalty function for the parameters subject to regularization. Prior to the usage of regularized estimation, screening can be applied to reduce the number of SNPs to be considered for penalized estimation. However, due to the high correlation of neighboring SNPs, the number of SNPs that pass a reasonable screening criterion is often larger than or much larger than the sample size.
We use the real SNP genotype data from a recent study (Wright et al., 2014) to illustrate the correlation structure of genotype data. We take the genotypes of 645,316 SNPs in chromosome 1 from 1,198 samples, and randomly pick 30 SNPs as important variables to simulate the response under the linear model. The effect size is simulated as 0.7 and the residual errors are standard normal variables. Figure 1 shows a Manhattan plot of the marginal association p-values. The 30 important SNPs are labeled by grey vertical lines. It is obvious that the high correlation among nearby SNPs leads to small p-values for those SNPs which are close to the 30 important SNPs. If we apply screening using the p-value cut-off 10 −4 , 3,087 SNPs will be selected which include 20 of the 30 important SNPs. Alternatively, if the p-value cut-off is 10 −8 , 991 SNPs will be selected, which include only 13 of the 30 important SNPs. Thus screening method can be helpful to certain extend, and screening with stringent threshold would lead to many false negatives. This conclusion is consistent with the extensive empirical study by Bühlmann and Mandozzi (2012). Therefore, the penalty function itself is still the key for high dimensional data analysis, and it is desirable to identify penalty functions that can tolerate higher dimension.
Several penalty functions have been proposed for high dimensional data analysis. One of the most popular penalty functions is the Lasso penalty (Tibshirani, 1996). The variable selection consistency of the Lasso requires the irrepresentable condition (Zhao and Yu, 2006) that there is no strong correlation between the "important covariates" that have non-zero effects and the "unimportant covariates" that have zero effects. This condition may not be satisfied in some applications, such as GWA studies. Recent studies have shown that a class of folded concave penalties can achieve variable selection consistency without requiring such an irrepresentable condition (Fan and Lv, 2010). These folded concave penalties include, but are not limited to SCAD (Smoothly Clipped Absolute Deviation) (Fan, 1997;Fan and Li, 2001), MCP (Minimax Concave Penalty) (Zhang, 2010), SICA (Smooth Integration of Counting and Absolute deviation) (Lv and Fan, 2009), and a Log penalty (Friedman (2008), Sun, Ibrahim and Zou (2010)).
A common concern in real data applications of penalized estimation is to tune the regularization parameters to achieve the two fundamental goals of penalized estimation: to penalize all the noise to be zero and to obtain an unbiased estimation of the true signals. However, it may not be clear whether such "optimal" tuning is possible, and this is the focus of our study. Mazumder, Friedman and Hastie (2011) study the non-convex optimization problem for SCAD, MCP and Log penalties, but they did not address the roles of tuning parameters of those penalties in variable selection. Moreover, all the aforementioned folded-concave penalties have two tuning parameters, and thus in practice, the immediate questions concern whether they both should be tuned, and what is the consequence of tuning only one of them in order to improve computational efficiency. Previous work has provided recommendations regarding the choice of tuning parameters, but there is no systematic asymptotic studies on the roles of multiple tuning parameters. To address those issues, we will relate the choice of tuning parameters to the difficulty of the variable selection problem, namely the minimum effect size and the dimensions, i.e., the number of important and unimportant covariates.
The results suggest that a class of penalty functions that bridges L 0 and L 1 penalties such as Log and SICA requires less restrictive conditions on dimensionality and minimum effect sizes, while achieving the two fundamental goals of penalized estimation. For the tuning of the regularization parameters, our study shows that both SICA and Log penalties have very limited performance if only one of the two regularization parameters is tuned, while tuning both regularization parameters can significantly improve their performances, although at the price of heavier computational burden. Our results are also insightful for designing other penalty functions. For example, our results imply that two tuning parameters are sufficient to achieve the two fundamental goals. Therefore, penalties with more than two regularization parameters may not be needed due to the substantial increase of computational cost.
We conducted empirical analyses of the penalty functions using both simulated data and real data in GWA settings. Those empirical results support the idea that the class of penalty functions that bridges L 0 and L 1 holds promise for genomic studies.

Notations and problem setup
Let p (β) be a penalty function of β, where are regularization parameters with arbitrary dimension. p (β) is referred to as a folded concave penalty if it satisfies the following condition: We formulate the effects of the covariates via a generalized linear regression model, permitting continuous and discrete outcome variables. Consider a sample of n responses, y = (y 1 , . . . , y n ) T , where each y i , i = 1, . . . , n, is independently generated from an exponential family distribution with a density: where θ i is the canonical parameter and φ ∈ (0, ∞) is the dispersion parameter. Let x ij be the value of the j-th covariate in the i-th sample, and let X = (x ij ) be a n × p matrix of the covariates' values. We assume that X has been normalized such that n i=1 x 2 ij = n, for j = 1, . . . , p. Under the assumed generalized linear model, θ i = p j=1 x ij β j , where β j 's are regression coefficients. Let E(y) = μ(θ) = (∂ θ1 b(θ 1 ), . . . , ∂ θn b(θ n )) T and Σ(θ) = diag ∂ 2 θ1 b(θ 1 ), . . . , ∂ 2 θn b(θ n ) . We maximize the penalized likelihood Q n (β) = l n (β) − p j=1 p (|β j |), where l n (β) = n −1 y T θ − 1 T b(θ) is an affine transformation of the log-likelihood.
Without loss of generality, we assume that the first s covariates of X are important (i.e., having non-zero effect on the response variable) and denote them collectively by X 1 , and then denote the remaining p − s unimportant covariates by X 2 , such that X = (X 1 , X 2 ). Similarly, we partition β for the important and unimportant covariates such that β = (β T 1 , β T 2 ) T . Let β 0 = (β T 01 , β T 02 ) T = (β 01 , . . . , β 0p ) T be the true coefficients, such that β 02 = 0. Let θ 0 be the true values of θ such that θ 0 = Xβ 0 .
It is difficult to analytically study the global maximizer of the penalized likelihood. Following the previous work (Fan and Lv, 2011), we study the local maximizer of the penalized likelihood that satisfies a set of sufficient and almost necessary conditions specified in Theorem 1 (see Appendix).

The role of the tuning parameters
The dimension of the regression problem and the minimum effect size are assumed to satisfy the following conditions: Condition 2.1. log p = O(n α ) and s = O(n ν ), respectively, with 0 ≤ α < 1 and 0 ≤ ν < 1/2.
The restriction of γ 0 > ν (which is equivalent to s < n γ0 ) in Condition 2.2 can be understood as an identifiability condition so that d n s = O(n ν−γ0 (log n) 1/2 ) can be bounded by a constant. Otherwise the response variable is unbounded, with non-trivial probability.
A maximizer of the penalized likelihood,β = (β T 1 ,β T 2 ) T , is considered to have weak oracle property ifβ 2 = 0 with probability tending to 1 as n → ∞, andβ 1 is consistent under L ∞ loss (Lv and Fan, 2009). We will study the role of tuning parameters by studying the conditions for the weak oracle property. To this end, we generalize the conditions for the weak oracle property in Fan and Lv (2011) to impose constraints on the penalty function rather than particular tuning parameters, which gives the following Conditions 3.1-3.3. This generalization is necessary because the original conditions are too stringent for any penalty function whose p (0+) involves more than one tuning parameter. For example, the Log penalty cannot satisfy the original conditions for the weak oracle property. After generalizing the conditions, we can show that the Log penalty can indeed fulfill the conditions of the weak oracle property.
σ is a constant that is defined based on the range of the response variable y (see proposition A1 in the Supplementary Materials for details), and η p = n −1/2+α/2 (log n) 1/2 . Condition 3.1 requires the derivative of the penalty function (i.e., the increase of penalization as the regression coefficient increases) for important covariates to be small enough. Condition 3.2 says that the ratio of the penalties' derivatives for unimportant covariates and for important ones (p (0+)/p (d n )) should be large enough relative to the maximum correlation between important and unimportant covariates, which is a generalization of the irrepresentable condition for Lasso (Zhao and Yu, 2006). Condition 3.3 requires the derivative of the penalty function for unimportant covariates to be large enough. In contrast to the conditions for the weak oracle property in Fan and Lv (2011), a critical modification is that we restrict the size of p (0+) in Condition 3.3, which replaces the condition λ n n −α (log n) 2 stated in equation (18) of Fan and Lv (2011). For SCAD and MCP, p (0+) = λ n , and thus constraints on λ n or p (0+) are equivalent. However, for Log and SICA, p (0+) = O(λ n /τ n ). Therefore, the generalized condition only requires the ratio of the two regularization parameters to be large enough instead of imposing a constraint on λ n itself. Given Conditions 2.1-2.2, Conditions 3.1-3.3, and Conditions 4.1-4.4 (presented in the Appendix), which are for the design matrix X, we have the weak oracle property (Theorem 2 in the Appendix).
One immediate conclusion from Conditions 3.1-3.3 is that the constraints on the penalty function p (β) are applied on the two quantities p (0+) and p (d n ). With an appropriate design, two tuning parameters can give enough degrees of freedom on these two quantities so that Conditions 3.1-3.3 are satisfied.
Next we discuss the implications of Conditions 3.1-3.3 for the four folded concave penalties: SCAD, MCP, Log, and SICA. It is more convenient to define SCAD and MCP by their derivatives.
where λ > 0 and a > 2 are two regularization parameters.
where λ > 0 and a > 0 are two regularization parameters. The Log and SICA penalties are defined as respectively, where λ > 0 and τ > 0 are two regularization parameters. In the following discussions, the tuning parameters employed by a penalty are indicated by subscripts. For example, the SCAD penalty with one tuning parameter λ n (the other regularization parameter a being set as constant) is denoted by SCAD λn and the SCAD penalty with two tuning parameters λ n and a n is denoted by SCAD λn,an .
Proposition 4 (SICA λn,τn or Log λn,τn ). There are tuning parameters that satisfy Conditions 3.1-3.3 for the weak oracle property without further constraints other than s n γ0 , as is specified in Condition 2.2.
The proofs of Propositions 1-4 and Corollary 1 are presented in the Supplementary Materials (Chen et al., 2016).
By Proposition 1, if d n η p or d n η p , SCAD has similar theoretical properties when one or two tuning parameters are used. This conclusion is consistent with many previous works where SCAD has satisfactory performance when the regularization parameter a is set to be a constant, e.g., 3.7. Using two tuning parameters (λ n and a n ) does have some advantage over one tuning parameter (λ n ) when d n = O(η p ). However, since the situation of d n = O(η p ) only covers a negligible part of the space for d n , we do not discuss it further here. Proposition 1 also states that if d n η p (the effect size is not large enough relative to the dimension), then there is no tuning parameter of SCAD to satisfy Conditions 3.1-3.3. Specifically, Condition 3.1 requires p (d n ) d n , and Condition 3.3 requires p (0+) > cη p , where c is a constant. These two conditions cannot both be satisfied if d n η p . Specifically, if SCAD satisfies Condition 3.3, then p (0+) = λ n > cη p . Given d n η p and η p < λ n /c, we have d n λ n , and then we can show that p (d n ) = λ n , which contradicts Condition 3.1. In addition, we can see that in this situation, both p (0+) and p (d n ) are functions of λ n so that a plays no role in fulfilling Conditions 3.1 and 3.3. Therefore, tuning only one regularization parameter is sufficient and can be a computational advantage of SCAD.
By Propositions 1 and 2, tuning both λ n and a n significantly improves the performance of MCP if d n η p . Specifically, if MCP satisfies Condition 3.3, then p (0+) = λ n > cη p . Then given d n η p , we have d n λ n . However, given a properly tuned a n = o(1) such that d n ≥ a n λ n , we have p (d n ) = 0, which allows MCP to satisfy Condition 3.1.
By Proposition 3, if we set τ = O(1) and only tune the regularization parameter λ, then SICA λn and Log λn require the following condition to achieve the weak oracle property: This condition is similar to the irrepresentable condition of Lasso because when τ = O(1), d n /τ + 1 → 1. Therefore, asymptotically SICA λn and Log λn would perform in a way similar to Lasso. If d n η p , then SICA λn and Log λn cannot simultaneously satisfy Conditions 3.1 and 3.3, even if the irrepresentable condition is satisfied.
By Proposition 4, tuning both λ n and τ n significantly improves the performance of SICA and Log. Specifically, SICA and Log can have satisfactory variable selection performances even if the minimum effect size is much smaller with respect to the dimension of the problem: d n η p . This can be justified by the following arguments. For Log penalty, p (d n ) = p (0+)/(d n /τ n + 1). Even Condition 3.3 requires a large value of p (0+); a small enough τ n can help p (d n ) to satisfy Condition 3.1. SICA has similar properties since it has p (d n ) = p (0+)/(d n /τ n + 1) 2 . Therefore, the implications of Proposition 3 and Proposition 4 for the practical use of SICA and Log penalties would be that we should not treat τ as a constant.
Corollary 1 shows that for a difficult variable selection problem where d n η p , the tuning parameter a n of MCP or τ n of SICA or Log should be on the scale of o(1). Zhang (2010) suggests that a larger tuning parameter a in MCP leads to a bigger bias and less accurate variable selection, a = 1 leads to a singularity problem, and a < 1 leads to a dramatic increase in computational cost. Similarly, Lv and Fan (2009) suggest that for penalized estimates using SICA, the bias decreases to 0 as τ n goes to 0+, but the computational difficulty increases because the maximum concavity goes to infinity. Similar conclusions apply to Log penalty. Although MCP λn,an , SICA λn,τn , and Log λn,τn have similar theoretical properties by Propositions 2 and 4, the following numerical studies show that the computation cost for SICA and Log is more affordable than that of MCP.

Algorithm and tuning parameter selection
We obtain the penalized estimates using SCAD or MCP by the coordinate descent algorithms implemented in the R package ncvreg (Breheny and Huang, 2011). We implement the penalized estimation using SICA and Log penalties by a combination of the coordinate descent algorithm and Local Linear Approximation (LLA) (Zou and Li, 2008). Specifically, we update the estimate of each regression coefficient sequentially (which is the coordinate decent part), and the solution of each coefficient is obtained after applying a local linear approximation. The details can be found in the Supplementary Materials.
We select a particular combination of tuning parameters from the initial tuning parameter pool using the extended BIC Chen, 2008, 2012). As discussed in Chen and Chen (2008), if log p/log n > 0.5, the conventional BIC (Schwarz, 1978) is not consistent. In all the scenarios considered in this paper, log p/ log n > 1. Our empirical studies confirm that in these scenarios the conventional BIC tends to be too liberal, and the extended BIC performs satisfactorily. The extended BIC for the linear model m is: where df m is the degrees of freedom for model m and ς(S dfm ) is the number of the models containing df m covariates. We take the number of the nonzero coefficient estimates in the model m as df m and set ς(S dfm ) = p dfm , the number of combinations of df m covariates chosen from p covariates. In addition, we set 1 − 1/(2log p/log n) as > 1 − 1/(2log p/log n) is suggested in Chen and Chen (2008). The extended BIC for a generalized linear model m is: where df m is the number of nonzero coefficient estimates, and similar to the above 1 − 1/(2log p/log n), as suggested in Chen and Chen (2012).

Simulation
We evaluated those four penalties using a set of simulated data for multiple loci mapping problems. Specifically, the response variable is either a continuous trait (linear regression) or the case/control status (logistic regression), and the covariates are the genotypes of the SNPs. One particular challenge in a multiple loci mapping problem is that nearby SNPs often have correlated genotypes due to linkage disequilibrium, and such correlations may violate the irrepresentable condition, which is needed for the consistency of Lasso. To faithfully reproduce such correlation structure, we directly used genotype data of European Ancestry (EA) samples from a GWAS study of schizophrenia (Shi et al., 2009). The dataset was obtained from NCBI dbGaP, which includes GAIN (Genetic Association Information Network) samples (2,686/2,656: cases/controls, dbGaP Accession: phs000021.v3.p2) and non-GAIN samples (1,217/1,442: cases/controls, dbGaP Accession: phs000167.v1.p1) genotyped by Affymetrix 6.0 SNP arrays with ∼900,000 SNPs.
To compare the performances of those penalty functions, we use two criteria to select the tuning parameters. One is the extended BIC as introduced earlier, and the other is an oracle criterion that uses the knowledge of the true model to select the tuning parameters. Certainly the oracle criterion is not applicable in practice when the true model is unknown. However, in simulation studies, the oracle criterion permits us to evaluate the performance of a penalty function rather than the combined outcome of a penalty function and a tuning parameter selection method. The oracle criterion is defined as follows. Let D be the number of discoveries, i.e., the covariates with non-zero regression coefficient estimates. D = TD + FD, where TD and FD are the number of true discoveries and false discoveries, respectively. The oracle criterion evaluates a model based on the three measures, the false discovery rate FD/D, power TD/s, and the sum of squared error of regression coefficient estimates p j=1 |β j − β 0j | 2 , where β 0j is the true value of β j . The model with the minimum of wt(FD/D − TD/s) + p j=1 |β j − β 0j | 2 is selected, where wt is a weight to balance the number of true/false discoveries and bias. Models selected with larger wt tend to have more true discoveries and fewer false discoveries, but have a larger bias in their regression coefficient estimates.

Linear model
For computational efficiency when there are a large number of simulations, we randomly selected n = 222 samples and 12,656 SNPs with no missing values, and with a minor allele frequency greater than 5% on chromosome 20. The response variables y were simulated by y = Xβ + , where ∼ N (0, I n×n ). We considered 3 situations involving different combinations of p and s: p = 12,656 and s = 12, 16, or 20. Let u T 1 = (0.5, −0.5, 0.4, −0.4). When s = 12, 16, and 20, β 0 are set by repeating u 1 three, four, and five times, respectively. In addition, we considered null situations with s = 0 and p = 12,656.
The tuning parameter grids were chosen as follows: a = (2.1, 2.5, 3.0, 3.7, 4.5, 6.0) for SCAD, a = (1.1, 2.0, 3.0, 4.0, 5.0, 6.0) for MCP, and 6 τ 's for Log and SICA as described in the Supplementary Materials. We also applied Lasso implemented in R/glmnet. For each of these five penalties, 100 λ's uniformly distributed on a log scale were generated as described in the Supplementary Materials.
We used the extended BIC and oracle criteria 10(FD/D − TD/s) + p j=1 |β j − β j0 | 2 to select the tuning parameters. We give the term (FD/D − TD/s) a larger weight of 10 so that the oracle criterion selects the model with the smaller false discovery rate FD/D, greater power TD/s first, and use the sum of squared error of regression coefficient estimates p j=1 |β j − β j0 | 2 as a secondary criterion. Additional simulation results using various values of weight can be found in the Supplementary Materials.
For null simulation situations, all penalties have at most 1 or 2 false discoveries by the extended BIC tuning parameter selection criterion. Table 1 summarizes the simulation results in non-null situations with 12, 16, or 20 important covariates. The folded concave penalties perform better than the Lasso penalty. Among the four folded concave penalties, SICA, Log and MCP have comparable performance, and are better than SCAD when the tuning parameters are selected by the oracle criterion. When the tuning parameters are selected by the extended BIC, SICA and Log have comparable performance, and are better than SCAD and MCP. In additional simulation studies that are presented in the Supplementary Materials, SCAD and MCP with one tuning parameter (λ) have slightly worse performance than the situations with two tuning parameters. In contrast, Log and SICA with one tuning parameter (λ) have much worse performance than the situations with two tuning parameters. Therefore, the extra tuning parameter (a or τ ) gives SCAD and MCP limited additional advantage, but significantly improves the performances of Log and SICA. Table 1 Simulation results for penalized linear regression with (n=222, p = 12,656). The headers indicate the tuning parameter selection criterion (Oracle or the extended BIC) and the numbers in parentheses are the number of important covariates. For each penalty, we present the median of the number of true discoveries, false discoveries (in parentheses), and average bias of the true discoveries (in brackets) across 100 simulations.

Simulation for logistic model
For penalized logistic regression, a larger sample size is needed for simulations with reasonable effect sizes. We randomly selected 10,156 SNPs (with a minor allele frequency larger than 5%) from chromosomes 1 to 22 and X and 750 samples (with a missing values percent smaller than 3%). We simulated the individual SNP effect so that the disease odds ratios are 2.0, corresponding to regression coefficients of 0.7. The binary response variables y were simulated based on the logistic regression model: log{Pr(y = 1)/Pr(y = 0)} = Xβ, where s = 4, 8, or 12. In addition, the null model where s = 0 was simulated. The intercept was set as −2, corresponding to a disease prevalence of 0.12. The initial pool of tuning parameters were generated in the same way as linear regression, and then a particular combination of tuning parameters was selected to minimize the extended BIC, or an oracle criterion 10(FD/D − TD/s) + p j=1 |β j − β j0 | 2 . For the simulation of null models, all penalties have at most 1 or 2 false discoveries by the extended BIC tuning parameter selection criterion. The simulation results of non-null models are shown in Table 2. In general, the results of logistic model simulation have a trend similar to that of linear model simulation. When the oracle criterion is used, all penalties have satisfactory variable selection performances though SICA and Log have a smaller bias on effect size estimation. It can be observed that the models chosen by the oracle criterion are different from those selected by the extended BIC for SCAD and MCP. This is because the models chosen by the oracle criterion tend to have larger biases, Table 2 Simulation results for penalized logistic regression (n=750, p = 10,156). The headers indicate the tuning parameter selection criterion (Oracle or the extended BIC) and the numbers in parentheses are the number of important covariates. For each penalty, we present the median of the number of true discoveries, the number of false discoveries (in parentheses), and the average bias of true discoveries (in brackets) across 100 simulations.

Oracle (4)
Ext BIC (4) Oracle ( Table 3 Running time rounded to minutes per simulation (n=750, s = 12, p = 10,156) for 100 λ's and a fixed a of MCP or τ of Log and SICA.
Finally, Table 3 presents the comparison of the computational burden for MCP, Log and SICA across various values of a and τ , respectively. It can be observed that the computation time of Log and SICA is much less than that of MCP.
In summary, Log and SICA have a smaller bias for the coefficient estimates of important covariates, and therefore, more accurate estimates of the likelihood function. In addition, they have lower computational burden compared to MCP. As a consequence, Log and SICA penalties have advantages in empirical usage.

Real data analysis
We analyzed the data of GWA studies of schizophrenia on European-ancestry samples (2,195 cases vs. 2,617 controls). The missing genotypic data were imputed using BEAGLE software (Browning and Browning, 2007), and 677,163 autosome SNPs with minor allele frequency no less than 5% were selected for the analysis. We included 23 principle components (PCs) of genotype data in the model to account for possible population stratification. First, a univariate logistic regression is conducted on the case-control status for each of the 677,163 SNPs, conditioning on the covariates: age, gender and 23 PCs. Using the resulting 677,163 p-values, we calculated a genomic control factor of 1.0445 (Devlin and Roeder, 1999), implying that there is no strong population stratification not accounted for in our model. The 7,984 SNPs with p-values smaller than 0.01 were selected for the following variable selection. We applied the penalized logistic regression on the 7,984 SNPs and 4,812 samples with the four foldedconcave penalties, while accounting for the effects of age, gender and 23 PCs, by including them as unpenalized covariates.
We applied SCAD with a = 3.7 and MCP with a = 3, the default value of R package ncvreg, and chose to use two tuning parameters for SICA and Log. Using the extended BIC for tuning parameter selection, the penalized logistic regressions with Log and SICA selected 38 and 22 SNPs, respectively (Supplementary Table 1-2). However, penalized logistic regressions with both MCP and SCAD selected the null model since the null model has the lowest value of the extended BIC.
A joint model was fitted by a logistic regression using the 38 SNPs identified by the Log penalty together with age, gender, and 23 PCs to obtain the p-values for the 38 SNPs. The results are illustrated in Figure 2, together with the marginal p-values for the 677,163 SNPs. There are 43 genes within 10kb distance of these 38 SNPs, and among them 21 are in the Database for Annotation, Visualization and Integrated Discovery (DAVID) (Huang, Sherman and Lempicki, 2008). By functional category enrichment analysis at the DAVID website, 16 of the 21 genes are bound by transcription factor FOXO1, with significant enrichment p-value after a Benjamini correction. Recent studies have shown that FOXO1 regulates neuroblastoma differentiation (Mei et al., 2012), which is relevant to schizophrenia. In contrast, we also did the functional category analysis for those genes within 10 kb of the 38 SNPs with the smallest marginal p-values, but no functional category was significantly over-represented.

Conclusion and discussion
Although the methods with folded concave (nonconvex) penalties may not be desirable in terms of computational efficiency, they may lead to nice statistical properties in high dimensional setting (Fan and Li, 2001). To investigate the applicability of the nonconvex penalty functions in challenging high dimensional settings such as genomic studies, we conducted a theoretical analysis on the roles of tuning parameters with respect to the dimension of the problem and minimum effect size. The results suggest that the derivatives of the penalty function around 0 and the minimum effect size are two important quantities to be considered. A good performance of the penalized estimation requires that these two quantities be asymptotically different. Among the four penalties discussed in this paper, tuning one regularization parameter is sufficient to exploit the advantages of SCAD. In contrast, MCP, SICA and Log's performances can be significantly improved if two instead of one (λ) regularization parameter is tuned. These theoretical conclusions are well supported in the empirical studies. In the simulations, we also observe that a penalized estimation using SICA or Log appears to be computationally more efficient than using MCP. The good performance of tuning two regularization parameters comes with the cost of added computational time. In real data analysis, one needs to judge the difficulty of the penalization problem in terms of effect size and dimensionality in order to choose whether one or two regularization parameters are needed, and the theoretical results of this paper can guide such choices. These theoretical results are based on the sufficient conditions of the weak oracle property, and thus they could be refined if the sufficient and necessary conditions of the weak oracle property are available.
For the future work, it will be of great interest to study if the regularized methods using those four nonconvex penalties achieve feature selection consistency under the necessary and sufficient condition derived by Shen et al. (2013). Furthermore, Shen et al. (2013) have demonstrated that constrained approaches may offer both theoretical and computational advantages. Therefore, our following study may derive the constrained counterpart approaches for Log or SICA to enhance better empirical performances. In addition, Wang, Kim and Li (2013) have proposed an calibrated CCCP algorithm that produces a consistent solution path which contains the oracle estimator with probability approaching one. They also proposed a high-dimensional BIC criterion and showed that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. Take the penalty SCAD at a fixed tuning value of a = 3.7 for example. The calibrated CCCP algorithm introduces another parameter τ , and then two convex minimization problems using τ λ and λ are solved sequentially. For penalties that are sufficient to use one tuning parameter such as SCAD, the calibrated CCCP algorithm is ready to be applied. However, for the penalties required the usage of both of the tuning parameters such as Log penalty, it warrants future research on how to calibrate the two tuning parameters simultaneously in an efficient way. After the algorithm has been built, it will be interesting to see how different high-dimensional BIC criteria may influence the empirical performances and how the new devised methods perform in the genetic studies.
The following Conditions 4.1-4.4 are for the design matrix X, and they are essentially the same as the corresponding conditions from Fan and Lv (2011). We first define a few notations used in the following regularity conditions. L ∞ norm of a matrix is the maximum of the L 1 norm of each row. λ max ()/λ min () denotes the maximum/minimum eigen-value of a symmetric matrix, respectively. Denote a neighborhood of the non-zero coefficients as N 0 = {δ ∈ R s : δ − β 01 ∞ ≤ d n }.