Selecting massive variables using an iterated conditional modes/medians algorithm

: Empirical Bayes methods are designed in selecting massive variables, which may be inter-connected following certain hierarchical structures, because of three attributes: taking prior information on model pa- rameters, allowing data-driven hyperparameter values, and free of tuning parameters. We propose an iterated conditional modes/medians (ICM/M) algorithm to implement empirical Bayes selection of massive variables, while incorporating sparsity or more complicated a priori information. The it- erative conditional modes are employed to obtain data-driven estimates of hyperparameters, and the iterative conditional medians are used to esti- mate the model coeﬃcients and therefore enable the selection of massive variables. The ICM/M algorithm is computationally fast, and can easily extend the empirical Bayes thresholding, which is adaptive to parameter sparsity, to complex data. Empirical studies suggest competitive perfor- mance of the proposed method, even in the simple case of selecting massive regression predictors.


Introduction
Selecting variables in problems with a large number of predictors is a challenging yet critical problem in analyzing high-dimensional data. Because highdimensional data are usually of relatively small sample sizes, successful variable selection demands appropriate incorporation of a priori information. A fundamental piece of information is that only a few of the variables are significant and should be included into the underlying models, leading to a fundamental assumption of sparsity in variable selection (Fan and Li, 2001). Many methods have been developed to take full advantage of this sparsity assumption, mostly built upon thresholding procedures (Donoho and Johnstone, 1994), see Tibshirani (1996), Fan and Li (2001), and others.
Recently, many efforts have been devoted to selecting variables from massive candidates by incorporating rich a priori information accumulated from historical research or practices. For example, Yuan and Lin (2006) defined group-wise norms for grouped variables. For graph-structured variables,  and Pan et al. (2010) proposed to use Laplacian matrices and L γ norms, respectively. Li and Zhang (2010) and Stingo et al. (2011) both employed Bayesian approaches to incorporate structural information of the variables, both formulating Ising priors.
Markov chain Monte Carlo (MCMC) algorithms have been commonly employed to develop Bayesian variable selection, see George and McCulloch (1993), Carlin and Chib (1995), Li and Zhang (2010), Stingo et al. (2011), and others. However, MCMC algorithms are computationally intensive and may be difficult to assign appropriate hyperparameters. On the other hand, penalty-based variable selection usually demands careful selection of certain tuning parameters (e.g. Fan and Li, 2001;Pan et al., 2010;Tibshirani, 1996;Yuan and Lin, 2006), which challenges high-dimensional data analysis. Although cross-validation has been widely suggested to choose tuning parameters, it may be infeasible in certain situations, in particular the case that many variables rarely vary. Recently, Sun and Zhang (2012) proposed the scaled sparse linear regression to attach the tuning parameter to the estimable noise level.
Empirical Bayes methods can be advantageous in high-dimensional data analysis because of no need to choose tuning parameters. They also allow incorporating a priori information while modeling uncertainty of such prior information using hyperparameters. For example, Johnstone and Silverman (2004) modeled the sparse normal means using a spike-and-slab prior. The mixing rate of the Dirac spike and slab is taken as a hyperparameter to achieve data-driven thresholding, and resultant empirical Bayes estimates are therefore adaptive to sparsity of the high-dimensional parameters. As demonstrated by Johnstone and Silverman (2004), this empirical Bayes method can work better than traditional thresholding estimators. One important contribution of this paper is to develop a new algorithm which allows to construct such empirical Bayes variable selection with complex data.
We propose an iterative conditional modes/medians (ICM/M) algorithm for easy implementation and fast computation of empirical Bayes variable selection (EBVS). Similar to the iterated conditional modes (Besag, 1986), iterative conditional modes are for optimization of hyperparameters and parameters other than regression coefficients. Iterative conditional medians are used to enforce variable selection. As shown in Johnstone and Silverman (2004) and Zhang et al. (2010), when mixture priors are utilized, posterior medians can lead to thresholding rules and thus help screen out small and insignificant variables. Furthermore, ICM/M makes it easy to incorporate complicated priors for the purpose of selecting variables out of massive structured candidates. Taking the Ising prior as an example (Li and Zhang, 2010), we illustrate such strength of ICM/M. The rest of this paper is organized as follows. In the next section, we will propose the ICM/M algorithm for empirical Bayes variable selection (EBVS). We also explore to control false discovery rates (FDR) using conditional posterior probabilities. We implement the ICM/M algorithm in Section 3 for highdimensional linear regression models, only assuming that non-zero regression coefficients are few. In Section 4, the ICM/M algorithm is shown when incorporating a priori information on graphical relationship between the predictors. Simulation studies are carried out in both Sections 3 and 4 to evaluate the performance of the corresponding ICM/M algorithms. An application to a real dataset from a genome-wide association study (GWAS) is presented in Section 5. We conclude this paper with a discussion in Section 6.
In the rest of this paper, the j-th component of a vector parameter, say β, is denoted by β j ; β −j denotes all the components of β except the j-th component; and β j:k includes components of β from β j to β k . The parameter with a parenthesized superscript, sayβ (k) , indicates an estimate from the k-th iteration.

Iterated conditional modes/medians
Consider a general variable selection problem with a likelihood function given by, where Y is a n × 1 random vector, X is a n × p matrix containing values of p variables, β is a p × 1 parameter vector with the j-th component β j representing the effects of the j-th variable to the model, and φ includes all other auxiliary parameters. A typical variable selection task is to identify non-zero components in β, that is, to select important variables out of the p candidates. For convenience, define τ j = I{β j = 0}, which indicates whether the j-th variable should be selected into the model. Further denote τ = (τ 1 , τ 2 , . . . , τ p ) t . Here we consider an empirical Bayes variable selection, which assumes priors, where ψ = (ψ t 1 , ψ t 2 , ψ t 3 ) t includes all hyperparameters. To avoid high-dimensional integrals, here we cycle through coordinates to obtain the estimate of each component of (β, φ, ψ) iteratively, Indeed, with properly chosen priors of φ and ψ, bothφ j =φ j (β,φ −j ,ψ, Y, X) andψ j =ψ j (β,φ,ψ −j , Y, X) can be obtained by maximizing the fully conditional posterior, i.e., φ j =φ j (β,φ −j ,ψ) = mode(φ j |Y, X,β,φ −j ,ψ), ψ j =ψ j (β,φ,ψ −j ) = mode(ψ j |Y, X,β,φ,ψ −j ). (2.4) When eachβ j is also obtained by maximizing its fully conditional posterior, it suggests the iterated conditional modes (ICM) algorithm by Besag (1986). However, calculation of conditional mode forβ j is either infeasible or practically undesirable (due to lack of variable selection function). Indeed, Bayesian or empirical Bayes variable selection for high-dimensional data usually follows a spike-and-slab prior on each β j (e.g. Ishwaran and Rao, 2005;Mitchell and Beauchamp, 1988), and it induces a spike-and-slab posterior for each β j . With a Dirac spike, it is infeasible to obtain the mode of such a spike-and-slab posterior. However, its median can be zero and allows to select the median probability model as suggested by Barbieri and Berger (2004). Henceforth, following Johnstone and Silverman (2004), we constructβ j =β j (β −j ,φ,ψ, Y, X) as median of the fully conditional posterior, i.e., β j =β j (β −j ,φ,ψ) = median(β j |Y, X,β −j ,φ,ψ). (2.5) With the iterative conditional median for β j , and conditional modes for φ j and ψ j respectively, for Bayesian update of a component conditional on all other components, we hereafter propose the iterated conditional medians/modes (ICM/M) algorithm for implementing the empirical Bayes variable selection. As shown later, the ICM/M algorithm allows an easy extension of the (generalized) empirical Bayes thresholding methods by Johnstone and Silverman (2004) and Zhang et al. (2010) to dependent data. Obviously, with a consistent initial point of (β,φ,ψ), the cycling Bayesian updates of this algorithm lead to a well-established estimate (β,φ,ψ).

Evaluation of variable importance
When proposing a statistical model, we are primarily interested in evaluating the importance of variables besides its predictive ability. For example, the objective of high-dimensional data analysis is to identify a list of J predictors that are most important or significant among p predictors. This is a common practice in biomedical research using high-throughput biotechnologies, ranking all markers and selecting a short list of candidates for follow-up studies.
In the Bayesian approach, inference on the importance of each variable can be done through its marginal posterior probability P (β j = 0|Y, X). However, this quantity involves high-dimensional integrals which is difficult to calculate even in the case of moderate p. Furthermore, the marginal posterior probability may not be meaningful when the predictors are highly correlated (which usually occurs in a large p small n data set). For example, suppose predictors X 1 and X 2 are linearly dependent and both predictors are associated with a response variable. The marginal posterior probability of X 1 being included in the model might be very high and dominates the marginal posterior probability of X 2 being included in the model.
We propose a local posterior probability to evaluate the importance of a variable. That is, conditional on the optimal point {β j ,φ,ψ} obtained from empirical Bayes variable selection through ICM/M algorithm, the importance of a variable is evaluated by its full conditional posterior probability, ζ j = P (β j = 0|Y, X,β −j ,φ,ψ).
(2.6) Such a probability has a closed form which can be easily computed. We will show later in simulation studies that the local posterior probability is a good indicator to quantify the importance of variables. Another challenging question is how large the list of important predictors should be. In many papers in the literature, see Brenner et al. (2001) and Syed and Hecht (1997) for example, the numbers of important variables reported are arbitrary. For instance, some laboratories may be interested in the top ten genes. Typically, however, there is an interest to create the list so that type-I and type-II errors are controlled (Dudoit et al., 2003). False discovery rate (FDR) control is widely used in high-dimensional data analysis since it is less conservative and has more power than controlling the familywise error rate (Benjamini and Hochberg, 1995).
With the local posterior probability ζ and assumption that true β is known, we can report a list containing predictors having the posterior probability greater than some bound κ, 0 ≤ κ < 1. Given the data, true FDR can be computed as I{ζ j > κ}.
(2.7) Newton et al. (2004) proposed the expected FDR given the data in Bayesian scheme as I{ζ j > κ}. (2.8) Therefore we can select predictors to report by controlling F DR(κ) at a desired level.

Selection of sparse variables
Here we consider the empirical Bayes variable selection for the following regression model with high dimensional data, Further assume that the response is centered and the predictors are standardized, that is, Y t 1 n = 0, X t 1 n = 0 p , and Assuming all model parameters except β j are known, β j has a sufficient statistic To capture the sparsity of regression coefficients, we put an independent prior on each scaled β j as follows, where δ 0 (·) is a Dirac delta function at zero, γ(·|σ) is assumed to be a probability density function. This mixture prior implies that β j is zero with probability (1 − ω) and is drawn from the nonzero part of prior, γ(·|σ), with probability ω. As suggested by Johnstone and Silverman (2004), a heavy-tailed prior such as Laplace distribution is a good choice for γ(·|σ), that is, where α > 0 is a scale parameter. We take Jeffreys' prior on σ as π(σ) ∝ 1/σ (Jeffreys, 1946). Note that there is a connection of using Laplace priors and the lasso. Indeed, setting ω = 1 in (2.4) and (3.3) leads to a lasso estimate with α related to a tuning parameter in the lasso, see details in Tibshirani (1996). Our empirical Bayes variable selection allows a data-driven optimal choice of ω. Indeed, a datadriven optimal α can also be obtained through the conditional mode suggested by (2.4), which avoids the issue brought by a tuning parameter to lasso (while lasso usually relies on cross validation to choose an optimal tuning parameter). Johnstone and Silverman (2004) also suggested a default value α = 0.5, which in general works well.
To obtainβ ,ω (k) ), we notice the sufficient statistic of β j in (3.2) and it is therefore easy to calculateβ (k+1) j as stated below. Indeed,β (k+1) j is an empirical Bayes thresholding estimator as shown in Johnstone and Silverman (2004).

Simulation studies
To evaluate the performance of our proposed empirical Bayes variable selection (EBVS) via ICM/M algorithm, we simulated data from model (3.1) with large p small n, i.e., p = 1,000 and n = 100. There are a total of 20 non-zero regression coefficients which are β 1 = · · · = β 10 = 2 and β 101 = · · · = β 110 = 1. The error standard deviation σ is set to one. The predictors are partitioned into ten blocks, each including 100 predictors which are serially correlated at the same level of correlation coefficient ρ. We simulated 100 datasets for ρ taking values in {0, 0.1, 0.2, . . . , 0.9} respectively.
EBVS was compared with two popular approaches, i.e., the lasso by Tibshirani (1996), and the adaptive lasso by Zou (2006). The scaled lasso by Sun and Zhang (2012) was also applied to the simulated datasets. Ten-fold crossvalidation was used to choose optimal tuning parameters for lasso and adaptive lasso respectively. The median values of prediction error, false positive, and false negative rates were reported for each approach based on the 100 simulated datasets.
As shown in Figure 1, EBVS performs much better than the other three methods in terms of prediction error. In particular, when ρ ≥ 0.3, EBVS consistently reported median prediction error approximately at 1.5. In comparison of lasso and adaptive lasso, adaptive lasso has smaller prediction error when ρ < 0.3; but lasso has smaller prediction error when ρ > 0.3.
It is known that lasso can inconsistently select variables under certain conditions, and adaptive lasso was proposed for solving this issue (Zou, 2006). Figure 2 showed that lasso has very high false positive rates (more than 50%), and adaptive lasso significantly reduces the false positive rates especially when ρ ≥ 0.2. Indeed, lasso has much larger false positive rates than all other methods. It is interesting to observe that EBVS has zero false positive rates except in the case that ρ = 0.5 and ρ = 0.9. All methods have very low false negative rates.
Recently, Meinshausen et al. (2009) proposed a multi-sample-split method to construct p-values for high-dimensional regressions, especially in the case that the number of predictors is larger than the sample size. Here we applied this method, as well as EBVS, to each simulated dataset with a total of 50 sample-splits, and compared its performance with that of ζ i defined in (2.6). For each predictor, Figure 3 plotted the median of − log 10 (1 − ζ i ), truncated at 10, against the median of − log 10 (p-value) across 100 datasets simulated from the regression model with ρ = 0.5 and ρ = 0.9 respectively. For either model, ζ i can clearly distinguish true positives (i.e., predictors with τ i = 0) from true negatives (i.e., predictors with τ i = 0). However, as shown in Figure 3.b where ρ = 0.9, there is no clear cutoff of p-values to distinguish between true positives and true negatives. Here we also observed that F DR(κ) can be well approximated by F DR(κ) (results are not shown), with both dropped sharply to zero for κ > 0.05. We therefore can select κ to threshold ζ i for the purpose of controlling FDR.

Selection of structured variables
When the information of structural relationships among predictors is available, it is unreasonable to assume an independent prior on each β j , j = 1, . . . , p as described in the previous section. Instead, we introduce an indicator variable τ j = I{β j = 0}. Then, the prior distribution of β is set to be dependent on τ = (τ 1 , . . . , τ p ) T . Specifically, given τ j , β j has the mixture distribution where γ(·) is the Laplace density with the scale parameter α.
The relationship among predictors can be represented by an undirected graph G = (V, E) comprising a set V of vertices and a set E of edges. In this case, each node is associated with a binary valued random variable τ j ∈ {0, 1} and there is an edge between two nodes if two covariates are correlated. The following Ising model (Onsager, 1943) is employed to model the a priori information on τ , where a and b are two parameters, and The parameter b corresponds to the "energies" associated with interactions between nearest neighboring nodes. When b > 0, the interaction is called ferromagnetic, i.e., neighboring τ i and τ j tend to have the same value. When b < 0, the interaction is called antiferromagnetic, i.e., neighboring τ i and τ j tend to have different values. When b = 0, there is no interaction, and the prior gets back to independent and identical Bernoulli distribution. The value of a + b indicates the preferred value of each τ i . That is, if a + b > 0, τ i tends to be one; if a + b < 0, τ i tends to be zero.

The algorithm
Next, we describe the ICM/M algorithm to develop empirical Bayes variable selection with Ising prior (abbreviated as EBVS i ) to incorporate the structure of predictors in modeling process. We assume the Ising prior as homogeneous Boltzmann model, but the algorithm can be extended to more general priors. With α = 0.5, the ICM/M algorithm described in (2.4) and (2.5) can be proceeded with φ = σ and ψ = (ω, a, b).
For the hyperparameters a and b, we will calculate the conditional mode of (a, b) simultaneously. Conceptually, we want (â (k+1) ,b (k+1) ) maximizing the prior likelihood P (τ ) in (4.2). However, it requires to compute Z(a, b) by summing up p-dimensional space of τ , which demands intensive computation especially for a large p. Many methods have been proposed for approximate calculation, see Geyer (1991), Geyer and Thompson (1992), Zhou and Schmidler (2009) and others. Here we will consider the composite likelihood approach (Varin et al., 2011) which is widely used when the actual likelihood is not easy to compute. In particular, (â (k+1) ,b (k+1) ) will be obtained by maximizing a pseudo-likelihood function, a special type of composite conditional likelihood and a natural choice for a graphical model (Besag, 1975).
With the Ising prior on τ (k) , the pseudo-likelihood of (a, b) is as follows, The surface of such a pseudo-likelihood is much smoother than the joint likelihood and therefore easy to maximize (Liang and Yu, 2003). The resultant estimator (â (k+1) ,b (k+1) ) by maximizing L p (a, b) is biased for a finite sample size, but it is asymptotically unbiased and consistent (Guyon and Kunsch, 1992;Mase, 2000;Varin et al., 2011). The implementation of the pseudo-likelihood method is fast and straightforward which is feasible for large scale graphs. Indeed,â (k+1) andb (k+1) are the logistic regression coefficients when the binary variableτ (k) i is regressed on <i,j>∈Eτ (k) j for i = 1, . . . , p.
As shown in the previous sections, the conditional medianβ (k+1) j can be constructed on the basis of the following proposition.
The first indicator variable τ 1 is sampled from Bernouli(0.5). The error variance is fixed at one. For each individual, its predictors were simulated from AR(1) with correlation coefficient ρ ranging from 0 to 0.9 with step 0.1. The median prediction error rates of all methods are shown in Figure 4. EBVS performed slightly better than adaptive lasso, and both performed much better than lasso and scaled lasso. Lasso, adaptive lasso, scaled lasso, and EBVS all presented varying prediction error rates when ρ goes from 0 to 0.9. However, the prediction error rates of EBVS i are rather stable for varying values of ρ, and are much smaller than those of the other four methods.
Shown in Figure 5 are the false positive rates and false negative rates of different methods. Not surprisingly, lasso has false positive rates over 70%, much higher than that of other methods. Adaptive lasso significantly reduces the false positive rates, which is still more than 10%. On the other hand, the false positive rates of both EBVS and EBVS i are less than 10%. Indeed, EBVS reported false positive rates at zero for different values of ρ; and EBVS i reported false positive rates at zero when ρ < 0.6, and 0.1 when ρ ≥ 0.6. However, EBVS i reported false negative rates less than EBVS. Therefore, EBVS tends to select correct true positives by including fewer true positives in the final model than the model obtained by EBVS i . We then conjecture that, when covariates are highly correlated, EBVS i tends to select more variables into the model. In particular, if one covariate is selected into the model, EBVS i tends to include its highly correlated neighboring predictors into the model. Figure 6 shows F DR(κ) and F DR(κ) of EBVS i for the models with ρ = 0.5 and ρ = 0.9 respectively (we also observed that F DR(κ) of EBVS is similar to that of EBVS i , results are not shown). Overall, the estimate F DR(κ) dominates F DR(κ), i.e., the true FDR. Therefore, we will be conservative in selecting variables when controlling FDR using F DR(κ). For example, if one would like to list important predictors while controlling FDR at 0.1 for the model with ρ = 0.9, κ should be set around 0.1 based on F DR(κ). However, one can set κ around 0.4 based on F DR(κ), which suggests a true FDR as low as zero. Plotted in Figure 7 are the p-values calculated using the multi-sample-split method (Meinshausen et al., 2009) against ζ j for each predictor. For both EBVS and EBVS i , ζ j quantified variable importance better than p-values in terms of distinguishing true positives from true negatives. Overall, EBVS i outperforms EBVS since it provides larger values of ζ for true positives, while both EBVS and EBVS i keep true negatives with ζ j close to zero. Indeed, EBVS produced ζ j close to 0 for several true positives while EBVS i produced larger values of ζ j for these true positives. We then summarize empirically that, by incorporating a priori information, EBVS i has more power to detect true positives than EBVS.
Case II. Pathway Information To mimic a real genome-wide association study (GWAS), we took values of some single nucleotide polymorphisms (SNPs) in the Framingham dataset (Cupples et al., 2007) to generate X in model (3.1). Specifically, 24 human regulatory pathways were retrieved from Kyoto Encyclopedia of Genes and Genomes (KEGG) database, and involved 1,502 genes. For each gene involved in these pathways, at most two SNPs listed in the Framingham dataset were randomly selected out of those SNPs residing in the genetic region. If no SNP could be found within the genetic region, a nearest neighboring SNP would be identified. A total of 1,782 SNPs were selected. We first identified 952 unrelated individuals out of the Framingham dataset, and used them to generate predictor values of the training dataset. For the rest of the .0394(.0003) * The results of the scaled lasso excluded seven datasets. Applying the scaled lasso to these seven datasets reported the median prediction error at 2.45 × 10 10 , false positive rate at .7059, and false negative rate at .0043.
Framingham dataset, we identified 653 unrelated individuals to generate predictor values of the test dataset. Five pathways were assumed to be associated with the phenotype Y . That is, all 311 SNPs involved in these five pathways were assumed to have nonzero regression coefficients, which were randomly sampled from a uniform distribution over [0.5, 3]. With the error variance at five, a total of 100 datasets were simulated.
As shown in Table 1, lasso has relatively low prediction error. However, its median false positive rate is as high as 69%, much higher than others. Adaptive lasso (LASSO a ), on the other hand, has very large prediction error, but its false positive rate is much smaller than lasso. EBVS presented the lowest false positive rate among all the methods, and its false negative rate is also smaller than that of adaptive lasso. Indeed, with initial values obtained from lasso, EBVS reduces the false positive rate from lasso by more than 98%. By incorporating the pathway information using an Ising prior on τ , EBVS i reported the lowest prediction error. Furthermore, EBVS i compromised between lasso, adaptive lasso, and EBVS to balance well between the false positive rate and false negative rate. Scaled lasso (LASSO s ) performed unstably in analyzing our simulated datasets, and it selected more than 800 positives in seven of the simulated datasets.

Real data analysis
The empirical Bayes variable selection using ICM/M algorithm was applied to the Framingham dataset (Cupples et al., 2007) to find SNPs associated with vitamin D level. The SNPs of the dataset were preprocessed following common criteria of GWAS, that is, both missingness per individual and missingness per SNP are less than 10%; minor allele frequency (MAF) is no less than 5%; and the significance level of Hardy-Weinberg test on each SNP is 0.001. It resulted in a total of 370,773 SNPs, and 84,834 of them resided in 2,167 genetic regions involving 112 pathways relevant to vitamin D level. We pre-screened SNPs by selecting those having p-values of univariate tests smaller than 0.1, and ended with 7,824 SNPs for the following analysis. As in Section 4.2, a training dataset and a test dataset were constructed with 952 and 519 unrelated individuals respectively. The response variable is the log-transformed vitamin D level. We applied lasso, adaptive lasso, scaled lasso, EBVS, and EBVS i to the training dataset, and calculated the prediction errors using the test dataset. The results are reported in Table 2. While identifying much more SNPs than all other methods, lasso reported the largest prediction error. Other than scaled lasso (LASSO s ), EBVS has the smallest prediction error though it identified only one SNP. Adaptive lasso (LASSO a ) and EBVS i each identified five SNPs, and their prediction errors are slightly higher than that of EBVS.
Presented in Table 3 are the seven SNPs identified to have non-zero regression coefficients by adaptive lasso, EBVS, and EBVS i . Each SNP is identified by the chromosome it resides in and four digits. The only SNP, 5-2773, which was identified by EBVS, was identified by all other methods. While adaptive lasso and EBVS i each identified five SNPs with non-zero regression coefficients, there are only three commonly identified SNPs, i.e., 1-3887, 5-2773, and 8-5143. The two SNPs on chromosome 17, i.e., 17-3907 identified by EBVS i and SNP 17-9089 identified by EBVS, neighbor each other with 16k bases in between. However the two SNPs on chromosome 4 are far apart from each other.
As in the previous section, we also took the multi-sample-split method to calculate p-values based on 50 sample splits for all methods. When we followed Benjamini and Hochberg (1995) to control FDR at 0.1, none of these methods reported any significant SNPs, though adaptive lasso and EBVS i reported SNP 5-2773 with the p-value as small as 0.0031 and 0.0034 respectively. Instead, when controlling F DR(κ) ≤ 0.1 for both EBVS and EBVS i , EBVS identified only SNP 5-2773, and EBVS i identified both SNP 5-2773 and 17-3907, with κ = 0.8. Note that SNP 17-3907 is one of the neighboring pair on chromosome 17. As shown in the simulation studies, F DR(κ) usually overestimated F DR(κ), so we expect that F DR(.08) < 0.1 for both EBVS and EBVS i .

Discussion
We intend to extend empirical Bayes thresholding (Johnstone and Silverman, 2004) for high-dimensional dependent data, allowing incorporation of complicated a priori information on model parameters. An iterative conditional modes/ medians (ICM/M) algorithm is proposed to cycle through each coordinate of the parameters for a Bayesian update conditional on all other coordinates. The idea of cycling through coordinates has been revived recently for analyzing high dimensional data. For example, the coordinate descent algorithm has been suggested to obtain penalized least squares estimates, see Fu (1998), Daubechies et al. (2004), Wu and Lange (2008), and Breheny and Huang (2011). However, direct application of the coordinate descent algorithm here is challenged with the spike-and-slab posteriors.
Without a priori information other than that regression coefficients are sparse, many lasso-type methods have been proposed with some tuning parameters. It is difficult to select a value for the tuning parameters, and in practice the crossvalidation method is widely used. However, high-dimensional data are usually of small sample sizes, and available model fitting algorithms demand intensive computation, both of which disfavor the cross-validation method. In particular, when genome-wide association studies focus more and more on complex diseases associated with rare variants (Nawy, 2012), the limited data usually contain large number of SNPs which differ in a small number of individuals. It is almost infeasible to take a cross-validation method as the small number of unique individuals for a rare variant is more likely to be included in the same fold. Instead, the proposed ICM/M algorithm obtains data-driven hyperparameters via conditional modes, which takes full advantage of each observation in the small sample.
With a large number of predictors and complicated correlation between estimates, classical p-values are difficult to compute and it is therefore challenging to evaluate the significance of selected predictors. Wasserman andRoeder (2009), andMeinshausen et al. (2009) recently proposed to calculate p-values by splitting the samples. That is, when a sample is split into two folds, one fold is used as the training data to select variables, and the other is used to calculate p-values of selected variables. Similar to applying the cross-validation method, splitting samples significantly reduces the power of variable selection and p-value calculation, especially for high-dimensional data of small sample sizes. Again, it is almost not feasible to apply such a splitting method to genome-wide association studies with rare variants.
As shown in Section 4, an Ising model as (4.2) can be used to model a priori graphical information on predictors. Maximizing pseudo-likelihood approach is utilized to obtain the conditional mode of the Ising model parameters, and therefore the ICM/M algorithm can be easily implemented. Indeed, at each iteration of the ICM/M algorithm, we cycle through all parameters by obtaining conditional modes/medians of one parameter (or a set of parameters), and therefore, many classical approximation methods for low-dimensional issues may be used to simplify the implementation. On the other hand, the Ising prior (4.2) can also be modified to incorporate more complicated a priori information on predictors. For example, we may multiply a weight w ij to the interaction τ i τ j to model the known relationship between the i-th and j-th predictors. A copula model may be established to model more complicated graphical relationship between the predictors.
For high-dimensional data, stochastic search has been employed to implement Bayesian variable selection, see Hans et al. (2007), Bottolo and Richardson (2010), Li and Zhang (2010), Stingo et al. (2011), and others. The reviewers pointed out that Rockova and George (2014) recently proposed EMVS as an EM approach for rapid Bayesian variable selection. EMVS assumes the "spikeand-slab" Gaussian mixture prior on each β j , where ω j is a prior probability, ν 1 takes either a prespecified large value or a gprior, and ν 0 is suggested to explore a sequence of positive values with ν 0 < ν 1 . With an absolutely continuous spike, EMVS estimates ω j at the E-step, and estimates β j at the M-step. Note that a positive ν 0 will not automatically yield a sparse estimate of β, which has to be sparsified using a prespecified threshold. However, the ICM/M algorithm estimates a common ω based on a conditional mode, and estimates β j based on a conditional median which enables variable selection following Johnstone and Silverman (2004). We also propose a local posterior probability to evaluate the importance of the predictor, which helps control the false discovery rate.
Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64178. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University.