Fast model-fitting of Bayesian variable selection regression using the iterative complex factorization algorithm

Bayesian variable selection regression (BVSR) is able to jointly analyze genome-wide genetic datasets, but the slow computation via Markov chain Monte Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting tenfold. The iterative method hinges on the complex factorization of the sum of two matrices and the solution path resides in the complex domain (instead of the real domain). Compared to the Gauss-Seidel method, the complex factorization converges almost instantaneously and its error is several magnitude smaller than that of the Gauss-Seidel method. More importantly, the error is always within the pre-specified precision while the Gauss-Seidel method is not. For large problems with thousands of covariates, the complex factorization is 10 -- 100 times faster than either the Gauss-Seidel method or the direct method via the Cholesky decomposition. In BVSR, one needs to repetitively solve large penalized regression systems whose design matrices only change slightly between adjacent MCMC steps. This slight change in design matrix enables the adaptation of the iterative complex factorization method. The computational innovation will facilitate the wide-spread use of BVSR in reanalyzing genome-wide association datasets.


Introduction
Bayesian variable selection regression (BVSR) can jointly analyze genome-wide genetic data to produce the posterior probability of association for each covariate and estimate hyperparameters such as heritability and the number of covariates that have nonzero effects (Guan and Stephens, 2011). But the slow computation due to model averaging using Markov chain Monte Carlo (MCMC) hampered its otherwise warranted widespread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting by ten fold.

Model and priors
We first briefly introduce the BVSR method before we dive into computational issues. Our model, prior specification and notation follow closely those of Guan and Stephens (2011). Consider the linear regression model where X is an n × N column-centered matrix, I denotes an identity matrix of proper dimension, y and ε are n-vectors, β is an N -vector, and MVN stands for multivariate normal distribution. Let γ j be an indicator of the j-th covariate having a nonzero effect and denote γ = {γ 1 , . . . , γ j , . . . , γ N }. A spike-and-slab prior for β j (the j-th component of β) is specified below, where π is the proportion of covariates that have non-zero effects and σ 2 β is the variance of prior effect size (scaled by τ −1 ). We will specify priors for both later. We use noninformative priors on the parameters µ and τ , τ ∼ Gamma(κ 1 /2, κ 2 /2), κ 1 , κ 2 → 0, where Gamma is in the shape-rate parameterization. As pointed out in Guan and Stephens (2011), prior (3) is equivalent to P (µ, τ ) ∝ τ −1/2 , which is known as Jeffreys' prior (Ibrahim and Laud, 1991;O'Hagan and Forster, 2004). Some may favor a simpler form P (µ, τ ) ∝ 1/τ , which makes no practical difference (Berger et al., 2001;Liang et al., 2008).
Given γ and σ 2 β , the Bayes factor with reference to the null model can be computed in closed form, where X γ denotes the submatrix of X with columns for which γ j = 1, | · | denotes matrix determinant, andβ is the posterior mean for β given bŷ β =(X t γ X γ + σ −2 β I) −1 X t γ y.
Note that in (4), the parameters β, τ , and µ are integrated out. The null-based Bayes factor BF(γ, σ 2 β ) is proportional to the marginal likelihood P (y | γ, σ 2 β ), and evaluating BF is easier than evaluating the marginal likelihood due to cancellation of constants. We now discuss the prior specification for the two hyperparameters, π and σ 2 β .
To specify the prior for σ 2 β , Guan and Stephens (2011) introduced a hyperparameter h (which stands for heritability) such that where s j denotes the variance of the j-th covariate. Conditional on γ, specifying a prior on h will induce a prior on σ 2 β , so henceforth we may write σ 2 β (γ, h) to emphasize σ 2 β is a function of h and γ. Since h is motivated by the narrow-sense heritability, its prior is easy to specify and we use h ∼ Unif(0, 1), by default to reflect our lack of knowledge of heritability, although one can impose a strong prior by specifying a uniform distribution on a narrow support. A bonus of specifying the prior on σ 2 β through h is that when h ∼ Unif(0, 1), the induced prior on σ 2 β is heavy-tailed (Guan and Stephens, 2011).
Up till now, we follow faithfully the model and the prior specification of Guan and Stephens (2011). To specify the prior on γ is equivalent to specifying the prior on π, and Guan and Stephens (2011) specified the prior on π as uniform on its log scale, log π ∼ Unif(log π min , log π max ), which is equivalent to P (π) ∝ 1/π for π ∈ (π min , π max ). Guan and Stephens (2011) sampled π, h, γ, but here we do something slightly different by integrating out π analytically. This is sensible because γ is very informative on π. Specifically, we integrate P (γ, π) over P (π) such that P (γ) = P (γ | π)P (π)dπ to obtain the marginal prior on γ, where the finite integral is related to truncated Beta distribution and can be evaluated conveniently. If π min goes to 0 and π max goes to 1 we have an improper prior P (π) ∝ 1/π and the marginal prior on γ becomes where we recall N is the total number of covariates, |γ| = γ j is the number of selected covariates in the model, and Γ denotes the Gamma function. P (γ) is always a proper probability distribution because it is defined on a finite set.

Posterior inference and computation
The joint posterior distribution of (γ, h) is given by The posterior inferences typically include computing the posterior inclusion probability P (γ j = 1 | y), which measures the strength of marginal association of the j-th covariate, the posterior distribution of the model size |γ|, and the posterior distribution of the heritability h, which measures the proportion of phenotypic variance explained by the selected models. We use MCMC to sample this joint posterior of (γ, h). Our sampling scheme follows closely that of Guan and Stephens (2011). In each MCMC iteration, to evaluate (10) for a proposed parameter pair (γ , h ), we need to compute the marginal likelihood P (y | γ , h ), which is proportional to (4). Two time-consuming calculations are the matrix determinant |I + σ 2 β X t γ X γ | andβ defined in (5), both of which have cubic complexity (in |γ|). The computation of the determinant can be avoided by using a MCMC sampling trick which we will discuss later. The main focus of the paper is a novel algorithm to evaluate (5), which reduces its complexity from cubic to quadratic.
The rest of the paper is structured as follows. Section 2 introduces the iterative complex factorization (ICF) algorithm. Section 3 describes how to incorporate the ICF algorithm into BVSR. Both sections contain numerical examples, including a real dataset from genome-wide association studies. A short discussion concludes the paper.

The iterative complex factorization
In this section, we propose a novel algorithm for solving the following linear system where Σ is a diagonal matrix with positive (but not necessarily identical) entries on the diagonal, and X is an n × p matrix. Clearly (5) is a special case of (11). (The subscript γ is omitted in this section.) Note thatβ is the familiar ridge regression estimator (Draper and Van Nostrand, 1979). It may appear that an algorithm designed for solving ridge regression can be borrowed to solve (11). But a unique feature of BVSR is that in each iteration of MCMC, the design matrix X usually changes only by one or a few columns. Thus, X t X and its Cholesky decomposition can be obtained conveniently (details will follow). This unique feature allows us to design a much more efficient algorithm.

Existing methods
In Guan and Stephens (2011), the linear system (11) was solved using the Cholesky decomposition of X t X + Σ 2 , which requires p 3 /3 flops (Trefethen and Bau III, 1997, Lec. 23). Although computing X t X from X requires O(np 2 ) flops, when X only changes by a few columns, the majority of the entries in X t X need no computation and updating X t X only requires O(np) flops.
Iterative methods sometimes can be used to reduce the computational time. Define where L (U ) is the strictly lower (upper) triangular component and D contains only imsart-ba ver. 2014/10/16 file: icf-arxiv-18.tex date: January 22, 2018 the diagonals. Then three popular iterative procedures can be summarized as follows: The successive over-relaxation (SOR) method is a generalization of the Gauss-Seidel method, where ω is called the relaxation parameter. When A is positive definite, the Gauss-Seidel method always converges and the SOR method converges for ω ∈ (0, 2) (Golub and Van Loan, 2012, Chap. 10.1.2). For all three iterative methods, each iteration requires 2p 2 flops. Thus, whether an iterative method is more efficient than the Cholesky decomposition depends on how many iterations it takes to converge. Another notable class of iterative methods is called Krylov subspace methods. Two famous examples are the steepest descent and the conjugate gradient (Trefethen and Bau III, 1997, Lec. 38).
In principle, all methods developed to solve ridge regression can be used here to solve (11), as we alluded to earlier. For example, the methods of Eldén (1977), Lawson and Hanson (1995) and Turlach (2006) were developed for solving (11) with fixed X but changing Σ, while Hawkins and Yin (2002) devised a method for fixed Σ but changing X. Modern least square solvers (Rokhlin and Tygert, 2008;Avron et al., 2010;Meng et al., 2014) typically considered the case where X is extremely large, sparse, or illconditioned, and under such conditions, these least square solvers outperform solving directly via Cholesky decomposition (see Meng et al. (2014, Sec. 5) for how to apply least square solvers to ridge regression). But all methods quoted above are less effective (or not applicable) in the context of BVSR, because they are not designed for solving (11) millions of times, each time with a slightly different X and a different Σ, and they did not take advantage of the feature of BVSR that the Cholesky decomposition of X t X can be obtained efficiently.

The iterative complex factorization (ICF) algorithm
Let R t R be the Cholesky decomposition of X t X, where R is upper triangular. In the context of BVSR, given the Cholesky decomposition of X t X, the Cholesky decomposition of a new matrix (X ) t X can be obtained efficiently since X (the proposed new design matrix) differs from X only by one or a few columns (see Section 3.1). So we consider solving the linear system Contrary to our intuitions, the Cholesky decomposition of R t R+Σ 2 cannot be obtained efficiently. (This was also noticed in (Zhou et al., 2013, p. 6).) We instead perform the following decomposition imsart-ba ver. 2014/10/16 file: icf-arxiv-18.tex date: January 22, 2018 where i is the imaginary unit. Then we have the update where the right-hand side is a complex vector, andβ (k+1) can be obtained by a forward and a backward substitution involving two complex triangular matrices, R t − iΣ and R + iΣ. This update, however, diverges from time to time. Examining the details of the observed divergent cases reveals that the culprit is the imaginary part ofβ (k) . Because the solutionβ is real, discarding the imaginary part ofβ (k) at the end of each iteration will not affect the fixed point to which the iterative method converges. Denoting the real part of a complex entity (scalar or vector) by Re, the generalized update of our algorithm ICF (Iterative Complex Factorization) becomeŝ where we have also introduced a relaxation parameter ω. Intuitively ω makes the update lazy to avoid over-shooting. The revised update converges almost instantaneously, in a few iterations, compared to a few dozen to a few hundred iterations with the Gauss-Seidel method. Each iteration of (15) requires 6p 2 flops, twice that required by a Gauss-Seidel iteration, because ICF operates complex (instead of real) matrices and vectors. The right-hand side of (15) can be reorganized as Re Note that H −1 z can be computed via forward and backward substitutions and does not require matrix inversion.

ICF converges to the right target
Then ICF in (15) converges for any starting point if ρ(Ψ(ω)) < 1 where ρ denotes the spectral radius.
Proof. The statement follows faithfully from Theorem 10.1.1 of Golub and Van Loan (2012).
Theorem 2. There exists ω ∈ (0, 1] such that the ICF update detailed in (15) converges to the true solution.
Proof. Using the notation defined in (12) and (14), we have H = A+iS. The imaginary part of H −1 can be expressed by Both A and S are real matrices, and by the Woodbury identity we have Immediately, for a fixed ω, the spectrum of Ψ(ω) is fully determined by the spectrum of is also skew-symmetric. Hence the eigenvalues of A −1 S, which are identical to those of A −1/2 SA −1/2 , are conjugate pairs of pure imaginary numbers or zero. Let ±ηi be such a pair with η ≥ 0 and u be the eigenvector corresponding to the eigenvalue ηi. We have A −1 Su = iηu, which implies iu * Su = −ηu * Au, where u * denotes the conjugate transpose of u. Since H is a Hermitian positive definite matrix, Since A is also positive definite, we have u * Au > 0 and From (16), the eigenvalues of Ψ(ω) are identical pairs equal to 1 − ω/(1 − η 2 ) (and the expression includes the situation η = 0). Proposition 1 just requires holds for all η. Thus the existence of ω follows from (17).
, whereβ is the true solution. Thus the spectral radius of Ψ(ω) determines how fast the error converges to zero. We provide a theory-guided procedure to adaptively tune the relaxation parameter ω, which relies on the following proposition that connects ρ(Ψ(ω)) with ω and η.
Proposition 3. Denote the eigenvalues of A −1 S as ±ηi (η ≥ 0), where A and S are defined in (12) and (14), and obtain η min and η max . Then the spectral radius of Ψ(ω) is and the optimal value for ω to achieve the minimum of ρ(Ψ(ω)) is (20) Proof. By (16) and (17), the smallest and the largest eigenvalue of Ψ(ω) are 1 − ω/(1 − η 2 max ) and 1−ω/(1−η 2 min ) with η ∈ [0, 1). After adjusting for their signs, we obtain (19). The right-hand side of (19) is a function of ω, with the first item decreasing linearly in ω and the second item increasing linearly in ω. Hence the minimum of (19) is attained when the two quantities in the braces are equal, which proves (20).
By the property of skew-symmetric matrix, η min = 0 when p is odd. When p is even, it is usually extremely small for a moderate sample size. Therefore η 2 min can be safely assumed as 0. We start the ICF update with ω (0) = 1. Suppose at the k-th iteration we can produce an estimateρ (k) for the spectral radius of Ψ(ω (k) ). Then using (19), η 2 max can be estimated by η 2 max ≈ 1 − ω (k) /(1 +ρ (k) ). Plugging this into (20) we obtain an update for ω Note that the update does not involve η 2 max , but only ω (k) andρ (k) , and ω (k+1) is a decreasing function of ρ (k) . Finally, to estimate the spectral radius of ρ (k) we usê where || · || 2 denotes the 2 -norm. The procedure works well in our numerical studies.

ICF outperforms other methods
Our numerical comparisons were based on real datasets of genome-wide association studies downloaded from dbGaP. The details of the datasets can be found in Section 3.3. Because the convergence of iterative methods is sensitive to the collinearity in the design matrix X, our comparison studies used two datasets: the first one contains 20K SNPs sampled across the whole genome, and the other contains 20K SNPs that are physically adjacent. The first dataset has little or no collinearity (henceforth referred to as IND), and the second has collinearity due to linkage disequilibrium (henceforth referred to as DEP). The sample size is n = 3000 for both datasets. Our numerical studies compared different methods (detailed below) for their speed and accuracy of solving (11). For a given p, for one experiment we sampled without replacement p columns from IND (or DEP) dataset to obtain X, simulated under the null z ∼ MVN(0, I), and solved (11) using different methods with Σ = diag(4, . . . , 4), which corresponded to σ β = 0.5. For each p we conducted 1000 independent experiments.
Our initial studies compared ICF with six other methods: the Cholesky decomposition (Chol), the Jacobi method, the Gauss-Seidel method (GS), the successive over-Relaxation (SOR) method, the steepest descent method, and the conjugate gradient (CG) method. We excluded the Jacobi method and the steepest descent method due to their poor performance. To ensure a fair comparison in the context of BVSR, the starting point for Chol, GS, and SOR was A = X t X + σ −2 β I being obtained, and the starting point for ICF was the upper triangular matrix R such that X t X = R t R being obtained. For GS, we tried the preconditioning method of Kohno et al. (1997), which is the most efficient among the methods surveyed in Niki et al. (2004), but we observed no improvement, most likely because A is well-conditioned due to the regularization. For SOR, we need choose a value for the relaxation parameter (denoted by ω SOR ), which is known to be very difficult. A solution was provided by Young (1954) (see also Yang and Matthias (2007)), but we observed that it did not apply when p > 500. After trial and error, we settled on using ω SOR = 1.2, which appeared to be optimal in our numerical studies. For all iterative methods, we started fromβ (0) = 0 and stopped if or the number of iterations exceeded 200. The computer code for different methods was written in C++ and was run in the same environment. The Cholesky decomposition was implemented using GSL (GNU Scientific Library) (Gough, 2009), and GS and SOR were implemented in a manner that accounted for the sparsity of the triangular matrices L and U to obtain maximum efficiency (Golub and Van Loan, 2012, Chap. 10.1.2). Lastly, we included in the comparison the LAPACK routine DGELS, the most widely used least square solver, as a baseline reference.  The results are summarized in Table 1. For IND dataset, three iterative methods (ICF, GS and SOR) appear on par with each other for smaller p. For p = 1000, ICF outperforms the other two, and is 10 times faster than the Cholesky decomposition. On average it took ICF 5 iterations to converge, and ICF never failed to converge in all experiments. On the other hand, both GS and SOR failed to converge, at least once, for large p. For DEP dataset, we note that ICF is the fastest among all the methods compared. For larger p (p = 500, 1000), ICF is 4 -5 times faster than the Cholesky decomposition, and 5 -6 times faster than the other three methods. Both GS and SOR had difficulty in converging within 200 iterations for large p. DGELS is always the slowest since it assumes X t X is unknown and solves (11) by QR decomposition. Advanced least square solvers (Avron et al., 2010;Meng et al., 2014) have similar performance to DGELS and can only beat it by a small margin when X is very large. We also tweaked simulation conditions to check whether the results were stable. For example, we tried to simulate z under the alternative instead of the null, and to initializeβ (0) in different manners, such as using unpenalized linear estimates. The results remained essentially unchanged under different tweaks.
Finally, we compared the accuracy of different iterative methods, where the truth was obtained from the Cholesky decomposition. In this experiment we used IND dataset and compared for p = 500 and p = 1000. Each method was repeated for 10, 000 experiments. The maximum entry-wise deviation (denoted by d) was obtained for each method, each experiment, under each simulation condition. Only those converged experiments were included in the comparison. Figure 1 shows the distributions of log 10 d for three methods. Clearly, ICF outperforms CG and SOR by a large margin. (GS is omitted since its accuracy is poorer than that of SOR in almost every experiment.) Because in real applications we are oblivious to the truth (or the truth is expensive to obtain), a method is more desirable if its deviation from the truth is within the pre-specified precision. Figure 1 shows that ICF always achieves the pre-specified precision (10 −6 ) while the other two methods do not. Moreover, the deviation of ICF is 2 -3 orders of magnitude smaller than that of SOR, and 3 -4 orders of magnitude smaller than that of the CG method. We repeated the experiments for DEP dataset and made the same observations.

Incorporating ICF into BVSR
As we mentioned in the introduction, for BVSR the most time-consuming step in MCMC is the computation of (4) for every proposed (γ , h ) (superscript denotes the proposed value). To successfully incorporate ICF into BVSR, we need to overcome two hurdles: avoiding the computation of the determinant term in (4) and efficiently obtaining the Cholesky decomposition R t R = X t γ X γ . Computing matrix determinant has a cubic complexity. If this is not avoided, using ICF to computeβ becomes pointless. When evaluating (4), we need to compute the determinant |X t γ X γ + σ −2 β (γ, h)I|, which takes O(|γ| 3 ) operations. Identifying that the determinant is a normalization constant, avoiding computation of the determinant becomes a well studied problem in the MCMC literature that deals with the so-called "doubly-intractable distributions," meaning two nested unknown normalization constants (Møller et al., 2006;Murray et al., 2012). Recall that we want to sample where Ω(γ, h) = X t γ X γ + σ −2 β (h, γ)I, and it is the computation of |Ω(γ, h)| that we want to avoid. For a naive Metropolis-Hastings algorithm, if the current state is (γ, h) and the proposed move is (γ , h ), we need to evaluate the Hastings ratio where and K(· | ·) is the proposal distribution for proposing · from ·. The proposed move (γ , h ) is accepted with probability min(1, α), which is called the Metropolis rule. Apparently, both |Ω(γ, h)| and |Ω(γ , h )| need to be evaluated in a naive MCMC implementation.
To avoid computing Z(γ, h), notice that holds for allỹ, and ifỹ is sampled from P (· | γ , h ), the ratio L(ỹ, γ, h)/L(ỹ, γ , h ) becomes a one-sample importance-sampling estimate of Z(γ , h )/Z(γ, h). Hence we can plug in L(ỹ, γ, h)/L(ỹ, γ , h ) to replace Z(γ , h )/Z(γ, h) and compute the Hastings ratio by This is the exchange algorithm of Murray et al. (2012), which can be viewed as a special case of the pseudo-marginal method (Andrieu and Roberts, 2009). In the Appendix we prove that the stationary distribution of this MCMC is the desired posterior distribution P (γ, h | y). Since the Bayes factor (4) is invariant to the scaling and shifting of y, we can simply assume µ = 0 and τ = 1 when samplingỹ.
To tackle the second difficulty, notice that in each MCMC iteration, usually only one or a few entries of γ are flipped between 0 and 1. Such proposals are often called "local" proposals, and Markov chains using local proposals usually have high acceptance ratios, which indicate the chains are "mixing" well (Guan and Krone, 2007). When one covariate is added into the model, the new Cholesky decomposition, (R ) t R , can be obtained from the previous decomposition R t R by a forward substitution. When one covariate is deleted from the model, we simply delete the corresponding column from R and introduce new zeros by Givens rotation (Golub and Van Loan, 2012, Chap. 5.1), which requires ≈ 3k 2 flops if the (|γ| − k)-th predictor is removed. We provide a toy example in Appendix to demonstrate the update on the Cholesky decomposition.

Numerical examples
We developed a software package, fastBVSR, to fit the BVSR model by incorporating ICF and the exchange algorithm into the MCMC procedure. The algorithm is summarized in Appendix. The software is written in C++ and available at http://www. haplotype.org/software.html. In fastBVSR, we also implemented Rao-Blackwellization, as described in Guan and Stephens (2011), to reduce the variance of the estimates for γ and β. By default, Rao-Blackwellization is done every 1000 iterations.
To check the performance of fastBVSR, we performed simulation studies based on a real dataset described in Yang et al. (2010) (henceforth the Height dataset). The Height dataset contains 3, 925 subjects and 294, 831 common SNPs (minor allele frequency ≥ 5%) after routine quality control (c.f. Xu and Guan, 2014). We sampled 10, 000 SNPs across the genome to perform simulation studies. Our aim was to check whether fastBVSR can reliably estimate the heritability in the simulated phenotypes. To simulate phenotypes of different heritability, we randomly selected 200 causal SNPs (out of 10, 000) to obtain γ. For each selected SNP we drew its effect size from the standard normal distribution to obtain β γ (the subvector of β that contains nonzero entries). Then we simulated the standard normal error term to obtain ε, and scaled the simulated effect sizes simultaneously using λ such that y = λX γ β γ + ε and the heritability, 1 − Var(ε)/ Var(y), was h (taking values in 0.01, 0.02, . . . , 0.99). This was the same procedure as that of Guan and Stephens (2011). For each simulated phenotype, we ran fastBVSR to obtain the posterior estimate of heritability. We compared fastBVSR with GCTA (Yang et al., 2011), which is a software package to estimate heritability and do prediction using the linear mixed model. For heritability estimation, GCTA has been shown to be unbiased and accurate in a wide range of settings (Yang et al., 2010;Lee et al., 2011). The result is shown in Figure 2. Both fastBVSR and GCTA can estimate the heritability accurately; the mean absolute error is 0.014 for fastBVSR and 0.029 for GCTA. Noticeably, fastBVSR has a smaller variance but a slight bias when the true heritability is large.
Next we compare the predictive performance of fastBVSR and GCTA. We first define the mean squared error as a function ofβ, Then following Guan and Stephens (2011) we define relative prediction gain (RPG) to measure the predictive performance ofβ imsart-ba ver. 2014/10/16 file: icf-arxiv-18.tex date: January 22, 2018 Figure 2: Heritability estimates. The left panel is the result of fastBVSR, obtained using 20, 000 burn-in steps and 100, 000 sampling steps, and the right panel is the result of GCTA. In each iteration of BVSR, the heritability is estimated by computing the proportion of explained variance of y using the sampled parameter values. The grey bars represent 95% credible intervals for fastBVSR and ±2 standard error for GCTA.
The advantage of RPG is that the scaling of y does not contribute to RPG so that simulations with different heritability can be compared fairly. Clearly whenβ = β, RPG = 1; whenβ = 0, RPG = 0. Figure 3 shows that fastBVSR has much better predictive performance than GCTA, which reflects the advantage of BVSR over the linear mixed model. This advantage owes to the model averaging used in BVSR (Raftery et al., 1997;Broman and Speed, 2002). Besides, note that BVSR with Rao-Blackwellization performs slightly better than BVSR alone.
Lastly we check the calibration of the posterior inclusion probability (PIP), P (γ j = 1 | y). We pool the results of the 99 simulation sets with different heritability (from 0.01 to 0.99 at increment of 0.01). In each set there are 10, 000 estimated PIPs, one for each SNP, and in total there are 990, 000 PIP estimates. We group these PIPs into 20 bins, [0.05 × (i − 1), 0.05 × i) for i = 1, . . . , 20, and for each bin we compute the fraction of true positives. If the PIP estimates are well calibrated, we expect that in each bin, the average of the PIPs roughly equals the fraction of true positives. Figure 4 shows that the PIP estimated by BVSR is conservative, which agrees with the previous study (Guan and Stephens, 2011), and the PIP estimated by Rao-Blackwellization is well cablibrated.

Analysis of real GWAS datasets
We applied fastBVSR to analyze GWAS datasets concerning intraocular pressure (IOP). We applied and downloaded two GWAS datasets from the database of Genotypes and Figure 3: Relative prediction gain (RPG) of fastBVSR and GCTA. BVSR stands for the RPG of the crude posterior mean estimates from fastBVSR; BVSR-RB represents the RPG of the Rao-Blackwellized estimates from fastBVSR; for GCTA, RPG is computed using the BLUPs (best linear unbiased predictors) from linear mixed model. Phenotypes (dbGaP), one for glaucoma (dbGaP accession number: phs000238.v1.p1) and the other for intraocular hypertension (dbGaP accession number: phs000240.v1.p1). The combined dataset contains 301, 143 autosomal SNPs and 3, 226 subjects, and was used previously by Zhou and Guan (2017) to examine the relationship between p-values and Bayes factors.
We conducted five independent runs with different random seeds. The sparsity parameters for BVSR were set as π min = 0.0001 and π max = 0.01, which reflected a prior of 30 -3000 marginally associated SNPs. The posterior mean model size in each run ranges from 430 to 622 (Table 2). The average number of iterations for ICF to converge is 7.3 (95% interval = [4,13]), which suggests that ICF is effective in analyzing real datasets. The speed improvement of fastBVSR over piMASS (a BVSR implementation that uses the Cholesky decomposition to fit Bayesian linear regression, a companion software package of Guan and Stephens (2011)) is dramatic: with model size around 500 and more than 3000 individuals, fastBVSR took 14 hours to run 2.1 million MCMC iterations, compared to more than 1000 hours used by piMASS on problems of matching size. Out of concern for the cumulative error in obtaining the matrix R by updating the Cholesky decomposition (detailed in Section 3.1), we compared every 1000 steps the updated Cholesky decomposition with the directly computed Cholesky decomposition, and results suggested that the entry-wise difference was absolutely negligible.
We examine the inference of the hyperparameters, namely, the narrow-sense heritability h, the model size |γ|, and the sparsity parameter π. Table 2 shows the inferences of the three parameters are more or less consistent across five independent runs, and thus we combine the results from five independent runs in the following discussion. The posterior mean for h is 0.28 with 95% credible interval (0.1, 0.44). This estimate of the heritability is smaller than those reported in the literature: 0.62 from a twin study (Carbonaro et al., 2008), 0.29 from a sibling study (Chang et al., 2005), and 0.35 from an extended pedigree study (van Koolwijk et al., 2007). For comparison, using the same dataset, the heritability estimated by GCTA is 0.47 with standard error 0.11. The underestimation of h using fastBVSR is perhaps due to over-shrinking of the effect sizes (the posterior mean of prior effect size σ β is 0.048). The posterior mean for π is 0.0016, which suggests that IOP is a very much polygenic phenotype, agreeing with the complex etiology of glaucoma (Weinreb et al., 2014), to which high IOP is a precursor phenotype.
Lastly, we examine the top marginal signals detected by fastBVSR. Table 3 lists top 11 SNPs based on PIP inferred from fastBVSR, at an arbitrary PIP threshold of 0.25. Table 3 also includes PPA (posterior probability of association) based on the single SNP test for two choices of prior effect sizes: σ β = 0.05 which is the posterior mean of σ β inferred from fastBVSR and σ β = 0.5 which is a popular choice for single SNP analysis. We use the posterior mean π = 0.0016 inferred from fastBVSR as the prior odds of association. Multiplying a Bayes factor and the prior odds we obtain the posterior odds, from which we obtain the PPA = (posterior odds)/(1 + posterior odds). First the two columns of PPAs are highly similar in both ranking and magnitude. This observation agrees with our experience and theoretical studies (Zhou and Guan, Table 3: SNPs with (Rao-Blackwellized) PIP > 0.25. Chr is short for chromosome, and Pos is short for position, which is measured in mega-basepair (Mb) according to HG18. PPA stands for the posterior inclusion probability. Computing PPA requires the Bayes factor and the prior odds. For prior odds we use 0.0016, which is the posterior mean of π inferred from fastBVSR. Bayes factors are computed using two priors effect sizes (σ β = 0.05, 0.5). SNPs that are mentioned in the main text are highlighted in bold.

Discussion
We developed a novel algorithm, iterative complex factorization, to solve a class of penalized linear systems, proved its convergence, and demonstrated its effectiveness and efficiency through simulation studies. The novel algorithm can dramatically increase the speed of BVSR, which we demonstrated by analyzing a real dataset. In our simulation studies we only considered n > p (in BVSR p = |γ| is the size of the selected model). When n ≤ p, the ICF can be implemented with the dual variable method (Saunders et al., 1998) (see also Lu et al. (2013)), which hinges on the identity (X t γ X γ + σ −2 β I) −1 X t γ y = X t γ (X γ X t γ + σ −2 β I) −1 y. One limitation of ICF is that it requires the Cholesky decomposition to obtain the effective complex factorization. Nevertheless, ICF still has a range of potential applications. The general ridge regression (Draper and Van Nostrand, 1979) is an obvious one, and here we point out two other examples. The first is the Bayesian sparse linear mixed model (BSLMM) of Zhou et al. (2013), which can be viewed as a generalization of BVSR. BSLMM has two components, one is the linear component that corresponds to X γ and the other is the random effect component. The linear component requires one to solve a system that is similar to (11) (Text S2 Zhou et al., 2013); therefore ICF can be applied to BSLMM to make it more efficient. Another example of ICF application is using the variational method to fit the BVSR (Carbonetto and Stephens, 2012;Huang et al., 2016). In particular, Huang et al. (2016) estimated (γ, β) using an iterative method. In each iteration, the posterior mean for β is updated by solving a linear system that has the same form as (11), where the dimension is equal to the number of covariates with nonzero PIP estimates, and the diagonal matrix Σ depends on the PIP estimates. Applying ICF to the variational method might be more beneficial compared to BVSR because the dimension of the linear system is exceedingly larger in the variational method.
The iterative complex factorization method converges almost instantaneously, and it is far more accurate than other iterative algorithms such as the Gauss-Seidel method. Why does it converge so fast and at the same time so accurate? Our intuition is that the imaginary part of the update in (15) , which involves the skew-symmetric matrix S = R t Σ − Σ t R, whose Youla decomposition (Youla, 1961) suggests that it rotateŝ β (k) , scales and flips its coordinates and rotates back, and which then appends itself to be the imaginary part of z, provides a "worm hole" through the imaginary dimension in an otherwise purely real, unimaginary, solution path. Understanding how this works mathematically will have a far-reaching effect on Bayesian statistics, statistical genetics, machine learning, and computational biology.
Finally, we hope our new software fastBVSR, which is 10 -100 fold faster than the previous implementation of BVSR, will facilitate the wide-spread use of BVSR in analyzing or reanalyzing datasets from genome-wide association (GWA) and expression quantitative trait loci (eQTL) studies.

Supplementary material
The online appendix includes a summary of the MCMC algorithm, examples for updating the Cholesky decomposition and a proof for the stationary distribution of the exchange algorithm.