SparsePro: an efficient genome-wide fine-mapping 1 method integrating summary statistics and functional 2 annotations

10 Identifying causal variants from genome-wide association studies (GWASs) is challenging 11 due to widespread linkage disequilibrium (LD). Functional annotations of the genome may 12 help prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS 13 results. However, classical fine-mapping methods have a high computational cost, particu-14 larly when the underlying genetic architecture and LD patterns are complex. Here, we pro-15 pose a novel approach, SparsePro, to efficiently conduct functionally informed statistical 16 fine-mapping. Our method enjoys two major innovations: First, by creating a sparse low-17 dimensional projection of the high-dimensional genotype data, we enable a linear search 18 of causal variants instead of a combinatorial search of causal configurations used in most 19


Introduction
Establishment of large biobanks and advances in genotyping and sequencing technologies have enabled large-scale genome-wide association studies (GWASs) [1][2][3].Although GWASs have revealed extensive associations between genetic variants and traits of interest, understanding the genetic architecture underlying these genetic associations remains challenging [4][5][6], mainly because GWASs typically rely on univariate regression models, which are not able to distinguish the causal variants from other variants in linkage disequilibrium (LD) [5,7,8].
Several statistical fine-mapping approaches have been proposed for identifying causal variants in GWASs while considering the underlying LD patterns.For instance, BIMBAM [9], CAVIAR [10] and CAVIARBF [11] estimate the posterior inclusion probabilities (PIPs) in a pre-defined locus by evaluating multivariate Gaussian likelihood enumerating all possible configurations.FINEMAP [12] accelerates the inference with a shotgun stochastic search focusing on the most likely subset of causal configurations.However, the number of causal configurations required to evaluate can grow combinatorially as the number of causal variants increases, thus tremendously increasing the computational cost if multiple causal variants exist.SuSiE [13] introduces an iterative Bayesian stepwise selection algorithm for variable selection, which can also be applied to statistical fine-mapping with greatly improved computational efficiency.
Furthermore, it has been recognized that functional annotations of the genome may help prioritize variants that are biologically relevant, thus improving fine-mapping of GWAS results [8].
For example, PAINTOR [14] and RiVIERA [15] empirically estimate the impacts of functional annotations from statistical evidence, which improves the accuracy of fine-mapping but has a high computational cost, especially when multiple causal SNPs exist in the same locus.Poly-Fun [16] adopts stratified LD score regression [17] to effectively partition total trait heritability into annotation-dependent heritability estimates, and uses these estimates of annotationtagged heritability to specify functional priors for fine-mapping methods.
In this work, we present a unified probabilistic framework called Sparse Projections to Causal Effects (SparsePro) for statistical fine-mapping with the capacity to incorporate functional annotations.Accompanied with an efficient variational expectation-maximization inference algorithm [18], SparsePro achieves superior accuracy in identifying causal variants as well as computational efficiency compared to the state-of-the-art approaches in both simulation studies and real data analyses.We further demonstrate the utility of SparsePro in genome-wide fine-mapping of functional biomarkers for five vital organs in human.

SparsePro method overview
To fine-map causal SNPs, our method takes two lines of evidence (Figure 1).First, from estimated marginal associations between genetic variants and a complex trait of interest, accompanied by matched LD information, we can group correlated genetic variants together and as-sess their effects jointly.Then, we infer the contribution of each SNP towards each group of causal effect separately to obtain posterior inclusion probabilities (PIPs).Second, optionally, if we have knowledge about any functional annotations which may be enriched for the causal SNPs, we can estimate the relative enrichment of these annotations, and re-prioritize SNPs according to the enrichments of these annotations.As outputs, our model yields functionally informed PIP for each SNP and the enrichment estimates of candidate functional annotations.

Our contributions in the context of the existing methods
Our work is related to two existing methods, SuSiE [13] and PolyFun [16].Inspired by the "sum of single effects" model in SuSiE, we introduce a sparse projection of the genotype in our model specification so that the identification of causal variants and estimation of causal effect sizes are separated.This sparse projection avoids exhaustively evaluating the combinatorial number of causal configurations.For statistical inference, SuSiE adopts an iterative Bayesian stepwise selection algorithm that operates on the Bayes Factors (BFs) [13].Here, we use a paired mean field variational inference algorithm [18] to jointly update the variational parameters for the causal effects and causal indicators of each SNP.Moreover, we have adapted our algorithm to directly work with GWAS summary statistics and provided appropriate estimates for the hyperparamters including trait variance and heritability estimates.To enable functionally informed fine-mapping, PolyFun uses genome-wide heritability estimates from LD score regression to set the functional priors for fine-mapping methods [16].In contrast, we aggregate the genomewide statistical fine-mapping evidence by maximizing the evidence lower bound of SparsePro to prioritize relevant annotations and robustly derive genome-wide functional priors.

SparsePro model specification
We assume the following data generative process (Figure 1) for a continuous polygenic trait.
First, the prior probability πg for the g-th SNP (g ∈ {1, . . ., G}) being causal is defined as: where A g is the 1 × M annotation row vector of M candidate annotations for the g-th SNP; and w is the M × 1 vector of logarithm of relative enrichment.Here, we use the sof tmax function to ensure the prior probabilities sum up to 1.If no functional information is provided, the prior probability of being causal is considered equal for all SNPs, i.e. πg = 1 G .
We assume that there exist K independent causal effects, and that where π = (π 1 , ..., πG ) and s k is a binary indicator vector of length G indicating which SNP is the causal SNP under the k-th (k ∈ {1, ..., K}) causal effect.
Then, the causal effect sizes are sampled from a normal distribution, i.e.
Finally, the continuous trait y N ×1 over N individuals is generated as follows: or in matrix form: where X N ×G is the full genotype matrix, S G×K is the sparse projection matrix, β K×1 is the causal effect vector, and ϵ N ×1 ∼ N (0, τ −1 y I N ) denotes the variance not attributable to the modelled ge-netic effects.

A variational inference algorithm for Bayesian fine-mapping
With this model specification (Figure 1), finding the causal variants is equivalent to inferring the sparse projections s k and the effect sizes β k given y and X for k ∈ {1, ..., K}: p(S, β|y, X, π, τ β , τ y ) = p(y, S, β|X, π, τ β , τ y ) p(y|X, π, τ β , τ y ) As the number of possible causal configurations grows combinatorial with G, the exact posterior solution is intractable because of the marginal likelihood in the denominator.Unlike most existing fine-mapping approaches using sampling-based methods to search through a subset of possible causal configurations [12,19], we adopt a paired mean field factorization of variational family to approximate the posterior [18]: This variational distribution preserves the dependency between s k and β k .It has been shown that the paired mean field variational family has similar mode and shape as the desired posterior distribution, and that such inference can achieve high accuracy with substantially improved computational efficiency [18].
To find the best approximation, we minimize the Kullback-Leibler (KL) divergence between the posterior distribution and the proposed variational distribution, which is equivalent to maximizing the evidence lower bound (ELBO) [20]: ELBO = E q [log p(y, S, β|X, π, τ β , τ y )] − E q [log q(S, β)] Based on the mean field assumptions [18], this optimization can be conducted iteratively for the k th causal effect and the g th SNP with the following closed-form updates until convergence (derivation details are available in Supplementary Notes).
We update posterior effect size for the g-th SNPs in the k-th causal effect: with where • represents element-wise multiplication of vectors.
We then update the posterior probability of the g-th SNP being causal in the k-th causal effect: For fine-mapping, we take the maximum of these K probabilities as the PIP for SNP g: γ * g = max(γ * 1g , . . ., γ * Kg ).

Adaptation to GWAS summary statistics
The above variational inference algorithm requires access to large datasets containing both individual-level genotype X and phenotype data y.Since a growing number of GWASs have released publicly available summary statistics (i.e., marginal effect size estimate βg and its standard error se g for the g-th SNP), we adapt SparsePro to directly operate on these summary statistics with additional information from an LD reference panel (i.e.estimates of pairwise SNP-SNP Pearson correlation).
Specifically, if we have reasonable surrogates for X ⊤ g X g , X ⊤ X and X ⊤ g y, we can plug them into Equations ( 1), (2), and (3).We include two forms of reformulation depending on whether the genotypes are standardized to have zero mean and unit variance in the GWAS.
1.If the genotypes are standardized, we have where N is the sample size.
2. If the genotypes are not standardized, we have Therefore, Notably, if y has been standardized to have unit variance prior to a GWAS, we naturally supply var(y) = 1.Otherwise, it can be estimated as var(y) = 2N p(1 − p)se 2 where N (the study sample size), p (minor allele frequencies), and se (standard errors of effect size estimates) are usually available in GWAS summary statistics.

Variational expectation-maximization for integrating functional annotations
To estimate the relative enrichment of functional annotations and further prioritize variants, we adopt a variational expectation-maximization scheme to maximize ELBO with respect to the logarithm of relative enrichment (w) first and then use the estimate ŵ to calculate πg (prior probability of being causal) for each SNP.
Suppose we have M candidate annotations and A gm (m ∈ {1, ..., M }) is a 0/1 indicator denoting whether the g-th SNP has the m-th annotation.By setting the derivative of ELBO with respect to w m to 0 and solving for w m , we have the following estimate for the logarithm of rele-vant enrichment (detailed in Supplementary Notes), where We note that this metric is equivalent to the logarithm of a relative risk, thus its standard error can be calculated as We evaluate the significance of annotation enrichment with the log likelihood ratio test (Gtest) [21].Only annotations which demonstrate statistical significance are included in our model to update the prior probability of being causal for each SNP.Specifically, This functionally informed prior helps prioritize causal SNPs in addition to statistical evidence.

Hyperparameter settings
We have three hyperparameters: number of causal effect K, inverse of the unexplained variance τ y and inverse variance of causal effect sizes τ β k in our model.As we show in Supplementary Notes, our model is not sensitive to the setting of K as long as K is larger than the actual number of independent effects, except that increasing K marginally increases the computation time.
We set τ y as where h 2 is the local SNP heritability that can be estimated by a modified Heritability Estimation from Summary Statistics (HESS) [22] based on GWAS summary statistics (Supplementary

Notes)
We set τ β as for each of the independent causal effects.We use k ∈ {1, ..., K} to account for different effect sizes and to improve model identifiability.

Simulation studies
We conducted simulations to showcase the efficiency and utility of our method.We leveraged resources from the UK Biobank [1].Specifically, we first retained 353,606 White British ancestry participants by excluding one individual from each pair of closely related individuals (who had a 3rd degree or closer relationship).We then retrieved the genotypes of these individuals based on 271,699 SNPs which had a minor allele frequency ≥ 0.001 and an imputation quality score ≥ 0.6 on chromosome 22.Next, we sampled 50 causal SNPs with a twofold relative enrichment amongst SNPs that were annotated as "conserved sequences" [23], "DNase I hypersensitive sites" (DHS) [24], "non-synonymous" [25], or that overlapped with histone marks H3K27ac [26] or H3K4me3 [24].We used the GCTA GWAS simulation pipeline [27] to simulate a continuous trait with a per-chromosome heritability of 0.01.We tested the association between each SNP and this simulated trait, and obtained GWAS summary statistics using the fastGWA software [28].This process was replicated 22 times to imitate a GWAS.We ob-tained LD information calculated using the UK Biobank participants from https://alkesgroup.broadinstitute.org/UKBB_LD/[16].These LD matrices were generated for genome-wide SNPs binned into sliding windows of 3 Mb where two neighboring windows had a 2-Mb overlap.
We applied SparsePro to the GWAS summary statistics with the above LD information, and iterated over all sliding windows, first without any functional annotation information.We denoted the fine-mapping results as "SparsePro-".Next, we aggregated the results from all 22 replications to estimate the relative enrichment for ten binary functional annotations.In addition to the five annotations simulated to be enriched of causal SNPs, we also included five annotations without enrichment: "actively transcribed regions" [29], "transcription start sites" [29], "promoter regions" [30], "5'-untranslated regions" [25], and "3'-untranslated regions" [25].
Annotations with a G-test p-value < 1 × 10 −6 were selected to conduct functionally informed fine-mapping, and the results were denoted as "SparsePro+".τ β and τ y were set according to aforementioned empirical estimates.PIPs for SNPs in the 1-Mb centre of each 3-Mb sliding window were extracted.

Method comparisons using simulated data
We also performed fine-mapping with some of the state-of-the-art methods.To perform finemapping with conditional and joint (COJO) analyses [31] and FINEMAP [12], we first selected COJO lead SNPs based on GWAS summary statistics by performing stepwise model selection implemented in the GCTA-COJO software [27].We then applied FINEMAP with shotgun stochastic search to SNPs in a 1-MB window centered at each COJO-identified lead SNP.We wrote an in-house script using the "susie rss" function to perform genome-wide fine-mapping with SuSiE in the same sliding windows as SparsePro.We aggregated summary statistics from 22 replications and used PolyFun with the "baselineLF2.2.UKB" model [16] to calculate functional priors.The "baselineLF2.2.UKB" model contained all annotations used in SparsePro as well as additional pre-computed LD-related annotations for optimal performance of PolyFun [16].The estimated priors were provided to SuSiE via "prior weights" and to FINEMAP via the --prior-snps option, respectively.The maximal number of causal SNPs in each locus was set to 5 for all methods.
We compared the performance of these methods in terms of precision (1 -false discovery rate), recall, calibration of PIPs, as well as computation time, all evaluated on a 2.1 GHz CPU node on Compute Canada.

Fine-mapping genetic determinants of functional biomarkers for vital organs
To investigate the genetic coordination mechanisms of vital organs, we performed GWAS in the UK Biobank [1] for five functional biomarkers: forced expiratory volume in one second to forced vital capacity (FEV1-FVC) ratio for lung function, estimated glomerular filtration rate for kidney function, pulse rate for heart function, total protein for liver function and blood glucose level for pancreatic islet function.For each trait, we first regressed out the effects of age, age 2 , sex, genotyping array, recruitment centre, and the first 20 genetic principal components before inverse normal transforming the residuals to z-scores that had zero mean and unit variance.
We then performed GWAS analysis on the resulting z-scores with the fastGWA software [27,28] to obtain summary statistics.
Using the summary statistics and the matched LD information [16], we performed genomewide fine-mapping with SparsePro-, SparsePro+, SuSiE and PolyFun-informed SuSiE as described in the simulation analyses (Section 2.9), except that the number of causal effects was set to 9 for each LD region to account for potentially more causal variants.
To evaluate the biological relevance of SNPs fine-mapped by different methods, we assessed their relative enrichment in tissue-specific expression quantitative loci (eQTL).Tissue-specific eQTL identified in the most recent release of the Genotype-Tissue Expression (GTEx) project [32,33] were obtained from https://gtexportal.org/home/datasets.The eQTL information was not used by any functionally informed fine-mapping methods.
Additionally, we calculated trait heritability conferred by fine-mapped SNPs with SparseProand SparsePro+, respectively, at several commonly used PIP thresholds for determining causal variants: 0.50, 0.80, 0.90, 0.95, and 0.99.The adjusted R 2 obtained from multivariate linear regression of the z-scores (i.e.inverse normal transformed trait residuals after regressing out covariate effects) against all fine-mapped SNPs was used as a surrogate of the SNP heritability.
We compared these results to heritability captured by the same number of SNPs fine-mapped by SuSiE and PolyFun-informed SuSiE, separately at each PIP threshold.For instance, if SparseProidentified J SNPs with a PIP > 0.5, we would select J SNPs with the highest PIP determined by SuSiE and compare the adjusted R 2 .Notably, this analysis evaluates predictive associations instead of actual causality, hence the adjusted R 2 is not a direct indicator of the validity of the fine-mapping results.
We selected SNPs with a PIP > 0.8 to explore possible pleiotropic effects using phenogram [34].Loci with potential pleiotropic effects were visualized using LocusZoom [35].

SparsePro demonstrates superior performance in simulation
We performed simulations based on real genotype data from UK Biobank (Materials and Methods).We observed that SparsePro consistently demonstrated superior accuracy in identifying true causal variants.SparsePro without annotation (SparsePro-) achieved an area under the precision-recall curve (AUPRC) of 0.3699, higher than the AUPRC of 0.2677 by FINEMAP and the AUPRC of 0.3573 by SuSiE (Figure 2A).Notably, SparsePro had a substantially higher precision at the same recall rates (Figure 2A).For example, at the recall rate of 25%, Sparse-Pro achieved greater than 95% precision, which is highly desirable in fine-mapping because only a small number of the prioritized SNPs will be experimented validated in vivo or in vitro in practice (Figure 2A).Moreover, SparsePro can incorporate functional priors (Supplementary Table S1) with improved fine-mapping power.SparsePro+ achieved an AUPRC of 0.4636, outperforming both functionally informed FINEMAP (AUPRC = 0.3088) and functionally informed SuSiE (AUPRC = 0.4042) with functional priors derived by PolyFun.As expected, we also found that the performance of SparsePro was not sensitive to the pre-specified number of independent causal effects (Supplementary Table S2 and Supplementary Notes).
Compared to FINEMAP and SuSiE, the PIPs yielded by SparsePro appeared to be much more calibrated.It has been shown that for a well-calibrated fine-mapping method, the mean PIP of all SNPs with a PIP above a certain threshold should be equal to the precision if these SNPs were to be considered causal variants [16].Here, we found that the mean PIP of all SNPs considered to be causal variants by SparsePro was almost identical to the desired precision at any threshold (Figure 2B).In contrast, the PIPs generated by FINEMAP and SuSiE appeared to be inflated (Figure 2B).
For instance, if SNPs with a PIP > 0.8 were to be considered as causal variants, SparseProand SparsePro+ would both have a median precision across simulations of 95% and 100% respectively (Figure 2C).The selected SNPs by FINEMAP (median precision = 50% only) and SuSiE (median precision = 71% only) included an excessive proportion of false positives, even with functional priors (median precision = 77% for FINEMAP and 79% for SuSiE; Fig- ure 2C).The high precision by SparsePro was consistent for all frequently used PIP thresholds (Figure 2C) although FINEMAP and SuSiE sometimes have a slightly higher recall.Furthermore, SparsePro conferred not only higher fine-mapping precision, but also higher computational efficiency.In our simulation, it took only an hour to fine-map chromosome 22, which was 6.5 times faster than FINEMAP and 3 times faster than SuSiE (Figure 2D and Supplementary Table S3).

Fine-mapped SNPs by SparsePro are more enriched in tissue-specific eQTL and confer higher trait heritability
We performed GWAS in the UK Biobank [1] for five functional biomarkers: FEV1-FVC ratio (lung function), estimated glomerular filtration rate (kidney function), pulse rate (heart function), total protein (liver function) and blood glucose level (pancreatic islet function).Genome-wide fine-mapping of five functional biomarkers based on the UK Biobank population using Sparse-Pro identified multiple potentially causal variants (Supplementary Table S4).To assess biological relevance of the fine-mapping results, we estimated the relative enrichment of causal signals in tissue-specific eQTL for each trait (Materials and Methods).We found that the finemapped SNPs were significantly enriched in tissue-specific eQTL for all five biomarkers, while results based on SparsePro-/+ showed the strongest enrichment (Figure 3A).For example, for total protein, the fine-mapped SNPs determined by SparsePro-were 4.00-fold (95% CI: 3.25-4.92)more likely to be liver-specific eQTL than non-fine-mapped SNPs, compared to a 1.54-fold (95% CI: 1.35-1.75)enrichment based on fine-mapped SNPs by SuSiE.While SuSiE was substantially improved by functional priors derived from PolyFun with a 2.20-fold (95% CI: 1.97-2.45)enrichment, the fine-mapped SNPs by SparsePro+ exhibited the highest biological relevance, being 4.06-fold (95% CI: 3.31-4.97)more likely to be liver-specific eQTL.Moreover, at most PIP thresholds, the SNPs fine-mapped by SparsePro-explained a higher proportion of phenotypic variance based on all UK Biobank subjects (Methods) compared to the same number of the most likely causal SNPs identified by SuSiE (Figure 3B and Supplementary Table S5).With the functional annotations (Supplementary Table S5), the finemapped SNPs by SparsePro+ consistently achieved a higher SNP heritability for estimated glomerular filtration rate, FEV1-FVC ratio, as well as total protein compared to the PolyFuninformed SuSiE; although for glucose and pulse rate, PolyFun-informed SuSiE was able to identify SNPs with a slightly higher predictive performance at certain PIP thresholds (Figure 3C and Supplementary Table S6).

Pleiotropic effects of SNPs rs1260326 and rs5742915 on the functions of multiple vital organs
Overall, we observed considerable polygenicity for the five biomarkers of the vital organs (Figure 4A).Interestingly, at the PIP threshold of 0.80, we found two potentially causal variants for three of the five biomarkers.Specifically, SNP rs1260326 (Figure 4B), a missense variant (Leu446Pro) in gene GCKR, was fine-mapped for glomerular filtration rate (PIP = 1.000), blood glucose level (PIP = 0.998), pulse rate (PIP = 0.823) and total protein level (PIP = 1.000).Notably, this specific variant has been found to be significantly associated with a wide variety of glycemic traits [36] and other quantitative traits for metabolic syndromes and comorbidities [37,38], and has been implicated in the functions of liver and other vital organs [39][40][41].
Another SNP, rs5742915 (Figure 4C), a missense variant (Phe645Leu) in gene PML was fine-mapped for FEV1-FVC ratio (PIP = 0.858), pulse rate (PIP = 1.000) and total protein level (PIP = 0.987).This variant has also been associated with other quantitative biomarkers of polygenic traits featuring development and metabolism, including birth weight [42], height [43], appendicular lean mass [44], and age at menarche [45].These findings, along with other SNPs exhibiting pleiotropic effects at somewhat lower PIP thresholds (Supplementary Table S4) presented promising genetic targets for experimental validations in a larger effort towards understanding the mechanisms of genetic coordination among vital organs.
Accurately identifying trait-determining and disease-causing variants is fundamental in genetics and particularly important for appropriately interpreting GWAS results [5,8].In this work, we developed SparsePro, an efficient fine-mapping method to help prioritize causal variants for complex traits, possibly with prior functional information.Through genome-wide simulations, we showed that SparsePro was highly accurate and computationally efficient compared to existing methods.By fine-mapping genetic associations with five biomarkers for vital organ functions, we demonstrated that SparsePro identified candidate variants that were biologically relevant, including two variants with pleiotropic effects, which might indicate genetically encoded coordination among vital organs.
Compared to the existing methods, SparsePro has three important features.First of all, we use an efficient variational inference algorithm to approximate the posterior distribution of the causal variant indicators instead of exhaustively searching through all possible causal configurations or performing stepwise regression.As a result, SparsePro can be significantly faster than the existing fine-mapping methods, such as FINEMAP [12], and is more than twice as fast as SuSiE [13], which is a similar variable selection framework but implements an iterative Bayesian stepwise selection procedure.The substantially improved computational efficiency enables statistical fine-mapping of large chunks of the genome instead of analyzing genetic associations on a per-locus basis as in most existing follow-up studies of GWASs.In our simulation studies, compared to locus-wise fine-mapping based on COJO-identified lead SNPs, such a genome-wide fine-mapping requires neither a pre-specified p-value threshold (e.g.p < 5 × 10 −8 ) for determining candidate loci nor an arbitrary number of causal effects per locus.If functional annotations are available, the estimation of functional enrichment may also be more robust by including more variants with little additional computational overhead.
Second, we utilize a paired mean field variational family, where the causal effect and the casual indicator are coupled in the variational distribution.This ensures that our approximation matches closely with the true posterior distribution of the causal variant indicators [18].As a result, SparsePro yielded better-calibrated PIPs compared to existing fine-mapping methods.
Third, given GWAS summary statistics, we provide estimates for hyperparameters including τ y and τ β that are reasonable in the context of polygenic trait genetics.Consequently, at several commonly used PIP thresholds for defining causal variants, SparsePro showed improved control of false positives, demonstrated higher precision in identifying causal variants in simulation and obtained stronger enrichment for tissue-specific eQTL in real data application.
Last, we propose and implement a probabilistic model that coherently integrates statistical evidence and functional prior information.The key difference between SparsePro+ and other methods that leverage functional priors, such as PolyFun [16] and PAINTOR [14], is that each annotation is tested for its relevance with the trait of interest before being used to derive the priors in our model.Therefore, functional annotations serve as complementary evidence when statistical evidence is not sufficient to discern causal variants.Based on our results, it seems that this approach distills better prior information from the functional annotations compared to the aforementioned methods.
We note that SparsePro can be further improved with the following future directions.First, SparsePro generally requires that the supplied LD reference panel matches well with that of the GWAS study population to guarantee proper convergence.While we advocate the public availability of the in-sample LD information along with the GWAS summary statistics, a more robust model is needed to account for mismatched LD information.Second, SparsePro currently supports only binary annotations while compatibility with continuous annotations is also desirable.
Last, the current variational expectation-maximization scheme might not accurately estimate the joint enrichment of highly correlated annotations.Performing variable selection beforehand or effectively aggregating enrichment estimates may enable the inclusion of multiple correlated informative annotations, such as cell type-specific annotations to further improve the utility of SparsePro.
In summary, SparsePro is an efficient genome-wide fine-mapping method with the ability of integrate functional annotations.We envision its wide utility in understanding the genetic architecture of complex traits, identifying target genes, and increasing the yield of functional followup studies of GWASs.
To help prioritize variants with functional annotations, we assume the prior probability of being causal πg for the g-th variant as πg = sof tmax(A g w) where A g is a M × 1 functional annotation vector and w is the M × 1 vector of annotation enrichment coefficients.We adopt an efficient variational inference algorithm to jointly estimate both causal effect sizes and sparse indicators and an expectation-maximization scheme for estimating annotation enrichment coefficients w as detailed in Section 2.

• Marginal associations • LD reference panel ○ Functional annotations • Posterior inclusion probabilities ○ Relative enrichment of functional annotations
Figure 1: SparsePro overview.The data generating process of SparsePro is depicted in a plate model with shaded nodes represent observed variables and unshaded nodes represent latent variables.The trait y is generated from K causal effects, where the k-th causal effect size We use a sparse projection s k ∼ M ultinomial(1, π) of genotype to indicate causal variant for the k-th effect.Given the causal effect sizes and sparse indicators of causal variants, the target trait To help prioritize variants with functional annotations, we assume the prior probability of being causal πg for the g-th variant as πg = sof tmax(A g w) where A g is a M × 1 functional annotation vector and w is the M × 1 vector of annotation enrichment coefficients.We adopt an efficient variational inference algorithm to jointly estimate both causal effect sizes and sparse indicators and an expectation-maximization scheme for estimating annotation enrichment coefficients w as detailed in Section 2.   Fine-mapped SNPs were identified at five PIP thresholds.As a surrogate of SNP heritability, the proportion of trait variance explained was obtained from multivariate linear regression adjusted R 2 .In this multivariate regression, we regress the inverse normal transformed trait residuals against all fine-mapped SNPs after adjusting for covariate effects.We selected the same number of top-ranked SNPs for each method separately at each PIP threshold (Materials and Methods).Recombination rate (cM/Mb)                                                                                                                                                                                                                                                                                                                                                                                                                  Recombination rate (cM/Mb) Recombination rate (cM/Mb) Recombination rate (cM/Mb) Illustration of genome-wide distribution of fine-mapped SNPs on 22 chromosomes.SNPs with a posterior inclusion probability > 0.80 were indicated as colored solid circles.Two loci with potential pleiotropic effects on four and three vital organ biomarkers respectively were highlighted by red dashed rectangles.Locus zoom plots were presented for these two loci: (B) locus with fine-mapped SNP rs1260326.and (C) locus with fine-mapped SNP rs5742915.SNPs in a ±500 kb window are included, colored by r 2 with the corresponding fine-mapped SNP.

Modified HESS estimates for hyperparameters τ y and τ β
Local heritability estimates are useful in setting hyperparameters for SparsePro.Shi et al. [22] provided an unbiased estimator for local heritability estimation based on summary statistics: ĥg = N βT R −1 β − P N − P where R is the LD matrix, β is GWAS summary effect size, N is the sample size in the GWAS and P is the number of SNPs considered in a locus.However, this estimate requires that when generating summary statistics, both genotypes and phenotypes should be standardized to have zero mean and unit variance.Since summary statistics generated by some GWAS pipelines do not specifically standardize the genotypes and phenotypes, we modified the HESS estimator to account for the non-unit variance: where • represents element-wise multiplication and v is a P × 1 vector: v p = X ⊤ p X p for the p-th SNP with genotype vector X p .This estimate can be adapted to directly operate on summary statistics as explained in Materials and Methods.

Full derivation of variational EM algorithm:
As has been described in Materials and Methods, based on the data generative process, for the k-th causal effect, we have: with ϵ i ∼ N (0, τ −1 y ).Therefore, we have the joint probability: p(y, S, β|X, π, τ β , τ y ) = p(y|X, S, β, τ y ) The goal of fine-mapping is to infer the posterior probability, and in particular, of the sparse projection S (from here we make the dependency on hyperparameters implicit for the ease of notation): p(S, β|y, X) = p(y, S, β|X) p(y|X) We use a paired mean field factorized [18] variational family q(S, β) to approximate the posterior: Note that in this variational family, we do not specify the form of the distribution; rather, we only specify the dependency of β k on s k and that all K causal effects are independent of each other.Also, the form of the variational family does not depend on any observed data.
To better approximate the posterior distribution with members of the variational family, we aim to minimize the KL divergence between the posterior distribution and the proposed variational distribution, which is equivalent to maximizing the ELBO [20]: ELBO = E q(S,β) [log p(y, S, β|X)] − E q(S,β) [log q(S, β)] To maximize the above ELBO, the following requirement should be satisfied for each k: log q(s k , β k ) = E q(S \k ,β \k ) [(y, S, β|X)] where E q(S \k ,β \k ) is the expectation with respect to the variational distribution excluding the k-th where β\k = E q(S \k ,β \k ) [ k′̸ =k We recognize that q(β k |s kg=1 , s k\g = 0) ∼ N (µ * kg , τ * kg ) By matching sufficient statistics for this normal distribution, we can obtain the following variational parameters for updates: By integrating out β k in Equation ( 5), we obtain that log q(s kg = 1, s k\g = 0) = log πg −

Figure 1 .
Figure 1.SparsePro overview.The data generating process of SparsePro is depicted in a plate model with shaded nodes represent observed variables and unshaded nodes represent latent variables.The trait y is generated from K causal effects, where the k-th causal effect size β k ∼ N (0, τ β k ).We use a sparse projection s k ∼ M ultinomial(1, π) of genotype to indicate causal variant for the k-th effect.Given the causal effect sizes and sparse indicators of causal variants, the target trait y i for individual i follows a normal distribution y i ∼ N (X i k s k β k , τ −1 y ).

Figure 2 .
Figure 2. SparsePro demonstrated improved accuracy and computational efficiency in genomewide simulation results.(A) Precision-Recall curves.The inset shows the area under the precision recall curve (AUPRC) for each method.(B) Calibration of posterior inclusion probabilities (PIPs).The y-axis is the mean PIPs for all SNPs considered as causal variants, corresponding to the expected precision at different PIP cutoffs.The x-axis represents the actual precision at different PIP cutoffs.The black dashed line indicates an optimal calibration, where the expected precision perfectly matches the observed precision.(C) Precision and recall rates obtained at five frequently used PIP thresholds.Error bars indicate inter-quartile ranges.(D) Comparison of computational time.Boxes denote inter-quartile ranges and the line inside each box indicates the median running time.The color legends are displayed at the bottom of the figure.

Figure 3 .
Figure 3. Biological relevance of fine-mapped SNPs for five biomarkers, each for a distinct vital organ.(A) Relative enrichment of causal signals in tissue-specific eQTL.Target traits and

Figure 4 .
Figure 4. Fine-mapping genetic associations for five functional biomarkers of vital organs.(A) Illustration of genome-wide distribution of fine-mapped SNPs on 22 chromosomes.SNPs with a posterior inclusion probability > 0.80 were indicated as colored solid circles.Two loci with potential pleiotropic effects on four and three vital organ biomarkers respectively were highlighted by red dashed rectangles.Locus zoom plots were presented for these two loci: (B) locus with fine-mapped SNP rs1260326.and (C) locus with fine-mapped SNP rs5742915.SNPs in a ±500 kb window are included, colored by r 2 with the corresponding fine-mapped SNP.

Figure 2 :
Figure 2: SparsePro demonstrated improved accuracy and computational efficiency in genome-wide simulation results.(A) Precision-Recall curves.The inset shows the area under the precision recall curve (AUPRC) for each method.(B) Calibration of posterior inclusion probabilities (PIPs).The y-axis is the mean PIPs for all SNPs considered as causal variants, corresponding to the expected precision at different PIP cutoffs.The x-axis represents the actual precision at different PIP cutoffs.The black dashed line indicates an optimal calibration, where the expected precision perfectly matches the observed precision.(C) Precision and recall rates obtained at five frequently used PIP thresholds.Error bars indicate inter-quartile ranges.(D) Comparison of computational time.Boxes denote inter-quartile ranges and the line inside each box indicates the median running time.The color legends are displayed at the bottom of the figure.

Figure 3 :
Figure 3: Biological relevance of fine-mapped SNPs for five biomarkers, each for a distinct vital organ.(A) Relative enrichment of causal signals in tissue-specific eQTL.Target traits and the corresponding organs are indicated.Estimates of relative enrichment with 95% confidence intervals are plotted on a logarithmic scale.(B) Comparison of the proportion of total trait variance explained by fine-mapped SNPs between SparsePro-and SuSiE.(C)Comparison of the proportion of total variance explained by fine-mapped SNPs between SparsePro+ and PolyFun informed SuSiE.Fine-mapped SNPs were identified at five PIP thresholds.As a surrogate of SNP heritability, the proportion of trait variance explained was obtained from multivariate linear regression adjusted R 2 .In this multivariate regression, we regress the inverse normal transformed trait residuals against all fine-mapped SNPs after adjusting for covariate effects.We selected the same number of top-ranked SNPs for each method separately at each PIP threshold (Materials and Methods).

Figure 4 :
Figure 4: Fine-mapping genetic associations for five functional biomarkers of vital organs.(A)Illustration of genome-wide distribution of fine-mapped SNPs on 22 chromosomes.SNPs with a posterior inclusion probability > 0.80 were indicated as colored solid circles.Two loci with potential pleiotropic effects on four and three vital organ biomarkers respectively were highlighted by red dashed rectangles.Locus zoom plots were presented for these two loci: (B) locus with fine-mapped SNP rs1260326.and (C) locus with fine-mapped SNP rs5742915.SNPs in a ±500 kb window are included, colored by r 2 with the corresponding fine-mapped SNP.

s
component.With the joint probability provided in Equation (4) we have log p(y, S, β|X) = log p(y|X, S, β)+ k log p(β k |τ β k ) + k (s k | π) kg log πgTaking expectation with respect to the variational distribution excluding the k-th component and plugging in s kg = 1 and s k\g = 0 for all SNPs excluding the g-th SNP, we can obtain the joint distribution of the k-th effect as: posterior probability of the g-th SNP being causal in the k-th effect can be estimated as:γ * kg := q(s kg = 1, s k\g = 0) = sof tmax(log πg −This completes the variational expectation step in our inference algorithm.When functional annotations are available, we use the following maximization step to integrate relevant annotations.After the expectation step, we have thatELBO = const + [A g w − log( g exp(A g w))]To maximize ELBO with respect to the relative enrichment of the m-th candidate annotation, we take partial derivatives of ELBO with respect to w m and set it to 0 to solve for w m :[A gm − g A gm exp(A g w)) g exp(A g w)) [A gm − g A gm exp(A gm w m ) exp( m ′ ̸ =m A gm ′ w m ′ )) g exp(A gm w m ) exp( m ′ ̸ =m A gm ′ w m ′ )) [A gm − g A gm exp(A gm w m )sof tmax( m ′ ̸ =m A gm ′ w m ′ ) g exp(A gm w m )sof tmax( m ′ ̸ =m A gm ′ w m ′ ) gm = 1]sof tmax( m ′ ̸ =m A gm ′ w m ′ ) e wm g [A gm = 1]sof tmax( m ′ ̸ =m A gm ′ w m ′ ) + g [A gm = 0]sof tmax( m ′ ̸ =m A gm ′ w m ′ ) = r 1 − (r 1 + r 0 ) k 1 e wm k 1 e wm + k 0 = 0wherek 1 = g [A gm = 1]sof tmax( m ′ ̸ =m A gm ′ w m ′ ) k 0 = g [A gm = 0]sof tmax( m ′ ̸ =m A gm ′ w m ′ ) r 1 = k,g [A gm = 1]γ * kg r 0 = k,g [A gm = 0]γ * kgWe then obtain: k 1 e wm k 1 e wm + k 0 = r 1 r 1 + r 0 and solve for: 574 35