Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores

Summary Publicly available genome-wide association studies (GWAS) summary statistics exhibit uneven quality, which can impact the validity of follow-up analyses. First, we present an overview of possible misspecifications that come with GWAS summary statistics. Then, in both simulations and real-data analyses, we show that additional information such as imputation INFO scores, allele frequencies, and per-variant sample sizes in GWAS summary statistics can be used to detect possible issues and correct for misspecifications in the GWAS summary statistics. One important motivation for us is to improve the predictive performance of polygenic scores built from these summary statistics. Unfortunately, owing to the lack of reporting standards for GWAS summary statistics, this additional information is not systematically reported. We also show that using well-matched linkage disequilibrium (LD) references can improve model fit and translate into more accurate prediction. Finally, we discuss how to make polygenic score methods such as lassosum and LDpred2 more robust to these misspecifications to improve their predictive power.


Introduction
Contrary to individual-level genotypes and phenotypes, summary statistics resulting from genome-wide association studies (GWAS) are widely available, and very large sample sizes can be obtained through meta-analyses. 1 GWAS summary statistics have been extensively used to derive polygenic scores (PGS), perform fine-mapping, and estimate a range of key genetic architecture parameters. [2][3][4] However, GWAS summary statistics come with uneven imputation accuracy. There is also heterogeneity in the information made available in these summary statistics; for example, per-variant imputation INFO scores, sample sizes, and allele frequencies are often missing. Moreover, many methods based on summary statistics use Bayesian models and iterative algorithms, which can be particularly sensitive to model misspecifications. 5,6 We present an overview of possible misspecifications that come with GWAS summary statistics in Table 2. First, the total sample size used can be misestimated, which would result in a biased estimation of the SNP heritability. For example, the total effective sample size is overestimated when computed from the total number of cases and controls from a meta-analysis of binary outcomes; 7 using BOLT-LMM summary statistics can result in an increased effective sample size; 8,9 and using SAIGE on binary traits with a large prevalence can result in a reduced effective sample size. 10,11 Second, per-variant sample sizes can vary substantially and be much smaller than the total sample size when meta-analyzing GWAS summary statistics from multiple cohorts with different sets of variants. 12 Using very different per-variant sample sizes can be problematic for models implicitly assuming that summary statistics have all been derived from the same individuals. 13,14 Third, many genetic variants are imputed, with association statistics being reported for the allele dosages instead of the true alleles, which can lead to some bias in the summary statistics. Fourth, the imputation quality of each variant (e.g., INFO scores), often used in a quality control (QC) step in statistical genetics analyses, can be severely misestimated (often overestimated) when computed from multiancestry individuals instead of a more homogeneous subset. Fifth, errors such as allele inversions can be present in the summary statistics; QC is particularly important here. Sixth, many follow-up analyses require using linkage disequilibrium (LD) from a reference panel. There can be some mismatch between the GWAS summary statistics and the LD reference used, which can, e.g., lead to suboptimal predictive performance for polygenic scores.
Here we investigate some of these misspecifications and propose adjustments to improve the predictive performance of polygenic scores derived from GWAS summary statistics. We approach this from three different angles. First, based on additional summary information such as the imputation INFO scores and allele frequencies from the GWAS summary statistics, we refine our previously proposed QC. 15 This QC consists in comparing standard deviations (of genotypes) inferred from GWAS summary statistics with the ones computed from a reference panel. This is useful to check that the input parameters used are consistent with one another. This was particularly important for LDpred2-auto, which directly estimates two key model parameters, the SNP heritability and polygenicity, from the data. 15 Here we further show that standard deviations of imputed genotypes (allele dosages) are lower than the expected values under Hardy-Weinberg equilibrium. Second, we investigate possible adjustments to apply to the input parameters of PGS methods using summary statistics, namely the reference linkage disequilibrium (LD) matrix, the GWAS effect sizes, their standard errors, and their corresponding sample sizes. For example, we show that GWAS effect sizes computed from imputed dosages are larger in magnitude than when computed from true genotypes. Third, we introduce two new optional parameters in LDpred2-auto to make it more robust to these types of misspecification. Note that, in this paper, we use ''robust'' to mean that we obtain predictive performance similar to that when there is no misspecification. We also reimplement and use a new version of lassosum, called lassosum2, in which we better handle when per-variant sample sizes are different. We focus our investigations on LDpred2 and lassosum2 for two reasons. First, multiple studies have shown that LDpred2 and lassosum rank among the best methods for single-trait polygenic prediction. [15][16][17][18] Second, lassosum2 now uses the exact same input parameters as LDpred2, which makes it easy for us to test the different QCs and adjustments presented here. We additionally investigate PRS-CS and SBayesR, two competitive PGS methods, 19,20 for which we develop some code to convert between the different formats required for the LD matrices and input GWAS summary statistics.

Data for simulations
We use the UK Biobank imputed (BGEN) data. 21 We restrict individuals to the ones used for computing the principal components (PCs) in the UK Biobank (field 22020). These individuals are unrelated and have passed some QC including removing samples with a missing rate on autosomes larger than 0.02, having a mismatch between inferred sex and self-reported sex, and outliers based on heterozygosity (more details can be found in section S3 of Bycroft et al. 21 ). To obtain a set of genetically homogeneous individuals, we compute a robust Mahalanobis distance based on the first 16 PCs and further restrict individuals to those within a log-distance of 5. 22 This results in 362,307 individuals of Northwestern European ancestry. We randomly sample 300,000 individuals to form a training set (e.g., to run the GWAS), 10,000 individuals to form a validation set (to tune hyper-parameters), and use the remaining 52,307 individuals to form a test set (to evaluate final predictive models).
Among genetic variants on chromosome 22 and with a minor allele frequency larger than 0.01 and an imputation INFO score larger than 0.4 (as reported by the UK Biobank), we sample 40,000 of them according to the inverse of the INFO score density so that they have varying levels of imputation accuracy ( Figure S1). We read the UK Biobank data into two different datasets using function snp_readBGEN from R package bigsnpr, 23 one by reading the BGEN data as hard-called genotypes by randomly sampling according to the imputation probabilities (of being 0, 1, or 2), and another one by reading it as dosages (i.e., expected values according to the imputation probabilities). The first dataset is used as what could be the real genotype calls and the second dataset as what would be its imputed version; this design technique was used in Privé et al. 24

Data for real analyses
We also use the UK Biobank data for validation/testing in real-data analyses, and use the same individuals of Northwestern European ancestry as described in the previous section. We sample 10,000 individuals to form a validation set and use the remaining 352,307 individuals as a test set. We restrict to the 1,054,315 HapMap3 variants used in the LD reference provided in Privé et al. 15 To define phenotypes in the UK Biobank, we first map ICD10 and ICD9 codes (UK Biobank fields 40001, 40002, 40006, 40013, 41202, 41270, and 41271) to phecodes using R package Phe-WAS. 25,26 We also use some continuous phenotypes, namely vitamin D (data-field 30890), height (50), body mass index (BMI) (21001), systolic blood pressure (4080), and high-density lipoprotein (HDL) cholesterol (30760).
To derive polygenic scores, we use published GWAS summary statistics listed in Table 1. We also use GWAS summary statistics for five disease endpoints from FinnGen 33 (release 6) and for four continuous outcomes from Biobank Japan. 34

Model for summary statistics
In this paper, we extensively use the following formula: where b g j is the marginal GWAS effect size of variant j, n j is the GWAS sample size associated with variant j, y is the vector of phenotypes, and G j is the vector of genotypes for variant j. This formula is used in LDpred2. 15,35 Note that, for a binary trait for which logistic regression is used, we have instead It is assumed that the marginal (GWAS) effects follow b gjS; R; g $ N À S À 1 RSg; S À 1 RS À 1 Á ; or; Sb gjS; R; g $ N ðRSg; RÞ; (Equation 3) where R is the correlation matrix of genotypes. 13 Then PGS methods aim at inferring g, the true causal (per-allele) effects, from S, R, and b g. However, in practice, we only have estimates for S and R, which causes some model misspecifications. Moreover, ubiquitous correlations between the large number of variants may cause a Gibbs sampler or any iterative approach used in PGS methods to fail in properly estimating genetic effect sizes. 36

GWAS sample size imputation
We can impute n j from Equation (1) using n j z varðyÞ . var and impute n eff j from Equation (2) using In practice, we also bound these estimates to be between 0:5, N and 1:1,N, where N is the total (effective) sample size. Also note that the estimate of varðG j Þ has to account for the imputation accuracy, i.e., use 2,f j ,ð1 À f j Þ,INFO j (section ''misspecification when using imputed allele dosages''). In Equations 1 and 4, we estimate varðyÞ by the first percentile of 0:5ðN seðb g j Þ 2 þb g 2 j Þ, where the first percentile approximates the minimum and is robust to outliers.

New implementation of lassosum in bigsnpr
Instead of using a regularized version of the correlation matrix R parameterized by s, R s ¼ ð1 À sÞR þ sI (where 0 < s % 1), we use R d ¼ R þ dI (where d > 0), which makes it clearer that lassosum is also using L2-regularization (therefore elastic-net). Then, from Mak et al., 16 the solution from lassosum can be obtained by iteratively updating Following the notations from Privé et al., 15 where G is the genotype matrix, C n is the centering matrix, and S is the diagonal matrix of standard deviations of the columns of G. ThenX and b g j is the GWAS effect of variant j, and n is the GWAS sample size. 16,35 Then computing R T :;j b ðt À 1Þ is the most time-consuming part of each iteration. To make this faster, instead of computing R T :;j b ðt À 1Þ at each iteration (j and t), we can start with an initial vector of 0s only (for all j) since b ð0Þ h0, then update this vector when b ðtÞ j sb ðt À 1Þ j only. Note that only positions k for which R k;j s0 must be updated in this vector R T :;j b ðt À 1Þ . Since bigsnpr v1.10.4, we now also use this updating strategy in LDpred2 (-grid and -auto), which makes it much faster, especially when the polygenicity is small.
In this new implementation of the lassosum model, which we call lassosum2, the input parameters are the correlation matrix R, the GWAS summary statistics (b g j , seðb g j Þ, and n j ), and the two hyper-parameters l and d. Therefore, except for the two hyper-parameters, lassosum2 uses the exact same input parameters as LDpred2. 15 We try d˛f0:001; 0:01; 0:1; 1g by default in lassosum2, instead of s˛f0:2; 0:5; 0:8; 1:0g in lassosum. For l, the default in lassosum uses a sequence of 20 values equally spaced on a log scale between 0.1 and 0.001. By default in lassosum2, we use a similar sequence of 30 values, but between l 0 and l 0 =100 instead, where is the minimum l for which no variable enters the model because the L1-regularization is too strong. To make lasso-sum2 more robust to different per-variant sample sizes n j , we also introduce per-variant penalty factors ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi max k ðn k Þ=n j p , by which we internally multiply l and d to penalize variants with smaller GWAS sample sizes more. Note that we do not provide an ''auto'' version using pseudo-validation (as in lassosum 16 ), since we have not found the pseudo-validation scores to be consistent enough with the predictive performance ( Figure S2). Also note that, as in LDpred2, we run lassosum2 genome-wide using a sparse correlation matrix which assumes that variants further away than 3 cM are not correlated. Contrary to lassosum, we do not require splitting the genome into independent LD blocks any longer, yet we recommend to do so for robustness in LDpred2 and for extra speed gain (cf. section ''new LD reference'').

LDpred2-low-h2 and LDpred2-auto-rob
Here we introduce the small changes made to LDpred2 (-grid and -auto) in order to make them more robust. First, LDpred2-low-h2 simply consists in running LDpred2-grid while testing h 2 within f0:3; 0:7; 1; 1:4g,h 2 LDSC (note the added 0.3 compared with Privé et al. 15 ), where h 2 LDSC is the heritability estimate from LD score regression. Indeed, we show in simulations here that using lower values for h 2 may provide higher predictive performance in the case of misspecifications (thanks to more shrinkage of the effects). In simulations, because of the large misspecifications, we use a larger grid over f0:01; 0:1; 0:3; 0:7; 1; 1:4g,h 2 LDSC . For LDpred2-auto, we introduce two new parameters. The first one, shrink_corr, allows for shrinking off-diagonal elements of the correlation matrix. This is similar to parameter ''s'' in lassosum (section ''new implementation of lassosum in bigsnpr''), and acts as a means of regularization. We use a value of 0.9 in simulations and 0.95 in real data when running ''LDpred2-auto-rob'' here, and the default value of 1 (with no effect) when running ''LDpred2-auto.'' The second new parameter, allow_jump_sign, controls whether variant effect sizes can change sign over two consecutive iterations of the Gibbs sampler. When setting this parameter to false (in the method we refer to as ''LDpred2-autorob'' here), this forces the effects to go through 0 before changing sign. This is useful to prevent instability (oscillation and ultimately divergence) of the Gibbs sampler under large misspecifications, and is also useful for accelerating convergence of chains with a large initial value for p (the proportion of causal variants).

New LD reference
We form nearly independent LD blocks using the optimal algorithm developed in Privé. 37 Note that a new parameter max_r2 has now been added to this method, which controls the maximum single squared correlation allowed outside blocks and offers an extra guarantee that the splitting is good and makes the function much faster by discarding many possible splits. For different numbers of blocks and maximum number of variants in each block, we use the split that minimizes ðC2 , is the sum of squared correlations outside the blocks and C2 is the sum of squared sizes of the blocks ( Figure S3). Twelve blocks from chromosome 22 are used in the simulations, and between 731 and 2,527 in the real-data analyses (cf. next section). Having a correlation matrix with independent blocks prevents small errors in the algorithm (e.g., the Gibbs sampler in LDpred2) from propagating to too many variants. It also makes running LDpred2 (and lassosum2) faster, taking e.g., 70% of the initial time (since only 70% of the initial non-zero values of the correlation matrix are kept). Note that we also compute the correlations within the same blocks and from the same data for PRS-CS.
We have also developed a new ''compact'' format for the SFBMs (sparse matrices on disk). Instead of using something similar to the standard ''compressed sparse column'' format which stores all fi; xði; jÞg for a given column j, we only store the first index i 0 and all the contiguous values fxði 0 ; jÞ; xði 0 þ1; jÞ; .g up to the last non-zero value for this column j. This makes this format about twice as efficient for both LDpred2 and lassosum2 (in terms of both memory and speed). Therefore, using both this new format and the LD blocks should divide computation times by 3 (without counting the faster updating strategy of residuals).

Alternative LD reference for FinnGen and Biobank Japan
To be used with GWAS summary statistics from FinnGen, we investigate three different LD reference panels. To define homogeneous ancestry groups in the UK Biobank, we include all individuals within a specific distance to a population center in the PC analysis space. 35,38 We use either the 503 European (including 99 Finnish) individuals from the 1000 Genomes (1000G) data, 39 404 Finnish-like UK Biobank plus the 99 Finnish 1000G individuals (also 503 in total), or the 10,000 UK individuals from the validation set we use in this paper. 1,454, 731, and 1,165 independent LD blocks are identified for these three LD references, respectively.
To be used with GWAS summary statistics from Biobank Japan, we also investigate three different LD reference panels. We use either the 504 East Asian (including 104 Japanese) individuals from the 1000G, 400 Japanese-like UK Biobank plus the 104 Japanese 1000G individuals (also 504 in total), or 2,041 East Asian UK Biobank individuals (including the 400 previous Japanese individuals). For the small LD references (based on $500 individuals), we restrict to variants with a minor allele frequency (MAF) >0.02. For the one based on 2,041 individuals, we restrict to MAF >0.01. We also construct versions of these LD references with independent LD blocks, as described in the previous section. 2,527, 1,769, and 2,196 independent LD blocks are identified for these three LD references, respectively.

Misspecification of per-variant GWAS sample sizes
We design GWAS simulations where variants have different sample sizes, which is often the case when meta-analyzing GWAS summary statistics from multiple cohorts with different sets of variants. 12 Using 40,000 variants from chromosome 22 (see material and methods), we simulate quantitative phenotypes by picking 2,000 causal variants at random, sampling their genetic effects (after scaling of the genotypes) from a normal distribution, and adding some Gaussian noise to the genetic component of the phenotype so that it has a heritability of 20%. This is implemented in function snp_simuPheno of R package bigsnpr. 23 We then randomly divide the 40,000 variants into three groups: for half of the variants, we use 100% of 300,000 individuals for GWAS, but use only 80% for one quarter of the variants and 60% for the remaining quarter. We use a simple linear regression for GWAS, as implemented in function big_univLinReg of R package bigstatsr. 23 We then run CþT, lassosum, lassosum2 (section ''new implementation of lassosum in bigsnpr''), LDpred2inf, LDpred2(-grid), LDpred2-auto, SBayesR (GCTB v2.03 with the automatic robust option), and PRS-CSauto 15,16,19,20,24 by using either the true per-variant GWAS sample sizes, the total sample size for all variants, or per-variant sample sizes imputed using Equation (4). When providing true per-variant GWAS sample sizes, squared correlations between the polygenic scores and the simulated phenotypes are of 0.123 for CþT, 0.160 for lassosum, 0.169 for lassosum2, 0.159 for LDpred2(-grid), 0.140 for LDpred2-auto, 0.141 for LDpred2-inf, 0.138 for SBayesR, and 0.159 for PRS-CS-auto ( Figure 1, averaging over ten simulations). Results when using imputed sample sizes are very similar to when using the true ones. Note that CþT does not use this sample size information, and that PRS-CS can only be provided with a single value (we use the maximum one). When using the total GWAS sample size instead of the per-variant sample sizes, predictive performance slightly decreases to 0.158 for lassosum and to 0.164 for lassosum2, but dramatically decreases for LDpred2 with new values of 0.134 for LDpred2-grid, 0.122 for LDpred2-auto, and 0.125 for LDpred2-inf ( Figure 1). This extreme simulation scenario shows that LDpred2 can be sensitive to the misspecification of per-variant GWAS sample sizes, whereas lassosum (and lasso-sum2) seems little affected by this. With 0.141 for SBayesR, predictive performance is actually larger than when using the true per-variant sample sizes.
We conduct further investigations to explain the results of Figure 1. First, the results for LDpred2-auto and SBayesR are similar to LDpred2-inf in these simulations because the internal polygenicity estimate p (proportion of causal variants) of LDpred2-auto, and similarly the proportion of non-null variants in SBayesR, always continuously increases up to 1 in the Gibbs sampler, which makes them behave as an infinitesimal model ðp ¼ 1Þ. To overcome this limitation in LDpred2-auto, we introduce two new parameters to make it more robust, and refer to this as ''LDpred2-auto-rob'' here (section ''LDpred2-low-h2 and LDpred2-auto-rob''). The first parameter prevents p from diverging to 1 while the second shrinks the off-diagonal elements of the LD matrix (a form of regularization). Second, for lassosum2, results for a grid of parameters (over l and d) are quite smooth compared with LDpred2 ( Figures S4 and  S5). In these simulations with misspecified per-variant sample sizes, it seems highly beneficial to use a small value for the SNP heritability hyper-parameter h 2 in LDpred2, e.g., a value of 0.02 or even 0.002 when the true value is 0.2 ( Figure S5). Indeed, using a small value for this hyperparameter induces a larger regularization (shrinkage) on the effect sizes. Here we call ''LDpred2-low-h2'' when running LDpred2(-grid) with a grid of hyper-parameters including these low values for h 2 . Results with LDpred2low-h2 improve from 0.159 to 0.169 when using true sample sizes and from 0.134 to 0.164 when using the maximum sample size (Figure 1). Finally, we introduce here a last change for robustness: we form independent LD blocks in the LD matrix to prevent small errors in the Gibbs sampler to propagate to too many variants (section ''new LD reference''). This change seems to solve convergence issues of LDpred2 in these simulations ( Figure S5) and further improves predictive performance for all LDpred2 methods ( Figure 1).
As secondary analysis, we rerun these simulations with a smaller heritability of 4% instead. While predictive performance are overall much lower, relative performance are less attenuated when using the maximum sample size, otherwise results are highly consistent as before ( Figure S6). We also try using the shrunk LD matrix from GCTB that is used, e.g., for SBayesR; 20 in these simulations, higher predictive performance is obtained for LDpred2auto with this shrunken LD matrix, whereas SBayesR benefits from using LDpred2's windowed LD matrix ( Figure S7).

Misspecification when using imputed allele dosages
Marchini and Howie 40 showed that the IMPUTE INFO measure is highly concordant with the MACH measure , where b q j is the estimated allele frequency of G j , the genotypes for variant j. Therefore, when using the expected genotypes from imputation (allele dosages), their standard deviations are often lower than the expected value under Hardy-Weinberg equilibrium, because  simulations''), we verify that sdðG j Þ true zsdðG j Þ imp = ffiffiffiffiffiffiffiffiffiffiffiffiffi INFO j p ( Figure S8). As a direct consequence of the lower standard deviation of dosages, we also show that GWAS effect sizes b g computed from imputed dosages are overestimated: Figures S9 and S10). This is the first correction of summary statistics we consider in the simulations below. As a second option, instead of using dosages to compute the GWAS summary statistics, it has been argued that using multiple imputation (MI) would be more appropriate. 41 Here MI consists in forming multiple (e.g., 20) datasets of hardcalled genotypes by sampling them according to the imputation probabilities stored in the BGEN files, performing a GWAS on each of these datasets, and pooling the GWAS estimates. In simulations, we show that b Figure S11). This is the second correction of summary statistics we implement in the simulations below, along with n j , INFO j as the new per-variant sample sizes. Finally, we consider an in-between solution as a third correction, using b , and n j ,INFO j as the per-variant sample sizes. Note that we have recomputed INFO scores for the subset of 362,307 European individuals used in this paper, since they can differ substantially from the ones reported by the UK Biobank for the whole data ( Figures S12 and S13).
To compare these three corrections, we conduct a simulation using the same 40,000 variants from chromosome 22 as before, whereby we simulate quantitative phenotypes assuming a heritability of 20% and 2,000 causal variants using the ''true'' dataset (cf. section ''data for simulations''). We compute GWAS summary statistics from the dosage dataset and use these summary statistics to run LDpred2, las-sosum2 and PRS-CS with either no correction of the summary statistics or with one of the three corrections described above. The LD references are computed from the validation set using the dataset with the ''true'' genotypes (i.e., same as before). For lassosum2 and LDpred2 (-grid), which tune parameters using the validation set, correcting for imputation quality slightly improves predictive performance in these simulations ( Figure 2). However, correcting for imputation quality can dramatically improve predictive performance for LDpred2-auto, provided the imputation quality is well estimated. Moreover, new additions for robustness introduced before, namely LDpred2low-h2, LDpred2-auto-rob, and forming independent blocks in the LD matrix, also improve predictive performance for all corrections (Figure 2). However, these corrections do not prove beneficial for PRS-CS-auto ( Figure 2).
When performing similar simulations with a heritability of 4% (instead of 20%), correction ''sqrt_info'' still provides slightly higher predictive performance for all methods (compared with no correction), yet the other two corrections sometimes provide lower predictive performance ( Figure S14). In the real-data applications hereafter, we therefore choose to use the first correction, ''sqrt_info,'' which seems to more consistently provide better results ,INFO j , seðb g j Þ imp , ffiffiffiffiffiffiffiffiffiffiffiffiffi ffi INFO j p , and n j ,INFO j . Red bars correspond to using the LD with independent blocks (section ''new LD reference''), which is a requirement for PRS-CS. and is simple because it is somewhat equivalent to postprocessing PGS effects by multiplying them by ffiffiffiffiffiffiffiffiffiffiffiffi INFO p .

Mismatch between LD reference and GWAS summary statistics
Here we design simulations to understand the impact of using a mismatched LD reference panel, e.g., that comes from a different population compared with the one used to compute the GWAS summary statistics. We use the same simulation setup as before (see material and methods), i.e., using individuals of Northwestern European ancestry from the UK Biobank for training (GWAS), validation (tuning of hyper-parameters), and testing (the final models). In addition to the LD reference panel of Northwestern European ancestry, we design an alternative reference panel based on 10,000 individuals from South Europe by using the ''Italy'' center defined in Privé et al. 35 Allele frequencies and pairwise correlations are quite similar between these two LD reference panels ( Figure S15). When using this alternative LD reference panel instead of a well-matched one as in the previous sections, squared correlations between the polygenic scores and the simulated phenotypes drop from 0.173 to 0.166 for lassosum2, from 0.168 to 0.159 for LDpred2(-grid), from 0.174 to 0.169 for LDpred2-low-h2, from 0.142 to 0.136 for LDpred2-auto, from 0.162 to 0.147 for LDpred2-auto-rob, from 0.143 to 0.138 for LDpred2-inf, and from 0.163 to 0.155 for PRS-CS-auto (Figure 3, averaging over ten simulations). Forming independent LD blocks in these two LD matrices always improves predictive performance, yet again. We also compute and test using a shrunk LD matrix (from GCTB) for this alternative LD reference panel; again, using this type of LD matrix seems beneficial for LDpred2-auto. Finally, under the same simulation scenario but with a smaller simulated heritability of 4% (instead of 20%), there remains a small drop in predictive performance when using the alternative LD reference ( Figure S16).

Application to breast cancer summary statistics
In this section, we transition to using real data. Breast cancer GWAS summary statistics are interesting because they include results from two mega-analyses, 27,42,43 which means that parameters reported in these GWAS summary statistics, such as INFO scores and sample sizes, are estimated with high precision. Imputation INFO scores for the OncoArray summary statistics are generally very good (mean of 0.968 after having restricted to HapMap3 variants, Figure S17) and better than the ones from iCOGS (mean of 0.841, Figure S18), probably because the iCOGS chip included around 200K variants only, compared with more than 500K variants for the OncoArray. For both summary statistics, we compare the standard deviations (of genotypes) inferred from the reported allele frequencies (i.e., ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2f ð1 À f Þ p where f is the allele frequency, and denoted as sd_af) versus the ones inferred from the GWAS summary statistics (Equation 2, and denoted as sd_ss). As shown in Figures 4 and S19 Figure S20), which are two known longrange LD regions. 44 We hypothesize that this is due to using PCs that capture LD structure instead of population structure, as covariates in the GWAS. 22 To validate this hypothesis, we simulate a phenotype using HapMap3 variants of chromosome 6 for 10,000 individuals from the UK Biobank and then run GWAS with or without PC19 as covariate. PC19 from the UK Biobank was previously reported to capture LD structure in region 70-91 Mbp of chromosome 6. 22 In these simulations, the same bias as in Figure 4B is observed for the variants in this region ( Figure S21), confirming our hypothesis. Therefore, providing an accurate imputation INFO score is useful for two reasons. First, it allows for correcting for a reduced standard deviation when using imputed data in the QC step we propose to better uncover possible problems with the GWAS summary statistics. Second, using one of the proposed corrections based on INFO scores may lead to an improved prediction when deriving polygenic scores. We apply these corrections to the two breast cancer summary statistics. In the comparison, we first use the QC proposed in Privé et al. 15 (which ends up filtering on MAF here, which we call ''qc1''). We then also filter out the two long-range LD regions of chromosomes 6 and 8 for the OncoArray summary statistics and remove around 500 variants when filtering on differences of allele frequencies (>0.1) between summary statistics and the validation dataset (''qc2''). As for the correction using INFO scores, we use the first correction, ''sqrt_info,'' which is simple because it is equivalent to post-processing PGS effects, by multiplying them by ffiffiffiffiffiffiffiffiffiffiffiffi INFO p . Although results for both QC used are very similar, correcting using INFO scores slightly improves predictive performance when deriving polygenic scores based on iCOGS summary statistics ( Figure S22). All other improvements introduced before have little to no effect here, probably because misspecifications are much smaller than in the simulations.

Results for other phenotypes
We use other external GWAS summary statistics for which INFO scores are reported (see Table 1); they all have a very high mean INFO score (larger than 0.94), except for T1Daffy (0.885). QC plots comparing standard deviations usually show little deviation from the identity line (after the INFO score correction), except for coronary artery disease (CAD) summary statistics ( Figures S23-S27). Note that the popular CAD summary statistics used here come from a multi-ancestry GWAS meta-analysis with more than 20% non-European samples. 31,38 In Figures 5 and S28-S32, we then provide similar results as in Figure S22 for other phenotypes. For major depressive disorder (MDD) and prostate cancer (PrCa), most changes introduced before have little to no impact on predictive performance. For CAD and T1D, the QC proposed in Privé et al. 15 and the additional QC proposed here provide much better predictive performance, especially for LDpred2-auto (and LDpred2-auto-rob) than when using no QC, showing how important this preliminary step is. We also investigate results for vitamin D GWAS summary statistics, which do not report INFO scores or allele frequencies, as opposed to previous ones. The QC procedure is thus less precise and uses allele frequencies from the LD reference ( Figure S33). However, these summary statistics do report per-variant sample sizes, which cover a wide range of different values ( Figure S34). Here we compare ) versus the ones inferred from the reported GWAS allele frequencies f j (using ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2f j ð1 À f j Þ p ). Only 100,000 HapMap3 variants are represented, at random.
using either the maximum sample size or the true pervariant sample sizes when deriving polygenic scores, and an additional QC step, ''qc2'', which refers to removing all variants with a per-variant sample size less than 70% of the maximum one. When the true per-variant sample sizes are used, the additional ''qc2'' does not seem to be necessary ( Figure 6), which is a bit surprising to us. When the maximum sample size is used, this ''qc2'' is very important, especially for LDpred2-auto. Note that, when using the maximum sample size to derive ''sd_ss'' in ''qc1'' (not the case here; we used the per-variant sample sizes), ''qc1'' would remove variants with underestimated standard deviations due to overestimated sample sizes, which would effectively remove variants with low sample sizes ( Figure S35), similar to the ''qc2'' used here.
For all phenotypes, PRS-CS-auto is very robust, whatever the QC procedure used (e.g., see the results for CAD, Figure 5). We believe this is because of the strong regularization used internally. We then run LDpred2-auto-rob for CAD with different shrinkage multiplicative coefficients for the off-diagonal elements of the LD matrix, from 1 (LD matrix unchanged) to 0 (using the identity matrix instead). When using a strong shrinkage of 0.3 or 0.4, results for LDpred2-auto-rob are similar regardless of the QC used ( Figure S36), as previously observed for PRS-CS-auto. Results with LDpred2-auto-rob are always best when using ''qc2'', the most stringent QC; then no shrinkage (1) seems needed, whereas a strong shrinkage leads to a drop in predictive per-formance. For phenotypes with relatively large genetic effects (e.g., vitamin D, Figure 6), PRS-CS-auto performs much worse, possibly due to using too much regularization.

Application to FinnGen and Biobank Japan summary statistics
Here we investigate the use of different LD reference panels with GWAS summary statistics from two large biobanks of isolated populations. We use GWAS summary statistics for five disease endpoints from FinnGen 33 (release 6), namely breast and prostate cancers (BrCa and PrCa), CAD, and type 1 and type 2 diabetes (T1D and T2D). These were derived using SAIGE. 10 First, for each phenotype, we estimate a global effective sample using the 80th percentile of imputed sample sizes from Equation 5. This estimation is only 84.7% for BrCa, 79.1% for PrCa, 73.1% for T1D, 64.5% for T2D, and 61.3% for CAD, when compared with the effective sample computed from the reported numbers of cases and controls. We believe this reduction in effective sample size is due to having related individuals included in the analyses as well as using SAIGE, as noted in the introduction. We then compare three different LD reference panels to use with these GWAS summary statistics (section ''alternative LD reference for FinnGen and Biobank Japan''). Interestingly, using the small Finnish LD reference panel we have defined here, composed of only 503 individuals, seems to almost always provide more predictive polygenic scores than using the large UK panel  composed of 10,000 individuals or the widely used European subset of the 1000G data (Figure 7). For PRS-CSauto, none of the three LD references seems to consistently outperform the other two across all phenotypes considered. Note that the PGS here are validated in people of UK-like ancestry in the UK Biobank to obtain sufficient sample sizes.
We also use GWAS summary statistics for four continuous outcomes from Biobank Japan. 34 These were derived using BOLT-LMM-inf. 8 First, for each phenotype, we estimate a global power improvement provided by BOLT-LMM by comparing c 2 statistics (the boost of BOLT-LMM-inf versus linear regression) across genomewide significant variants, as recommended in Loh et al. 8 This estimated power ratio, which corresponds to the increase in effective sample size, is 1.143 for height, 1.039 for HDL cholesterol, 1.025 for BMI, and 1.013 for systolic blood pressure. We then compare three different LD reference panels to use with these GWAS summary statistics (section ''alternative LD reference for FinnGen and Biobank Japan''). Interestingly, when looking at height and HDL cholesterol, where we could get the best predictive performance, using the smaller Japanese LD reference we have defined here seems to provide more predictive PGS than using the one using a larger set of East Asian individuals from the UK Biobank or the widely used East Asian subset of the 1000G data ( Figure S37), except when using PRS-CS-auto to predict height. Note that the PGS here are validated in people of broad East Asian ancestry in the UK Biobank and for continuous outcomes to obtain moderate sample sizes for comparison.

Recommendations
Based on previous results from simulations and real-data analyses, we use this section to suggest some recommendations on how to handle misspecifications in GWAS summary statistics and to get the most out of PGS. We recall that an overview of these misspecifications, along with possible consequences and possible corrections, is summarized in Table 2. First, standard QC should always be performed, e.g., variants with low frequency or low imputation quality should be removed; one must make sure to recompute MAF and INFO scores for the homogeneous subset of individuals of interest. Second, comparison of allele frequencies can be used to, e.g., detect problems with signs of effects while comparing standard deviations to detect other issues such as low per-variant GWAS sample sizes. Third, the LD reference panel should be as close as possible (in terms of genetic ancestry) to the data used to derive the GWAS. In case of any doubt, ancestry proportions can be estimated from GWAS allele frequencies using the method derived in Privé. 38 Fourth, if some validation data is available for tuning hyper-parameters, one can use methods such as lassosum2 and LDpred2(-grid). The grid of parameters used in lasso-sum2 includes a broad range of regularization, which makes  it very robust. For LDpred2, we recommend to also include a smaller value for the heritability in the grid of parameters to allow for more regularization when needed. As these two methods use the same data and the same principle (i.e., hyper-parameter tuning), one could run both methods and choose the best one when tuning hyper-parameters. Fifth, if no validation/tuning dataset is available, we recommend using LDpred2-auto with the two new parameters introduced here, or PRS-CS-auto if there are strong misspecifications and the phenotype of interest is known to have only small effects. Sixth, we recommend to form independent LD blocks in the LD matrix, as it proved to make the methods more robust (and slightly faster). Most of these new recommendations are included in the LDpred2 tutorial.

Discussion
Here we have investigated misspecifications in GWAS summary statistics, focusing particularly on the impact of sample size heterogeneity and imputation quality, and the application to PGS methods. Previously, we proposed a QC based on comparing standard deviations (of genotypes) inferred from GWAS summary statistics with the ones computed from a reference panel. 15 Here we show that we can refine this QC by deriving the latter directly from the reported allele frequencies in the GWAS summary statistics, and by correcting the former using imputation INFO scores. Using this refined QC, we are able to identify a potential issue with how PCs were derived in a GWAS of breast cancer. Fortunately, this has practically no effect on the predictive performance of the derived PGS. Additional QC can also be performed, e.g., comparing reported GWAS allele frequencies with the ones from the LD reference panel to detect genotyping errors or allele inversions. We perform this additional QC as part of ''qc2'' here. One can also run other QC tools such as DENTIST, 4 and also infer ancestry proportions from summary statistics to make sure these are matching with the LD reference used. 38 Note that, in this study, we mostly use GWAS summary statistics that include extended information (e.g., INFO scores and allele frequencies), yet most GWAS summary statistics do not provide such exhaustive information. 47 We acknowledge that, in the case of a meta-analysis from multiple studies, providing a single INFO score per variant may not be possible. Solutions such as using a weighted averaged INFO score might be worth exploring in future studies. Nevertheless, this QC could be performed within each study before meta-analyzing results to ensure that resulting summary statistics have the best possible quality for followup analyses such as deriving PGS. Other information, the effective sample size per variant, is often missing from GWAS summary statistics. Sometimes it can even be challenging to recover the total effective sample size from large meta-analyses. We recall that when some studies have an imbalanced number of cases and controls, the total effective sample size of their meta-analysis should not be computed from the total numbers of cases and controls overall, but  Figure 7. Results for PGS derived from five FinnGen GWAS summary statistics and using three different LD references Partial correlations are computed using function pcor of R package bigstatsr where 95% confidence intervals are obtained through Fisher's Z-transformation, then all values are squared to report the phenotypic variance explained by PGS. Red bars correspond to using the LD with independent blocks (see material and methods), which is a requirement for PRS-CS.
instead from the sum of the effective sample sizes of each study. 7 Indeed, take the extreme example of metaanalyzing two studies, one with 1,000 cases and 0 controls and another one with 0 cases and 1,000 controls: the effective sample size of the meta-analysis is then 0, not 2,000. Misspecifying the GWAS sample size can lead to serious issues such as misestimating the SNP heritability. 7 Fortunately, an overestimated sample size can be detected from the QC plot we propose, where the slope is then less than 1 for case-control studies using logistic regression. Here we have used this strategy to estimate reduced effective sample sizes in FinnGen GWAS summary statistics.
We have assessed the impact of these misspecifications in GWAS summary statistics on the predictive performance of some PGS methods. Using both the Bayesian LDpred2 models 15 and our reimplementation of the frequentist lassosum model 16 for deriving PGS, we have introduced and investigated some changes to possibly make these models more robust to misspecifications. Overall, these changes provided large improvements of predictive performance in the simulations with large misspecifications. The proposed QCs on GWAS summary statistics also provided better predictive performance for CAD, T1D, and vitamin D. However, these changes had limited effect when applied to some other real GWAS summary statistics, which is both unfortunate but also reassuring because it means that these GWAS summary statistics are of particularly good quality for follow-up analyses such as deriving PGS.
In conclusion, we recommend adopting these changes, i.e., performing the (refined) QC proposed here, forming independent LD blocks in the LD matrix, and using more regularization when needed. We also recommend using well-matched LD reference panels. More regularization can be achieved by testing additional smaller values for the heritability parameter in LDpred2-grid, and using the two new parameters introduced here in LDpred2-auto (section ''LDpred2-low-h2 and LDpred2-auto-rob''). Note that LD blocks are already widely used by several methods, such as lassosum and PRS-CS, because they allow for processing smaller matrices at once. 16,19 Imposing independent blocks on the LD matrix could result in further misspecifications, 14 but here we have shown that welldefined blocks can actually make PGS methods more robust. PRS-CS is currently one of the most robust PGS methods; for example, it can use the (small) 1000G dataset as LD reference, 19 and can even use a European LD reference panel with multi-ancestry GWAS summary statistics, 12 also shown here. We believe this is made possible by the use of a strong regularization in PRS-CS (4 À 1 j À 1 j R 1, which would approximately correspond to using s ¼ 0:5 in lassosum, d ¼ 1 in lassosum2, and the new parameter shrink corr ¼ 0:5 in LDpred2-auto). Using enough regularization is good for robustness, but note that using too much regularization can also damage predictive performance. To address this limitation, we recommend reliance on a proper QC and choice of ancestry-matched LD instead of on too much regularization. We would like to encourage large biobanks, such as FinnGen and Biobank Japan, to provide LD reference matrices matching the large GWAS summary statistics they provide, ideally based on Table 2. Overview of possible misspecifications when using GWAS summary statistics, along with possible consequences and corrections

Misspecification
Possible consequence(s) Possible correction Misestimation of the total sample size a bias estimation of SNP heritability 7,9 estimation of the effective sample size Differences in per-variant sample sizes possible violation of model assumptions (e.g., Equation 3) resulting in poor predictive performance imputation using Equations 4 and 5, and filtering b Using imputed dosages for GWAS standard deviations of dosages are too small, resulting in overestimated effect sizes and standard errors see section ''misspecification when using imputed allele dosages'' Using imputed dosages for LD computation none required; LD seems correctly estimated ( Figure S38) Misestimation of imputation INFO scores misuse when filtering or adjusting input parameters based on these recomputing from a homogeneous subset Error in summary statistics (e.g., allele inversions) strong violation of model assumptions, which can result in e.g., identifying false positives in fine-mapping 45 quality control Rounding of summary statistics (e.g., effect sizes and SEs) add unnecessary noise to the data none, but this has almost no impact on the predictive performance ( Figure S39) Ancestry mismatch between summary statistics and LD strong mismatch of LD and allele frequencies, often resulting in divergence of models checking ancestry proportions from GWAS summary statistics 38 Any possible violation of model assumptions resulting in poor predictive performance quality control, using more regularization, and constraining LD to blocks a Examples: using the total number of cases and controls in a meta-analysis of binary traits, a larger effective sample size when using BOLT-LMM summary statistics, 8 and a reduced effective sample size when using SAIGE on binary traits with a large prevalence. 10,11 b For example, removing SNPs with an effective sample size less than 0.67 times the 90th percentile of sample size. 46 the same large number of individuals. As a future research direction, we are interested in using multi-ancestrymatched LD matrices to use with multi-ancestry GWAS summary statistics to improve polygenic prediction in all ancestries. It would also be useful to investigate the impact of the QC, well-matched LD reference panels and other adjustments we propose here on other (non-PGS) methods, for which consistency and robustness are likely to be very important as well. 4 For example, we are interested in assessing the impact of these misspecifications on the inference of disease architecture parameters in future work.

Data and code availability
The UK Biobank data is available through a procedure described at https://www.ukbiobank.ac.uk/using-the-resource/. All code used for this paper is available at https://github.com/privefl/ paper-misspec/tree/master/code. We have extensively used R packages bigstatsr and bigsnpr 23 for analyzing large genetic data, packages from the future framework 48 for easy scheduling and parallelization of analyses on the HPC cluster, and packages from the tidyverse suite 49 for shaping and visualizing results. The latest version of R package bigsnpr can be installed from GitHub, and a recent version can be installed from CRAN.