Participation bias in the UK Biobank distorts genetic associations and downstream analyses

While volunteer-based studies such as the UK Biobank have become the cornerstone of genetic epidemiology, the participating individuals are rarely representative of their target population. To evaluate the impact of selective participation, here we derived UK Biobank participation probabilities on the basis of 14 variables harmonized across the UK Biobank and a representative sample. We then conducted weighted genome-wide association analyses on 19 traits. Comparing the output from weighted genome-wide association analyses (neffective = 94,643 to 102,215) with that from standard genome-wide association analyses (n = 263,464 to 283,749), we found that increasing representativeness led to changes in SNP effect sizes and identified novel SNP associations for 12 traits. While heritability estimates were less impacted by weighting (maximum change in h2, 5%), we found substantial discrepancies for genetic correlations (maximum change in rg, 0.31) and Mendelian randomization estimates (maximum change in βSTD, 0.15) for socio-behavioural traits. We urge the field to increase representativeness in biobank samples, especially when studying genetic correlates of behaviour, lifestyles and social outcomes.

Genotyping, imputation and quality control in the UK Biobank 4
Probability weighted genome-wide association analyses on UK Biobank traits 5 3.3.
Weighted SNP heritability and genetic correlation estimates 5 3.4.
Effect of participation bias on Mendelian Randomization estimates 5 4. SFIGURES 6 sFigure 1. Estimated correlations among harmonized variables in the HSE and the UK Census Microdata 6 sFigure 2. Weighted and unweighted genome-wide analyses: number of genome-wide variants 7 sFigure 3. Weighted and unweighted genome-wide analyses: SNP effects 8 sFigure 4. Autosomal genome-wide association analyses on biological sex 9 sFigure 5. Genome-wide association study on UKBB participation -Manhattan plot 10 sFigure 6. Genome-wide association study on UKBB participation -QQ plot 10 sFigure 7. SNP heritability estimates in weighted (wGWA) and standard genome-wide (GWA) analyses 11 sFigure 8. Genetic correlation estimates from weighted and standard genome-wide analyses 12 sFigure 9. Effect of participation bias on exposure-outcome associations obtained from Mendelian . Poor quality samples were identified using the metrics of missing rate and heterozygosity computed using a set of 605,876 high quality autosomal markers that were typed on both arrays. Imputation was performed using IMPUTE4 with the Haplotype Reference Consortium (HRC) UK10K and the 1000 Genomes Phase 3 dataset as the main imputation reference panels. Detailed genotyping, imputation and quality control (QC) procedures have previously been described 3 . Additional quality control filters for genome-wide analyses were applied to select participants (i.e., exclusion of related individuals, exclusion of non-White British ancestry based on principal components, high missing rate and high heterozygosity on autosomes) and genetic variants (Hardy-Weinberg disequilibrium P > 1 × 10 −6 , minor allele frequency > 1% and call rate > 90%).

sResults
3.1. Genome-wide association study on the liability to UKBB participation wGWA on UKBB participation was conducted in Neff=102,215 participants. 28 SNPs reached genome-wide significance (p< 5 × 10 −8 ), of which LD-independent 23 SNPs were selected after clumping. sFigure 5 displays the Manhattan plot with positional mapping of genome-wide SNPs associated with the liability to UKBB participation (cf. sTable 6 for annotation and estimates of significant SNPs). The QQ plot is shown in sFigure 6. A lookup of SNP-trait associations estimated in previous GWA analyses showed that UKBB participation-associated variants mostly tapped into age-related outcomes (e.g., cause of death: cancer/dementia/fatty liver disease/pneumonia) (sTable 7).

3.3.
Weighted SNP heritability and genetic correlation estimates A number of the assessed trait-pairs were significantly underestimated or overestimated as a result of participation bias. Change in direction of genetic correlations as a result of participation bias was less present. While a number of genetic correlations showed opposite signs between rg and rgw (17 out of the 153 assessed trait pairs), none of these rgDIFF (rg-rgw) were significantly different (pFDR<0.05). For example, the largest rgDIFF with opposite signs in rg and rgw was present for rg(depression/anxiety, vegetable intake) [rg=0.19; p=4.3e-05 versus rgw=-0.12; p=0.45, FDR-corrected p-value for rgDIFF = 1] and rg(number of illnesses, vegetable intake) [rg=0.19; p=7e-07 versus rgw=-0.01; p=0.9, FDR-corrected p-value for rgDIFF = 1].

3.4.
Effect of participation bias on Mendelian Randomization estimates Of all exposure-outcome associations tested (k=234), 14 (6%) estimates were either overestimated or underestimated. Significant (pFDR<0.05) differential effects were only present for two of the exposure-outcome associations tested (education on BMI; smoking status on fruit consumption). There was little evidence of bias resulting in changes in direction of MR estimates. The largest difference between ̂ and ̂ resulting from opposite effects was present for fruit intake on LDL cholersterol (̂=0.03; p=0.83 versus ̂=-0.12; p=0.47) and smoking status on physical activity (̂=0.07; p=0.091 versus ̂=-0.04; p=0.45).