The HUNT study: A population-based cohort for genetic research

Summary The Trøndelag Health Study (HUNT) is a population-based cohort of ∼229,000 individuals recruited in four waves beginning in 1984 in Trøndelag County, Norway. Approximately 88,000 of these individuals have available genetic data from array genotyping. HUNT participants were recruited during four community-based recruitment waves and provided information on health-related behaviors, self-reported diagnoses, family history of disease, and underwent physical examinations. Linkage via the Norwegian personal identification number integrates digitized health care information from doctor visits and national health registries including death, cancer and prescription registries. Genome-wide association studies of HUNT participants have provided insights into the mechanism of cardiovascular, metabolic, osteoporotic, and liver-related diseases, among others. Unique features of this cohort that facilitate research include nearly 40 years of longitudinal follow-up in a motivated and well-educated population, family data, comprehensive phenotyping, and broad availability of DNA, RNA, urine, fecal, plasma, and serum samples.

In brief Brumpton et al. present the genetic cohort profile of the Trøndelag Health Study (HUNT), a large, genotyped population-based cohort from Trøndelag County, Norway. They describe the extraordinarily rich features that make it an excellent cohort for genetic research. These include repeated survey data since 1984, broad availability of biological material, and the possibility to link with patient electronic health records and population registries. They show how HUNT has aided in understanding the genetic contribution to human traits and disease and describe future opportunities for research.

INTRODUCTION
Norway, like other Nordic countries, has characteristics that are uniquely favorable for recruitment to population studies, establishing biobanks, and identifying clinical outcomes and disease trajectories. This includes a unique personal identification number applied throughout the life span, a universal and digitized public health care system, and accessible harmonized electronic health records. In addition, 17 mandatory and validated national health registries are used for health analysis, administration, and emergency preparedness, and 52 national medical quality registries provide disease specific data on diagnosis and treatment parameters. Finally, Norwegians are an altruistic, highly motivated population for participating in biomedical research, as reflected in survey response rates of up to 89%. These factors have supported the establishment and maintenance of the Trøndelag Health Study (HUNT), a large population-based prospective Norwegian cohort, linked to registries and biobanks dating back more than 40 years (Figure 1).
To understand the genetic basis of diseases, as well as follow individuals with genetic and epidemiological risk factors in a well-ascertained county in Norway, we established a comprehensive collaboration in 2005 between the HUNT study at the Norwegian University of Science and Technology, Norway, and the University of Michigan, USA (see Data S1). This paper presents the history and status of this collaboration by describing the study population, the strategy incorporating genotyping, sequencing, and imputation-based approaches in HUNT, the vast phenotype data collected by decades of HUNT researchers, the linkage to the digitized public health care system, and key findings to date.

Study population
HUNT is an ongoing population-based health study in Trøndelag County, Norway. The study collects health-related data from questionnaires, interviews, and clinical examinations from individuals within this geographical region ( Figure 2). More than 229,000 adults (20 years or older at recruitment) have participated in the study to date, of whom 95,000 have provided at least one biological sample (https://www.ntnu.edu/hunt/ hunt-samples). [1][2][3][4] The periodic survey design includes four recruitment waves. HUNT1 (1984)(1985)(1986), HUNT2 (1995HUNT2 ( -1997, HUNT3 (2006HUNT3 ( -2008, and HUNT4 (2017-2019) concentrated primarily on the North-Trøndelag area, where all adults (age R 20 years) were invited. In addition, HUNT4 expanded to collect basic questionnaire data from the adult population of South-Trøndelag (105,797 additional participants). 3 Approximately 19,000 adults have participated in all four HUNT waves, thus having longitudinal questionnaire and physical exam information spanning over 35 years. Complementing the surveys in adult participants, four separate Young-HUNT surveys gathered data from $25,000 adolescents in junior high and high school, concurrent with HUNT2-4. No genotyping has been performed on Young-HUNT; however, 4,212 have sequentially participated in the adult version of HUNT. The HUNT Study has a high level of participation (ranging from 54% to 89% between surveys among those invited) making the cohort a good representative of the general Norwegian population. The HUNT and Young-HUNT cohorts are described in more detail elsewhere. [1][2][3][4][5] Genotyping and imputation study design in HUNT Approximately 88,000 individuals provided DNA for medical research during at least one of the HUNT recruitment periods. Initially, our efforts were focused on identifying genetic variants associated with myocardial infarction (MI). [6][7][8] Toward this goal, we genotyped exome variants and performed low-pass wholegenome sequencing (4.73 average coverage) in 2014 on 2,201 samples from HUNT2 and HUNT3 (HUNT-WGS) (Table S1), including early-onset MI cases and equal numbers of sex-and age-matched controls. Although no novel significant associations were found, likely due to the limited sample size, this set of low-pass sequences provided important insights into genetic variants present in the Norwegian population and contributed Norwegian reference sequences to the Haplotype Reference Consortium (HRC) imputation panel. 9 We next completed genome-wide genotyping on all HUNT2-3 participants (n = 70,517) with available DNA (Figure 3). Motivated by a goal of capturing high-quality, common-and low-frequency, and Norwegian-specific variants, we used a variety of approaches to observe or estimate genotypes: (1) direct genotyping using standard and customized HumanCoreExome arrays from Illumina; (2) genotyping and imputation with a merged HRC and HUNT-WGS imputation panel; and (3) imputation with the TOPMed imputation panel (Figure 4). After genotyping 12,864 with standard HumanCoreExome arrays (HumanCoreExome 12 v1.0 and v1.1), we performed genotyping on the remaining samples using a customized HumanCoreExome array (UM HUNT Biobank v1.0), which included protein-altering variants observed in HUNT-WGS. We followed a strict quality control protocol based upon the approach developed by that of Guo et al. 10 This included excluding samples and variants that failed to reach a 99% call rate, resulting in genotyping 358,964 polymorphic variants. We next used the 2,201 sequenced samples (HUNT-WGS) for joint imputation with the HRC panel. 9 We previously showed that imputation with a HUNT-specific reference panel improved imputation of low-frequency and population-specific variants compared with using either the 1000 Genomes or HRC reference panels alone. 11 Finally, we imputed 25 million variants from the TOPMed imputation panel (minor allele count greater than 10), which resulted in slightly lower imputation quality compared with the population-specific reference panel but captured a larger number of variants ( Figure S1). These two imputed datasets can be used separately in downstream Resource ll OPEN ACCESS analysis; we recommend using the HRC and HUNT-WGS imputation for the investigation of the Norwegian-specific variants. Together, the imputations resulted in 33 million variants in 70,517 individuals from HUNT2 or HUNT3, of which 3.3 million variants are not found in UK Biobank. Finally, 18,721 new samples from HUNT4 have recently been genotyped using the same approaches (Human CoreExome array, UM HUNT Biobank v2.0) and following imputation will create a new, larger data freeze of $88,000 individuals from HUNT2-4. Further details of the quality control and imputation in HUNT can be found in the STAR Methods.

Phenotypes
A broad range of phenotypes are available for HUNT participants based on laboratory tests, clinical examinations, and self-reported questionnaires. These include non-fasting blood lipids and glycemic traits; history (including age of diagnosis) of a range of diseases, including cardiovascular events; basic demographics, including sex and participation age; anthropometrics, including weight, height, BMI, and waist-to-hip ratio; blood pressure measurements; and lifestyle information, including smoking status (Table 1). HUNT data categories have been described previously, 2,3 and are described in detail on the HUNT databank website (https://www.ntnu.edu/hunt/databank). To ensure data were of high quality, biologic material was handled at the field stations according to appropriate standards and transported to the biobank every evening in a cold chain. Several measurements, including hemoglobin and blood cell counts, creatinine, and cholesterol were sent for immediate analysis, which was performed by specially trained personnel according to the same standardized protocols with the same equipment. Plasma, serum, and buffy coat are stored in aliquots in automated freezers in the HUNT Biobank at À80 C. The databank website describes each measure in more detail, including specific details of the instrument used and coefficients of variation (https://www. ntnu.edu/hunt/databank). Importantly, many measurements and questionnaire items have been intentionally kept identical or similar across HUNT surveys to enable longitudinal analyses, which may contribute to understanding disease progression and survival. Linkage to regional and national health registries HUNT participants have consented to linkage to the many highquality health and administrative registries in Norway and to information from medical records. Using the unique personal identification number given to all Norwegian citizens allows for longitudinal follow-up by linkage between HUNT data, regional and national registries, and electronic health records. Norway currently has 17 national health registries (https://helsedata.no/ no/) that are mandatory and cover the entire population (Table S2) (Table S2). Electronic health records from the local hospitals hold International Statistical Classification of Diseases and Related Health Problems (ICD) codes back to 1987. Potential linkage to administrative registries expands the data resource, which, among others, includes Statistics Norway, recording income and wealth statistics for individuals and households, and the Norwegian Armed Forces Health Registry (https://helsedata.no). Together, the listed registries provide opportunities to integrate a breadth of data from multiple time points to obtain high-quality phenotypes and related information on, for example, environmental and socioeconomic factors. Time-stamped data allow studies of disease development and progression, such as risk prediction of coronary artery disease. 12 Some selected disease endpoints are presented in Table 2.
Analytical approaches with related samples The majority of HUNT participants are of Norwegian ancestry. 4 Using principal components of ancestry projected onto the Human Genome Diversity Project, we typically exclude samples of non-European ancestry (<2%) ( Figure S2) due to limited power. We have observed fine-scale differences between Northand South-Trøndelag and between individuals born closer to the coast versus the border with Sweden. 13 In addition, because of high ascertainment from a single county in Norway (Trøndelag), there are many related individuals within the cohort. A total of 79,551 (89%) out of 88,615 HUNT2-4 participants have at least one second-degree or closer relative who also participates in HUNT ( Figure S3; Table S3). High degree of participant relatedness in the dataset on one hand allows for unique data analysis methods using nuclear or extended families but can result in bias when using methods that assume unrelated individuals or power loss if related individuals are excluded. An early effort to use extended families and genetic data in HUNT was for the analysis of rare coding variants, 14 where family samples can provide more power to detect associations when sample sizes were limited and only a modest fraction of all trait-associated variants were identified. 14 Previously, methods had been developed to account for relatedness for analysis of quantitative traits, 15 but methods to properly account for relatedness and control for unbalanced case-control ratios for binary traits were lacking. We therefore developed statistical methods to allow for the analysis of all individuals, and to control for case-control imbalance of binary phenotypes, which is commonly observed in biobanks, such as HUNT. These methods, which are computationally efficient in biobank-scale data, allowed us to perform association testing in HUNT for both single variants (using SAIGE) and gene-based burden tests (using SAIGE-GENE) while accounting for sample relatedness with a sparse identical by state sharing matrix. 14,16-18 These methods account for case-control imbalance of binary phenotypes, typical in a population-based sample, by using the saddlepoint approximation to calibrate unbalanced case-control ratios in score tests based on logistic mixed models. 14 We demonstrated a vast improvement in reducing type I error rates when analyzing unbalanced case-control ratios with SAIGE in HUNT. For example, venous thromboembolism, with 2,325 cases and 65,294 controls and a case-control ratio of 0.036 had substantial inflation of type I error with methods available prior to the development of SAIGE ( Figure S4). To demonstrate the application of SAIGE-GENE, we investigated 13,416 genes, with at least 2 rare (MAF % 1%) missense and/ or stop-gain variants that were directly genotyped or imputed from the joint HRC and HUNT-WGS reference panel among 69,716 Norwegian samples from HUNT2-3 with measured high-density lipoprotein. We identified eight genes with p values below the exome-wide significance threshold (p % 2.5 3 10 À6 ), seven of which remained significant after conditioning on nearby single-variant associations, suggesting independent rare coding variants within these genes. 17 Importantly, using SAIGE and SAIGE-GENE, we were able to use all samples, account for sample relatedness case-control imbalance, and maintain wellcontrolled type I error rates.
A traditional way of using related samples is linkage analysis, which, however, has computational challenges in the era of whole-genome genetics. To allow for linkage testing in datasets with millions of genetic markers, faster and computationally scalable linkage analysis methods have been developed, e.g., Population Linkage. 19 Population Linkage uses a Haseman-Elston regression (originally used for sibling pair linkage analysis) to estimate variance components from pairwise relationships and identity by descent estimates. Using HUNT data, Zajac et al. observed 25 significant linkage peaks with LOD > 3 across 19 distinct loci for the four traits (high-density lipoprotein, low-density lipoprotein, total cholesterol, and triglycerides), where 5 peaks with LOD > 3 were not replicated at genome-wide significance in a genome-wide association study (GWAS) of 359,432 genotyped variants in HUNT. 19 However, after imputing the dataset with the HRC and HUNT-WGS reference panel to cover more variants or meta-analysis in the Global Lipids Genetics Consortium, significant associations in all five linkage peaks were observed. This study demonstrates one of the benefits of linkage analysis over GWAS, which is the ability to test for linkage in regions that are difficult to genotype, such as rare variants, structural variants, copy number variants, or variants in highly repetitive regions, as long as identical-by-descent segments in the region can be identified. 19 Finally, linkage analysis may improve statistical power when investigating rare risk variants that segregate within families and reduce confounding effects of population stratification.
The high degree of relatedness in the HUNT Study participants has enabled analysis methods tailored to this study design. These include GWAS by proxy, 20,21 in which the phenotypes of non-genotyped family members of genotyped HUNT participants can be used to identify proxy-cases, individuals with a proportion (0.5 for first-degree relatives) of the genetic risk of cases. These proxy-cases can be appropriately modeled to increase the statistical power in GWAS. For example, the power to detect an allele with an odds ratio of 1.1 and MAF of 0.21 at an alpha of 5 3 10 À8 increases from 0.419 to 0.644 when proxy-cases were appropriately modeled instead of used as controls in standard GWAS ( Figure S5A). We also present empirical results for a known type 2 diabetes variant rs7903146 in TCF7L2 in HUNT ( Figure S5B).

Genetic discoveries from HUNT
The wealth of phenotypic and genetic data available in the HUNT cohort has led to the discovery of many new genetic associations across a broad range of traits (Table 3). Early genetic studies of HUNT participants used exome arrays and focused on cardiovascular disease. We identified a novel coding variant in TM6SF2 associated with total cholesterol, MI, and liver enzymes 6 and replicated known MI associations at the 9p21 locus and a low-frequency missense variant in the LPA gene (p.Ile1891Met). 7 Following the genotyping of nearly 70,000 participants in HUNT2 and HUNT3 and the development of a combined HRC and HUNT-WGS imputation reference panel, we extended our analyses to a genome-wide search. Through imputation of indels called from low-pass HUNT-WGS, we discovered a rare mutation in the MEPE gene, enriched in the Norwegian population (0.8% in HUNT, 0.1% in non-Finnish Europeans), that was associated with low forearm bone mineral density and increased risk of osteoporosis and fractures. 22 Although this region had been identified previously as associated with bone mineral density, 23 the association in HUNT with replication in the UK Biobank 24 pin-pointed MEPE as the likely causal gene in the region by identifying an insertion/deletion polymorphism that likely resulted in a loss-of-function protein. In another study, we paid special attention to loss-of-function mutations associated with favorable blood lipid profiles (reduced LDL cholesterol and reduced CAD risk), which were not associated with altered liver enzymes or liver damage. We also found an elderly individual with homozygous ZNF529 loss-of-function variant showing no signs of cardiovascular disease or diabetes, suggesting that the full knockout of this gene is viable. This highlighted ZNF529 as a potential therapeutic target for lipids 25 identified from sequencing and custom content genotyping.
On top of the association studies performed using HUNT data only, we have contributed to many international consortium efforts aimed at aggregating GWAS data across cohorts. By performing GWAS meta-analyses that included HUNT and other cohorts, efforts driven by our research team have identified genetic variants associated with atrial fibrillation that may act through a mechanism of impaired muscle cell differentiation and tissue formation during fetal heart development 29 and cardiac structural remodeling 30 ; variants associated with estimated glomerular filtration rate exhibiting a sex-specific effect 27,37 ; and variants associated with thyroid-stimulating hormone that revealed an inverse relationship between TSH levels and thyroid cancer. 26 Later studies using the TOPMed reference panel 38 identified variants associated with circulating cardiac troponin I level, investigated its role as a non-causal biomarker for MI using Mendelian randomization, 31 and identified variants associated with ironrelated biomarker levels and explored their relationship with all-cause mortality. 32 Causal inference and family effects The high degree of relatedness in the HUNT Study offers a unique opportunity to use family-based designs to investigate causal associations. Mendelian randomization (MR), which uses genetic variants as instrumental variables to investigate modifiable (non-genetic) factors, was first proposed using parent-offspring designs. 39 Alleles that are inherited from each parent are randomly determined during the meiotic process. This random allocation is essential to providing reliable comparisons in MR studies. However, due to the lack of genotyped A B family data, previous studies applied MR on the population-level, where the random allocation of alleles is only approximate. We were able to use the $15,000 families in HUNT to perform MR as originally proposed-in family-based designs. 33 Using this approach in HUNT, we showed empirically that MR estimates from samples of unrelated individuals for the association of taller height and lower BMI increase educational attainment, were likely induced by population structure, assortative mating, or   33 This approach has since grown in popularity and, together with HUNT, many cohorts now contribute to the investigation of causal associations with family-based designs. 34 Further leveraging the family structure information in HUNT, we have performed and have future opportunities to investigate causal effects between family members, for example parentoffspring effects 40,41 and assortative mating and sibling effects. 42 These study designs have not been previously possible due to the lack of genotyped family data, and this has limited both causal inference (as mentioned above) and the ability of typical GWASs to distinguish between direct and indirect genetic effects. 34 HUNT data allow for study designs to disentangle these sources of genotype-phenotype associations in humans. In one such example, we used 26,057 mother-offspring and 9,792 father-offspring pairs to investigate whether adverse environmental factors in utero increased future risk of cardiometabolic disease in the offspring. We observed that adverse maternal intrauterine environment, as proxied by maternal SNPs that influence offspring birthweight, were unlikely to be a major determinant of late-life cardiometabolic outcomes of the offspring. 40 Contribution to collaborative studies While the HUNT study has been an essential cohort in the genetic discoveries and causal inference mentioned so far, used in isolation it is limited due to low power to investigate uncommon phenotypes, uncertainty of the generalizability of findings to non-Europeans, and the lack of an independent sample for replication. To overcome these limitations, we contribute to genetic studies worldwide through participation in consortia focused on a variety of diseases including cardiovascular disease, 43,44 lipids, 45,46 type 2 diabetes, 47 osteoporosis, 48 decline in kidney function, 49 Alzheimer's disease, 50 bipolar disease, 51 intracranial aneurysms, 52 insomnia, 53 respiratory health, 54 and sleepiness. 55 We also contributed HUNT data to studies of anthropometric traits, 56 alcohol and nicotine use, 57,58 COVID-19, 59 phenomewide discovery, 60 and genetic risk prediction, 12 among others. These contributions highlight efforts from researchers in equal parts from the K.G. Jebsen Center for Genetic Epidemiology, NTNU, Norway, the University of Michigan Medical School, and the University of Michigan School of Public Health, USA. We believe that team science by consortia 60 fulfills the goals of the HUNT study and moves the science fastest toward new discoveries and improved human health.

DISCUSSION
Limitations of the study As noted above, the HUNT study includes primarily individuals of European descent and lacks diverse ancestries for study. In addition, it is limited by sample size to investigate uncommon phenotypes. Furthermore, while all residents aged R20 were invited to attend HUNT, biological samples were not available for all participants, which may limit generalizability. However, the relatively high level of participation in HUNT, compared with other studies, indicates a lower concern for selection bias.

Summary
Together, the multifaceted genetic discovery strategy incorporating genotyping, sequencing, and imputation-based approaches in HUNT has aided the identification of likely causal genes and variants for disease and human traits. It has also proved to be a valuable resource for genetically informed methods of causal inference, supporting the identification of modifiable risk factors. We owe this success to the willingness and high participation rates of the people of Trøndelag, the vast phenotyping collected by decades of HUNT researchers, and access to digitized public health care systems. We hope that initiatives such as this, which capture population-specific variants, use up to 40 years of existing longitudinal biomedical research data, and where the majority of adult inhabitants participated, make a strong case for why it is important to have genetic data both in Norway and a wide range of populations. We anticipate that the rich data collection will continue to be a unique dataset for future opportunities in longitudinal and family-based designs, genetic discoveries, Mendelian randomization, metaanalysis and polygenic score validation, well into the future.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:   Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

EXPERIMENTAL MODEL AND SUBJECT DETAILS
All residents in North-Trøndelag area (age R20 years), were invited to HUNT1-4. In addition, HUNT4 expanded to collect basic questionnaire data from the adult population of South-Trøndelag as described elsewhere. 3 Sample size, sex, gender, and information about age for HUNT1-4 are provided for study participants in Table 1.
The genotyping in HUNT and work presented in this cohort profile was approved by the Regional Committee for Ethics in Medical Research, Central Norway (2014/144, 2018/1622, 152,023). All participants signed informed consent for participation and the use of data in research.

METHOD DETAILS
Genotyping array design We aimed to identify as many high-quality genetic variants among HUNT participants as possible. Toward this aim, we developed a list of custom content for inclusion on one of four Illumina Human Core Exome arrays (HumanCoreExome12 v1.0, HumanCoreExome12 v1.1, UM HUNT Biobank v1.0 and UM HUNT Biobank v2.0) to directly genotype (i) 16,116 missense and loss-of-function variants as well as 1,072 lipid-associated variants identified from low-pass sequencing, (ii) 149 variants observed in Norwegian clinics for familial hypercholesterolemia, (iii) 5,324 Neanderthal variants, and (iv) 32,868 not-previously-observed variants predicted to introduce a premature stop codon in any of 56 genes in which protein-altering variants are deemed clinically actionable by The American College of Medical Genetics and Genomics (ACMG56) 68 (Table S4). Additionally, for the genotyping of HUNT4, we included variants for traits of interest including psoriasis, depression, alcohol use disorder, breast cancer, liver function, and bone mineral density; variants in the GWAS catalog; and loss-of-function variants available in TOPMed but poorly imputable in HUNT samples.

Genotyping procedures
Protocols were carefully planned to mitigate any possible batch effects from the genotyping process. Sample assignments to plates and plate positions were randomized and sample sets that needed to be grouped together (e.g., based on robot requirements for liquid volume handling or a requirement for re-precipitation of DNA, etc.) were randomized within each subgroup. Within each plate, genetically determined sex was evaluated against expected sex to identify any plate orientation issues. To enable this during genotype calling, new HUNT-specific cluster files were developed for the genotyping arrays using GenomeStudio, which had to be specific to each array version. Following genotype calling, allele frequencies were examined between array versions and any variants that demonstrated significant association with batch or array versions were excluded. Limited manual validation of GenomeStudio calls (a few thousand variants) were performed. Quality control was performed based upon the approach developed by that of Guo et al. 10 After a first round of automatic clustering in GenomeStudio (including samples with call rate >95%), samples that failed to reach a 99% call rate, had contamination >2.5% as estimated with BAF Regress, 63 large chromosomal copy number variants, lower call rate of a technical duplicate pair and twins, gonosomal constellations other than XX and XY, or whose inferred sex contradicted the reported gender, were excluded. Samples that passed quality control were analyzed in a further round of genotype calling following the Genome Studio quality control protocol described elsewhere. 10 Genomic position, strand orientation and the reference allele of genotyped variants were determined by aligning their probe sequences against the human genome (Genome Reference Consortium Human genome build 37 and revised Cambridge Reference Sequence of the human mtDNA; http://genome.ucsc.edu) using BLAT. 61 Variants were excluded if (1) their probe sequences could not be perfectly mapped to the reference genome, cluster separation e2 Cell Genomics 2, 100193, October 12, 2022 Resource ll OPEN ACCESS was <0.3, Gentrain score was <0.15, showed deviations from Hardy Weinberg equilibrium in unrelated samples of European ancestry with p value < 0.0001), their call rate was <99%, or another assay with higher call rate genotyped the same variant. Ancestry of all samples was inferred by projecting all genotyped samples into the space of the principal components of the Human Genome Diversity Project (HGDP) reference panel (938 unrelated individuals; downloaded from http://csg.sph.umich.edu/chaolong/LASER/). 62,64 For genotyping batches from HUNT2 and HUNT3, PLINK v1.90 69 was used and recent European ancestry was defined as samples that fell into an ellipsoid spanning exclusively European populations of the HGDP panel. For genotyping from HUNT4, we predicted ancestry using an online singular value decomposition and shrinkage adjustment algorithm (FRAPOSA) with the same reference panel. 67 The different arrays were harmonized by reducing to a set of overlapping variants and excluding variants that showed frequency differences >15% between datasets, or that were monomorphic in one and had MAF >1% in another dataset. The resulting genotype data were phased using Eagle2 v2.3 71 .

Imputation
The imputation described here is limited to the 69,716 samples of recent European ancestry from HUNT2-3, as the work on HUNT4 is ongoing. Samples were imputed using Minimac3 (v2.0.1, http://genome.sph.umich.edu/wiki/Minimac3) 65 with default settings (2.5 Mb reference-based chunking with 500kb windows) and the HUNT-WGS customized Haplo-type Reference consortium release 1.1 (HRC v1.1) for autosomal variants and HRC v1.1 for chromosome X variants. 9 The HUNT-WGS customized reference panel represented the merged panel of two reciprocally imputed reference panels: (1) 2,201 low-coverage (5x) whole-genome sequenced samples from the HUNT study (HUNT-WGS) and (2) HRC v1.1 with 1,023 overlapping HUNT WGS samples removed before merging. Since only 1,200 HUNT samples were sequenced at the time the HRC was established, we instead merged all HUNT-WGS samples (including indels) with the non-HUNT HRC samples to create a combined HRC and HUNT-WGAS imputation reference panel. Additionally, we recently performed imputation from 60,039 TOPMed reference genomes using Minimac4 (v1.0).
Panel A: Simulated power curves for proxy-case models. Given a total biobank size of 100,000 for a disease prevalence of 10% and heritability of disease liability of 10% a biobank could have cases (n=10,000),