Genotyping and population characteristics of the China Kadoorie Biobank

Summary The China Kadoorie Biobank (CKB) is a population-based prospective cohort of >512,000 adults recruited from 2004 to 2008 from 10 geographically diverse regions across China. Detailed data from questionnaires and physical measurements were collected at baseline, with additional measurements at three resurveys involving ∼5% of surviving participants. Analyses of genome-wide genotyping, for >100,000 participants using custom-designed Axiom arrays, reveal extensive relatedness, recent consanguinity, and signatures reflecting large-scale population movements from recent Chinese history. Systematic genome-wide association studies of incident disease, captured through electronic linkage to death and disease registries and to the national health insurance system, replicate established disease loci and identify 14 novel disease associations. Together with studies of candidate drug targets and disease risk factors and contributions to international genetics consortia, these demonstrate the breadth, depth, and quality of the CKB data. Ongoing high-throughput omics assays of collected biosamples and planned whole-genome sequencing will further enhance the scientific value of this biobank.


In brief
Walters et al. present genetic analyses of >100,000 participants of the China Kadoorie Biobank (CKB), a populationbased, prospective cohort of >512,000 from 10 diverse regions of China. They describe how the CKB has contributed to understanding of the genetic basis for many diseases and risk factors and report GWASs of 124 diverse disease outcomes.

INTRODUCTION
Major non-communicable chronic diseases, such as heart attack, stroke, cancer, and chronic obstructive pulmonary disease (COPD), account for much of the adult disease burden in China and globally. Several such diseases display large unexplained variations in incidence between different regions in China, indicating that important genetic and non-genetic causes remain to be discovered. The China Kadoorie Biobank (CKB) was initiated in 2002, with the goal of investigating the causal relevance of established and novel disease risk factors in the adult Chinese population. 1 From 2004 to 2008, CKB recruited >512,000 adults aged 30-79 years from 10 geographically diverse (five urban, five rural) regions across China, making it one of the largest blood-based prospective biobanks in the world. 2 Many aspects of the CKB study design support a wide range of hypothesis-driven and hypothesis-free research into many different diseases: population-based recruitment; prospective sample collection; a relatively medication-naive population; rich and diverse exposure and lifestyle data; and comprehensive capture of incident disease events through electronic linkage to death and disease registries and to health insurance records. The CKB also contributes to the growing demand for ancestrally diverse biobanks, which not only expand opportunities for novel discoveries of potential value to all human populations but also address potential inequalities in healthcare that may arise from the historical focus of research on individuals of European ancestry, findings from which are not necessarily transferable to other populations. 3,4 In common with many other large biobanks, the value of the CKB has been greatly enhanced by large-scale genotyping of study participants. Such genotype information enables investigation of the contribution of genetic variation to phenotype and disease risk, Mendelian randomization (MR) assessment of the causal contribution of risk factors and behaviors to disease, and phenome-wide analyses of the impact of variation at specific loci.
We describe the design and performance of a custom Affymetrix Axiom array optimized for individuals of Chinese Han ancestry, which provides both genome-wide coverage to enable high-quality imputation of both common and low-frequency variation, and direct genotyping of $68,000 putative loss-of-function, missense, and expression quantitative trait loci (eQTL) variants of potential use for MR or phenome-wide association studies. On the basis of genotyping of >100,000 CKB participants, we demonstrate extensive population diversity across China, identify substantial relatedness within the CKB study population, and observe principal component (PC) signatures consistent with population movements from recent Chinese history. Through linkage to deep phenotyping and >1.2 million recorded disease outcomes in the CKB, this genotyping has already facilitated a wide range of studies, from investigation of ancestry-specific loss of function variants to inform drug target identification, validation, and repurposing [5][6][7][8] ; to participation in international trans-ancestry genome-wide association study (GWAS) consortia including the Global Biobank Meta-Analysis Initiative (GBMI). [9][10][11][12][13][14] We report the results of GWASs of 224 disease outcomes, which can be accessed through a CKB PheWeb browser.

STUDY POPULATION AND DATA COLLECTION
Recruitment of CKB participants was community-based, taking place at a large number of local assessment centers within each of 10 diverse regions of China (five rural counties and five urban districts), respectively referred to by the name of the province or city in which recruitment took place. At baseline assessment participants completed an extensive interviewer-administered questionnaire on factors including demographics and socioeconomic status, diet and lifestyle (e.g., smoking, alcohol), physical activity, reproductive history (for women), and medical history and current medication. In addition, all participants underwent physical examination including measurements of anthropometrics, blood pressure, spirometry, exhaled carbon monoxide, and body composition (using bioimpedance). Furthermore, there were onsite blood tests of (non-fasting) glucose and hepatitis B virus (HBV) surface antigen, and blood samples were processed within a few hours to separate plasma and buffy coat for long-term storage. 2 Subsequent to initial recruitment, three periodic resurveys have been undertaken of approximately 5% of surviving participants, selected on the basis of representative random samples of assessment centers, to provide repeat measurements for correction of regression dilution bias, gather additional questionnaire information, conduct additional physical measurements and blood tests, and collect repeat blood samples and other additional biosamples for long-term storage (Figure 1). The first resurvey of 19,802 participants, conducted immediately after completion of study recruitment in 2008, was largely a repeat of the baseline survey. 2 This was extended in the second resurvey (25,091 participants), conducted in 2013 and 2014, with additional questionnaire data, additional physical measurements, on-site assays of blood lipids, and collection, testing, and storage of urine samples. Additional enhancements in the third resurvey (25,087 participants) in 2020 and 2021 included additional measurements of abdominal ultrasound and retinal imaging, and collection of saliva and stool samples. Baseline characteristics of resurvey participants were similar to those of the overall CKB cohort, with only minor differences attributable to survivor bias (Table S1). More than 22,000 individuals attended at least two resurveys; these multiple measurements at different time points will enable future longitudinal analyses of trajectories of risk factors for major diseases.
In addition to data collected at baseline and the resurveys, an increasing range of data are being generated from assays of stored biosamples. As part of a nested case-control study of stroke and ischemic heart disease (IHD), plasma samples from up to 18,728 participants (all with genotyping) were assayed for 17 clinical biochemistry measurements, with 1 H nuclear magnetic resonance (NMR) metabolomics for 4,657; further 1 H-NMR metabolomics measurements were conducted for other nested case-control studies of pancreatic cancer and diabetes (2,500 samples to date). Olink proteomics (3,072 proteins) and SomaLogic proteomics (up to 7,000 proteins) have recently been assayed for a further nested case-subcohort study of myocardial infarction (MI; 3,977 participants, all with genotyping), with additional larger scale measurements planned in the near future. Further assays include multiplex serology of antibodies to antigens from 19 pathogens (4,500 samples to date, with measurement in 40,000 cancer cases and controls under way) and 1 H-NMR metabolomics of urine samples from 25,251 resurvey participants.

ARRAY DESIGN
CKB genotyping used custom-designed arrays on the Affymetrix (now Thermo Fisher Scientific) Axiom platform, with content selection based on similar overall principles to those used for the UK Biobank array design, 15,16 but adapted to optimize performance for individuals of East Asian ancestry. This addressed three high-level criteria: (1) maximization of genome-wide coverage of common and low-frequency variation in individuals across the whole of China, (2) detection of important variants and of rare loss-of-function and other protein-coding variants that are present in Chinese populations, and (3) consistent performance of photolithographic manufacturing processes across array designs and batches of arrays manufactured over an extended time period. 17 Using the UK Biobank probe list as the starting point for the CKB array design, this was then modified and extended, informed by allele frequency and sequence data for more than 12,000 East Asians that were available to us in 2013 (Data S1; Figure S1). In brief, we (1) removed variants identified as absent or at low frequency in East Asians; (2) added specific variants confirmed as being present in East Asians, including loss-offunction, missense, and eQTL variants; (3) constructed an East Asian-specific genome-wide grid that maximized imputation of both common (>5%) and low-frequency (1%-5%) variants; and (4) included multiple copies of a series of degenerate probes for detection and classification of circulating HBV viral DNA. The resulting array design comprised 781,937 probe sets assaying 700,701 variants, of which 354,399 were also present in the UK Biobank array ( Figure S2; Table S2). The design included duplicate probe sets, one for each strand, for 81,236 variants that did not have a validated assay on the Axiom platform.
The initial array design was revised on the basis of genotyping data from the first 100 plates (8,995 samples after quality control [QC]) (Data S2; Figure S3). Failed, poor-quality, or otherwise uninformative monomorphic probe sets were removed, along with the poorer-performing probes of each pair of duplicate probe sets. Further specific content of interest (including further HBV probes and tags for variants that failed QC) were added to the design. Finally, variants were added to improve or restore genome-wide imputation coverage. Figure 2 summarizes the content of the final updated array, comprising 804,496 probe sets assaying 803,030 variants, of which 340,562 are present in the UK Biobank array (Table S3).

GENOTYPING AND QC
Genotyping and QC of a total of 105,408 CKB DNA samples are summarized in Table 1. The initial CKB array design was used to genotype 33,408 samples that had been selected for nested case-control studies of cardiovascular disease and COPD. On the basis of disease follow-up to January 1, 2014 (see below), this initial genotyping included all incident cases of intracerebral hemorrhage (ICH 5,020), subarachnoid hemorrhage (SAH; 455), and fatal IHD (753); randomly selected incident cases of ischemic stroke (IS; 5,662), MI (1,008), and COPD (5,376); participants with no cardiovascular events (n = 10,038) at time of selection, matched to ICH cases for sex, age, and region; and 4,766 randomly selected individuals who had attended the second resurvey.
The updated array design was then used to genotype a further 72,000 samples, including all available additional cases of ICH (602), SAH (46), MI (1,028), and fatal IHD (163) that had not previously been genotyped. The remaining genotyped samples came from boxes of DNA samples that were either randomly selected or selected as containing samples derived from the assessment centers used for the second resurvey, in either case being largely representative of the overall CKB cohort. Duplicate samples were present on each pair of consecutive plates to support sample and plate tracking and for other QC purposes.
Genotyping and QC followed the Affymetrix best practice workflow 18 with additional checks for probe sets that displayed substantial between-plate or between-batch differences in allele frequency or call rate. After QC and removal of duplicate probe sets, 76.1% and 85.5% of probe sets remained for the initial and revised arrays, respectively (Table S4); 3.4% of samples failed QC (summarized in Table 1), which were mostly initial QC failures, likely reflecting low quality or low concentration DNA samples; only 0.16% of samples were excluded because of sex mismatch, reflecting the CKB's stringent sample tracking procedures. 19 All pairs of duplicate samples showed high concordance of non-missing genotypes, overall concordance being 99.88% and 99.87% for the first and second array designs respectively (with 0.67% and 0.64% calls missing in one or both of a pair). Where probe sets on the revised CKB array were present in the UK Biobank array, 192 samples genotyped on both arrays yielded concordance of 99.80% (0.46% missing).
The allele frequency distribution for the two datasets reflected the design characteristics of the arrays ( Figure 3A). There were many more monomorphic or very low frequency (minor allele frequency [MAF] < 0.0001) variants on the original array design (5.7%) than on the revised array (1.6%) on which many such variants had been removed. Conversely, revision of the array design included selection of additional variants to improve imputation of variants with MAF of 0.01-0.05, and correspondingly more variants passed QC in this MAF range. The allele frequencies of variants that passed QC showed strong agreement with the East Asian populations in the 1000 Genomes reference dataset 20 ( Figure 3B), even at lower MAF where estimates in the 1000 Genomes reference were affected by small sample size, and this agreement was consistent across all recruitment regions, although with somewhat greater variation at lower MAF ( Figure S4).
These allele frequency data provide some insight into the potential utility of variants included on the CKB array for purposes of investigating the impact of protein loss of function. Variants on the revised array that passed QC were categorized according to their potential functional significance as predicted by Combined   Figure 3D). These will provide opportunities not available in European cohorts for genetic investigations of the importance of the affected genes for disease and disease risk, such as those already conducted for PLA2G7 and CETP. 5,6 Initially, imputation was performed separately for each array dataset, but the revised array provided only a modest improvement in imputation quality, despite the substantially larger number of informative variants passing QC (Table S4). Therefore, to minimize batch and array effects, we derived a single imputation dataset, using only those variants passing QC in all batches on both array versions (although variants excluded for imputation remain available for analysis in the final dataset). We achieved high-confidence imputation for the large majority of common and low-frequency variants present in the EAS populations of the 1000 Genomes Phase 3 reference (Tables S4 and S5; Figure S5): mean info scores were 0.950 for variants with MAFs > 0.05, 0.849 for MAFs of 0.01-0.05, and 0.695 for MAFs of 0.005-0.01. Imputation was typically poorer for rare variants with MAF <0.005, which are excluded from many analyses.

RELATEDNESS
The community-based recruitment of CKB participants resulted in family groups attending together, so that many individuals had close relatives also recruited into the study (Tables S6 and S7). Among genotyped participants, 31.9% had an also genotyped second-degree or closer relative (23.6% having at least one first-degree relative), with more relatedness in rural than in urban regions (39.0% vs. 22.8% with first-/second-degree relatives), with the exception of participants in Suzhou (54.7%); Suzhou recruitment took place in a previously rural district that has only recently become urbanized. Suzhou also had a particularly high proportion of individuals with multiple close relatives (1,561 [20%] with R3 genotyped relatives). Further analysis of relatedness in the CKB also identified 32 pairs of twins, 13,875 individ-uals with at least 1 sibling (comprising 6,325 family groupings of up to 7 siblings) and 6,571 parent-child relationships, including 1,189 trios. Several regions displayed patterns of relatedness suggestive of historical consanguinity ( Figure S6). Histograms of pairwise Resource ll OPEN ACCESS identity by descent (IBD) displayed not only the expected peaks corresponding to integer numbers of meioses separating relatives (at IBDs of 0.5, 0.25, 0.125, etc.), but also other peaks centered on values corresponding to relationships that arise as a result of consanguineous unions between individuals with recent common ancestors, for instance triple second cousins (expected IBD of 0.09375). Consistent with this, 1,050 participants had heterozygosity >3 SDs below the mean values across all genotyped samples, all but 3 of whom had correspondingly extensive runs of homozygosity ( Figure S7).

CKB POPULATION STRUCTURE
We performed PC analysis (PCA) of 76,719 unrelated CKB participants and identified that the first 11 PCs were informative for CKB population structure, according to the Bayesian information criterion (BIC) for models predicting individuals' recruitment region ( Figure S8). Consistent with findings from many previous studies, individuals formed discrete clusters, whose locations on a plot of the first 2 PCs closely resembled the pattern of longitude and latitude for the regions in which they were recruited ( Figures 4A, 4B, and S9). For three regions, how-ever, the positions of their PCA clusters were clearly offset compared with their geographic location. In each case, the apparent discrepancy corresponds to a known major population movement from recent Chinese history: large-scale migration in the 16th and 17th centuries AD from Guangdong to Hainan island; repopulation of the Chengdu Basin in Sichuan in the late 17th and early 18th centuries, a large proportion of migrants coming from Huguang (now Hubei/Hunan); and settlement of largely unpopulated Manchuria in the late 19th and early 20th centuries, with the majority of settlers originating from Shandong province. Thus, the lead PCs reflect the known historical geographic origins of the population in a region, rather than its current physical location.
Although the leading PCs tightly clustered most individuals for each recruitment region, a small proportion (5.7% overall) lay outside the main cluster (>3 SDs from the region mean for PCs 1-11) and thus appeared to have non-local ancestry. Of these, for those with data available from the second resurvey, a high proportion (47.8%) reported that they or at least one parent were born in a different province of China; by comparison, only 12.5% of the remainder reported origins from a province other than that in which they were recruited. The large majority of participants with non-local ancestry was recruited in Liuzhou, among whom a substantial fraction of individuals (25.4%) lay outside the main PCA cluster, mostly reflecting outlier values for PC 1 (corresponding to the major north-south axis); 87.2% of these individuals reported that either they or a parent was born outside Guangxi province.
Local population structure within each region was not readily apparent from the above pan-China PCA but was clearly observed for individuals within each region-specific cluster (excluding those identified as having non-local ancestry). Between 2 and 9 PCs were informative for the latitude and longitude of the assessment center at which an individual was recruited (Figures 4C, S10, and S11). Rural regions (plus previously rural Suzhou) typically displayed substantial structure, reflecting established communities with little population movement; by contrast, there was only limited population structure for most urban regions, with the exception of Liuzhou, for which an appreciable proportion of second resurvey participants reported having non-Han ancestry (17.6% compared with 1.0% across the other 9 regions). Although uninformative for geographical location within Liuzhou, the first 4 PCs from whole-cohort PCA or the first 2 local PCs were informative for Han status ( Figure S12); however, there was no clear discrimination between Han and non-Han that might suggest a need to analyze these individuals separately.
The geographical and/or historical relationships between the different CKB populations are reflected in F st analyses of the genetic distance between regions ( Figure S13): the four northern regions cluster together, and also with the northern Han 1000 Genomes population (CHB); the four regions situated on or near the Yangtze River cluster together, in two pairs, and also with the southern Han 1000 Genomes population (CHS); and the southern two regions cluster together, with no appreciable distinction in Liuzhou between Han and non-Han identity. The positions of the 1000 Genomes Project East Asian populations in this analysis indicate that the 10 regions in the CKB population are components of a continuum running from north to south with no clear separation from neighboring countries (KHV, from Vietnam) or ethnic populations (Dai Chinese, from near to the borders with Laos and Myanmar).

DISEASE OUTCOMES
In common with the other biobanks contributing to the GBMI, 14 one of the chief strengths of the CKB is the ability to follow up study participants for a wide range of fatal and non-fatal disease outcomes. 2 In the CKB, disease follow-up is obtained by electronic linkage using participants' unique national identity numbers to registries for death (with cause of death recorded) and for 4 major diseases (stroke, IHD, cancers, diabetes) and to the national health insurance system, which records all inpatient hospital events. These procedures are complemented by active follow-up through annual checks of local residential records and, if necessary, in-person visits by local staff to check key data including vital status and to identify hospitalized episodes in a small proportion of the CKB (currently approximately 2%) who have not joined the health insurance scheme. 24 Data from these multiple sources, including parsing of free-text Chinese language disease descriptions and matching to a clini-cian-curated disease description library, are integrated and standardized into International Classification of Diseases, 10th Revision (ICD-10), coded incident disease events. By January 1, 2019, >1.2 million incident events and 49,428 deaths (including causes of death) had been recorded for the whole of the CKB, covering >5,000 separate disease types (defined according to three-character ICD-10 code), with only 5,302 (1.0%) of participants being lost to follow-up. Table 2 shows the number of individuals with selected incident events as recorded through disease follow-up, in all CKB participants and in the genotyped subset. Additional prevalent cases are available through medical questionnaire data, and on-site measurements at baseline (e.g., type 2 diabetes and COPD, through blood glucose assays and spirometry). Reflecting the strategy for selecting the genotyped samples, there was enrichment for cardiovascular diseases, including IS (ICD-10 code I63), intracerebral hemorrhage (ICD-10 code I61), and MI (I21), for COPD (ICD-10 codes J41-J44 and J47), and for all-cause mortality. By contrast, other diseases unrelated to the ascertained case types were present in proportion with the number of genotyped samples.
The wide range of disease outcomes recorded during followup is illustrated by the 224 different 3-character ICD-10 codes that have at least 100 incident events recorded in genotyped participants (Table S8). Although limited, this number of cases is sufficient to permit analysis using software packages such as SAIGE, 25 and subsequent contributions to multi-cohort metaanalyses. Work is ongoing to convert the ICD-10 data into Phecodes, 26 to aid harmonization of disease outcomes between CKB and other biobanks and to facilitate phenome-wide association analyses.

ANALYTICAL APPROACH AND GWAS
The historical consanguinity and extensive relatedness in the CKB have been exploited in analysis of the impact of inbreeding on reproductive success 27 and for within-sibship GWASs to derive estimates of direct genetic effects unaffected by genetic nurture. 28 However, for the majority of studies, the population structure of the CKB cohort together with the strategy for selection of samples for genotyping require thoughtful analytical approaches. The substantial relatedness within the CKB, as in many other population-based cohorts, means that exclusion of individuals to avoid inclusion of pairs of close relatives (typically kinship > 0.05, corresponding to third-degree relatives, e.g., first cousins) would result in a substantial reduction in sample size, with some recruitment regions being disproportionately affected (Tables S6, S7, and S9). Therefore, we typically use well-established software packages such as BOLT-LMM 29 and SAIGE 25 that implement linear mixed models to account for both relatedness and population stratification, thereby permitting inclusion of related individuals.
It is unclear, however, that current software packages fully account for all aspects of population structure in the CKB, with its recruitment in 10 discrete regions each with their own distinct genetic characteristics, varying environments, cultures, demographics, and incidence rates of major disease outcomes. For some diseases, this may be not only because they vary in prevalence but also because of varying access to healthcare (e.g., in rural compared with urban regions), so that patterns of severity in reported cases may also vary between regions. Therefore, although it is frequently appropriate to conduct analyses across the full genotyped dataset with adjustment for recruitment center (as a 10-level categorical covariate), wherever possible we supplement this with region-stratified analyses (excluding the 5.7% of individuals with non-local ancestry) and meta-analysis, to ensure that association signals are not due to unresolved population stratification or subject to other biases (e.g., arising from heterogeneity between regions).
A second consideration is that the selection for genotyping of nested case-control samples has resulted in substantial overrepresentation of participants with hospitalization for cardiovascular disease or COPD; although the additional cases were selected on the basis of incident events (i.e., after recruitment), the baseline characteristics of these individuals nevertheless differ from the overall population (e.g., cardiovascular disease events are positively associated with blood pressure, adiposity, blood lipids, smoking, and alcohol consumption). Their inclusion potentially introduces biases or confounding into analyses using the complete genotyping dataset, and we have therefore developed approaches that seek to minimize or eliminate these biases. Where traits are available only in non-random subsets of individuals (e.g., clinical biochemistry measures), we either exclude ascertained cases entirely, include case ascertainment as a covariate, or conduct analyses stratified by ascertainment; note that this is not required for measurements taken at the second resurvey, which was representative of surviving CKB participants. For quantitative traits available for all participants, such as blood pressure or reproductive traits, we perform all adjust-ments for covariates and data transformations in the full CKB cohort, prior to genetic analyses, so that these adjustments are not distorted by the non-random nature of those genotyped; this is typically performed as a single regression including region as covariates, but we also check the impact of instead performing such adjustments in each region separately. This is the approach used both for contributions to large multi-ancestry meta-analyses 13,30 and for several CKB-specific GWASs. 31,32 For the analysis of dichotomous variables, to overcome the potential biases due to case enrichment, we have constructed a subset of 77,176 individuals representative of the full CKB cohort in which over-representation of ascertained disease cases was eliminated (STAR Methods; Tables S1 and S9). Analyses of disease outcomes and other binary phenotypes, including contributions to the first round of GBMI studies, 14,[33][34][35] have typically used this population-representative subset supplemented with additional cases from the remainder of the dataset. For these analyses we use SAIGE software, 25 which is designed to account for imbalances in numbers of cases and controls. This approach is combined with region stratification and meta-analysis for diseases with large numbers of cases, so that separate analyses in each region are possible and do not lead to exclusion from analysis of large numbers of variants due to low minor allele count (MAC; see STAR Methods). It should be noted that use of the population subset does not result in a noticeable loss of power, despite the exclusion of $25% of samples, as there is invariably a large excess of controls even for more common diseases. It also provides a reduced dataset, unaffected by disease ascertainment biases, which can be used for sensitivity analyses of studies of other traits that use the full genotyped dataset. Resource ll OPEN ACCESS Using this approach, we have conducted GWASs of the ICD-10 codes with at least 100 genotyped cases, yielding 35 associations at genome-wide significance (5 3 10 À8 ) ( Figure 5A; Table S10). The majority of these replicated known association signals for a range of diseases (type 2 diabetes, hypertension, atrial fibrillation, cerebral infarction, gout, liver cirrhosis, liver cancer, lung cancer), but we also identified 14 potentially novel disease-associated loci. Although none of the latter associations survived a strict Bonferroni adjustment to take account of the multiple GWASs, and some were based on a small number of cases, several were nevertheless at loci associated with closely related diseases or phenotypes, suggesting that these reflect robust associations: an association with ICD-10 H40 (glaucoma) near EYS, at which there is also an association with retinitis pigmentosa, 36,37 a known risk factor for primary angle-closure glaucoma 38 ( Figure 5B); an association with H43 (disorders of vitreous body) at NCKAP5, close to reported associations with optic disc size 39 and primary open-angle glaucoma 40 ( Figure 5C); an association with K81 (cholecystitis) at MYLK4, at which there is an association with serum alkaline phosphatase, 37 an established marker of bile duct stones and acute cholecystitis 41 (Figure 5D); and a variant in SLC35F3 (rs4333882) associated with K60 (fissure and fistula of anal and rectal regions) and also with one of the causes of fistulas, diverticular disease 42 ( Figure 5E). We have made the results and associated plots of all these GWASs available through a PheWeb browser. 43

RESEARCH CONTRIBUTIONS
In combination with genotyping and imputation, and the wide range of phenotypes and disease outcomes available for CKB participants, the above analytical approaches have been applied in diverse studies. Initial studies, using directly genotyped variants, were MR-based investigations that emphasized the value of ancestry diversity for genetic analyses. In early examples of ''drug target MR,'' we found no association of East Asian-specific variants in PLA2G7 and CETP with major cardiovascular disease outcomes or other major diseases, in each case complementing the results of clinical trials that found no major benefit of drug treatments targeting their respective protein products. 5,6,44 We also exploited the high frequency in East Asians of variants influencing alcohol metabolism, and thereby drinking behavior, to investigate the causal relationship between alcohol consumption and deleterious effects on health: we showed a clear link between alcohol and risk for stroke and, for the first time, we robustly refuted previous reports from observational studies of apparent protective effects of moderate drinking. 45 Other early genetic studies in the CKB investigated the causal relevance of other disease risk factors, providing evidence that vitamin D deficiency increases risk for diabetes and cardiovascular disease, 46,47 that diabetes is itself causally associated with increased risk for cardiovascular disease, 48 and that, although lowering of low-density lipoprotein (LDL) cholesterol decreases risk for IS, it also increases risk for hemorrhagic stroke. 49,50 Since becoming available, the genome-wide imputed data have enabled a wider range of studies, including MR investigation of further drug targets 7,8 and of diverse disease risk factors such as bone mineral density, 51 gallstone disease, 52 height, 53 blood pressure, 32,54 and resting heart rate. 55,56 We have contributed to replication of novel association signals from large GWASs of blood pressure, 57 menopause, 12 and early-onset  Table S10. stroke, 58 and have evaluated the performance of polygenic scores in predicting disease risk for lung function, 9,59 lung cancer, 60 fracture, 61 and breast cancer. 62 A comprehensive set of GWASs for major diseases and disease risk factors are in progress, including multiple adiposity and blood pressure traits. 31, 32 We also recently published the first large GWAS of lung function in an East Asian population, 63 identifying 48 independent associations of which 18 were novel, once again emphasizing the value of expanding ancestry diversity in genetic studies.
In addition to these analyses conducted primarily within the CKB, we have contributed to many genome-wide association studies in collaboration with major consortia, including transancestry studies of intracranial aneurysm, 10 recurrent miscarriage, 11 blood lipids, 64 fingerprint patterns, 65 diabetes, 66 and height. 13 In particular, CKB has made important contributions to the growing number of studies focused specifically on populations of East Asian ancestry, including the largest single contribution to a GWAS of depression in East Asians 67 ; and a major contribution to a GWAS of type 2 diabetes, the largest East Asian GWAS to date. 68 Summary statistics from CKB GWASs have also contributed to the development of methods for genetic association analyses using very low coverage whole-genome sequencing from non-invasive prenatal testing 69 ; trans-ancestry colocalization to assess whether two populations share causal variants 70 ; and improved genetic discovery in multi-ancestry meta-analyses. 71

LIMITATIONS OF THE STUDY
Although also one of its strengths, the population-based nature of CKB recruitment leads to certain limitations. First, voluntary participation in the study, although mitigated by very low loss to follow-up, might lead to selection bias, with those recruited potentially being healthier and with fewer health conditions that would reduce likelihood of study participation. In addition, because of recruitment in specific, not necessarily representative locations, care may sometimes be required in extrapolating results from CKB to the Chinese population overall.
Second, although there is near-complete linkage to any episode of hospitalization subsequent to recruitment, data on medical (and family) history is restricted to the limited details recorded during the baseline questionnaire, with no outpatient or primary care data recorded (not covered by the health insurance system). This, together with the middle-to old-age profile of the cohort, means that some categories of disease (e.g., relating to female reproduction) are under-reported or not captured. Although addressed in part by updates to the resurvey questionnaires, this only applies to 5% of participants. Nevertheless, the prospective nature of the cohort and comprehensive linkage ensures that analyses in the CKB can provide reliable assessment of the contribution of genetic and non-genetic factors to major diseases in China.

FUTURE PROSPECTS
Together, these many analyses have established the high-quality linkage between genotyping, data collected at baseline recruitment and resurveys, and disease follow-up. With its breadth of phenotypes and disease outcomes, prospective study design, and growing range of diverse omics assays, CKB will continue to make significant contributions to genetic discovery and elucidation of disease etiology and causality. Ongoing work will further enhance the available genetic resources, including DNA methylation arrays for 982 samples 72 ; imputation using the Trans-Omics for Precision Medicine (TopMED) 73 and Westlake Biobank for Chinese (WBBC) 74 reference panels; and wholegenome sequencing of 10,000 participants with incident IS. 75 Whole-genome sequencing of the entire 512,000 cohort is planned in the near future through private-public partnership. Together with other notable biobanks across the world, CKB is addressing the recognized need for ancestrally diverse biobanks, and will continue to make strong contributions to the East Asian and trans-ancestry genetic analyses that are beginning to correct the strong Euro-centric bias of the genetic literature. 4

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
The most important acknowledgment is to the participants in the study and the members of the survey teams in each of the 10 regional centers and to the project development and management teams based at Beijing, Oxford, and the 10 regional centers.

DECLARATION OF INTERESTS
The authors declare no competing interests. CKB study data Full details of the CKB study design and methods have been previously reported. 2 Briefly, 512,726 adults aged 30-79 years were enrolled during 2004-2008 from ten urban and rural areas across China. At local study assessment centers, trained health workers administered a laptop-based questionnaire; undertook physical measurements; and collected a blood sample for long-term storage and onsite blood tests. Three subsequent resurveys of $5% randomly selected surviving participants were conducted using similar procedures in 2008, 2013-2014, and 2020-2021. With the exception of genomics data, all CKB survey data were collected and stored using bespoke IT systems and databases tailored to CKB requirements. 86,87 Disease follow-up data from death and disease registries and from health insurance records were processed and matched to study identifiers by local staff in each recruitment region, centrally processed and converted into ICD-10-coded events, and integrated into the main study database. The database is regularly processed into research-ready snapshots from which datasets are served to researchers. All results are based on CKB data release version 17.02, incorporating disease follow-up up to 1 January 2019.

METHOD DETAILS
DNA extraction and SNP genotyping DNA extraction and genotyping was performed at BGI, Shenzhen, China, using KingFisher TM Blood DNA Kit and KingFisher TM Flex 24 Magnetic Particle Processors (Thermo Scientific), yielding 400mL DNA at 220 ng/L mean concentration. Buffy coat sample tubes were barcode scanned, and up to 800mL was manually pipetted into the extraction tube at the positions specified by a bespoke sample-tracking IT system. Extracted DNA was transferred using a Freedom EVOâ (Tecan) fluid handling system, up to 200mL to each of two sets of 96-tube racks of 2D-barcoded cryovials (Fluidx, Azenta Life Sciences), and 12mL to a 96-well microtitre plate. DNA concentration and quality was recorded using a NanoDrop Microvolume Spectrophotometer (Thermo Scientific). Tubes were frozen and shipped on dry ice to the CKB sample storage facility in Beijing, for long-term storage at À70C. All sample movements were recorded by the sample-tracking IT system. The first 95,680 DNA samples (from randomly-selected participants), extracted during 2012-2013, were genotyped using the multiplex Golden Gateâ platform (Illumina), for panels of 384 variants which included 3 variants informative for sex within the chromosome XY pseudoautosomal regions. SNP genotyping was performed in 96-well microtitre plates, and used up to 5mL DNA from the microtitre plate produced during DNA extraction. Each genotype plate included positive and negative controls at fixed positions, and 2 pairs of duplicate samples at unique combinations of plate positions. Genotyping was performed for a total of 1,040 plates according to GoldenGate Genotyping Assay Manual Protocols, 88 with beadchip imaging using an iScan System (Illumina). The 384 SNP panel was revised after genotyping of the first 100 plates (9200 unique samples), and again after the second 100 plates.
Data processing, genotype calling, and quality control was conducted at CTSU, University of Oxford, UK. Genotyping calling was performed in 4 batches using GenomeStudio software, with initial QC based on automated clustering. All negative controls had an SNP call rate of 80% or less (mean = 34%). 15 plates were flagged for inspection due to an initial positive control call rate <95%, but no failures of genotyping were identified; the remaining positive controls had mean call rate of 98.6%. A further 12 plates were flagged for inspection due to 1 or both of 12 pairs of duplicates being among 709 samples excluded with call rate <90%, but again no failures of genotyping were identified.
Following this initial QC, and again after final sample QC, SNPs were reclustered, and within each batch SNPs with GenTrain score <0.7 were inspected manually, and manually reclustered or excluded as appropriate. Across the 3 SNP panels, 30 SNP assays failed genotyping within that panel (either due to gross genotyping failure or call rate <95%), and a further 42 SNP assays failed for a subset of the 4 'plexes'. Two SNPs displayed Hardy-Weinberg disequilibrium due to presumed assay interference by nearby SNVs or indels. 15 SNPs displayed potential batch effects, identifying genotype clustering errors that were adjusted manually.
Following SNP QC, an additional 1,518 unique samples (2,217 in total) were excluded on the basis of an SNP call rate <98%. One sample with excess heterozygosity (F-statistic >5 SDs above the mean) was excluded. For 2,063 remaining pairs of duplicate samples, genotyping concordance was between 98.66% and 100% (mean 99.98%), and only 118 samples (0.1%) were identified with mismatches of reported gender and inferred sex based on 3 sex-informative SNPs from chrXY pseudoautosomal regions, confirming Cell Genomics 3, 100361, August 9, 2023 e2 Resource ll OPEN ACCESS good DNA quality and robust linkage to originating study participants. A further 136 samples (0.1%) from blocks with multiple sexmismatches or with other potential sample linkage errors were also excluded.

Genome-wide genotyping
For genotyping using the first version of the CKB array, samples were selected for genotyping as part of nested case-control or casecohort study designs. Incident cardiovascular disease cases were selected according to available disease follow-up at time of sample selection (August 2014) from amongst those with extracted DNA and no self-reported prior cardiovascular disease history, as follows: (a) all cases of intracerebral haemorrhage (ICH -CD-10: I61, I69.1) where this was the first stroke event, including additional samples selected for prioritised DNA extraction and one case originally incorrectly recorded as an ischaemic stroke (IS); (b) all available cases of subarachnoid haemorrhage (SAH -ICD-10: I60, I69.0) where this was the first stroke event; (c) 5,662 cases of IS (ICD-10: I63, I69.3) occurring prior to 1 January 2014 at age %71 years where this was the first stroke event; (d) 1,008 incident cases of myocardial infarction (MI -ICD-10: I21-I23); and (e) all available cases of death with ischaemic heart disease as underlying cause (fatal IHD -ICD-10: I21-I25). Pairs of controls with no cardiovascular disease events or self-report were identified for each ICH case, matched to sex, recruitment region, and year of birth. For respiratory disease, 5,358 participants were selected with at least one event of hospitalisation with chronic obstructive pulmonary disease (COPD -ICD-10: J41-J44); as controls, 4,766 participants were randomly selected from amongst those who attended the second resurvey. For genotyping using the second version of the CKB array, selection was on the basis of complete boxes of DNA samples, prioritising those boxes that contained samples from participants originally recruited in clinics at which the second resurvey was conducted. These samples were supplemented with additional cases of ICH, SAH, MI, and fatal IHD that occurred subsequent to initial sample selection.
Genotyping was performed at BGI, Shenzhen, China. DNA samples selected for genome-wide genotyping were retrieved from storage at À70C, either as complete boxes of 96 samples or (for nested case-control samples) individually selected and transferred to new boxes, and were shipped on dry ice to BGI, Shenzhen. DNA concentration was checked using a NanoDrop Microvolume Spectrophotometer (Thermo Scientific), and a Microlab STAR liquid handling system (Hamilton) was used for transfer of sub-aliquots to new racks of 96 Fluidx cryovials and dilution with TE buffer to 80 ng/mL; the equivalent measured concentration of a subset of samples measured using Qubit DNA quantification (ThermoFisher) was 50 ng/mL. Diluted DNA was plated onto 96-well microtitre plates, with samples from a minimum of 3 boxes distributed across a single plate (a 1:1 mix of cases and controls for nested case-control samples). Samples with low DNA concentration were plated separately for genotyping with a modified first stage of the protocol, using a larger volume of DNA in place of TE buffer. Samples at position H12 were replaced with a duplicate sample from position D1 on the previous plate, thereby providing checks of genotyping quality and sample tracking. Genotyping was performed with manual target preparation according to Affymetrix protocols with automated plate processing and imaging using CKB_1 and CKB_2 Axiomâ arrays and GeneTitanâ Instruments. 89 Raw genotyping data were exported from China to the Oxford CKB International Coordinating Center under Data Export Approvals 2014-13 and 2015-39 from the Office of Chinese Human Genetic Resource Administration.
Genotyping quality control and calling (summarized in Table S4) was performed at CTSU, University of Oxford, UK, according to Affymetrix Best Practice workflow 18 using the Axiom Analysis Suite (Affymetrix) with default settings. Initial QC was performed on samples genotyped on batches of 50 plates. Genotyping was carried out for a preselected set of $20K ''high performance'' SNPs, using the 'Sample QC' option, to give initial quality metrics. These were used to identify samples and plates to be excluded from subsequent steps, on the basis of sample DQC<0.82; sample QC call rate <97%; or plates with mean call rate for remaining samples <98.5%. Plates with sample pass rate <95% were flagged for inspection, and were excluded if there was evidence of a general failure of genotyping (e.g. large sections of the plate have failed), or if sample call rate was systematically low relative to sample DQC (rather than having a large number of failing samples due to e.g. a group of samples with poor-quality DNA). Some plate failures identified array manufacturing defects; genotyping of these plates was repeated using a new array.

Variant QC
Samples passing initial QC were processed and co-clustered, again in batches of 50 plates, to derive genotypes and further quality metrics. Within each batch, probesets were ''failed'' if they were classified as ''OTV'' (off target variation), ''CallRateBelowThreshold'' (using the default threshold 95%), or ''Other'', and genotypes for non-failed probesets; all further QC was performed using PLINK v1.9 and/or v2.0. 80 Within each batch, probesets were assessed for the presence of plate effects: logistic regressions were conducted to test each individual plate within a batch for significant deviations in genotype calling: each plate in turn was treated as ''case'' status with all other plates in the batch as controls, with recruitment center as covariate; probesets were failed according to criteria determined empirically through manual review of cluster plots to identify clustering failures -any plate effect with p < 10 À10 , >3 instances of plate effect p < 10 À4 , or any plate effect p < 10 À8 and clustering metrics FLD<8, HetSO<0.68, and HomRO<3.7; in addition, for probesets with any plate effect P < 2 3 10 À5 , cluster plots were manually reviewed, and appreciable clustering failures (e.g. poor cluster separation) were ''failed''.
Probesets passing this initial QC were combined into a single dataset, and a preliminary round of sample QC was performed (see below). A set of autosomal probesets pruned for linkage disequilibrium (LD; PLINK option --indep-pairwise 50 5 0.1) was then used to identify an unrelated subset of samples (PLINK --rel-cutoff 0.025). These were used to test for significant deviations in genotype calling between batches: logistic regressions were performed treating each batch in turn as ''case'' status with all other batches as controls, again with recruitment center as covariate; probesets were failed entirely, across all batches, again according