Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank

Summary Background Understanding the genetic basis of airflow obstruction and smoking behaviour is key to determining the pathophysiology of chronic obstructive pulmonary disease (COPD). We used UK Biobank data to study the genetic causes of smoking behaviour and lung health. Methods We sampled individuals of European ancestry from UK Biobank, from the middle and extremes of the forced expiratory volume in 1 s (FEV1) distribution among heavy smokers (mean 35 pack-years) and never smokers. We developed a custom array for UK Biobank to provide optimum genome-wide coverage of common and low-frequency variants, dense coverage of genomic regions already implicated in lung health and disease, and to assay rare coding variants relevant to the UK population. We investigated whether there were shared genetic causes between different phenotypes defined by extremes of FEV1. We also looked for novel variants associated with extremes of FEV1 and smoking behaviour and assessed regions of the genome that had already shown evidence for a role in lung health and disease. We set genome-wide significance at p<5 × 10−8. Findings UK Biobank participants were recruited from March 15, 2006, to July 7, 2010. Sample selection for the UK BiLEVE study started on Nov 22, 2012, and was completed on Dec 20, 2012. We selected 50 008 unique samples: 10 002 individuals with low FEV1, 10 000 with average FEV1, and 5002 with high FEV1 from each of the heavy smoker and never smoker groups. We noted a substantial sharing of genetic causes of low FEV1 between heavy smokers and never smokers (p=2·29 × 10−16) and between individuals with and without doctor-diagnosed asthma (p=6·06 × 10−11). We discovered six novel genome-wide significant signals of association with extremes of FEV1, including signals at four novel loci (KANSL1, TSEN54, TET2, and RBM19/TBX5) and independent signals at two previously reported loci (NPNT and HLA-DQB1/HLA-DQA2). These variants also showed association with COPD, including in individuals with no history of smoking. The number of copies of a 150 kb region containing the 5′ end of KANSL1, a gene that is important for epigenetic gene regulation, was associated with extremes of FEV1. We also discovered five new genome-wide significant signals for smoking behaviour, including a variant in NCAM1 (chromosome 11) and a variant on chromosome 2 (between TEX41 and PABPC1P2) that has a trans effect on expression of NCAM1 in brain tissue. Interpretation By sampling from the extremes of the lung function distribution in UK Biobank, we identified novel genetic causes of lung function and smoking behaviour. These results provide new insight into the specific mechanisms underlying airflow obstruction, COPD, and tobacco addiction, and show substantial shared genetic architecture underlying airflow obstruction across individuals, irrespective of smoking behaviour and other airway disease. Funding Medical Research Council.

For individuals who gave up smoking for more than 6 months, pack years was defined as: A percentage of life span smoking variable was defined as: For individuals who gave up smoking for more than 6 months, percentage of life span smoking was defined as: For current smokers, pack years variables were calculated using age at recruitment in place of age stopped smoking. Heavy smokers were defined as individuals with a percentage of life span smoking ≥ 42% (equivalent to a minimum pack years of 10 in the youngest participants). See Appendix 1 for all UDIs for smoking behaviour.
Within the 275,939 European ancestry individuals with 2 or more FEV 1 and FVC measures which met ERS/ATS guidelines and who had non-missing information for spirometry method, age, sex and standing height, 105,281 were never smokers and 46,763 were heavy smokers. After exclusion of 14 individuals who had outlying FEV 1 after adjusting for sex, age, age 2 , height and height 2 , 105,272 never smokers and 46,758 heavy smokers remained. Healthy never smokers were selected from the never smokers by excluding individuals who indicated that they had experienced wheeze, or reported any of the following respiratory conditions: asthma; chronic obstructive pulmonary disease (COPD); emphysema; chronic bronchitis; bronchiectasis; interstitial lung disease; asbestosis; pulmonary fibrosis; fibrosing/unspecified alveolitis; respiratory failure; pleurisy; spontaneous/recurrent pneumothorax; other respiratory problems (or did not know or declined to answer, according to UDIs 2316, 6152 or 20002). A subset of 81,719 healthy never smokers were used in the calculation of predictive values (below).
Asthma associated variants included explicitly in the design were: -63 variants listed for asthma phenotypes in the GWAS catalog as downloaded on 23 rd January 2013 plus one tag variant per variant. -111 variants representing potentially interesting regions which showed evidence of nominal significance for association with severe asthma 11 . In brief, all variants with P < 10 -4 for association with severe asthma and which were defined as independent (r 2 < 0.5 with other variants with P < 10 -4 ) were extracted from the full 2.5million imputed database. Smoking, idiopathic pulmonary fibrosis (IPF) or lung cancer associated variants included explicitly in the design were: -21 variants with genome-wide significant (P < 5 x 10 -8 ) evidence of association with cigarettes smoked per day, smoking cessation and smoking initiation 12-14 plus 2 tag variants per variant.

-
Variants with genome-wide significant evidence of association with IPF 15,16 in the MUC5B promoter and TERT, plus one tag variant per variant. -Four variants associated with lung cancer 17 . Regions showing robust or putative association with lung function and/or disease were highlighted for inclusion of additional content to boost imputation coverage and quality. These regions were: -26 regions associated with lung function [2][3][4][5] , defined based on P values and linkage disequilibrium (LD) (variants with -10 ( − ) > 2.5 and not further from 50kb away from the next variant were selected, including any gene intersecting with the region or the nearest gene, if the region did not include any, ±10kb).

-
Six additional regions associated with smoking behaviour, defined based on region of association illustrated in published region plots [12][13][14] . -Three regions ±10kb of three genes associated with IPF (TERT, MUC5B and TERC) 15,16,18 Summary of final array content Of the 808,370 variants targeted in the design, 802,283 were able to be assayed directly by at least 1 probe on the Axiom® UK BiLEVE genotyping array. A tag variant was assayed for 5,340 variants that could not be directly measured, with 134 tag variants being used for more than 1 target variant and 951 tag variants also being a target variant, giving a total of 806,626 unique variants. An additional 785 variants included by Affymetrix for quality control purposes, gave a total of 807,411 variants assayed by the array. 781,732 variants were targeted by a single probe, with 25,679 targeted by 2 probes to increase the chance of successful genotyping, giving a total of 833,090 probes on the array.

Genotyping
DNA extraction was undertaken at the UK Biobank laboratories (http://www.ukbiobank.ac.uk/wpcontent/uploads/2014/04/DNA-Extraction-at-UK-Biobank-October-2014.pdf). 850ul buffy coat from 9ml of whole blood was extracted on a custom TECAN Freedom EVO® 200 platform using Promega Maxwell® 16 Blood DNA Purification Kit (AS1010) (modified to optimise DNA yield from a large volume of buffy coat, including additional lysis and wash buffer and an additional pass through the extraction process). DNA concentration and quality was assessed via 260/280 using a Trinean DropSense® 96. DNA concentration was required to be > 10ng/ul for > 80% of samples on a plate and purity as measured by 260/280 was required to be between 1.8 and 2.2 for > 80% of samples on the plate. Samples were shipped on dry ice for genotyping. Samples were shipped to Affymetrix, Santa Clara, CA, USA for genotyping. Genotype calling was undertaken using Affymetrix Power Tools v1. 15 Genotyping was undertaken in the batches which comprised of 50 plates. Variants which had a MAC < 6 in any batch were recalled in each individual plate in that batch as this was shown to improve calling for very rare variants (unpublished data comparing genotype calls with re-sequencing data from non UK BiLEVE samples, Affymetrix).

Description of post-genotyping quality control (QC) steps undertaken for samples and variants
Variants were excluded prior to sample QC if they failed the basic Affymetrix genotyping quality metrics indicating poor genotype clustering (cluster QC). This included exclusion of variants for which more than three genotype clusters were observed (indicating an off-target measurement), for which the call rate was less than 95% or for which there was failure of one of three cluster quality metrics (Fisher's linear discriminant (FLD), Heterozygous cluster strength offset (HetSO), Homozygote Ratio Offset (HomRO)) defined in the Affymetrix Axiom® Genotyping Solution Data Analysis Guide (http://media.affymetrix.com/support/downloads/manuals/axiom_genotyping_solution_analysis_guide.pdf).
Where a variant was assayed by 2 probes the genotypes from the probe with the highest call rate were used. A total of 50,561 UK BiLEVE samples were genotyped. Samples were excluded sequentially from the analysis according to each of the following criteria (n indicates the number of samples excluded for each step) (Supplementary Methods Table 2): 1. Poor DNA quality -Indicated by Affymetrix's dish QC (dQC) metric. Samples were excluded if dQC < 0.82. (n=100) 2. Call rate -Samples with call rate < 97% were excluded by Affymetrix in an initial round of genotype clustering. The batches were then re-clustered without these samples. (n=31) 3. Sex mismatch -Samples were excluded if the sex inferred from X chromosome genotypes did not match submitted sex (see below for method). (n=125) 4. Call rate -Samples with a call rate < 95% after the second round of clustering were excluded. (n=1) 5. Outlying heterozygosity (high or low, indicative of a contaminated sample) -Samples with heterozygosity which was three standard deviations (SD) from the mean heterozygosity of all samples were excluded (see below for method). (n=333) 6. Unintended duplicates -Samples which share > 98% of alleles identical by descent (IBD) were consistent with either being duplicated samples (with different IDs) or identical twins. Where the duplication could be resolved (e.g. where we could identify which sample of the pair had the correct ID, or they were likely to be twins based on other information) then only 1 sample of the pair was excluded, otherwise both samples were excluded. (n=17) 7. Intended duplicates -The sample with the lowest genotyping call rate from each pair of intended duplicates was removed. (n=481) 8. Principal Components Analysis (PCA) outliers -Ancestry informative principal components (PCs) were derived from variant genotypes (see detailed methods below). Samples with a score for any of the first 10 principal components that was outside 10 SD from the mean were excluded. (n=104) 9. Withdrawn consent -One individual withdrew consent from further study after steps 1 to 8 above had been completed. This sample was excluded from all subsequent steps. (n=1) 10. Related individuals (see detailed methods below) -For any pair of samples which shared more than 20% of alleles IBD, the sample with the lowest call rate was excluded. Where more than 2 samples were mutually related, examination of the relationships between the samples was studied to identify which sample(s) were excluded. (n=515) Details of each step are given below. A total of 48,943 samples remained for subsequent analysis.

Removed Remaining
No filters 0 50,561 DNA quality (dQC) 10 50,551 Initial clustering CR<97% 31 50,520 Sex mismatch 125  Supplementary Methods Table 2: Sample exclusions Sample QC: Sex mismatches Two methods were used to identify discrepancies between the sex provided by UK Biobank and the sex inferred from the genotype data. Firstly, a scatterplot of the ratio of the mean X chromosome and Y chromosome probe intensities (XY ratio) against X chromosome heterozygosity rate (X het rate) was plotted. Secondly, using PLINK v1.07 19 , the chromosome X inbreeding (homozygosity) estimate, F, was used to classify samples as male (F > 0.8), female (F < 0.2) or unknown/ambiguous (0.2 < F < 0.8). A total of 82 samples were reported as showing a sex mismatch using both methods and an additional 28 samples were reported using the PLINK approach (Supplementary Methods Figure 2). Seventeen of the samples reported by PLINK only, and one sample reported by both methods were subsequently found to be heterozygosity outliers and were excluded. Thirty-one of the samples detected by both methods had an XY ratio indicative of being male and an X het rate indicative of being female suggesting that these samples had two copies of the X chromosome and a Y chromosome, consistent with Klinefelter syndrome and were excluded from further analysis. Plots of X het rate and XY ratio of the 11 remaining samples reported as showing a sex mismatch by PLINK were re-examined. Three of these samples had a low XY ratio and low X het rate and were likely to be XO (Turner syndrome) or XX/XO mosaics. All 11 samples were subsequently excluded leading to a total exclusion of 110 samples for sex mismatches.

Supplementary Methods Figure 2: Samples reported as having a different sex based on genotype data to that provided by UK Biobank
Sample QC: Heterozygosity Heterozygosity rate per sample was calculated based on 602,584 autosomal variants with MAF>1%. Supplementary Methods Figure 3 shows a scatter plot of heterozygosity rate against call rate. A total of 333 Samples with a heterozygosity rate greater than 3 SD from the mean were excluded. 82 Figure 3: Heterozygosity rate vs sample call rate.

Sample QC: Relatedness estimation
The proportion of alleles shared IBD, inferred using PLINK v1.07 19 , was used to identify unintended duplicates, confirm intended duplicates and infer relatedness. A subset of autosomal variants was selected based on the following criteria: MAF > 1%, Hardy Weinberg Equilibrium (HWE) (P > 10 -6 ), outside regions of strong LD and inversions. These variants were then pruned based on LD (r 2 > 0.2 within 50 variant windows) to identify a subset of 244,507 independent variants. Supplementary Methods Figure 4 shows a scatterplot of the proportion of variants where a pair share 1 allele IBD (Z1) plotted against the proportion sharing 0 alleles IBD (Z0). Hence parents and offspring who share 1 allele IBD at all genotypes (Z1=1, Z0=0) are in the top-left, duplicates and identical twins share 2 alleles IBD across all variants and hence have 0 variants sharing only 1 or 0 alleles IBD (Z1=0, Z0=0) and siblings on average have 50% of variants where they share 1 allele IBD and 25% of variants where they share 0 alleles IBD (Z1=0.5, Z0=0.25). Cousins, half-siblings etc. lie on the line of slope -1, intercept 1, with relatedness decreasing towards Z1=1, Z0=0. A threshold of PI_HAT < 0.2 was used to define unrelated pairs where PI_HAT = Z2 + 0.5 × Z1.

Supplementary Methods Figure 4: Proportion of genotypes where a pair share 1 allele IBD (Z1) plotted against the proportion sharing 0 alleles IBD (Z0) for samples submitted as unique (left panel) and samples submitted as intended duplicates (right panel). Each point represents a pair of samples.
Seven unintended duplicate pairs were identified and were reported back to UK Biobank. Further investigation of these pairs led to exclusion of 6 unique participants corresponding to 17 sets of genotype data. A total of 481 duplicate pairs which were intended were identified and the sample with the lowest call rate of the pair was removed in each case.

Sample QC: Principal components analysis of ancestry
The intersection of variants used for IBD analysis (described above) and the HapMap3 reference panel were used for PCA of ancestry (43,232 variants). Principal component variant weightings were derived using 987 unrelated HapMap samples and then used to calculate the scores on the principal components of the UK BiLEVE samples using EIGENSOFT 4.2. Supplementary Methods Figure 5 shows that the UK BiLEVE samples' principal component scores lie in the region associated with European ancestry (HapMap CEU and TSI) as expected. Samples which were more than 10 SD outside of the mean score for any of the first 10 principal components were excluded. A total of 104 samples (58 male, 46 female) were excluded, with the following breakdown of outliers excluded by principal component: PC1=19, PC2=56, PC3=22, PC5=7.

±10SD ±10SD
To test whether there was an association between PCA outlier status and lung function subgroup, we performed a chi-squared test and found no significant evidence of association (P=0.07) (Supplementary Methods Table 3).  Table 3: Contingency table for association of PCA outlier status with phenotype group. The difference from the expected count under independence is shown in brackets.

Sample QC: Related individuals
Prior to PCA analysis, a total of 526 pairs of samples showed evidence of relatedness by IBD analysis (PI_HAT > 0.2, see above). One of the samples in one of these pairs was subsequently excluded by the PCA analysis leaving 525 pairs of samples showing evidence of relatedness. Although association testing methods that take relatedness into account are well-developed, given the small proportion of related individuals amongst the UK BiLEVE samples (~1%), we excluded related individuals from downstream association testing as follows (NB: related individuals were included in the imputation process but excluded prior to association testing).
Of the 525 pairs of samples showing evidence of relatedness, 1,000 samples were related to only one other sample and for these 500 pairs, the sample with the lowest call rate was excluded. Within the remaining 25 pairs, 30 samples were related to more than one other sample (indicative of more than 2 members of the same family). For these 25 pairs, we grouped the samples into families and assessed family relationships based on ages and sex. In all families, all samples were recruited from the same recruitment centre. We excluded individuals from each family so as to retain as many unrelated individuals as possible. For example, for a mother-father-offspring trio, the offspring was excluded so as to retain the unrelated mother and father. Where only one sample could be retained from a family, the sample with the highest call rate was selected. A total of 515 samples were excluded from association testing.  Figure 6). Figure 6:  Figure 7 summarises the number of variants which failed cluster QC, exhibited a plate effect or were flagged as exhibiting a batch effect in N batches. Figure 8: Flowchart of QC steps for imputation input variants, variants which were only genotyped (not in imputation panel) and association testing.

Description of association testing for autosomal and X, Y and mitochondrial variants
Genome-wide association testing was carried out for the following nested comparisons  Heavy smokers with low FEV 1 vs heavy smokers with high FEV 1  Never smokers with low FEV 1 vs never smokers with high FEV 1  Heavy smokers with low FEV 1 vs heavy smokers with average FEV 1  Never smokers with low FEV 1 vs never smokers with average FEV 1  Heavy smokers with high FEV 1 vs heavy smokers with average FEV 1  Never smokers with high FEV 1 vs never smokers with average FEV 1  Heavy smokers vs never smokers Within each comparison subset of the data, variants with a MAC < 3 were discarded. 515 samples were excluded due to evidence of relatedness, as described above. Association testing of each case-control group was undertaken using SNPTEST v2.5b4 23 (score test) under an additive genetic model of genotype dose (continuous from 0 to 2 reflecting imputation uncertainty), with the first 10 ancestry principal components as covariates and pack years of smoking as an additional covariate in the heavy smoking stratum. The same association model was used for the X chromosome but with male reference allele coded as 0 and alternate allele as 2; likewise for the Y chromosome (female samples removed) and mitochondrial (MT) SNPs (0 to 2 for both male and female) (Supplementary Table 20). For variants with MAC < 400 the association testing was repeated using the Firth test implemented in EPACTS v3.2.4, which is better calibrated for testing low MAC variants than the score test 24 . The genomic control inflation factor lambda was calculated across autosomes for each comparison and used to adjust for population stratification. For all chromosomes, a P value threshold of 5x10 -8 was used to signify genome-wide significant association. P < 5 x 10 -7 was used to signify suggestive association for autosomal chromosomes and chromosome X. Bonferroni-corrected suggestive significance thresholds for signals on the Y and MT chromosomes and in the pseudo-autosomal region were defined as P < 2 x 10 -4 (250 variants), P < 3.6 x 10 -4 (3.3 x 10 -4 , 140 variants) and P < 3.7 x 10 -5 (1342 variants), respectively. Full genome-wide association results are available via UK Biobank (access@ukbiobank.ac.uk).

Selection of signals
"Sentinel" variants representing independent signals of association were identified by iteratively selecting the variant with the lowest P value, assigning that variant as a sentinel and excluding all variants +/-500kb from the sentinel variant before repeating the process. Sentinel variants were annotated using ANNOVAR 25 . For sentinel variants with MAC < 400, we repeated local imputation and association testing following removal of genotyped SNPs with poor clustering (judged by eye); the variant was retained if P<5x10 -8 following re-analysis.

Proportion of variance explained
The proportion of variance in FEV 1 explained by the previously and newly reported variants was calculated as: where n is the number of variants f i and β i are the effect-allele frequency and effect estimate of the i'th variant, and V is the phenotypic variance. We used the effect estimates from a meta-analysis of quantitative FEV 1 across smokers and non-smokers where FEV 1 is adjusted for age, age 2 , sex and height and then rank inverse-normal transformed. As with previously reported proportion of FEV 1 variance explained 4 we assumed a heritability of 40% to estimate the proportion of additive polygenic variance.

Genome-wide analysis of SNP x smoking interaction
The following statistic was used, both comparing the FEV 1 comparison for which the variant was significant in the heavy smokers with that in the never smokers (or vice versa), and also the low FEV 1 vs high FEV 1 comparison in the heavy smokers and in the never smokers: where under the null ( 0 : ℎ = ), ~(0,1).
A genome-wide scan for smoking interaction was also performed using the above test with the effect estimates and standard errors from the low FEV 1 vs high FEV 1 comparison in the heavy and never smokers. Variants with P < 5×10 -7 were followed up with 2 further tests: i) using the same Z statistic as above but with effects and standard errors from a Firth test to control for type I error in low MAC variants; ii) fitting a logistic model, updated from the logistic model used in the main analysis with a variant × smoking interaction term (implemented in R) and using a likelihood ratio test for significance, thereby using the individual level data to estimate the interaction effect.
Association with GOLD Stage 2+ COPD for novel signals of association with extremes of FEV 1 We undertook a case-control analysis for all SNPS in novel regions, which showed genome-wide significant association in at least one of the nested lung function comparisons. We selected 9,564 COPD cases, defined as those samples with GOLD Stage 2+ COPD according to spirometry (FEV 1 /FVC < 0.7 and % predicted FEV 1 < 80%), and 9,453 controls, selected from the high FEV 1 strata and with FEV 1 /FVC > 0.7 (all had % predicted FEV 1 in excess of 80%). Post-bronchodilator spirometry was not available for any participants and medication was not withheld prior to spirometry being undertaken. Summaries of these samples are given in Supplementary Methods Table 4. Analyses were carried out using the score test, implemented in SNPTEST v2.5b4 23 and assuming an additive genetic model of genotype dose. For never smokers, sex, age and the first 10 ancestry principal components were included as covariates. For heavy smokers, pack years were included as an additional covariate. The results for never and heavy smokers were then combined, using inverse variance weighted meta-analysis. Supplementary Methods Table 4: Sample sizes and mean and standard deviation % predicted FEV 1 and FEV 1 /FVC of GOLD stage 2+ COPD cases and controls in heavy smokers and never smokers.

Analysis of polygenic architecture of diseases and health-related traits
Risk scores 26 and GCTA 27, 28 were used a) to investigate whether there was evidence for polygenic architecture 29 of FEV 1 -defined traits, b) to investigate shared genetic aetiology of FEV 1 between never smokers and heavy smokers, c) to identify whether the genetic variants underlying high FEV 1 also predicted low FEV 1 and d) to explore shared aetiology between individuals with asthma and individuals without asthma. The scores allow the combined influence of many variants with weak effects to be observed by comparing a discovery group and a target group. GCTA was used to estimate the proportion of variance explained in the target population by subsets of variants chosen from the discovery population. QC of individuals and genotyped variants was undertaken as described above, with additional exclusion of variants based on HWE (P < 0.001 excluded) and MAF (MAF < 1% excluded). Only autosomal variants were included in these analyses. The discovery and target groups for each analysis are described below. For each analysis, a GWAS was performed using PLINK v1.9 (Wald test) with the same covariates and additive genetic model, as described above, for the discovery group. For each variant a value for the log odds ratio and P value were obtained. Scores for each allele were assigned as equal to the log odds ratio in the discovery group for variants which met a pre-defined P value threshold (scores were set to zero otherwise). P value thresholds of 1.0, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01 and 0.001 were investigated. To aid interpretation of the score analysis, log odds ratios were set in the same direction, i.e. the effect allele was chosen as that with log odds ratio > 0. Risk scores were then calculated for each individual in the target group by summing the score for each allele multiplied by the number of effect alleles across all variants, i.e.: Risk score = ∑(Score for allele) =1 × (Number of effect alleles) , Where i is the individual, j is the variant and n is the number of variants investigated. These scores were then normalised ensuring the scores had a mean of zero and a standard deviation of one. To test if these risk scores were associated with the phenotype in the target group, logistic regression was performed with the individuals' risk score as the only covariate. The proportion of variance explained by the subset of variants generated for each target population from each P value threshold was calculated using GCTA 27,28 . GCTA estimates the genetic relationship between individuals, and then, using REML and adjusting for covariates (in this instance the first 10 principal components and pack years), estimates the proportion of variance explained. Using all variants, for every pair of individuals found to have cryptic relatedness (cut-off value of 0.025) one individual was removed from analyses for each subset of variants. Case-control data is transformed onto a liability scale through an assumed prevalence level 30 . For investigating shared polygenic effects in FEV 1 -defined traits, between high FEV 1 and low FEV 1 and between asthma and no asthma; prevalence was set to the proportion of low FEV 1 (21,000) in the whole sampling frame (275,915), i.e. the prevalence was set at 7.611%. We based estimates of prevalence on the known sampling frame from which the UK BiLEVE samples were selected with a known sampling strategy. Thus, when investigating the shared genetic architecture of low FEV1 across the strata defined by smoking status the prevalence was assumed to be the number of never smokers with low FEV 1 (10,500) divided by the number of never smokers in the sampling frame (105,272), i.e. a prevalence of 9.974%. To first investigate whether there was a polygenic component associated with low FEV 1, individuals with low FEV 1 and average FEV 1 were randomly split into discovery and target populations (Supplementary Methods Figure 9). To assess whether the genetic variants underlying high FEV 1 also predicted low FEV 1 (airflow obstruction), the discovery group comprised individuals with high FEV 1 and a random sub-sample of those with average FEV 1 . The target sample consisted of those with low FEV 1 and the remaining individuals with average FEV 1 who were not included in the discovery sample (Supplementary Methods Figure 10). To investigate the shared genetic aetiology of low FEV 1 between never smokers and heavy smokers, heavy smokers with average FEV 1 and low FEV 1 were used as the discovery group and never smokers with average and low FEV 1 as the target group (Supplementary Methods Figure 11). Finally, to investigate shared genetic variants between those with and without asthma, the discovery population was selected as those reporting doctor diagnosed asthma with low FEV 1 or average FEV 1 and the target population as those with no doctor diagnosed asthma with low FEV 1 or average FEV 1 (Supplementary Methods Figure 12). Results are presented in Supplementary Table 2. Results were similar if variants with MAF < 5% were excluded. Association with self-reported/doctor diagnosed asthma of loci previously reported for genome-wide significant association with asthma Asthma cases were defined as participants that either (i) answered "asthma" to a touchscreen question "Has a doctor ever told you that you have had any of the following conditions? (You can select more than one answer) (Blood clot, DVT, bronchitis, emphysema, asthma, rhinitis, eczema, allergy)", or (ii) reported asthma in verbal interview, as per any of the self-reported, non-cancer illness fields. Using this definition, we identified 7,488 asthma cases and 41,455 controls within the 48,931 unrelated samples passing the QC steps described above.

Supplementary Methods
We tested for association with asthma for 17 variants at 12 loci which had previously shown genome-wide significant (P<5x10 -8 ) association with asthma 11, 31-34 . Association testing was undertaking using SNPTEST using a logistic model with genotype dose with 10 ancestry principal components and pack years as covariates (0 for never smokers). Results are in Supplementary Table 1.
Effect on quantitative FEV 1 for novel signals of association with extremes of FEV 1 For each of the 6 novel signals of association with extremes of FEV 1 , we tested association of FEV 1 as a quantitative trait separately in heavy smokers and never smokers using a linear model with imputed genotype dose and P values from a score test implemented in SNPTEST v2.5. Firstly, residuals from a linear regression of FEV 1 with age, age 2 , sex, height and 10 ancestry principal components were obtained, which were then ranked and inverse-normal transformed. These normally distributed z-scores were used as the dependent phenotype in the linear regression. Results are presented in Table 2.
Analysis of expression data from lung, blood and brain tissues to identify if our novel signals affect gene expression (eQTL)

Lung
The descriptions of the lung eQTL dataset and subject demographics have been published previously [35][36][37] .
Briefly, non-tumor lung tissues were collected from patients who underwent lung resection surgery at three participating sites: Laval University (Quebec City, Canada), University of Groningen (Groningen, The Netherlands), and University of British Columbia (Vancouver, Canada). Whole-genome gene expression and genotyping data were obtained from these specimens. Gene expression profiling was performed using an Affymetrix custom array (GPL10379) testing 51,627 non-control probe sets and normalized using RMA 38 . Genotyping was performed using the Illumina Human1M-Duo BeadChip array (using blood or lung samples). Genotype imputation was undertaken using the 1000G reference panel. Following standard microarray and genotyping quality controls, 1,111 patients were available including 409 from Laval, 363 from Groningen, and 339 from UBC. Lung eQTLs were identified to associate with mRNA expression in either cis (within 1 Mb of transcript start site) or in trans (all other eQTLs) and meeting the 10% false discovery rate (FDR) genome-wide significant threshold. Variants which showed evidence of association (P < 5 x 10 -7 ) with extremes of FEV 1 and all proxy variants (r 2 > 0.3 with the sentinel variants) were queried. The results for the most significant variant × probeset pair for any genes identified in the look-up and the results for the sentinel variant and/or strongest proxy variants are presented in Supplementary Table 9. There was no significant evidence of association (FDR < 10%) for chr12:114743533, chr11:109843513 and rs34712979 (or proxies) in the data set.

Blood
Evidence for association with gene expression in blood was assessed for all variants which showed evidence of association (P < 5 x 10 -7 ) with extremes of FEV 1 or smoking behaviour, and their proxies (r 2 > 0.3). A publicly available resource based on blood expression data from 5,311 individuals, imputed to HapMap 2 was used (resource previously described 39 ). Cis and trans eQTL signals meeting the 10% FDR genome-wide significant threshold were identified. The results for the most significant variant × probeset pair for any genes identified in the look-up and the results for the sentinel variant and/or strongest proxy variants are presented in Supplementary Table 10a. Data were only available where FDR < 50%. For loci where it could not be established whether an absence of signals with FDR<10% was due to signals of association with FDR > 10% (only results with FDR < 50% were publicly available) or because there were no data for those variants (either due to absence of a proxy in HapMap or variant QC failure), a 1000 Genomes Project imputed eQTL dataset from the Estonian Genome Project was also queried. These loci were those represented by the following sentinel variants: chr12:114743533, rs2047409, rs34712979, rs4466874, rs10193706, rs61784651 and rs10807199 (

Brain
Evidence for association with gene expression in brain was assessed for all variants which showed evidence of association (P < 5 x 10 -7 ) with smoking behaviour, and their proxies (r 2 > 0.3). A publicly available resource of expression data from 10 brain regions in 134 individuals, with variant genotype data imputed to 1000 Genomes Project phase 1 reference panel was used (resource previously described 40 ). Cis and trans eQTL signals meeting the 1% FDR genome-wide significant threshold were identified. The results for the most significant variant × probeset pair for any genes identified in the look-up and the results for the sentinel variant and/or strongest proxy variants are presented in Supplementary Table 14.
Analysis of differential expression of candidate genes in the lungs of individuals with and without COPD Genes were defined as candidate genes for novel signals of association with extremes of FEV 1 if they contained a) the sentinel variant or were the nearest genes, b) a putatively functional variant within the gene, correlated with the sentinel variant, was identified through conditional analysis as explaining the observed association (see Supplementary Table 21) or c) the sentinel variant or a strong proxy variant (r 2 > 0.8) was an eQTL for that gene. Publically available microarray data (GSE37147 41 ) was mined using GEO2R on the gene expression omnibus website (http://www.ncbi.nlm.nih.gov/geo/info/geo2r.html). Two sample groups were defined. Affymetrix Human ST1.0 array expression data for 87 bronchial brushings in the lungs of individuals with COPD was defined as the first group, whilst the second group had the expression profiles of 151 bronchial brushings from individuals without COPD. There were no significant differences in age, cumulative smoking exposure or smoking status between the individuals with COPD and those without COPD 41 . Differential expression between the 2 groups was identified using the default array statistics. P values were adjusted for multiple testing using the Benjamini & Hochberg method 42 . Results are presented in Supplementary Table 22.
Analysis of differential expression of candidate genes in the developing foetal lung Genes were defined as candidate genes for novel signals of association with extremes of FEV 1 if they contained a) the sentinel variant or were the nearest genes, b) a putatively functional variant within the gene, correlated with the sentinel variant, was identified through conditional analysis as explaining the observed association (see Supplementary Table 21) or c) the sentinel variant or a strong proxy variant (r 2 > 0.8) was an eQTL for that gene. Publically available Affymetrix U133 Plus 2 array data (Gene expression omnibus: GSE14334) of 38 foetal lung samples from the Pseudoglandular (7 -16 weeks) and Canalicular (17 -22 weeks) stages of lung development was mined as previously reported 43 . Results are presented in Supplementary Table 7.
Messenger RNA sequencing in human bronchial epithelial cells (HBECs) to identify novel transcripts of genes at novel loci associated with the extremes of FEV 1 We looked for evidence of novel transcripts for genes containing the sentinel SNPs associated with extremes of FEV 1 and for genes which were regulated by nearby (<1Mb) SNPs (eQTLs) using RNA sequencing in HBECs. Passage 3 normal human bronchial epithelial cells (NHBECs) (Lonza, UK), were cultured in growth factorsupplemented medium (BEGM, Lonza as described previously53. Cells were grown under these conditions and four different experimental conditions as part of a related RNA interference (RNAi) project each in three independent biological replicates (12 samples in total  Figure 6).

Pathway analysis using MAGENTA
We tested whether the results of the meta-analysis of low FEV 1 vs high FEV 1 across heavy smokers and never smokers were enriched for known biological pathways using MAGENTA v2 44 . Briefly, MAGENTA defines a P value for each gene that is the lowest variant P value within 110kb upstream and 40kb downstream of the gene and is corrected for gene size, number of variants per gene and LD within the region. For each gene set, the null hypothesis that there is a random distribution of gene association score ranks within the gene set is tested against the alternative hypothesis that there are more gene association score ranks above a given rank cut-off (75 th percentile cut-off is recommended for polygenic traits) compared to random sampling of 10,000 gene sets of identical size. For each gene set, a FDR is calculated as the fraction of all randomly sampled gene sets (10,000 × number of gene sets tested) that have more genes with P value below the cut off (75 th percentile) than in the gene set being tested, divided by the fraction of real gene sets that have more genes with P value below the cut off (75 th percentile) than in the gene set being tested. Variants with MAC less than 400 were excluded. Genes within 500kb of the genome-wide significant associations with FEV 1 reported in this paper, and within 500kb of the 32 variants previously reported as associated with FEV 1 , FEV 1 /FVC and/or FVC 2-4, 45 were flagged. Results are listed in Supplementary Table 17.
Stepwise conditional analysis to identify additional independent signals at the novel loci We used a stepwise selection procedure implemented in GCTA 46 to identify independent signals within all the novel regions. This method starts by conditioning all the variants in a region by the most significant variant and then it uses a stepwise procedure to select other variants for which joint P values meet a pre-specified threshold (10 -3 in this analysis).The software then returns P values for a joint model containing the stepwise-selected independent variants. The joint model P values returned by GCTA were checked by fitting the joint model in R with the glm function. Results are presented in Supplementary Table 6. Variants with a joint conditional P < 10 -4 were defined as being independent.

Imputation and association testing of structural variation haplotypes in the inversion locus at chromosome 17q21.31 (KANSL1)
An imputation reference panel for the nine structural haplotypes observed at 17q21.31 was provided 47 . The structural haplotypes were encoded in the reference panel in the form of bit patterns of 12 surrogate, virtual biallelic variants. In this way standard imputation procedures could be used to impute the genotypes of the surrogate markers which could then be decoded into the corresponding structural haplotypes.  Table 5: Imputed haplotype frequencies for 17q21.31 inversion region compared to CEU frequencies provided with imputation reference panel. The haplotypes are defined on the uninverted (H1) or inverted (H2) region with different copy numbers of the regions α, β and γ within the inversion region 47 .

Supplementary Methods
We tested association of low FEV 1 versus high FEV 1 with copy number count of the α, β and γ structural polymorphisms using logistic regression across both smoking and non-smoking strata, with 10 ancestry principal components and pack years as covariates (0 pack years for never smokers) (Supplementary Table 11).
Corroborative evidence supporting loci with genome-wide significant evidence of association with extremes of FEV 1 We searched for corroborative evidence of association with FEV 1 for our novel signals of association with extremes of FEV 1 in i) an independent subset of the UK BiLEVE sample and ii) in publicly available association results from a previous large GWAS of FEV 1 in the general population 4 (n=48,201, ever and never smokers first analysed separately and then meta-analysed) . Where the novel signal was identified in never smokers, the results for the same SNP were extracted for the same comparison (i.e. low FEV 1 vs high FEV 1 ) in heavy smokers, and vice versa, in UK BilEVE. From the previous large GWAS, we extracted the meta-analysis P values for association with FEV 1 for all sentinel SNPs and their proxies (linkage disequilibrium r 2 > 0.3). We report both the most significantly associated proxy SNP and the P value for the sentinel or strongest proxy. All results are in Supplementary Table 18.

Corroborative evidence supporting loci with genome-wide significant evidence of association with smoking behaviour (heavy smokers vs never smokers)
To provide corroborative evidence to support our genome-wide significant findings of association with smoking behaviour at 4 loci, regional imputation, association testing and meta-analysis across 15 studies was undertaken. The primary analysis was a comparison of ever smokers vs never smokers (smoking initiation). Secondary analyses of current smokers vs non-current (smoking cessation) and smoking quantity (smoking quantity levels were 0 (defined as 1-10 cigarettes per day (CPD)), 1 (11-20 CPD), 2 (21-30 CPD) and 3 (31 or more CPD)) were also undertaken. Supplementary Methods Table 6 describes the sample sizes available for each study. SHAPEIT2 49 was used to phase a region 500Kb either side of each site with 200 conditioning states in the phasing run. Imputation was carried out using IMPUTE2 50 with the 1000 Genomes Phase 1 dataset as a reference panel. SNPTEST was used to carry out association testing. Age and sex were included as covariates within each cohort. Some of the cohorts were analysed using other covariates, such as principal components and case-control status (see Supplementary Material of Liu et al. 12 ). META 12 was used to apply meta-analysis across studies. The meta-analysis was carried out by combining studyspecific β estimates using a fixed effects model, which used the inverse of the variance of the study-specific β estimates to give weight to the contribution of each study. The variance of each cohort's β estimate was multiplied by the genomic control λ estimate to correct for observed inflation. The genomic control λ estimates for each study were taken from Liu et al. (2010) 12 . At each variant only those studies which had INFO ≥ 0.5 were included in the meta-analysis. Results are given in Supplementary Table 19. GSK_BIPOLAR  546  657  344  313  600  GSK_EPIC  1589  1927  1574  353  0  GSK_KORA  831  811  1425  217  251  GSK_LOLIPOP  635  653  395  258  648  GSK_UNIPOLAR  856  935  432  503  897  GSK_COPD  0  0  905  725  1630  GSK_GEMS  793  910  642  268  860  GSK_LAUSANNE  2275  3357  1872  1485  3130  GSK_MEDSTAR  469  853  553  300  818  GSK_POPGEN  494  608  0  0  571  GSK_PENNCATH  0  0  Supplementary Methods Table 6: Sample sizes for smoking traits per cohort.

Cohort Never smokers Ever smokers Non-current smokers Current smokers Smoking quantity
In addition, we undertook a look-up of our novel genome-wide significant signals of association with smoking behaviour in the publicly available GWAS data from the Tobacco and Genetics (TAG) 14 consortium. Results from this look-up, and meta-analysis with the results described above, are presented in Supplementary Table 19.

Power Calculations
We undertook power calculations prior to the start of the project based on use of an exome array, as shown in Supplementary Methods Table 7.  Table 7: Power estimates for rare variants with case:control ratio of 1:1.

Supplementary Methods
†Calculations assume an additive genetic model (that is the odds ratios of disease are expressed per copy of the risk variant) and a 5% baseline prevalence of disease. OR= odds ratio.
Due to advances in genotyping arrays and reduced costs it became possible to include a genome-wide imputation grid to the custom array in addition to exome array content and other categories of content. Illustrative power calculations for common and low frequency variants are shown in Supplementary Methods Table 8.  Table 8: Power estimates for low frequency and common variants with case:control ratio of 1:1. †Calculations assume an additive genetic model (that is the odds ratios of disease are expressed per copy of the risk variant) and a 5% baseline prevalence of disease.

Supplementary Methods
Corresponding illustrative power calculations for common and low frequency variants are shown in Supplementary Methods Table 9 for a case-control ratio of 2:1, relevant to comparison of groups from the two extremes of the % predicted FEV 1 distribution.  Table 9: Power estimates for low frequency and rare variants with case:control ratio 2:1. †Calculations assume an additive genetic model (that is the odds ratios of disease are expressed per copy of the risk variant) and a 5% baseline prevalence of disease.

Analysis to identify whether variants with a high functional score explain the signal.
In order to identify if there were suggestively functional variants which explain the novel and previously reported association signals, association testing was repeated for each novel and previously reported sentinel variant with each nearby functional variant included in the logistic model in turn. To do this, variants within 1 Mb of the sentinel variant and which were in LD with the sentinel variant (r 2 > 0.3) and/or had nominal evidence of association (P < 5×10 -4 ) were annotated with their functional effect. Variants were annotated using ENSEMBL's Variant Effect Predictor (VEP) 51 and functional effects were predicted with SIFT 52 , PolyPhen-2 53 , CADD 54 , and GWAVA 55 databases. If a variant was annotated as 'deleterious' by SIFT, 'probably damaging' or 'potentially damaging' by PolyPhen-2, had a CADD scaled score ≥ 20 (CADD_PHRED ≥ 20), or had a GWAVA score > 0.5, it was defined as a functional variant. CADD (Combined Annotation-Dependent Depletion) is a method for integrating many diverse annotations, namely conservation metrics, functional genomic data, transcript information, and protein level scores into a single score for each coding and noncoding variant. Scaled CADD score (CADD_PHRED) ranks each variant relative to all possible substitutions of the human genome (~8.6 billion SNVs of the GRCh37/hg19 reference genome). A scaled CADD score of greater or equal to 20 indicates the 1% most deleterious variants in the human genome. GWAVA (genome-wide annotation of variants) is a tool that combines information from a wide range of annotations to predict the functional impact of noncoding variants. We used a GWAVA score threshold of 0.5, as proposed by the authors 55 , above which noncoding variants were considered as 'deleterious'. Annotation results were filtered with VEP's --pick flag, which selects only one consequence per variant based on the canonical, biotype status and length of the transcript as well as the ranking of the consequence type. For variants with multiple annotations, we selected the most deleterious annotation (i.e. if a variant was annotated as frameshift variant and intronic variant, the variant was considered to be frameshift).
The association of the sentinel variant was identified as being explained by a functional variant if the P value for the sentinel variant was > 0.01 in the joint association test. Results are in Supplementary Table 21.

Gene-based analysis of rare and low-frequency variants (MAF < 5%) using SKAT-O
ENSEMBL's Variant Effect Predictor (VEP) was also used to annotate genotyped variants based on the ENSEMBL version transcript set 51 . Annotation results were also filtered with VEP's --pick flag, and for variants with multiple annotations, we selected the most deleterious annotation. In total we identified 115,444 variants in the protein coding regions of genes (exonic variants), of which 104,673 variants were annotated as loss of function (LoF) or missense variants. Gene-based analysis was performed using the optimal unified kernel-based test (SKAT-O) 56 , which maximizes power by selecting the best combination of the burden test and the non-burden sequence kernel association test (SKAT). For each FEV 1 comparison and heavy smokers vs never smokers, we ran two SKAT-O tests including two classes of variants: (1) loss of function (LoF) and missense variants with MAF <5%, and (2) LoF and missense variants with MAF <5%, which were predicted by SIFT 52 to be 'deleterious' or by PolyPhen-2 53 to be 'probably damaging' or 'possibly damaging' or variants with CADD scaled (CADD_PHRED) score ≥ 20 54 . Allele frequencies used for the inclusion threshold were estimated based on all 48,943 unrelated UK BiLEVE samples.
For each gene we selected the minimal P value between the two gene-based tests. In total for each comparison we tested, 9,427 genes in the analysis with LoF and missense variants, and 3,393 genes in the analysis with deleterious LoF and missense variants. Genes with less than 3 variants meeting the criteria for inclusion were excluded. We defined a statistical significance threshold of P < 3.9 × 10 -6 (Bonferroni-corrected for 12,820 genes). All analyses included the first ten principal components and pack years (for heavy smokers) as covariates. Missing genotypes of variants were imputed with the average allele frequency of the genotyped individuals. All genes with SKAT-O P < 10 -4 are reported in Supplementary Table 23.
For each gene with P < 10 -4 , SKAT-O analyses were re-run excluding each variant in turn to identify whether the SKAT-O signal was driven by a single variant (Supplementary Figure 9).

Analysis of the effect of geographical location on novel loci
Rounded East and North home location coordinates were used to assign each individual to a postcode area (the one or two letter sequence at the start of a UK postcode) using the Ordnance Survey tool Code Point Open (http://www.ordnancesurvey.co.uk/business-and-government/products/code-point-open.html). Supplementary Methods Table 10 shows which region each postcode area corresponds to, and the number of unrelated individuals in each region. Supplementary Methods Table 10: Allocation of participants to postcode area.
Out of 48,943 individuals, 504 had missing home location coordinates were not mapped. Those in the "Eastern" region were reassigned to their next closest region (two samples were re-assigned to London and one sample was reassigned to the Southeast). The geographical distribution of participants included in UK BiLEVE is shown in Supplementary Methods Figure 13. Figure 13: Plot of Northing coordinate against Easting coordinate for all participants and coloured according to region as designated by postcode area.

Supplementary Methods
To assess within-UK population structure (manifest as allele frequency variation by geographical region of residence) for selected genetic variants we performed a likelihood ratio test (implemented using the Deducer package, http://www.deducer.org, in R) which has better statistical properties than a Chi-squared test at low minor allele counts. The likelihood ratio test was performed for the sentinel variants in novel signals of association with extremes of FEV 1 and smoking. As a positive control, we also tested rs9378805 in IRF4 for association as this SNP showed the strongest evidence of association with geographical location in a previous report 57 . Results are presented in Supplementary Methods Table 11.  Table 11: Geographical variation of novel loci. Likelihood Ratio Test P values for association of each of the novel loci (for extremes of FEV 1 or smoking behaviour) with geographical region defined by 11 postcode areas. The SNP rs9378805 in IRF4 was included as a positive control having been previously reported as being strongly associated with geographical location. Supplementary Table 1: Association with doctor-diagnosed asthma (cases vs controls) for loci previously reported as having genome-wide significant association (P < 5 x 10 -8 ) with asthma in European adults. Consistent: whether detection of effect in this study is consistent with that of previous study (na: direction of effect in previous study could not be determined from the literature).           Supplementary Table 11: Imputation of structural haplotypes at 17q21.31 (KANSL1) and association with extremes of FEV 1 . Genomic regions α, β, and γ are those comprising the structural haplotypes in the 17q21.31 inversion region 47 with their start and end positions. The columns beta, se, OR and P show respectively the fitted effect estimate, its standard error, odds ratio and P value of association for a logistic regression of low FEV 1 versus high FEV 1 with copy number of each genomic region for both heavy and never smokers with 10 ancestry principal components and pack years smoked as covariates (0 for never smokers).  For the low FEV 1 vs high FEV 1never smokers section, the sentinel variants are conditioned on concurrently or previously reported lung function signals in the same region. For the heavy smokers vs never smokers section the first 3 sentinel variants are secondary novel signals within regions containing a genome-wide significant variant, on which they are conditioned. The remaining variants are conditioned on previously reported smoking behaviour variants within the region.    Supplementary Table 17: MAGENTA pathway analysis. Results of gene set enrichment analysis (MAGENTA) for genome-wide results from a meta-analysis of low FEV 1 vs high FEV 1 in heavy smokers and never smokers. Only gene sets with false discovery rate > 0.05 are presented. Analyses were run before and after excluding variants within the HLA region. Genes within 500kb of novel (i.e. reported in this paper) and previously reported genome-wide significant signals of association with lung function are flagged. Original gene set size: original number of genes per gene set in publicly available dataset. Effective gene set size: effective number of genes per gene set analysed after removing genes that were not assigned a gene score (e.g. no variants in their region), or after adjusting for physical clustering of genes in a given gene set (removing all but one gene from a subset of genes assigned the same best variant, retaining the gene with the most significant gene score       Supplementary Figure 1

: a) Imputation quality (spline smoothed) against minor allele frequency (MAF) (log scale), and b) Percentages of usable variants on chromosome 2 passing imputation quality control (INFO > 0.5) and minor allele count (MAC) ≥ 3, in different minor allele frequency (MAF) ranges.
Imputation against 1000G panel alone (grey) and 1000G+UK10K (the rest) reference panels (total number of imputed variants on chromosome 2 is 3,515,740 variants with MAC ≥ 3 for UK10K+1000G panel and 3,292,965 for 1000G panel) is shown. Colour reflects the component of the array content used for imputation (cyan: basic GWAS grid (18367 variants), green: as cyan, plus "booster 1" content (7,127 variants) to optimise imputation of common variation in European ancestry, blue: as green, plus "booster 2" content (18,838 variants) to optimise imputation of low frequency (MAF 1-5%) variation in European ancestry, black: all array content (additional 3,887 variants). Chromosome 2 was used as it is the largest representative autosomal chromosome. e) TEX41/PABPC1P2 independent signal rs10928224 conditioned on novel genome-wide significant signal rs10193706. f) LPPR5 independent signal rs12060706 conditioned on novel genome-wide significant signal rs61784651.

Supplementary Figure 4: Region plots for novel signals of association at previously reported loci (NPNT and HLA-DQB1).
a) NPNT (low FEV 1 vs high FEV 1 in never smokers).
b) HLA-DQB1: low FEV 1 vs high FEV 1 in never smokers. Gabriel study asthma SNP is not in this study, but we have a proxy rs17843604 (r 2 0.917 with Gabriel SNP in HapMap 3; r 2 0.65 with rs9274600 in this study); rs17843604 association P = 4.86×10 -9 .
Supplementary Figure 5: Effect of exclusion of individuals with asthma at novel loci associated with extremes of FEV 1 . Odds ratios for the five novel genome-wide significant signals of association for low FEV 1 vs high FEV 1 in never smokers, before and after exclusion of individuals with doctor-diagnosed/self-reported asthma. A total of 2,828 individuals with low FEV 1 (never smokers) and 286 individuals with high FEV 1 (never smokers) with doctor-diagnosed/self-reported asthma were excluded.