Genetic variants associated with susceptibility to idiopathic pulmonary fibrosis in people of European ancestry: a genome-wide association study

Summary Background Idiopathic pulmonary fibrosis (IPF) is a chronic progressive lung disease with high mortality, uncertain cause, and few treatment options. Studies have identified a significant genetic risk associated with the development of IPF; however, mechanisms by which genetic risk factors promote IPF remain unclear. We aimed to identify genetic variants associated with IPF susceptibility and provide mechanistic insight using gene and protein expression analyses. Methods We used a two-stage approach: a genome-wide association study in patients with IPF of European ancestry recruited from nine different centres in the UK and controls selected from UK Biobank (stage 1) matched for age, sex, and smoking status; and a follow-up of associated genetic variants in independent datasets of patients with IPF and controls from two independent US samples from the Chicago consortium and the Colorado consortium (stage 2). We investigated the effect of novel signals on gene expression in large transcriptomic and genomic data resources, and examined expression using lung tissue samples from patients with IPF and controls. Findings 602 patients with IPF and 3366 controls were selected for stage 1. For stage 2, 2158 patients with IPF and 5195 controls were selected. We identified a novel genome-wide significant signal of association with IPF susceptibility near A-kinase anchoring protein 13 (AKAP13; rs62025270, odds ratio [OR] 1·27 [95% CI 1·18–1·37], p=1·32 × 10−9) and confirmed previously reported signals, including in mucin 5B (MUC5B; rs35705950, OR 2·89 [2·56–3·26], p=1·12 × 10−66) and desmoplakin (DSP; rs2076295, OR 1·44 [1·35–1·54], p=7·81 × 10−28). For rs62025270, the allele A associated with increased susceptibility to IPF was also associated with increased expression of AKAP13 mRNA in lung tissue from patients who had lung resection procedures (n=1111). We showed that AKAP13 is expressed in the alveolar epithelium and lymphoid follicles from patients with IPF, and AKAP13 mRNA expression was 1·42-times higher in lung tissue from patients with IPF (n=46) than that in lung tissue from controls (n=51). Interpretation AKAP13 is a Rho guanine nucleotide exchange factor regulating activation of RhoA, which is known to be involved in profibrotic signalling pathways. The identification of AKAP13 as a susceptibility gene for IPF increases the prospect of successfully targeting RhoA pathway inhibitors in patients with IPF. Funding UK Medical Research Council, National Heart, Lung, and Blood Institute of the US National Institutes of Health, Agencia Canaria de Investigación, Innovación y Sociedad de la Información, Spain, UK National Institute for Health Research, and the British Lung Foundation.

Genotyping data was available for 152,729 of the 502,682 individuals recruited by UK Biobank. Of the genotyped individuals, UK Biobank recommends removing 480 individuals for either having outlying genome-wide heterozygosity or having more than 5% missing genetic data. Those recorded as being related to any other individual in UK Biobank (n = 17,308) or not determined as being of European ancestry from the genetic data (n = 22,603) were removed. There were no sex mismatches leaving 112,338 individuals in UK Biobank passing QC. In total 148 individuals were removed as they were concluded to have an ILD leaving 112,190 individuals passing selection criteria.
Controls were selected such that they followed similar age, sex and smoking distributions to the IPF cases. The proportion of ever smokers in the IPF cases was based on those in the PROFILE study (n = 176) as this was the only centre that had smoking status data available. Accordingly, controls were selected ensuring that 70% were male, 70% were ever smokers and 83% were aged 65 or above. Five controls (plus 10% extra to allow for downstream QC failures) were selected for every IPF case passing QC meaning 3,366 controls were selected. IBD analysis between cases and controls found no additional duplicates or related individuals.

Power calculations
Power calculations were performed using Quanto 5 v1.2.4, assuming a population disease prevalence of 20 cases per 100,000 and varying effect size and effect allele frequencies. Results from these calculations can be seen in Supplementary Figure 1.

Phasing and imputation
For stage 1 samples, phasing and imputation was performed on stage 1 cases and controls together in 3Mb chunks. Phasing was performed before imputation using SHAPEIT v2 (r837) 6 using variants with a call rate greater than 95%, MAF greater than 1% and in Hardy-Weinberg equilibrium (p > 10 −6 ). For variants found on both the UK BiLEVE and UK Biobank array call rate was calculated across the whole sample and for variants found only on one array, call rate was calculated just in individuals genotyped using that array. Imputation was performed using IMPUTE2 7 v2.3.2 using a combined reference panel from 1000 Genomes Project 8 Phase 3 and UK10K 9 .
For the Chicago Consortium, genotyping was performed using the Affymetrix Genome-Wide Human SNP 6.0 Array. Phasing and imputation used SHAPEIT 6 and minimac 10 and the HRC 11 (Haplotype Reference Consortium) r1.1 reference panel. For the Colorado Consortium the Illumina 660 Quad beadchip was used for genotyping and phasing and imputation was performed using SHAPEIT 6 and IMPUTE2 7 (using 1000 Genomes Phase 1 as the reference panel).

Variant QC
During imputation, an imputation quality score is calculated based on how certain the imputation is for each variant. Generally, imputation is of a higher quality for common variants compared to rare variants. Therefore for variants with MAF ≥ 1% only variants with an imputed information score ≥ 0.5 were included in the analysis and for variants with MAF < 1% only variants with an information score ≥ 0.8 were included. Due to potential biases caused by poor imputation for very rare variants, a minimum MAC threshold of 10 was set for imputed variants and a minimum MAC threshold of 3 was set for directly genotyped variants.

Stage 1 genome-wide and X chromosome analysis
Case-control analyses were run genome-wide assuming an additive genetic model conditioning on age, sex and ten principal components. Variants were recorded as dosages (i.e. continuous values between 0 and 2 to take into imputation uncertainty). The analysis was run using the score test using SNPTEST 12 v2.5.2 due to its computational efficiency. The Firth test has been shown to give a better combination of type 1 and type 2 error rate than the score test for low frequency variants (especially in unbalanced case-control studies) 13 . Therefore for variants with a score test p value < 5 × 10 −3 and a MAC < 400 the analysis was rerun using the Firth test using EPACTS 14 v3.2.4. For the X chromosome males, the reference allele was coded as zero and the alternate allele coded as two (or the equivalent dosage) and analysed using the same approach as used for the autosomes.

Selection of signals for stage 2 analyses
Independent variants representing signals of association with susceptibility to IPF in stage 1 with p < 5 × 10 −6 were selected for further study in stage 2. Independence of signals was confirmed using conditional analyses. Where possible, genotyping cluster plots of the most strongly associated variant of each independent signal (or a genotyped proxy, if the top variant was imputed) were manually checked and variants with evidence of poor genotype clustering were excluded from further analysis.

Accounting for array effects
The controls for stage 1 of this study were selected from UK Biobank where ~50K individuals were genotyped on the Affymetrix Axiom UK BiLEVE array (the same array as the cases) and the rest were genotyped on the 95% identical Affymetrix Axiom UK Biobank array. In order to maximise the power of our study by increasing the number of controls, we selected controls that had been genotyped on either array. As the UK Biobank controls genotyped on the Axiom UK BiLEVE array had originally been ascertained on the basis of lung function and smoking behaviour, it was reasonable to assume that allele frequency differences between the controls genotyped on the two arrays could feasibly be driven by either technical array artefacts or genuine associations with lung function and smoking. It was not possible to include an adjustment for array in the association testing model as all cases were genotyped on the same array. Furthermore, it has been shown that there is some overlap of genetic signals of lung function/COPD and pulmonary fibrosis 15 . In order to ensure that signals of association with IPF that we reported were genuine, we additionally tested for association with array by comparing controls genotyped on the UK Biobank array with controls genotyped on the UK BiLEVE array. All variants that showed association with susceptibility to IPF in stage 1were selected for analysis in stage 2. We report as genuine novel signals of association with IPF, signals that were associated with susceptibility to IPF independently in both stage 1 and stage 2, and which had no evidence of association with array (p < 0.05) in stage 1. The genome-wide Manhattan plot and QQ plot for the controlcontrol analysis, including the genomic inflation factor lambda, are shown in Supplementary Figure 2. For novel signals of association with IPF susceptibility, we additionally undertook sensitivity analyses restricting the control set to individuals genotyped on the same array as the cases and repeating the association analyses, and regionally re-imputing the signal using the Haplotype Reference Consortium imputation panel and repeating the association analyses.

Stage 2 samples
Stage 2 analyses were performed on individuals from two independent samples from the USA; the Chicago Consortium and the Colorado Consortium. The Chicago Consortium dataset comprised of individuals used in the stage 1 GWAS (genomewide association study) in the study by Noth et al 16 . Briefly, these were individuals of American-European ancestry passing QC (call rate > 97%, no sex mismatches and related individuals removed) where controls were selected so they were genetically similar (based on the first four principal components) to the IPF cases. The Colorado Consortium dataset consisted of individuals as described in the Fingerlin at al GWAS 17,18 . Cases were self-reported non-Hispanic white individuals with an IIP and controls were selected from self-reported non-Hispanic white individuals based on IBS (Identical by State) estimates such that the controls were genetically similar to the cases. Individuals were removed if they were genetic outliers based on IBS estimates, showed sex mismatches, had genome-wide heterozygosity more than four standard deviations away from the mean or had call rates < 98%.

Stage 2 case-control analysis and meta-analyses
The analyses for the Chicago Consortium and Colorado Consortium have been described elsewhere [16][17][18] . The Chicago Consortium adjusted for age and sex but did not adjust for principal components because cases and controls were prematched using genetic similarity to the cases based on principal components analysis. The Colorado Consortium adjusted for sex and three dimensions from a multi-dimensional scaling model. Dimensions were used rather than principal components as the distribution of p values when running the analysis genome-wide was closer to the null when using dimensions. Age was not included in the model as these data were not available for controls.
These two analyses were combined using an inverse variance weighted fixed effects meta-analysis. These two analyses were then meta-analysed with the stage 1 results and any variant found to be genome-wide significant (p < 5 × 10 −8 ) when metaanalysing the stage 1 and stage 2 results together was deemed to show an association with susceptibility to IPF. Variants which became less significant in the meta-analysis than in stage 1 alone were not reported as showing an association with susceptibility to IPF.

eQTL
The variants identified as associated with susceptibility to IPF as well as any variants in high LD (r 2 > 0.8) were investigated in three eQTL datasets; a blood eQTL database 19 , the GTEx project covering multiple tissues 20,21 and a lung eQTL database [22][23][24] . The blood eQTL database, as described by Westra et al 19 , contained eQTL data for variants showing any association with peripheral blood samples from 5,311 individuals. Individuals were genotyped on various platforms and imputed using HapMap. The GTEx project includes eQTL data from multiple tissues from 449 donors genotyped using either the Illumina OMNI 5M or 2.5M arrays and imputed to the 1000 Genomes Project Phase 1 (version 3). Only the 44 tissues with at least 70 samples from genotyped donors were included in the lookup for the IPF associated variants. Finally, the lung eQTL database consisted non-tumour lung tissue samples from 1,111 individuals who had undergone lung resection surgery, mainly current or former smokers, genotyped on the Illumina Human1M-Duo BeadChip array. Cis-eQTLs were available for all three databases (defined as the variant being within 1Mb of the beginning of the transcription start site in the lung and blood databases and within 250kb in GTEx) and trans-eQTLs were available in the lung and blood databases. An FDR threshold of 10% was used for eQTLs in the blood and lung databases and 5% in the smaller GTEx resource. mRNA analysis RNA was extracted extracted from human lung tissues using the mirVana kit (Ambion/Life Technologies, Carlsbad, CA). Complementary DNA (cDNA) wasComplementary DNA (cDNA) was synthesized from 80ng RNA using SuperScript III (Life Technologies, Waltham, MA) and diluted 1 to 8 with nuclease-free water. cDNA was used with real-time polymerase chain reaction (PCR) gene expression assay for AKAP13 (Hs00180747_m1, Thermo Fisher Scientific) and normalized to glyceraldehyde phosphate dehydrogenase (GAPDH; Hs02786624_g1, Thermo Fisher Scientific) and TaqMan Fast Advanced Master Mix (Applied Biosystems, Carlsbad, CA) and Viaa7 real-time PCR system (Thermo Fisher Scientific). Relative gene expression was determined through the ΔΔ cycle threshold (Ct) method and ΔCt values were calculated by subtracting the Ct value for GAPDH from the Ct value for AKAP13 for a sample. All statistical analyses were computed in GraphPad Prism (v5.01, GraphPad Software, La Jolla, California, USA) and Stata (Release 14, StataCorp., College Station, Texas, USA).

Immunohistochemistry
Lung tissue from patients with IPF was obtained either post-mortem or from lung transplant patients following informed written consent and ethical review available through the MRC Nottingham Molecular Pathology Node (ethical approval numbers: Nottingham Respiratory Research Unit, 08/H0407/1; Papworth Hospital Research Tissue Bank (REC) 08/H0304/56). Non-fibrotic human lung tissue was obtained from non-cancerous tissue removed during surgery or from donor lungs unsuitable for transplant. All experiments were performed in accordance with the World Medical Association Declaration of Helsinki. Antigens were unmasked by microwaving in 10mmol citrate buffer for 20 minutes. Endogenous peroxidase activity was quenched by incubating in 3% H2O2 in methanol for 30 minutes. Sections were then incubated for one hour in 5% goat serum in PBS. The sections were incubated in primary antibody solution (anti AKAP13 rabbit polyclonal antibody -Sigma HPA019773, 1:400 dilution, in 5% goat serum blocking buffer) overnight followed by incubation in secondary anti-rabbit antibody (1:200 goat anti-rabbit -Vector BA-1000) for 30 minutes. Streptavidin label was attached by incubation in ABC complex solution (Vector PK-6100) for 30 minutes. Colour development was performed using 3,3'diaminobenzidine tetrachloride (Sigma D4418), and counterstained using Mayer's haematoxylin (Raymond Lamb Ltd).

Signal refinement using Bayesian fine-mapping
We estimated 95% credible sets (a set of variants that is 95% likely to contain the causal variant). To do this the posterior probabilities for each variant being causal was calculated using approximate Bayes factors (ABFs) proposed by Wakefield 25 . The ABFs were calculated using the following formula: where Z is the Z statistic (i.e. effect size estimate divided by the standard error of the estimate), V is the variance of the effect size and prior W. Setting a prior of W = 0.4 is equivalent to a 95% belief that the departure from the null model for the relative risk is less than 1.5 26 . This formula gives the Bayes factor comparing the null hypothesis to the alternative model, so the Bayes factor for the whether a variant has an effect on the phenotype compared to the null model would be equal to the reciprocal of this formula. These were calculated using a prior of W = 0.4 for all variants with a stage 1 p value < 0.001 and within 1Mb of the sentinel variant.
An approximate posterior probability of the variant being causal is equal to the ABF for that variant divided by the sum of all ABFs in that signal (as the posterior probability will be proportional to the Bayes factor and the sum of the probabilities must equal one). This assumes that there is a single causal variant in the signal and that it has been included in the analysis. The credible set can be constructed by adding the variants with the highest probabilities to the credible set until the sum of the probabilities of the variants in the set exceed 0.95.
The credible set for the signal on chromosome 15 contained 113 variants and covered a region of the genome that included the AKAP13 and KLHL25 genes (Supplementary Table 3), the credible set for the MUC5B signal only contained rs35705950 and the credible set for the DSP signal only contained rs2076295.

Druggability
Proteins that interact with the proteins identified as associated with susceptibility to IPF were identified. This was performed using STRING 27 setting conditions that the interactions had to have "known interactions" (identified from either experiments or curated pathway databases) and only including the interactions with the "highest confidence" (confidence score > 0.9). The protein IDs (obtained from the protein database UniProt 28 ) that were identified by STRING were searched in the drug databases ChEMBL 29 and DrugBank 30 .
53 protein interactions with AKAP13, 18 proteins interactions with DSP and one with MUC5B were identified (Supplementary Table 7). Of these, 11 of the protein interactions with AKAP13 were returned as possible drug targets (Supplementary Table 8).  . The x axis shows chromosomal position. The y axis above the x axis shows the -log10(p value) from the case-control analysis and the y axis below the x axis shows the -log10(p value) for expression of AKAP13 from the blood eQTL analysis. The blue dotted line shows the significance threshold (p = 5 × 10 −6 ) used in stage 1 and the red dotted line shows genome-wide significance (p = 5 × 10 −8 ). Variants are coloured by LD with rs62025270 and the genes in the region are shown at the bottom with AKAP13 gene in green.