Novel risk genes and mechanisms implicated by exome sequencing of 2572 individuals with pulmonary arterial hypertension

Background Group 1 pulmonary arterial hypertension (PAH) is a rare disease with high mortality despite recent therapeutic advances. Pathogenic remodeling of pulmonary arterioles leads to increased pulmonary pressures, right ventricular hypertrophy, and heart failure. Mutations in bone morphogenetic protein receptor type 2 and other risk genes predispose to disease, but the vast majority of non-familial cases remain genetically undefined. Methods To identify new risk genes, we performed exome sequencing in a large cohort from the National Biological Sample and Data Repository for PAH (PAH Biobank, n = 2572). We then carried out rare deleterious variant identification followed by case-control gene-based association analyses. To control for population structure, only unrelated European cases (n = 1832) and controls (n = 12,771) were used in association tests. Empirical p values were determined by permutation analyses, and the threshold for significance defined by Bonferroni’s correction for multiple testing. Results Tissue kallikrein 1 (KLK1) and gamma glutamyl carboxylase (GGCX) were identified as new candidate risk genes for idiopathic PAH (IPAH) with genome-wide significance. We note that variant carriers had later mean age of onset and relatively moderate disease phenotypes compared to bone morphogenetic receptor type 2 variant carriers. We also confirmed the genome-wide association of recently reported growth differentiation factor (GDF2) with IPAH and further implicate T-box 4 (TBX4) with child-onset PAH. Conclusions We report robust association of novel genes KLK1 and GGCX with IPAH, accounting for ~ 0.4% and 0.9% of PAH Biobank cases, respectively. Both genes play important roles in vascular hemodynamics and inflammation but have not been implicated in PAH previously. These data suggest new genes, pathogenic mechanisms, and therapeutic targets for this lethal vasculopathy.

New risk genes are emerging from large exome-and genome-wide sequencing studies. Rare mutations in SRY-related HMG-box transcription factor (SOX17), a key regulator of embryonic vasculogenesis, explain3 .2% of APAH-CHD [22] and 0.7% of IPAH [22,23]. The UK NIHR BioResource-Rare Diseases PAH Study, utilizing~1000 PAH cases of primarily adult-onset IPAH, identified an ATPase gene (ATP13A3), growth differentiation factor 2 (GDF2; also known as BMP9), and SOX17 as risk genes contributing to 0.8-1.1% of cases [23]. The low frequency of risk variants for each gene, except BMPR2, indicates that large numbers of individuals are required for further validation of rare risk genes and pathways, and to understand the natural history of each genetic subtype of PAH.
The National Biological Sample and Data Repository for PAH (aka PAH Biobank) is a resource of biological specimens as well as clinical and genetic data generated for 2900 group 1 PAH patients to serve as a resource to the research community to enable larger-scale PAH studies. Herein, we performed targeted PAH gene and whole exome sequencing of 2572 cases from the PAH Biobank to identify and characterize frequencies and mutations in known PAH risk genes, identify new risk genes, and identify correlations between risk genes and clinical phenotypes.

Participants
The PAH Biobank is housed and maintained at the Cincinnati Children's Hospital Medical Center (CCHMC). Thirty-eight North American PH Centers participate in the PAH Biobank to identify and enroll patients meeting eligibility criteria. Each enrolling center also completes an electronic case report form with clinical data for each patient enrolled. Participants are diagnosed according to the World Health Organization PH group I classification [5], and the diagnosis of PAH is confirmed by medical record review including right heart catheterization. The cohort for this genetic analysis included 2534 singletons, 19 duos (proband and 1 unaffected parent), and 19 trios (proband and 2 unaffected biological parents). Written informed consent (and assent when appropriate) was obtained from participants or parents/legal guardians under a protocol approved by the institutional review board at CCHMC as well as those at each of the participating PH Centers. Written informed consent for publication was obtained at enrollment. The data and resources of the PAH Biobank are made available to the research community for hypothesis-driven projects via an application process (www.pahbiobank.org). A subset including 183 affected participants were included in previous publications from our group [14,20,22].

Targeted sequencing and multiplex ligation-dependent probe amplification (MLPA)
After proper informed consent, blood samples were collected and shipped to CCHMC for processing and generation of genetic data including panel sequencing of up to 12 genes, SNP genotyping using the Illumina OMNI5-4 Beadchip, and limited MLPA dosage data. Targeted next-generation sequencing was performed with 250 ng DNA using the Illumina Tru-seq Custom Amplicon system (Illumina, USA) according to the manufacturer's instructions. Custom amplicons were designed with Illumina's DesignStudio for the coding sequence of BMPR2, ACVRL1, ENG, CAV1, SMAD9, KCNK3, and EIF2AK4, for which all participants were sequenced. ABCC8, GDF2, KCNA5, SMAD4, and TBX4 were added to the panel later, and a subset of 739 were also sequenced for these genes. Each sample was sequenced using the Illumina MiSeq® instrument with paired-end 250 nucleotide read lengths. Demultiplexing, base calling, and alignment were executed using the default Illumina TruSeq Amplicon Workflow. Fastq files were aligned and visualized with NextGENe (SoftGenetics, USA). Variants were confirmed via Sanger sequencing on an ABI 3730xl DNA analyzer (Applied Biosystems, USA).
MLPA was performed with 100 ng of genomic DNA according to the manufacturer's instructions using the P093 Salsa MLPA probe sets (MRC-Holland, Amsterdam, The Netherlands). This probe set includes probes for all exons of BMPR2, ALK1, and ENG. Probe amplification products were run on an ABI 3730xl DNA Analyzer using GS500 size standard (Applied Biosystems). MLPA peak data was imported into Coffalyser (MRC-Holland) for quality checks and dosage ratio analysis. A dosage ratio value of ≤ 0.7 was used as the boundary for deletions, and ≥ 1.35 was used as the boundary for duplications.

Whole exome sequencing (WES)
Exome sequencing was performed for the entire cohort in collaboration with the Regeneron Genetics Center (RGC). In brief, genomic DNA was prepared with a customized reagent kit from Kapa Biosystems and captured using Integrated DNA Technologies xGen lockdown probes. All samples were sequenced on the Illumina HiSeq 2500 platform using v4 chemistry, generating 76 bp paired-end reads. 99.8% of the exome sequencing samples have read depth coverage ≥ 15× for 90% of the targeted regions (see Additional file 1: Figure S1).

WES data analysis
We used a previously established bioinformatics procedure [24] to process and analyze exome sequence data. Specifically, we used BWA-MEM [25] to map and align paired-end reads to the human reference genome (version GRCh38/hg38), Picard MarkDuplicates to identify and flag PCR duplicates, and GATK HaplotypeCaller (version 3.5) [26,27] recommended settings to call genetic variants. We used additional heuristic filters to minimize technical artifacts, excluding variants that met any of the following criteria: missingness > 10%, allele balance ≤ 25% [28], genotype quality < 60 for indels or < 90 for SNVs, cohort allele frequency ≥ 0.01, depth < 9, GATK MQ < 40, located in MUC or HLA genes, and located in segmental duplications with similarity ≥ 95%. We obtained gnomAD whole genome sequence (WGS, data release v2.02) as part of the controls. We applied the same heuristic filtering approach with additional exclusion criteria: not "PASS," not located in xGen-captured protein coding region, VQSLOD ≤ − 5.5, and FS ≥ 35. We obtained WES data of unaffected parents from the Pediatric Cardiac Genomics Consortium (PCGC) [24] as "internal" controls. The internal control data was processed with the same pipeline as cases. For both panel and exome sequencing data, we used ANNO-VAR [29] to aggregate variant annotation, allele frequencies (AF), and in silico predictions of deleteriousness. Rare variants were defined as AF ≤ 0.01% in both ExAC and gnomAD exome datasets (all ancestries). An exception was made for recessive inheritance of EIF2AK4 variants [16,17], in which the AF cutoff was ≤ 1%. Deleterious variants were defined as likely gene damaging (LGD, including premature stop-gain, frameshift indels, canonical splicing variants, and exon deletions) or predicted damaging missense with REVEL score > 0.5 (D-Mis), as previously described [22,30]. For EIF2AK4 variants, deleterious missense variants were defined by CADD score ≥ 20 since REVEL is not optimized to assess the deleteriousness of relatively common variants. CADD scores for all variants are provided in the variant tables as a reference [31]. Insertion/deletion variants were manually inspected using Integrative Genome Viewer (IGV).

Statistical analysis
To identify novel risk genes, we performed a gene-based case-control association test comparing the frequency of rare deleterious variants in PAH cases with population controls. The controls consisted of gnomAD WGS subjects and unaffected parents from the Pediatric Cardiac Genomics Consortium (PCGC) ("internal controls") [24]. To control for confounding from genetic ancestry, we selected 1832 cases and 5262 internal controls of European ancestry based on principle components analysis using PLINK version 1.9 [32], and 7509 non-Finnish, European (NFE) gnomAD subjects. Relatedness was checked using Peddy [32], and only unrelated cases were included in the association tests. To reduce batch effects in combined datasets from different sources [33], we limited the analysis in regions targeted by all xGen, NimbleGen, and MedExome, and with at least 10× coverage in 90% of samples. We then tested for similarity of the rare synonymous variant rate among cases and controls, a class that is mostly neutral with respect to disease status.
To identify PAH risk genes, we tested the burden of rare deleterious variants (AF ≤ 0.01%, LGD or D-mis) in each protein-coding gene in cases compared to controls. We used REVEL [30] scores to predict the deleteriousness of missense variants. To improve statistical power, we searched for a gene-specific REVEL score threshold that maximized the burden of rare deleterious variants in cases compared to controls, and use permutations to calculate statistical significance, similar to a published method [34] designed for variable threshold on allele frequency. Specifically, in each gene, we performed binomial tests with a given REVEL score threshold, ranging from 0.2 to 1 with 0.05 intervals, and defined the optimal threshold by the smallest p value (P 0 ). Then, we performed 10,000,000 permutations (shuffling casecontrol labels); in each permutation, we obtained the smallest p value (P′) using the same variable REVEL threshold procedure. We then recalibrated the p value for each gene as the fraction of permutations where P ′ ≤ P 0 . We assigned a REVEL score of 1 (most deleterious) to LGD variants. In each binomial test, the null model is that the number of rare deleterious variants in cases follows a binomial distribution, given the total number of such variants in cases and controls, and a rate determined by fraction of cases in total number of subjects (cases and controls). We used binom.test function in R to calculate p values in the binomial tests. The script used for this variable threshold method is available from the following URL: https://github.com/ShenLab/ VariableThresholdTest.
We expect that most genes will not be associated with PAH, and thus, the distribution of test statistics across most targeted regions in cases will not deviate from that of the controls. We checked for inflation using a quantile-quantile (Q-Q) plot and calculated the genomic control factor, lambda, using QQperm (https://cran.rproject.org/web/packages/QQperm/QQperm.pdf). Lambda equal to 1 indicates no deviation from the expected distribution.
To assess type I error and statistical power, we used BMPR2 data from 1832 unrelated PAH Biobank cases and 5262 unrelated internal controls, all of European ancestry. In total, there were 188 rare (allele frequency < 10 −4 ) LGD and D-mis variants. To assess whether type I error is controlled, we randomly shuffled case/control labels 10,000,000 times to generate 10,000,000 sets of data under the null and applied the variable threshold test to each dataset. We found the type I error rate was well controlled (see Additional file 2: Table S1).
BMPR2 is a well-known PAH risk gene with large effect size and relatively long transcript, two conditions that lead to good statistical power in association tests. We expect that most of the undiscovered risk genes will have smaller effects or shorter transcripts. To estimate statistical power under realistic conditions with smaller effect sizes and shorter transcripts, we simulated data in two ways: (1) Randomly labeled cases and controls with a required fraction (F) of true cases being labeled as cases. F = 0.258 is equivalent to completely randomizing case and control labels, and therefore, it corresponds to the null model (relative risk = 1). F = 1 corresponds to original case/control data and maximizes the effect size (relative risk~45). The power was estimated using two significance thresholds, α = 0.005 and α = 2.5e−6. In each setting, we ran 1000 simulations to calculate power. We compared our method (VT) with SKAT-O [35], a popular method for testing association of rare variants. VT has better power than SKAT-O with F in the range of 0.4-0.6, reflecting a range of modest effect sizes (relative risk~2 to 5) as can be seen in Additional file 3: Figure S2A. (2) Given F, we then sampled a fraction of variants to generate a smaller dataset. This effectively creates datasets with smaller cumulative allele frequencies (CAF), a condition that fits for genes with shorter transcripts. We generated 1000 datasets under each condition (defined by F and CAF) to estimate power, setting the significance threshold at 2.5e−6. As shown in Additional file 3: Figure S2B, VT has better power than SKAT-O with all CAF values when F is between 0.45 and 0.6.
We defined the threshold for genome-wide significance by Bonferroni's correction for multiple testing (n = 20,000 genes, threshold p value = 2.5e−6). We used the Benjamini-Hochberg procedure to estimate false discovery rate (FDR) by p.adjust in R. All GGCX and KLK1 variants reported herein were confirmed by Sanger sequencing.

Cohort characteristics
Characteristics of the PAH Biobank cohort are shown in Table 1 with more detailed characteristics shown in Additional file 4: Table S2. The cohort included 2572 cases: 43% IPAH, 48% APAH, 4% FPAH, and 5% other PAH. The APAH cases included 722 associated with autoimmune CTDs (mostly scleroderma with few cases of rheumatoid arthritis, systemic lupus erythematosus, and Sjogren's syndrome), 268 with CHD, 139 with portopulmonary hypertension (PoPH), and 110 with other diseases (HHT, HIV, and rare disorders). The "other PAH" group included 110 drug-and toxin-induced (DTOX) PAH, 11 non-familial PVOD/PCH cases, and 1 persistent pulmonary hypertension of the newborn. The majority of cases (91.2%) were adult-onset with a cohort mean age of onset of 48 ± 19 years (mean ± SD). However, there was an enrichment of child-onset cases (95/ 268, 37.2%, p < 0.0001 by chi-square) in the APAH-CHD subclass. As has been reported previously for adult populations [36], there was an overall 3.7:1 ratio of females to males, with a 9:1 ratio for PAH associated with autoimmune disease and 1:1.2 ratio for the PoPH subclass.
Hemodynamic data collected at the time of PAH diagnosis are also shown in Additional file 4: Table S2. Compared to IPAH cases, APAH-CTD cases had lower mean pulmonary artery pressure (MPAP) and lower pulmonary vascular resistance (PVR) by one-way ANOVA with correction for multiple comparisons. APAH-CHD and FPAH cases had higher MPAP and PVR compared to IPAH cases; FPAH cases also had decreased cardiac output (CO) as previously described. PoPH cases had increased CO and decreased PVR compared to IPAH cases. By these measures, APAH-CTD and PoPH cases had more moderate hemodynamic profiles compared to IPAH cases whereas both FPAH and APAH-CHD had less favorable profiles.
A complete list of cases carrying rare deleterious variants in established risk genes is provided in Additional file 5:  Table S3. Not surprisingly, 68% of these cases carried variants in BMPR2 (n = 119 variants in 180 cases: 9% exon deletions, 65% LGD, 26% D-Mis). The age of onset for BMPR2 variant carriers was 38 ± 15 years (mean ± SD), significantly younger than that of the whole cohort (p = 1.1E −15, Mann-Whitney U test) but with a wide range of ages from 2 to 76 years (Fig. 2). The second most common genetic cause was TBX4, accounting for approximately 1% of cases (n = 23 cases with 22 variants: 12 LGD, 9 D-Mis, and 1 in-frame deletion), the majority of whom (57%) had a diagnosis of IPAH. Although more than 90% of cases in the PAH Biobank cohort had adult-onset disease, only 48% of the TBX4 variant carriers had adult-onset disease. The overall mean age of onset was 29 ± 25 years (see Fig. 2a), with a bimodal distribution and a significant enrichment of pediatric-onset cases compared to the whole PAH cohort (p = 6.5E−08, RR = 12.3, binomial test) (see Fig. 2b), consistent with previous findings [14]. Deleterious variants in 9 additional genes were observed: ACVRL1 (n = 16 cases, including 7 with HHT), SMAD9 (13 cases), CAV1 (10 cases), ENG (6 cases, including 2 with HHT), bi-allelic EIF2AK4 (5 cases, including 2 with PVOD/PCH), KCNK3 (3 cases), BMPR1A (4 cases), SMAD4 (2 cases), and BMPR1B (2 cases). The three FPAH cases carrying the same LGD mutation in CAV1 were related (proband, aunt, grandfather); only the proband was included in the downstream association analyses. We note that four cases (two IPAH, two FPAH) carried risk variants in BMPR2 plus one other risk gene. A complete list of rare deleterious variants in newly reported PAH risk genes is provided also in Additional file 5: Table S4. Nearly two thirds were variants in ABCC8 (26 variants in 29 cases: all D-Mis) or GDF2 (24 variants in 28 cases: 9 LGD, 15 D-Mis). The ABCC8 variants occurred equally in IPAH and APAH cases (50: 50) while the GDF2 variants occurred primarily in IPAH cases (75%). Deleterious variants in the other new PAH risk genes were observed less frequently or not at all: KCNA5 (n = 13 cases), SOX17 (10 cases), ATP13A3 (7 cases), SMAD1 (2 cases), and KLF2 (0 cases). The mean age of onset for these risk gene variant carriers ranged from 41 to 46 years, with the exception of SOX17 which had a mean age of onset of 26 years (Fig. 2a), significantly younger than that of the whole cohort (p < 0.003, Mann-Whitney U test). The female to male ratio among these patients was 4.2:1, similar to that of the whole cohort. Overall, 71% of the variants in known risk genes were novel.
Locations of the risk gene variants are shown in Additional file 6: Figure S3 and Additional file 7: Figure S4. For BMPR2, all but 2 of the D-Mis variants are located within the first 500 amino acids of the protein, mostly within the conserved activin and protein kinase domains (Additional file 6: Figure S3). While the LGD variants are also clustered within the activin and protein kinase domains, 26 variants carried by 28 individuals are located downstream of these domains. For the other risk genes, the majority of D-Mis variants are also located in conserved protein domains (Additional file 7: Figure S4).

Identification of novel PAH risk genes: KLK1 and GGCX
Our extensive genome screening efforts failed to identify rare deleterious variants in known risk genes for 86% of the PAH Biobank cases. To identify novel PAH risk genes, we performed a gene-based, case-control association analysis. To prevent confounding by genetic ancestry, we included only participants of European ancestry (cases n = 1832; controls n = 7509 gnomAD WGS subjects and 5262 unaffected parents from the Pediatric Cardiac Genomics Consortium). To minimize technical batch effects of genotype data between cases and controls, we applied heuristic filters as described in the "Methods" section. We observed similar overall frequencies of rare synonymous variants in cases and controls (enrichment rate = 1.01, p value = 0.09), a class that is mostly neutral with respect to disease status (Additional file 8: Table S5). Further, a gene-level burden test confined to rare synonymous variants was consistent with a global null model (Additional file 9: Figure S5), indicating that technical batch effects would likely have minimal impact on genetic analyses.
We then proceeded to test for gene-specific enrichment of rare deleterious variants (allele frequency < 0.01%, LGD and D-Mis) in cases compared to controls. The use of in silico prediction tools to select deleterious missense variants can increase statistical power for rare variant association analyses [42], but the optimal threshold for deleteriousness scores is often gene specific [43]. To improve power, we implemented a rare variant burden test utilizing empirically determined, gene-specific deleterious score thresholds, a "variable threshold test." The association results across all protein-coding genes, including 1832 European cases from all PAH subclasses, were generally consistent with the expectation under the null model (see Fig. 3). The Q-Q plot shows negligible genomic inflation, and we calculated the genomic control factor lambda = 1.02. Only two genes exceeded the Bonferroni-corrected threshold for significance: BMPR2 (p = 1.0E−07, FDR = 0.002) and KLK1 (p = 2.0E−07, FDR = 0.002) (Fig. 3). Established risk gene, ACVRL1, and recently reported GDF2 fell just below the cutoff for significance. Variants in most other known risk genes are less frequent causes of PAH, and some have smaller effect size compared to BMPR2, requiring larger patient populations for genome-wide significance in association studies. TBX4 and SOX17 exhibited marginal association (p = 0.001 for each) which was not significant in this largely adult-onset cohort. KLK1 encodes kallikrein 1, Fig. 2 Age of disease onset for PAH Biobank cases with rare deleterious variants in known PAH risk genes. a Box plots showing median, interquartile range, and min/max values for age of disease onset (i.e., age at diagnostic right heart catheterization). The number of cases carrying variants for each gene is given above each box plot. Genes represented by less than four cases are not shown. b Histogram plots showing ageof-onset distributions for the whole cohort (n = 2572), BMPR2 (n = 180), or TBX4 (n = 23) variant carriers. Red vertical lines indicate the group means. BMPR2 carriers had a younger mean age of onset (mean = 37 years, SD = 15; Mann-Whitney U test, p = 1.1E−15) but no enrichment of child-onset cases (binomial test p = 1, RR = 0.93) compared to the whole cohort, whereas TBX4 carriers had a younger mean age of onset (mean = 29 years, SD = 25; Mann-Whitney U test, p = 0.001) and significant enrichment of child-onset cases (binomial test p = 6.5E−08, RR = 12.3) compared to the whole cohort also known as tissue kallikrein, involved in the regulation of systemic blood pressure and vascular remodeling but not previously associated with pulmonary hypertension [44,45].
We next repeated the analysis using 812 IPAH cases only (all European). Again, the Q-Q plot shows negligible inflation (see Fig. 4) with lambda = 0.98. For IPAH, we observed significant associations for BMPR2 (p = 1.0E −7, FDR = 9.0E−04), KLK1 (p = 1.0E−7, FDR = 9.0E−04), and GGCX (p = 1.9E−06, FDR = 0.013). encodes g a m m a g l u t a m y l c a r b o x y l a s e , i m p l i c a t e d in coagulation factor deficiencies and ectopic mineralization of soft tissues [46]. These three genes were the only genes to reach genome-wide significance among IPAH cases. IPAH risk genes, TBX4 and GDF2, fell just below the cutoff for significance. The near genome-wide associations of ACVRL1 and TBX4 are considered positive controls in our study as variants in ACVRL1 are enriched in PVOD/PCH patients and variants in TBX4 enriched in pediatric patients, both subgroups included in the cohort but at very low frequency. Likewise, while the association signal for GDF2 fell below the cutoff (p = 3.2E−06, FDR = 0.016), we clearly provide confirmation of this new PAH risk gene. All association results for the total cohort or IPAH alone, with p ≤ 0.001, are listed in Figs. 3 and 4, respectively.
Analysis of the depth of sequencing coverage of the targeted regions in KLK1 and GGCX indicated that nearly 100% of samples attained read depths of at least 15×, excluding the possibility that the associations were driven by coverage differences between cases and controls (Additional file 10: Figure S6). We next screened the entire PAH Biobank cohort, including participants of non-European ancestry, for rare deleterious variants in KLK1 and GGCX. In total, 12 cases carried KLK1 variants (10 IPAH, 2 APAH) and 28 cases carried GGCX variants (17 IPAH, 9 APAH, 1 FPAH, 1 unknown subclass) ( Table 2). Most of the participants were of European ancestry; however, for GGCX, there were also 6 cases of African and 3 cases of Hispanic ancestries. The mean age of onset was similar to that of the overall cohort for both genes (KLK1, 49 ± 6; GGCX, 49 ± 3). The variants for KLK1 included 4 LGD (1 stop-gain, 2 frameshifts, 1 splicing) and 8 D-Mis; variants for GGCX included 6 LGD (5 stop-gain, 1 frameshift), 21 D-Mis, and 1 in-frame deletion. Three KLK1 (1 LGD, 2 D-Mis) and 5 GGCX (1 LGD, 4 D-Mis) variants were recurrent in the cohort. Locations of the variant amino acid residues are shown in Fig. 5. All but 1 of the KLK1 and 2 of the GGCX missense variants, as well as the in-frame deletion, occur in conserved enzymatic domains. GGCX KLK1 belongs to a contiguous gene family cluster on chromosome 19 encoding 15 distinct peptidases. While some have highly restricted expression patterns (i.e., KLK2 and KLK3 in prostate), others are widely expressed [47]. Nine of the family members, including KLK1, are expressed in the lung and have been implicated in various lung diseases-inflammatory respiratory diseases, viral infections, and cancers [48]. We tested for enrichment of rare deleterious variants in the gene set expressed in the lung and observed a significant enrichment of LGD + D-Mis variants in European cases compared to controls (enrichment rate = 2.1, p = 0.004) (Additional file 11: Table S6A). We then performed gene-specific association analyses to determine which genes were contributing to the enrichment. The associations were stronger for IPAH than all PAH; while KLK1 was the only gene to exceed the Bonferroni-corrected threshold for significance (OR = 13.9, p = 2.00E−07 for all PAH; OR = 26.2, p = 1.00E−07 for IPAH), five additional family members had an enrichment rate of rare deleterious variants greater than 2.0 for IPAH (Additional file 11: Table S6B).

Clinical phenotypes of KLK1 and GGCX variant carriers
Hemodynamic measurements at the time of PAH diagnosis for individual carriers of KLK1 and GGCX variants are provided in Table 3. Clinical phenotypes of IPAH participants with KLK1 or GGCX variants did not differ from that of other IPAH cases without variants in known risk genes (Additional file 12: Table S7). Overall, participants with predicted deleterious variants in either gene exhibited less severe clinical phenotypes compared to participants with variants in BMPR2. Carriers of both KLK1 and GGCX variants were older at PAH onset and had decreased MPAP, increased CO, and decreased PVR compared to BMPR2 carriers (Table 3). Furthermore, both KLK1 and GGCX carriers had increased ratios of mean (systemic) arterial pressure to MPAP compared to BMPR2 carriers (MAP:MPAP, Table 3).
A known KLK1 single nucleotide polymorphism conferring at least partial loss of function occurs with high frequency in the general population [49,50]. The c.230G>A; p.R77H SNP (formerly called c.230G>A; p.R53H) has been associated with decreased urinary kallikrein activity and aberrant flow-mediated arterial remodeling but not systemic hypertension [49,50]. We screened the PAH Biobank cohort for the c.230G>A; p.R77H SNP and compared the cohort allele frequency with the frequency observed in gnomAD. No enrichment was observed in the PAH Biobank cohort (Additional file 13: Table S8), and none of the carriers of rare deleterious KLK1 variants also carried the c.230G>A;

Discussion
Using exome sequencing of a large PAH Biobank cohort recruited by 28 participating centers, followed by rare deleterious variant identification and gene-based association analysis, we identified KLK1 and GGCX as novel candidate genes for PAH. These candidate risk genes suggest new pathogenic mechanisms outside of the TGF-β/BMPR2 signaling pathway. We showed that carriers of rare, predicted deleterious variants in KLK1 or GGCX have less severe clinical phenotypes compared to carriers of BMPR2 variants. In addition, we identified 252 novel rare deleterious variants in 17 known PAH risk genes and confirmed the importance of TBX4 and SOX17 in early-onset disease as well as the association of GDF2 with IPAH.
Our study complements the recently reported findings from the UK NIHR BioResource-Rare Diseases PAH Study with some similarities and some differences. Our cohort differed from the UK cohort in size and composition. The PAH Biobank is more than twice as large as the UK cohort and includes PAH associated with other diseases (APAH), a subgroup that has not been widely studied and was not included in the UK cohort. In agreement with the findings from the UK cohort [23], we provide confirmation of GDF2 among 1832 all PAH and 812 IPAH cases of European ancestry. In total, we identified 24 variants, only 2 of which had been reported previously. GDF2 encodes a well-characterized ligand for BMPR2, and these data further confirm an important role for GDF2 in IPAH, as well as other PAH subclasses. Similar to the UK cohort, as well as our previous report of a cohort enriched in APAH-CHD cases [22], we observed a low frequency of SOX17 variants (0.4%) in the PAH Biobank. We reported enrichment of SOX17 variants in APAH-CHD; the low frequency of SOX17 variants in the PAH Biobank and UK cohort is likely due, at least in part, to the paucity of APAH-CHD cases in both cohorts. Interestingly, a genome-wide association study of common SNPs involving both the PAH Biobank and the UK cohort identified SNPs in a putative endothelialacting enhancer region of SOX17 in PAH [51], suggesting that common variants may play an important role in susceptibility to PAH.
Differences in the two studies of rare variants included lack of genome-wide association of ATP13A3 or AQP1 in the PAH Biobank and no significant association of KLK1 or GGCX in the UK cohort. We screened for ATP13A1 rare deleterious variants and identified only seven cases with variants. AQP1 not only failed to reach genome-wide significance but also was not among the expanded list of genes with p ≤ 0.001 for either the whole PAH cohort or the IPAH alone. Based on the small relative risks and associated confidence intervals from the UK cohort (RR = 0.37, CI 0.06-1.47 for ATP13A3; RR = 0.19, CI 0.004-1.53 for AQP1), it was not unexpected that that by chance we would observe no association with these genes and PAH. In terms of KLK1 and GGCX, if the effect sizes for the 2 genes are equal to the estimates from the US cohort (relative risk~12 and 4,  respectively), we would expect only 6 and 9 carriers in the UK cohort. As a result, the UK cohort would have poor power (~30% and 10%, respectively) to detect these two genes with genome-wide significance. The new PAH candidate risk genes identified in the current study, KLK1 and GGCX, are both expressed in the lung and vascular tissues and play important roles in vascular hemodynamics and inflammation, but have not been implicated in PAH previously. KLK1, also known as tissue kallikrein 1, is a major component of the kallikrein-kinin system that, together with the renin-angiotensin system, regulates blood pressure and cardiovascular function. In rodent and in vitro studies, KLK1 is constitutively expressed by endothelial cells, and endothelial activation leads to release of active protease, matrix degradation, smooth muscle cell migration, and vascular sprouting [44], processes relevant to PAH. KLK −/− mice exhibit impaired neovascularization, and adenoviral overexpression of human KLK1 promotes neovascularization in KLK −/− mice, rat mesentery arteries, and zebrafish [52]. KLK1 is part of a highly conserved, serine protease subfamily. We observed enrichment of rare deleterious variants in a gene set of nine KLK genes expressed in the lung, suggesting candidate genes for further investigation including KLK12 which may play a role in angiogenesis via indirect regulation of vascular endothelial growth factor [53]. Together, the data suggest that both deficiency and loss of function mutations in KLK1, and potentially other KLK genes, may cause impaired neovascularization of injured distal arterioles in PAH. Gene delivery of tissue KLK1 via adenoviral vectors, protein infusion, or genetically modified stem cells has shown beneficial effects in multiple models of vascular diseases [54]. There may be potential for gene delivery of KLKs as a treatment for PAH.
GGCX encodes gamma glutamyl carboxylase, responsible for the post-translational modification of vitamin K-dependent proteins involved in coagulation, soft tissue mineralization, prevention of vascular calcification, inflammation, bone formation, and cell proliferation [46].
It is unclear what all the targets of GGCX are, but mutations in GGCX could alter inflammatory responses or cell proliferation of pulmonary artery smooth muscle and/or endothelial cells in pulmonary arterioles, both hallmarks of PAH. Homozygous mutations in GGCX cause vitamin K-dependent clotting factor deficiency (MIM #277450) as well as pseudoxanthoma elasticum (MIM #264800), an ectopic mineralization disorder. None of the PAH Biobank GGCX heterozygous variant carriers had diagnoses of bleeding disorders or pseudoxanthoma elasticum. Ggcx −/− mice die pre-or perinatally due to massive bleeding, but heterozygotes are viable [55]. Thus, the heterozygous knockout mouse may provide a model for testing the effect of Ggcx on PAH phenotypes.
The differences in etiology, clinical course, and prognosis for child-vs adult-onset PAH are an area of active investigation. Previous studies have implicated BMPR2, TBX4, and SOX17 in child-onset disease. In our large PAH Biobank cohort, BMPR2 variant carriers exhibited a shift towards younger age of onset, but the overall age distribution was similar to that of the whole cohort. TBX4 exhibited a bimodal distribution with significant enrichment of variants among pediatric-onset cases. Consistent with our previous report of SOX17 in APAH-CHD, SOX17 carriers in the PAH Biobank also had a relatively young mean age of onset (26 years). The hypothesis that pediatric PAH is linked to lung growth and development [56] is consistent with roles for TBX4 and

Conclusions
In summary, we have identified KLK1 and GGCX as new candidate risk genes for PAH, accounting for~0.4% and 0.9% of PAH Biobank cases, respectively. The total percentage of cases with rare deleterious variants in known and novel genes combined was 15.1% (389/2572). The large proportion of unexplained cases can be accounted for by incomplete penetrance which requires a larger sample size, analyses being confined to protein-coding sequences, and genetic heterogeneity, as well as environmental/non-genetic factors. Incomplete penetrance is the rule in PAH with the major susceptibility gene, BMPR2, exhibiting only 20-40% penetrance. The finding of risk gene carriers across multiple PAH subclasses is not necessarily surprising. As APAH and other rare subclasses have not been studied extensively, there was no a priori hypothesis as to whether the risk variants would be observed in the same or different genes compared to IPAH/FPAH. Since there is clear overlap in pulmonary phenotypes between the subclasses, it is likely that there will be both overlapping and distinct risk genes. The growing list of PAH risk genes and variants indicates that exome sequencing may be useful in families with PAH if no genetic cause is identified with panel gene testing. However, genomic studies of larger international consortia will be necessary to better clinically characterize these rare genetic subtypes of PAH.
Ethics approval and consent to participate This study was approved by the Cincinnati Children's Hospital Medical Center Institutional Review Board (IRB) along with the individual IRBs at each of the enrolling centers' institutions. All patients have signed consent forms which are on file at the individual enrolling centers. No protected health information (PHI) on any patients enrolled in the PAH Biobank is forwarded to the PAH Biobank. Only the individual enrolling centers have the ability to re-contact any of the patients enrolled in the study. All research using these patient samples conformed to the principles of the Helsinski Declaration.

Consent for publication
Written informed consent for publication was obtained at enrollment.

Competing interests
CG-J is a full-time employee of the Regeneron Genetics Center from Regeneron Pharmaceuticals Inc. and receives stock options as part of compensation. The remaining authors declare that they have no competing interests.