Resequencing analysis of five Mendelian genes and the top genes from genome-wide association studies in Parkinson’s Disease

Most sequencing studies in Parkinson’s disease (PD) have focused on either a particular gene, primarily in familial and early onset PD samples, or on screening single variants in sporadic PD cases. To date, there is no systematic study that sequences the most common PD causing genes with Mendelian inheritance [α-synuclein (SNCA), leucine-rich repeat kinase 2 (LRRK2), PARKIN, PTEN-induced putative kinase 1 (PINK1) and DJ-1 (Daisuke-Junko-1)] and susceptibility genes [glucocerebrosidase beta acid (GBA) and microtubule-associated protein tau (MAPT)] identified through genome-wide association studies (GWAS) in a European-American case-control sample (n=815). Disease-causing variants in the SNCA,LRRK2 and PARK2 genes were found in 2 % of PD patients. The LRRK2, p.G2019S mutation was found in 0.6 % of sporadic PD and 4.8 % of familial PD cases. Gene-based analysis suggests that additional variants in the LRRK2 gene also contribute to PD risk. The SNCA duplication was found in 0.8 % of familial PD patients. Novel variants were found in 0.8 % of PD cases and 0.6 % of controls. Heterozygous Gaucher disease-causing mutations in the GBA gene were found in 7.1 % of PD patients. Here, we established that the GBA variant (p.T408M) is associated with PD risk and age at onset. Additionally, gene-based and single-variant analyses demostrated that GBA gene variants (p.L483P, p.R83C, p.N409S, p.H294Q and p.E365K) increase PD risk. Our data suggest that the impact of additional untested coding variants in the GBA and LRRK2 genes is higher than previously estimated. Our data also provide compelling evidence of the existence of additional untested variants in the primary Mendelian and PD GWAS genes that contribute to the genetic etiology of sporadic PD.


Background
PD is the second most common neurodegenerative disorder after Alzheimer's disease (AD) [1]. By the year 2030, the prevalence of PD is projected to be between 8.7 and 9.3 million [1]. Genetic studies in PD have provided valuable insights into the underlying pathogenic mechanisms [2], leading to the development of animal models for investigation of disease mechanisms and identification of novel therapeutic targets [3]. Initial studies of multiplex families with PD found concordance rates of 75 % in monozygotic twins, 22 % in dizygotic twins [5], and an increased relative risk of PD of 2.9 (95 % CI 2.2-3.8) for those with an affected first-degree relative [6]. These findings indicate that the genetic etiology of PD does not fit a simple genetic model [5]. GWAS of PD have identified variants at 20 loci influencing PD risk [2,4,[7][8][9], with population-specific differences [10,11]. The currently identified genetic factors explain only 6-7 % of the phenotypic variability associated with PD [12], and the most prevalent GWA signals account for only 3-5 % of PD genetic variance in individuals of European ancestry [12]. These results provide unequivocal, compelling evidence for the existence of undiscovered genetic factors that contribute to the etiology of PD. Both candidate gene association studies and GWAS repeatedly validate that the most statistically significant signals associated with PD are common variants located close to SNCA, LRRK2, MAPT genes and low frequency coding variants in the GBA gene [2,4,7,10,[13][14][15][16].
Non-coding variants are the most significant single nucleotide polymorphisms (SNPs) identified near the MAPT and SNCA genes by GWAS [4]. To date, the functional variants driving such associations are unknown. We hypothesize that low frequency or rare coding variants can be identified by re-sequencing the MAPT and SNCA genes. In addition, deep-sequencing LRRK2 and GBA genes can not only identify additional untested coding risk variants but also protective alleles, as previously reported in these genes [17].
Highly penetrant mutations in the SNCA and LRRK2 genes are found in families with autosomal dominant inheritance, whereas autosomal recessive families with a typical PD phenotype carry mutations in the PARK2/ PARKIN, PARK6/PINK1 and PARK7/DJ-1 genes [18]. Most genetic studies in PD have focused on sequencing a particular Mendelian gene in familial or early onset PD, or have directly screened few variants in sporadic PD cases in small samples [18]. A systematic study that sequences all of these genes (SNCA, LRRK2, PARK2, PINK1 and PARK7) in a large PD dataset has not been reported in European Americans [19,20]. Thus, we used next-generation sequencing technology to re-sequence five Mendelian and the top GWAS susceptibility PD genes in a well-characterized case-control European American dataset (478 cases and 337 healthy controls) to identify both risk and protective low frequency or rare variants for PD.

Results
We performed pooled DNA-targeted deep-sequencing of the protein-coding regions of 7 genes, including 5 genes previously reported to most frequently cause familial forms of PD (SNCA, LRRK2, DJ-1, PARK2 and PINK1) and 2 genes that have significant associations in GWAS with sporadic PD (GBA and MAPT genes) in 478 PD patients and 337 healthy individuals of European-American descent from the Washington University in Saint Louis Movement Disorder Clinic (Table 1) [15,21]. This cohort contains 83 % late-onset PD (LOPD) and 74 % sporadic PD cases.

Rare variants in a European American case-control sample
We validated missense and splice-affecting variants with a predicted minor allele frequency (MAF) <5 %. In this European-American descent sample, a total of 47 lowfrequency (0.5-5 %) and rare (<0.5 %) non-synonymous coding variants were validated. 36

Novel variants
8.5 % (4/47) of the total variants are novel and not present in public databases (accessed on June 11th, 2015). All of the novel singleton variants located on LRRK2, p.D1887N and p.S885C, and GBA, p.T336S genes are present exclusively in LOPD patients ( Table 3). The PINK1 p.R147C, variant was found in one control individual but was not present in public datasets.

Copy number analysis
We observed a single structural genomic variant in a 70year-old man with a family history of PD (1/126; 0.8 %; Fig. 1). B allele frequency and log R ratio indicate that this variant is an intra-chromosomal duplication at the SNCA locus. We did not identify this duplication, or any duplication at this locus, in control individuals. No other exonic rearrangements were observed in any PD patient in the PARK2, DJ-1 or PINK-1 loci.

Variants of unclear and unknown pathogenicity
34.9 % (15/43) of the variants located on LRRK2 (four variants), PARK2 (six variants), PINK1 (4 variants) and PARK7 (one variant) have been reported previously and their pathogenicity is unclear ( Table 2). Although the cumulative frequency of these variants is higher in PD patients (4.4 %) compared to controls (3.8 %), this difference is not statistically significant (p = 0.7; OR = 1.14, 95 % CI = 0.56-2.32), suggesting that either most of these variants are very unlikely to be true risk factors for PD or our sample size is not large enough to detect such differences.
The MAPT p.A152T variant has been associated with other neurodegenerative diseases including AD and frontotemporal dementia (FTD) [23]. In our study, the MAPT p.A152T variant occurs in 0.8 % (4/478) of PD cases but in none of the controls (0/337, p = 0.09).

Gene-burden analyses
To determine whether rare variants in the LRRK2, DJ1, PARK2, PINK1, GBA or MAPT genes contribute collectively to PD risk, we performed a gene-burden association test using the optimal SNP-set sequence kernel association test (SKAT-O) [24]. Gene-based association testing achieved significance for GBA When we exclude p.T408M from the analysis, the risk of PD conferred by GBA variants is not significant (p = 0.39), which suggests that p.T408M may be the primary driver of the association with PD risk. These findings highlight the importance and necessity to sequence the entire GBA gene as opposed to genotyping only known risk variants for PD. We also found a significant enrichment of coding variants in the LRRK2 gene in PD cases compared to controls (p = 0.01, OR = 1.86, 95 % CI = 1.14-3.02) ( Table 6), which suggests that there are other risk variants in the LRRK2 gene in addition to the known pathogenic p.G2019S mutation.
No significance was found for the MATP2, PARKIN, PINK1 and DJ-1 genes Effect on age at onset (AAO) of PD GBA variant carriers tend to exhibit an earlier AAO than non-carriers [25]. Thus, we tested whether GBA variants affect AAO; we found that GBA variants carriers have a earlier AAO than non-carriers (54 years. vs. 62 years.; p < 0.0001) (Fig. 2a). Interestingly, when restricted to carriers and non-carriers of p.N408M using the same model, carriers had a 5.0-year-earlier onset than noncarriers (57 years. vs. 62 years.; p = 0.006) (Fig. 2b).

Discussion
Disease-causing variants in the SNCA, LRRK2, PARKIN, PINK1 and DJ-1 genes have been found in familial early onset forms of PD [18]. In this study, we systematically screened for rare variants and pathogenic mutations in the SNCA, LRRK2, PARK2, PINK1, PARK7, MAPT and GBA genes in a series of well-characterized PD casecontrol samples. A total of 47 low-frequency and rare non-synonymous coding variants were validated.

Most common pathogenic variants in this cohort
Nine individuals (1.9 %) of the total sample of PD patients carry a known pathogenic mutation in two Mendelian genes, LRRK2 p.G2019S and PARK2 p.D53X. Among patients with a family history of PD, 5.6 % (7/126) carry a known pathogenic mutation. In this cohort, we found that among the sequenced genes, the LRRK2 gene was enriched with multiple variants, accounting for 36.2 % of all the validated variants. The LRRK2 p.G2019S mutation is significantly associated with risk of PD and occurs in 1.7 % of PD patients.   [17]. Both frequencies are similar to the frequency reported here of 0.6 % (2/ 352) in sporadic PD patients. Our gene-based analysis found a significant association with the LRRK2 gene, which suggests that there are additional risk variants in LRRK2 affecting PD risk. We also detected a SNCA locus duplication in a 70year-old man with a family history of PD (1/126; 0.8 %; Fig. 1) and a 3-year history of parkinsonism. This PD patient exhibited clinical features indistinguishable from idiopathic PD. As expected, we found no coding mutations in the SNCA gene in this cohort. Point mutations in the SNCA gene are extremely rare and have been identified mostly in familial and EOPD [18]. The most common variation found in the SNCA gene are copy number variations (CNVs). SNCA duplications are not fully penetrant and are associated with variable clinical features, ranging from early-onset with dementia and psychiatric features to late-onset sporadic [27].
A recent report examining rare variants in the main Mendelian PD genes in a small case-control sample consisting of 249 cases and 145 controls of European origin (Spanish) found an enrichment of rare functional variants in PD cases [20]. They reported that up to 3.6 % of patients with sporadic PD are carriers of known pathogenic mutations in different Mendelian genes. The difference in the frequency of pathogenic mutations reported here (1.9 %) and that reported by Spataro (3.6 %) [20] is likely due to differences in methodology (exome sequencing data vs pooled-targeted sequencing) and to the different genetic background of the samples (Spanish vs North American).

Most common risk variants in this cohort
Heterozygous mutations in the GBA gene can be considered as low penetrance variants with autosomal dominant inheritance for PD [28]. In this study, fifty-three (11 %) of the PD patients and fifteen (4.5 %) controls carry heterozygous variants in the GBA gene (p = 1.0 × 10 −3 ; OR = 2.17, 95 % CI = 1.36-3.46), which indicate that GBA coding variants increase risk for PD in this cohort. We also have demonstrated that those patients with PD carrying a GBA variant experience a disease onset 6 years earlier than patients without GBA variants. Interestingly, GBA variants mainly affect AAO of LOPD patients. Two GBA variants (p.N409S and p.L483P) have consistently been reported to be associated with increased PD risk in both, Ashkenazi Jewish and non-Ashkenazi populations [29]. Here, the p.N409S (MAF = 0.007) and p.L483P (MAF = 0.007) variants, are present in 2.9 % (14/478) of PD patients and 0.9 % (3/337) of controls. These allelic frequencies agree with previous reports [29]. We found that both variants are overrepresented in PD cases compared to controls, but they only reached statistical significance after including a larger control sample from publicly available databases. In addition, we report for the first time, an association between PD risk and the GBA variant p.T408M (MAF: 0.018). p.T408M is considered a polymorphism because it has been found in control populations [25,30]. In this dataset, the GBA p.T408M variant drives the gene-based association with risk for PD. In the largest Non-Ashkenazi case-control sample studied to date, the GBA p.T408M variant was not significantly associated with PD [29]. This discrepancy could be explained by the heterogeneity of populations included in that study as it was enriched with individuals from populations in which the p.T408M variant is absent or very rare. The p.E365K allele is a Fig. 2 a. Cumulative incidence rates of PD among carriers and non-carriers of all GBA variants. b. Cumulative incidence rates of PD among carriers and non-carriers of the p.N408M variant. Survival fractions were calculated using the Kaplan-Meier method and significant differences were calculated by Log-rank test hypomorphic variant (42.7 % of wild type activity) [31] often found in cis or trans with other Gauchercausing non-synonymous mutations [32], exhibiting a frequency that is similar in controls and Gaucher patients [33]. We found that p.E365K achieves nominal significance (p = 0.02; OR = 1.69, 95 % CI = 1.06-2.67) after including controls from public databases. Interestingly, the OR found here is similar to those reported previously [33,34]. Both p.T408M and p.E365K have been described as "mild" mutations or modifier alleles. In our study, we did not observe a "second" mutation that occurred with either p.T408M or p.E365K, which suggests a second hit may exist as an interacting factor, similar to those described in a traditionally considered non-pathogenic variant in AD [35]. Interestingly, we found seven PD patients carrying PD risk variants in two of all screened genes, further suggesting a double-hit mechanism impacting the risk for PD (Table 7), as reported by the presence of variants in the LRRK2 and GBA genes in PD patients [36].
We also found that the MAPT p.A152T variant occurs in 0.8 % (4/478) of PD cases but in none of the controls (0/337, p = 0.09). It is possible that the MAPT p.A152T variant increases PD risk, but this association needs further confirmation in additional series.
Among the eight variants validated in PARK2, we found a stop-codon, p.D53X, in an EOPD (early onset PD) patient with a family history of PD. We also found one control individual carried the PINK1 (p.R492X) variant. We validated just a single variant DJ-1, p. A179T in a 56 year old PD patient with no family history of PD. All of these variants in recessive genes were found a heterozygous manner. Truly causative variants in PARK2, PINK1 or DJ-1 are present in a homozygous or heterozygous compound manner, but we cannot exclude the possible role of heterozygous variants on risk of sporadic PD. It is important to highlight that the most common pathogenic mutations in these genes are exon rearrangements or copy number variations. We did not detect exonic rearrangements in these genes in our cohort. The high proportion (83 %) of LOPD and sporadic cases (74 %) in our sample may explain the low number of validated variants found in the recessive genes.

Novel variants
We uncovered four novel variants (LRRK2, p.D1887N and p.S885C), (PINK1, p.R147C), and (GBA, p.T336S) in 0.8 % of PD cases. LRRK2, p.D1887N is located in the kinase domain and could play a functional role. The rareness of and the impossibility to expand the segregation studies with these variants to additional family members make its clinical interpretation challenging. However, finding novel variants in sporadic late onset PD suggests that it is possible to uncover such variants in genes linked to Mendelian PD or even in PD cases with an unclear pattern of inheritance. This is supported by our gene-based analysis, which demonstrates that additional untested variants in the GBA and LRRK2 genes contribute to the role of these genes in PD risk.

Conclusions
In summary, our results confirm the strong effect of GBA and LRRK2 on sporadic PD risk. However, our gene-based analyses demonstrates that non-synonymous GBA variants can have a greater impact on PD risk than LRRK2. In this cohort, the more common pathogenic mutations are located in the LRRK2 gene. Multiple GBA gene variants confer the highest risk for PD in our sample. We report novel interactions between variants in the GBA and LRRK2 genes as double hits affecting PD patients with no family history of PD. Our results also suggest that novel and untested variants in the GBA and LRRK2 genes influence PD risk. This has important implications on the genetic information provided to patients and families and potential new therapeutic approaches for PD patients. Our findings also strongly support the role of the lysosomal system as a pathogenic pathway in PD. Further work is necessary to clarify the role of specific and very rare variants in these genes on risk and PD phenotype.

Ethics statement
The Institutional Review Board (IRB) at the Washington University School of Medicine in Saint Louis approved the study. Prior to their participation, written informed consent was reviewed and obtained from family members. The Human Research Protection Office (HRPO) approval number for our ADRC Genetics Core family studies is 201104178.

Samples
Samples included 478 PD patients and 337 healthy individuals from the Washington University in Saint Louis Movement Disorder Clinic (MO, USA) [15,21,37]. All were examined by experienced movement disorder clinicians (J.S.P.). PD diagnosis was established according to the UK Brain Bank criteria.

Statistical and association analyses
For each variant, allele frequencies were calculated in cases and controls, and a χ2 test on allelic association was performed. A p-value of 0.05 was set as nominal significance threshold. The multiple-testing correction cutoff for the single-variant analysis using Bonferroni correction for 47 tests is 1.0 × 10 −3 (0.05/47). We used Plink (http://pngu.mgh.harvard.edu/~purcell/plink/) to analyze associations [38]. The gene-based association was performed using SKAT-O, which utilizes the R package SKAT [24]. All variants were included in the model independent of their clinical interpretation. The influence of the genetic variants on AAO was carried out using the Kaplan-Meier method and tested for significant differences using a log-rank test.

Pooled-DNA sequencing experiment
Pooled-DNA sequencing was performed as described previously [35,39,40]. Briefly, equimolar amounts of individual DNA samples were pooled together after being measured using Quant-iT PicoGreen reagent. Two different pools with 100 ng of DNA from 114 and 98 individuals were made. The coding exons and flanking regions (a minimum of 50 bp each side) were individually PCR amplified using specific primers and Pfu Ultra high-fidelity polymerase (Stratagene). An average of 20 diploid genomes (approximately 0.14 ng DNA) per individual were used as input for a total of 62 PCR reactions that covered 46,319 bases from the 7 genes. PCR products were cleaned using QIAquick PCR purification kits, quantified using Quant-iT PicoGreen reagent and ligated in equimolar amounts using T4 Ligase and T4 Polynucleotide Kinase. After ligation, concatenated PCR products were randomly sheared by sonication and prepared for sequencing on an Illumina Genome Analyzer IIx (GAIIx) according to the manufacturer's specifications. pCMV6-XL5 amplicon (1908 base pairs) was included as a negative control. As positive controls, ten different constructs (p53 gene) with synthetically engineered mutations at a relative frequency of one mutated copy per 250 normal copies were amplified and pooled with the PCR products. Six DNA samples heterozygous for previously known mutants in MAPT gene were also included. Single reads (36 bp) were aligned to the human genome reference assembly build 36.1 (hg18) using SPLINTER [41]. SPLINTER uses the positive control to estimate sensitivity and specificity for variant calling. The wild type: mutant ratio in the positive control is similar to the relative frequency expected for a single mutation in one pool (1 chromosome mutated in 125 samples = 1/250). SPLINTER uses the negative control (first 900 bp) to model the errors across the 36-bp Illumina reads and to create an error model from each sequencing run of the machine. Based on the error model, SPLINTER calculates a p-value for the probability that a predicted variant is a true positive. A p-value at which all mutants in the positive controls were identified was defined as the cut-off value for the best sensitivity and specificity. All mutants included as part of the amplified positive control vector were found upon achieving >30-fold coverage at mutated sites (sensitivity = 100 %) and only ∼ 80 sites in the 1908 bp negative control vector were predicted to be polymorphic (specificity = ∼95 %). The variants with a p-value below this cut-off value were considered for follow-up confirmation.

Genotyping
All rare missense or splice site variants identified by SPLINTER were validated by directly genotyping all sequenced individuals using Sequenom iPLEX or KASPar genotyping systems as described previously [42][43][44]. The validated SNPs were then genotyped in all members of the series. An average coverage of 30-fold per allele per pool is the minimum coverage necessary to obtain an optimal positive predictive value for the SNP-calling algorithm [41]. The necessary number of lanes to obtain a minimum of 30-fold coverage per base and sample were run.

Copy number variation analysis
The B Allele frequency and Log R Ratio were used to identify genomic deletions and duplications as previously described [45] using NeuroX chip data [46].