Novel multiple sclerosis susceptibility loci implicated in epigenetic regulation

Genome-wide study in Germans identifies four novel multiple sclerosis risk genes and confirms already known gene loci.


INTRODUCTION
Multiple sclerosis (MS) is an autoimmune disease of the central nervous system. Human leukocyte antigen (HLA) alleles, located within the major histocompatibility complex (MHC) region, have been identified as major genetic determinants for the disease (1, 2). In addition, more than 100 non-MHC MS susceptibility variants have been described (3,4). Many of the genes carrying known susceptibility variants are involved in the regulation of either immune cell differentiation or signaling (4)(5)(6)(7)(8). However, because the heritability of MS is limited (9), environmental contributions to disease etiology are also important (10). Environmental influences can alter gene expression via epigenetic mechanisms (11). Epigenetic alterations, such as DNA methylation or histone modifications, have been observed in tissues and cells of MS patients (8,(12)(13)(14). Nevertheless, the impact of epigenetic regulation in MS is not yet understood.The known genetic variants outside the MHC region have predominantly been established in large international collaborative studies. To achieve large sample sizes with the power to detect associations, these studies have combined sample sets from diverse ethnic populations (4)(5)(6). So far, the variants affecting MS susceptibility identified in these studies account for only 25% of disease heritability under an additive model of heritability (3), warranting for additional studies to fully unravel the genetic contribution to disease susceptibility. In contrast to the previously investigated large interna-tional cohorts, we have strived to examine the genetic contribution to MS susceptibility in a more homogeneous population, focusing entirely on German cases and controls. The genetic substructure among Germans is low (15). We therefore expected to have sufficient power to detect novel associations with moderate effect sizes in a data set showing little population stratification. Using a total of 4888 cases and 10,395 controls, we had 80% power to detect genome-wide significant associations with an odds ratio (OR) of 1.2 involving common variants with a minor allele frequency (MAF) of 21%. For rare singlenucleotide polymorphisms (SNPs) (MAF, 1%), the power surpassed 80% for an OR of 1.9.

Genome-wide association analyses
We recruited patients with either MS or clinically isolated syndrome (CIS) from MS centers throughout Germany and combined them with controls from several German population-based cohorts (Table 1). After quality control (QC), the data set DE1 consisted of 3934 cases and 8455 controls (control/case ratio, 2.15; table S1). We also compiled a second data set, called DE2, based on an independent group of onset was not normally distributed, rank-based inverse normal transformation was applied. The HLA allele most strongly associated with transformed age at onset was DRB1*15:01 (effect size, −0.21; P = 7.6 × 10 −6 ). When conducting a genome-wide analysis of transformed age at onset in all 1519 patients, no variant passed the threshold for genome-wide significance ( fig. S3, A and B). The most strongly associated SNP was rs4959027 (effect size, −0.20; P = 1.5 × 10 −7 ; fig. S3, C and D), which is in LD with DRB1*15:01 (r 2 = 0.72). After conditioning for DRB1*15:01 in the subset of cases with both age at onset and imputed HLA alleles available, the P value of rs4959027 was increased from 1.1 × 10 −6 to 4.8 × 10 −2 . We conclude that our findings for the MHC region are very well in line with previous studies and concentrated further analyses on associations with case/control status outside this region.
Associations outside the MHC region Variants at 15 loci outside the MHC region showed genome-wide significance ( Fig. 1, figs. S4 and S5, Table 3, and table S4). Ten of these loci have already been established in previous large MS GWAS (3,4,6). One more locus, DLEU1 (deleted in lymphocytic leukemia 1), was only recently confirmed to be associated with MS in a candidate gene study (21). The remaining four signals are thus novel candidates for MS susceptibility loci. The lead variants at all 15 non-MHC loci showed P < 5 × 10 −6 in DE1 and lower P < 5 × 10 −8 in the pooled analysis of DE1 and DE2 and have thus replicated in DE2. We could not detect any significant interaction among the 15 top non-MHC variants or between them and SNP rs3104373 within the MHC region.
For validation of our findings, we compared our results to the largest study on MS genetic susceptibility published to date (Fig. 2) (4). Of the 108 non-MHC variants showing genome-wide significant or suggestive associations with MS in the published study, 104 variants were present in our data and could be analyzed. All of them showed the same direction of effect (P = 5 × 10 −32 , binomial sign test; CI, 0.97 to 1.00), 84 with nominal (P < 0.05) and 10 with genome-wide significance (P < 5 × 10 −8 ). Fifty-eight of the variants had lower ORs and 35 had higher ORs in our data than in the published data set (4). It was expected to observe more signals with lower ORs than previously reported due to regression toward the mean. Next, we examined the four novel loci and DLEU1 not found at genome-wide significance in a GWAS before in more detail. We investigated whether the five lead variants at these loci are significantly associated with MS in our German cohort only or whether they replicate in Sardinians, a genetically distinct population with low genetic heterogeneity. This independent Sardinian cohort consisted of 2903 cases (69.2% female, 1.2% PPMS) and 3323 controls (control/case ratio, 1.15) (22)(23)(24). Two of the variants (rs2812197 and rs4925166) replicated with P < 0.01 in the Sardinian data set; two more (rs34286592 and rs2836425) showed the same direction of effect but did not reach nominal significance (Table 3, table S4, and fig. S6).

SHMT1 as a novel MS susceptibility gene
The association of rs4925166 constituted the strongest signal among the novel variants. It showed an OR of 0.85 (CI, 0.81 to 0.90) and a P value of 2.7 × 10 −9 in the pooled analysis of German data sets (Table 3). This variant replicated in the Sardinian cohort with a joint P value of 7.4 × 10 −12 ( fig. S6D). SNP rs4925166 is located on chromosome 17 in an intron of the gene TOP3A, coding for the DNA topoisomerase IIIa. However, strongly associated SNPs in this genomic region spread over several neighboring genes (Fig. 3A). We therefore conducted an expression quantitative trait locus (eQTL) analysis using a subset of 242 patients from data set DE1 to functionally link variants to nearby genes. We examined transcripts within a cis window of 1 million base pairs upstream and downstream of the lead variant for an association of blood gene expression levels with allele configuration (table S5). The variant rs4925166 and proxy SNPs (r 2 > 0.7) were found to be part of  a strong eQTL with the gene SHMT1 in DE1 samples [false discovery rate (FDR), 2.99 × 10 −10 ; Table 4 and fig. S7, A to C]. This eQTL was replicated in two independent control data sets [Max Planck Institute of Psychiatry (MPIP) data (25) and Grady Trauma Project (GTP) (26)(27)(28)] and in the publicly available GTEx eQTL database (29) (Table 4).
To investigate how rs4925166 influences the expression of SHMT1, we conducted an association analysis of the SNP with DNA methylation levels in blood. DNA methylation is an important epigenetic mechanism for regulation of gene expression. We tested the association between rs4925166 and DNA methylation levels at CpG sites in the two non-MS data sets MPIP and GTP. Methylation levels at 157 CpG sites that mapped to SHMT1 were examined for an association with genotype. We observed eight significant (FDR <0.05) methylation QTLs (mQTLs) between rs4925166 and CpGs in SHMT1 within the MPIP data set. Three of these associations could be replicated in the GTP data set (table S6  and fig. S7, D and E).
We wondered whether the CpG site showing the strongest association with rs4925166 (cg26763362) could fully explain the observed association between the SNP and SHMT1 expression (causal direction: rs4925166→cg26763362→SHMT1 expression) using mediation analysis (Table 4, tables S7 and S8, Fig. 3) (30). We observed partial mediation of the effect of rs4925166 on SHMT1 expression by DNA methylation status of CpG site cg26763362. The association pattern indicates that an additional factor influences the relationship between the SNP, the CpG, and the gene expression (see the Supplementary Materials). Thus, we conclude that the genotype of rs4925166 affects the expression of SHMT1 in a complex fashion, partially involving rs4925166-dependent DNA methylation.
Additional novel candidate loci associated with MS Three loci showed genome-wide significance in the pooled analysis of German data sets DE1 and DE2 but not in Sardinians ( Table 3). The strongest association, SNP rs4364506, was found on chromosome 6 and is located in an intron of the gene coding for the transcriptional regulator L3MBTL3 [Lethal (3) When conditioning for the lead variants at the four newly identified MS-associated loci, no evidence for secondary signals was found. Thus, the lead variants also constitute the most likely causal variants. These variants all map to introns of genes. This makes a functional link between each variant and the gene it is located in probable. To further explore the functional connections between SNPs and genes, we conducted an eQTL analysis of the 15 loci showing genome-wide significant associations. We thereby identified four cis-eQTLs with FDR <0.05 in MS cases (table S5). In addition to the eQTL of rs4925166 and SHMT1 already described above, three more significant eQTLs involved variants at two previously known MS susceptibility loci and three transcripts of the genes MMEL1 and ANKRD55.
Fine-mapping of DLEU1 Three variants located on chromosome 13 (rs806321, rs9596270, and rs806349), all intronic within the gene for the long noncoding RNA  (30), except for total effect (*), which was calculated by linear regression. Results were obtained using 1 million simulations. Effects and P values shown here differ from Table 5, as a lower number of samples contained both expression and methylation data than expression data alone. (C) Relationship between cg26763362 methylation, SHMT1 expression, and rs4925166 genotype in MPIP controls. DLEU1, have been described previously as associated with MS in three large studies (4-6), yet the variants did not show genome-wide significance in any of them. The association of rs806349 has recently been confirmed in a candidate-driven follow-up analysis of suggestive MS associations (21). However, the variant rs806349 reached a P value of only 2.7 × 10 −4 in our analysis (Table 5). Instead, a different SNP (rs2812197) in weak LD with rs806349 (r 2 = 0.4) showed genome-wide significance in the pooled analysis of DE1 and DE2 and also replicated in Sardinians (Tables 3 and 5, and fig. S5K). The association of previously described rs806349 is completely dependent on the more strongly associated rs2812197 (Table 5). Thus, it is unlikely that rs806349 is the causal SNP at this locus. The same is true for rs806321 (5), which is not independent of rs2812197 either ( Table 5). The DLEU1 locus contains evidence for a second signal, rs9591325 (Table 5 and fig. S5L), in poor LD with rs2812197, but in high LD with rs9596270, which was identified by Patsopoulos et al. (6) as a suggestive MS-associated variant. The two signals were partially independent of each other (Table 5). SNP rs9591325 is located in a clearly functional region with binding sites for many transcription factors, which is not the case for the other four variants ( fig. S8, B to F). Although rs2812197 shows the overall strongest association at DLEU1, the functional data indicate that rs9591325 might be either the actual or a second causal variant. Additional studies with larger sample sizes are required to fully answer this question.

DISCUSSION
The present study constitutes the largest GWAS on MS conducted in a single population to date. By pooled analysis of 3934 cases in data set DE1 and 954 cases in data set DE2, we identified strong associations in the MHC region with a P value of up to 1.3 × 10 −234 . In addition, 15 loci outside the MHC region were associated at a genome-wide significant level ( Fig. 1 and Table 3). Associations in the MHC region were examined using imputed HLA alleles. Stepwise conditional logistic regression identified DRB1*15:01 and six more associated HLA alleles (Table 2), in line with results from previous studies (2). All genome-wide significant and suggestive non-MHC MS susceptibility variants published by the IMSGC in 2013 (4) and present in our data (n = 104) were replicated regarding direction of effects in our samples (P = 5 × 10 −32 ; Fig. 2).
Four of the 15 non-MHC loci have not been found to be associated with MS in previous studies. One more locus, DLEU1, did not reach genome-wide significance in previous GWAS but has recently been confirmed as MS-associated in a candidate SNP study (21). The lead variants at DLEU1 and at the novel locus SHMT1 replicated in an independent Sardinian cohort containing 2903 cases (Table 3 and fig. S6). Variants at the other three novel loci did not reach nominal significance in Sardinians, yet two of them showed the same direction of effect. Because of their consistency and replication within the German cohorts, these three associations can nevertheless be considered as plausible. As the Sardinian population is genetically distinct from Germans, future studies are required to replicate these findings in other cohorts.
Previous genetic analyses of MS susceptibility have indicated immune system-related processes as relevant for the development of MS (4). Functions of known MS susceptibility genes have been mapped to KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways Janus kinase/signal transducers and activators of transcription (JAK/STAT) signaling, acute myeloid leukemia (AML), and T cell receptor signaling (7). Accordingly, MS-associated genes are predominantly expressed in immune cells (7,8). The five genes examined in detail in our study (L3MBTL3, DLEU1, MAZ, ERG, and SHMT1) are associated with regulatory mechanisms in immune cells as well.
The gene L3MBTL3 encodes a Polycomb group protein that maintains the transcriptionally repressive state of genes (31) and is frequently deleted in several forms of acute leukemia, including AML (32). Genes Table 4. eQTL and mQTL analysis for rs4925166. Direction of effect is relative to the minor allele T. Note that the effect sizes cannot be directly compared because normalization methods and covariates partly differ between studies. Additional eQTLs and mQTLs are described in the Supplementary Materials. Because only the single eQTL rs4925166/SHMT1 was examined in GTEx data, no FDR is indicated here. NA, not applicable.  associated with AML constitute one of the most significant pathway categories linked to MS susceptibility variants (7). The murine ortholog of L3MBTL3, MBT-1, has been found to regulate maturation of myeloid progenitor cells (33). The regulatory long noncoding RNA DLEU1 is often deleted in cases of B cell chronic lymphocytic leukemia and mantle cell leukemia (34). This locus regulates the expression of NF-kB (35), a transcription factor implicated in MS pathology (4,36,37). MAZ is an inflammation-responsive transcription factor (38) up-regulated during chronic myeloid leukemia (39). It binds to the promoter of the gene MYC, which is associated with MS (5). The transcription factor ERG is important for hematopoiesis (40), and the expression of this oncogene is associated with both AML and acute T cell lymphoblastic leukemia (41). ERG regulates the expression of MS-associated NF-kB (42), as DLEU1 does. Finally, SHMT1 is a serine hydroxymethyltransferase acting in the folate cycle. It catalyzes the transfer of a carbon unit subsequently used for synthesis of both nucleotides and methionine. SHMT1 is thus an essential component in the metabolism of the substrate S-adenosylmethionine (SAM), the major methyl group donor during both protein and DNA methylation (43,44). By this effect on regulation of gene expression, one-carbon metabolism plays an important role in oncogenesis. Lack of SHMT1 function is, among other effects, associated with acute lymphocytic leukemia (44)(45)(46). Thus, each of the five genes is involved in regulatory processes of the immune system. Although a clearer picture has already emerged regarding the cell types and broad pathways relevant for the etiology of MS (3,7), little is still known about the mechanisms by which risk genes act. Analysis of the known functions of the five genes examined in this study revealed that four of them regulate transcription, especially of immune-related genes. Moreover, indirect evidence suggests that they could all be linked either directly or indirectly to epigenetic regulatory mechanisms: L3MBTL3 recognizes epigenetic histone lysine methylation (31) and ERG interacts with ESET, a histone H3-specific methyltransferase (47). The best known regulatory target of the transcription factor MAZ is MYC (48), a regulator of epigenetic chromatin state that is associated with MS (5,49). DLEU1 is strictly regulated by DNA methylation at its promoter region (35). Finally, SHMT1 is essential for maintaining methylation homeostasis in the cell by catalyzing an important reaction in the generation of the methyl donor substrate SAM. Accordingly, the establishment of SHMT1 as an MS risk factor further puts epigenetic regulation by methylation in the focus of MS susceptibility.

Expression
In recent years, several studies have addressed the role of DNA methylation in the etiology and progression of MS. Methylation differences between MS cases and healthy controls have been analyzed in small, cross-sectional studies. Despite negative results in CD4 + cells (12,50), Bos and colleagues (12) recently observed significant differences in overall DNA methylation levels in CD8 + T cells. Another study demonstrated differentially methylated and expressed genes in brain tissue of MS patients compared to controls (14). Furthermore, differential methylation of the major risk locus HLA-DRB1 was observed in MS patients (51). Several groups have found either hypermethylation or hypomethylation of specific genes to be associated with inflammation or demyelination in MS patients (11).
In summary, these studies argue in favor of DNA methylation being relevant for the development of MS. By finding novel risk genes with potential roles in epigenetic regulation, our study adds further indication that epigenetic mechanisms might be important for MS susceptibility. A disturbed homeostasis of methyl donors, caused by an altered expression of SHMT1, is likely to have an impact on the disease. As epigenetic mechanisms constitute a major route for environmental risk factors to influence expression of disease-associated genes (11), regulation of DNA and protein methylation is an interface where genetic and environmental risk factors for MS might intersect. Detailed analyses of DNA methylation patterns and their interaction with MS susceptibility genes in larger cohorts and among different cell populations and tissues are now required to better understand the role of epigenetic mechanisms in MS.

Study samples
Two cohorts of cases, referred to as DE1 and DE2, were analyzed. Both data sets included patients with CIS, bout-onset MS, and PPMS. For cohort DE1, 4503 cases were recruited across multiple sites in Germany (for details, see the Supplementary Materials). For cohort DE2, 1002 cases were recruited across multiple sites in Germany (see the Supplementary Materials). The latter cohort was used in a previous publication (5). Controls for these cohorts were obtained from several population-based cohorts across Germany to match the different geographical regions where cases were recruited: KORA from the southeastern Germany region of Augsburg (52,53), HNR from central western Germany (54), SHIP from the northeastern region of West Pomerania (55), DOGS from Dortmund in central western Germany (56), and FoCus (57) and PopGen (58) from Kiel, northern Germany. In addition, controls from two studies on depression conducted in southeastern Germany were included (59,60). For a more detailed description of control cohorts, see the Supplementary Materials. All responsible ethics committees provided positive votes for the individual studies. All study participants gave written informed consent. In case of minors, parental informed consent was obtained.
Genotyping and QC Samples of cohort DE1 were genotyped using the Illumina Human-OmniExpress-24 v1.0 or v1.1 BeadChips. Samples of cohort DE2 were genotyped using the Illumina Human 660-Quad platform. For both cohorts, identical, stringent QC was conducted on samples and variants. QC steps on samples included removal of individuals with genotyping rate <2%, cryptic relatives (relatedness ≥1/16), and genetic population outliers. QC steps on variants included removal of variants with call rate <2% and MAF <1%. For a full description of QC, see the Supplementary Materials. Each set of cohorts was combined with controls genotyped on similar arrays, producing case/control data sets DE1 and DE2. QC was repeated on the merged data sets, leading to final figures of 3934 cases and 8455 controls for DE1 (table S1), as well as 954 cases and 1940 controls for DE2 (table S2).
HLA alleles were separately imputed from genotyping data for DE1 and DE2 using HIBAG v1.6.0 (20). Alleles with a posterior probability >0.5 were converted to hard calls. Results were validated using HLA typing of 442 patients from DE1 (see the Supplementary Materials).
Statistical analyses of genotype data GWAS was conducted on data sets DE1 and DE2 using PLINK2 v1.90b3s (61). Sex and the first eight MDS components were used as covariates in logistic regression. Data sets were combined using a fixed-effects model in METASOFT (62). For maximum precision, logistic regression and meta-analysis of lead SNPs were repeated in R v3.2.3 using package meta v4.3.2. All follow-up analyses (for example, conditional and interaction analyses) were conducted in R. Locusspecific Manhattan plots were generated using LocusZoom with European samples of the 1000 Genomes March 2012 reference panel on the hg19 build (63). For analysis of HLA alleles, stepwise logistic regression was conducted in R as previously described (1, 2, 5).

Gene expression and methylation data
For a subset of 242, mostly treatment-naïve patients from data set DE1 (73 male and 169 female) whole-blood RNA was collected using Tempus Blood RNA Tubes (Applied Biosystems). RNA was hybridized to Illumina HT-12 v4 Expression BeadChips (Illumina) and further processed as described in the Supplementary Materials. In summary, QC was conducted in R 3.2.1 using the packages beadarray and lumi (64,65). Probes were transformed and normalized through variance stabilization and normalization (66). Probes, which showed a detection P < 0.05 in more than 10% of the samples, which could not be mapped to a known transcript, or which were identified as cross-hybridizing by the Re-Annotator pipeline (67), were removed. This left 20,302 transcripts from 242 samples. Technical batch effects were identified by inspecting the association of the first two principal components of expression levels with amplification round, amplification plate, and amplification plate column and row, as well as expression chip. The data were then adjusted using ComBat (68). Gene expression and methylation data of the two control cohorts MPIP and GTP were published and described previously and are summarized in the Supplementary Materials (25)(26)(27)(28).

Statistical analysis of gene expression and methylation data
For each of the 15 genome-wide significant loci, all 429 transcripts beginning or ending within 1 Mbp upstream or downstream of a lead variant were determined. Associations between genotype and expression levels were examined in data set DE1 by linear regression using sex, age, and three MDS components as covariates. To account for multiple testing, P values were first corrected for the number of transcripts per cis window, followed by calculation of the FDR for the total number of variants tested. Replication of eQTLs with an FDR <0.05 in data set DE1 was conducted in control cohorts MPIP and GTP. For MPIP, the covariates sex, age, body mass index (BMI), disease status, and three MDS components were used in linear regression. For GTP, covariates were sex, age, and four MDS components. eQTLs were also looked up in the GTEx database (29). Here, only associations in whole blood were considered.
For analysis of the association of rs4925166 with DNA methylation at SHMT1, 210 CpG probes were identified in data set MPIP that mapped to SHMT1. After removing the quartile of probes showing the lowest variation in methylation status, 157 CpGs remained. Association of DNA methylation with imputed genotype was assessed by linear regression using sex, age, BMI, disease status, three MDS components, and estimated cell counts as covariates. The eight CpG probes showing an FDR <0.05 were replicated in data set GTP, using sex, age, four MDS components, and estimated cell counts as covariates. Mediation analysis was conducted as outlined in the Supplementary Materials, including nonparametric bootstrap for estimation of CIs and P values (30).

Replication of the results in a Sardinian cohort
The replication case group consisted of 2903 unrelated Sardinian MS patients that were diagnosed and selected using the McDonald criteria (22)(23)(24). Only 35 of these patients were diagnosed with PPMS (1.2%). Two thousand ten (69.2%) cases were female, and 893 (30.8%) were male; the average age at onset was 32 years. The matching control group of healthy individuals is composed of 2880 unrelated adult volunteer blood donors from the same locations where the cases were collected, as well as 443 Affected Family BAsed pseudo-Controls (AFBACs) derived from 242 MS and 201 type 1 diabetes family trios (23). AFBAC allele and haplotype frequencies were constructed using the two alleles in each trio that are not transmitted from the parents to the affected child. These familial pseudo-controls are matched to the cases for ethnic origin and are thus robust to population stratification.
All individuals were genotyped using the Illumina ImmunoChip array. In addition, 2040 (962 cases and 1078 controls) were genotyped with the Illumina HumanOmniExpress array and 3917 (2111 cases and 1806 controls) with the Affymetrix 6.0 array. One hundred seventy-four individuals (170 case and 4 controls) were genotyped using both HumanOmniExpress and Affymetrix 6.0 (22). After QC, we used 883,557 SNPs as baseline for imputation (17)