Haplotype analysis identifies functional elements in monoclonal gammopathy of unknown significance

Genome-wide association studies (GWASs) based on common single nucleotide polymorphisms (SNPs) have identified several loci associated with the risk of monoclonal gammopathy of unknown significance (MGUS), a precursor condition for multiple myeloma (MM). We hypothesized that analyzing haplotypes might be more useful than analyzing individual SNPs, as it could identify functional chromosomal units that collectively contribute to MGUS risk. To test this hypothesis, we used data from our previous GWAS on 992 MGUS cases and 2910 controls from three European populations. We identified 23 haplotypes that were associated with the risk of MGUS at the genome-wide significance level (p < 5 × 10−8) and showed consistent results among all three populations. In 10 genomic regions, strong promoter, enhancer and regulatory element-related histone marks and their connections to target genes as well as genome segmentation data supported the importance of these regions in MGUS susceptibility. Several associated haplotypes affected pathways important for MM cell survival such as ubiquitin-proteasome system (RNF186, OTUD3), PI3K/AKT/mTOR (HINT3), innate immunity (SEC14L1, ZBP1), cell death regulation (BID) and NOTCH signaling (RBPJ). These pathways are important current therapeutic targets for MM, which may highlight the advantage of the haplotype approach homing to functional units.


INTRODUCTION
Germline disease genetics has historically applied linkage studies between family members to find susceptibility genes.With time families have become smaller which has reduced the statistical power of family-based linkage studies [1].However, with increasing understanding of the human genome organization, genetic variants, including single nucleotide polymorphisms (SNPs), have been identified at specific locations throughout the genome.SNPs are inherited together as haplotypes containing a linked sets of alleles and allele-specific biological functions, including genes and their cis-regulatory elements [2].In the study on global reference for human genetic variation 3.53 million SNPs were used for the European population to define haplotypes [3].One goal for the development of dense linkage map with welldefined haplotypes over the human genome has been the possibility to map susceptibility genes assuming that some marker SNPs would be in high linkage with the functional variants, even if the SNPs lacked independent functions [4].This has been the driving idea behind the genome-wide association studies (GWASs) for which the SNP coverage increased hugely in a decade.The result has been that the current GWASs target increasing numbers of haplotypes and may identify functional variants in even rare haplotypes [5].
The importance of haplotypes has been well known in some cancers, particularly in lymphoma, where immune related haplotypes at the HLA locus are associated with disease risk or protection [6].Other cancer-related applications of haplotypes include dating the origins of mutations based on conserved haplotypes [7].However, direct use of haplotypes has not been popular in human germline genetics, in marked contrast to its role in animal breeding [8,9].Nevertheless haplotype-based genetic mapping has been used in some recent Swedish cancer studies where novel candidate genes have been detected [10,11].
In the present study we apply a haplotype-based gene mapping approach to monoclonal gammopathy of unknown significance (MGUS) which is a precursor condition for multiple myeloma (MM) and other plasma cell malignancies such as Waldenström macroglobulinemia and immunoglobulin light chain (AL) amyloidosis.The MGUS populations included Swedish, German and Czech individuals and local controls who have been genotyped in previous GWASs [12][13][14][15].For haplotype analysis diverse populations may be a disadvantage as the haplotype structure may show subtle differences.However, the advantage is the increase in sample size and the possibility to internally replicate the findings in three populations.

Populations and original GWASs
We included three independent GWAS data sets of MGUS patients and controls from Germany, the Czech Republic and Sweden, including a total of 992 MGUS cases and 2910 controls, as described elsewhere [15].Collection of patient samples and associated clinical information was undertaken with informed consent in accordance with the tenets of the Declaration of Helsinki.The study was approved by the Ethical Committees of the universities of Heidelberg, Ostrava and Umea.The diagnosis of MGUS was based on the internationally accepted criteria of a monoclonal protein concentration <30 g/L, <10% monoclonal plasma cells in the bone marrow, normal plasma calcium and normal renal function, no bone destruction and no anemia.
Details of the individual GWASs and imputation have been described earlier [15].Shortly, the German MGUS set was genotyped with Illumina Human OmniExpress-12 v1.1 arrays and the control set with Illumina Human OmniExpress-12 v1.0 arrays.The Czech individuals were genotyped using Illumina HumanOmniExpressExome8v1.3 arrays.The Swedish samples were genotyped with Illumina Human Omni1-Quad BeadChips or OmniExpress-12 v1.0 arrays.General quality control assessment of genotyping was performed as previously described by Broderick et al. [16] and Chubb et al. [17].SNPs with a minor allele frequency (MAF) < 5% were excluded due to statistical uncertainty at lower allele frequencies using the current population sizes.After imputation using data from the combined UK10K -1000 Genomes Project (Phase 3, Oct 2014) with IMPUTE2 v2.3.2 [18], the genotyped and imputed SNP sets consisted of 4 015 889 SNPs for the Czech population, 5,305,950 SNPs for the German population, and 4,455,885 SNPs for the Swedish population.All genomic positions are given in NCBI Build 37/UCSC hg19 coordinates.

Construction of haplotype blocks
Haplotype blocks were created for each population separately based on 13,777,724 SNPs for a total of 3902 subjects.For construction of the haplotype blocks, two steps were followed: 1) phasing of chromosomes through SHAPEIT.v2[19] and 2) construction of haplotype blocks through the "Ghap" package in R [20].
For phasing, the genotype data was segregated based on chromosomes with creation of individual chromosome binary ped and map files by using the PLINK program [21].Chromosome-wide genotypes were then phased again using SHAPEIT.v2.12 with 200 states and window size of 0.5 Mb.SHAPEIT has been acknowledged to be most efficient for estimating haplotypes from the genotypes and for creating the consecutive haps and sample file which were used to generate haplotype blocks through GHap.
Haplotype blocks were constructed based on linkage disequilibrium (LD), Length (Len), and the number of SNPs (nsnp) and corresponding phase, sample, and marker files for each chromosome were created.Genomic positions in haplotype blocks were based on the distance from the first SNP.After phasing, haplotype blocks were constructed using the GHap function, which generates HapBlocks based on sliding windows.Windows and the step size can be specified in markers.For each window, block coordinates are generated.Based on the fact that the average length of linkage disequilibrium (LD) blocks in human populations is about 20 kb [22,23] the window size was defined to be 15 SNPs in order to cover a similar span.The sliding block along the genome was defined to be 2 SNPs during the call haplotypes.For each window, block coordinates were generated as the mean position of the start and end position of the corresponding haplotype window.
Overall, a total of 99,518,698 haplotype blocks were created using the above-mentioned methods based on LD, the length and the number of SNPs.In a consecutive step, we used GHap package in R to derive haplotype "alleles" from these haplotypes' blocks.Haplotype "alleles" were then exported in the PLINK ped file format, where haplotype allele counts 0, 1, and 2 are recoded as NN, NH, and HH genotypes (H = haplotype allele and N = NULL = all other alleles), as if haplotypes were bi-allelic markers.A regular ped file was obtained with PLINK and the haplotypes presenting a MAF < 0.01 were filtered out.These data sets were further used to perform a haplotype-based association study.

GWAS of the haplotypes
Haplotype alleles determined by GHap were used for association studies conducted by PLINK using logistic regression models with the covariate sex.The association analysis was first performed for each single population based on the haplotypes derived in the steps above.

Meta-analysis
Meta-analysis was performed using PLINK by pooling the beta values and standard errors for haplotypes from each single population data set.The meta-analysis was performed with a random effects inverse-variance weighted logistic model.Cochran's Q-statistic was calculated, to test for heterogeneity, and the I2 statistic measured, to quantify the proportion of the total variation due to heterogeneity.
While normally the meta-analysis of GWAS data from different populations refers to common SNPs in the different data sets, this approach does not work for haplotypes, since only in rare cases exactly identical haplotypes could be derived in all populations.As an alternative, we therefore used a common SNP in the corresponding haplotypes in the three populations as an anchor point for the analysis (from now on referred to as "a joint SNP").For a haplotype in each population, all SNPs forming the haplotype were assigned and a meta-analysis was performed for the joint SNP to represent each haplotype (Fig. 1).However, the joint SNP is only used as a common anchor and it represents the P value, odds ratio and frequency of the haplotype.In case of a large number of common SNPs in a haplotype, the most significant one was assigned as the joint SNP.Similarly, if common SNPs along the chromosome were assigned to several overlapping haplotypes, the haplotype represented by a joint SNP with the highest significance was selected.

Gene identification in significant haplotype blocks
Haplotypes that after meta-analysis of the three populations were associated with the risk of MGUS at the genome-wide significance level p < 5 × 10 −8 and showed increased risk of MGUS in all three populations were investigated further using UCSC browser's GRCh37/hg19 assembly (https:// genome.ucsc.edu)[24,25].Overlapping haplotypes were defined by a joint SNP and the start and end positions of the haplotype regions were based on the lowest start and highest end positions, among the overlapping haplotypes.We included tracks of UCSC genes, Genome Segmentations and Histone modifications by ChIP-seq from ENCODE [26], CpG Islands, Transcription Factor ChIP-seq Clusters from ENCODE, Enhancers and promoters from GeneHancer (Double Elite) [27] and Interactions between GeneHancer regulatory elements and genes (Double Elite).For Histone modification tracks we included H3K27Ac as an enhancer marker, H3K4me3 as a promoter and transcribed region marker and H3K4Me1 as regulatory region marker from the GM12878 lymphoblastoid cell line from ENCODE.Further information about the associated GeneHancer promoters and enhancers were extracted via GeneCards (https://www.genecards.org).Our focus was on the Double Elite regulatory elements and associated genes, i.e. they were confirmed by several sources used by GeneHancer.We used the Roadmap Epigenomics project data on histone marks to evaluate the regions in different immune cell populations, including primary B-cells, monocytes, hematopoietic stem cells, neutrophils, T-cells and natural killer cells (https://epigenomegateway.wustl.edu/)[28].We also evaluated the histone marks in other adult cells and tissues, including ovary, adipose nuclei, osteoblasts, lung, breast myoepithelial cells, adult liver, thymus, skeletal muscle, spleen and gastrointestinal tract.We checked the function of the genes located within the haplotype regions and genes interacting with the regulatory elements within the haplotype regions using GeneCards (https://www.genecards.org) and UniProt (https://www.uniprot.org).We used PubMed (https://pubmed.ncbi.nlm.nih.gov) to look for the relationship between the genes with cancer, especially with multiple myeloma or other hematological malignancies.

RESULTS
We used a joint SNP as an anchor point to create overlapping haplotypes for the three populations of our study (Fig. 1).We identified 23 haplotypes that were associated with the risk of MGUS at the genome-wide significance level p < 5 × 10 −8 and showed increased risk of MGUS in all three populations (Table 1).One region represented by the SNP rs111335312 was located within the ULK4 gene, in a known MM risk locus.Altogether 13 haplotype regions were overlapping with genes or regulatory elements interacting with the neighboring genes (Tables 1, 2).In these haplotype regions most of the regulatory elements were active in B-cells and/or immune cells; some of them were even recognized as super-enhancers.All regulatory elements had several target genes and most of them contained tens to a few hundred transcription factor binding sites.
In five haplotype regions, strong promoter, enhancer and regulatory element related histone marks and regulatory elementtarget gene association scores as well as ChrommHMM and Segway genome segmentation data supported the importance of the regions in MGUS susceptibility (Fig. 2, Table 2).On chromosome 1, the region represented by rs11368313 overlapped with TMCO4 (transmembrane and coiled-coil domains 2) and RNF186 (ring finger protein 186) and showed interactions between promoter/enhancers of these two genes.Additionally, RNF186 interacted with OTUD3 (OTU deubiquitinase 3).Both RNF186 and OTUD3 are involved in (de)ubiquitination.A haplotype on chromosome 6 represented by rs10658790 overlapped with the promoter/enhancer of HINT3 (histidine triad nucleotide binding protein 3), which interacted with promoter/enhancers of TRMT11 (tRNA methyltransferase 11 homolog) and NCOA7 (Nuclear Receptor Coactivator 7).HINT3 may be involved in PI3K/AKT/ mTOR pathway.
On chromosome 17, the most interesting haplotype region represented by rs10163481 contained three regions with strong promoter and enhancer histone marks (Fig. 2, Table 2).These were related to GeneHancer promoter/enhancers and showed interactions between SEC14L1 (SEC14 like lipid binding 1) and SNHG20 (small nucleolar RNA host gene 20) and SRSF2 (serine and arginine rich splicing factor 2) which is involved in RNA splicing.SEC14L1 encodes a signal transduction inhibitor involved in innate immunity.
The region on chromosome 20 represented by rs111797554 overlapped with the ZBP1 (Z-DNA binding protein 1) gene and its promoter and enhancer that interact with each other (Fig. 2, Table 2).ZBP1 is involved in innate immunity responses.The region on chromosome 22, represented by rs1045588, involved two genes, BID (BH3 interacting domain death agonist) and MICAL3 (microtubule associated monooxygenase, calponin and LIM domain containing 3), and showed several regions with very strong promoter and enhancer histone marks and promoterenhancer interactions.BID belongs to the BCL-2 family of cell death regulators and MICAL3 is involved in cell cycle regulation.
Further five haplotype regions with either weaker histone marks, genome segmentation data or regulatory element-target gene associations included a region on chromosome 4 represented by rs10006825 which contained an enhancer associated with RBPJ (recombination signal binding protein for Fig. 1 Meta-analysis of the haplotypes from the three populations, originating from the Czech Republic (CZE), Germany (GER) and Sweden (SWE), represented by the joint SNP rs11368313 (in red).Other SNPs common in all three populations are shown in green.For each population, the chromosome start and end position and the allele of the haplotype are shown.immunoglobulin kappa J region) (Table 2, Supplementary Fig. 1).RBPJ is regulating NOTCH-signaling.In the region on chromosome 17 represented by rs111362005 three enhancers within an intron of NXN (nucleoredoxin) showed interactions with an enhancer region flanking the promoter of NXN, which regulates Wnt signaling.In another region on chromosome 17, represented by rs1024819, a weak enhancer interaction within MSI2 (musashi RNA binding protein 2) may affect the function of MSI2 as a transcriptional regulator of genes involved in development and cell cycle.
A large region on chromosome 19 represented by rs10420324 hosted MYADM (myeloid associated differentiation marker), PRKCG (protein kinase C gamma), CACNG7 and CACNG8 (calcium voltagegated channel auxiliary subunit gamma 7 and 8, respectively) that interacted with each other as well as with three zink finger proteins and CNO3 (CCR4-NOT Transcription Complex Subunit 3), which is involved in RNA-mediated gene silencing (Table 2, Supplementary Fig. 1).MYADM is a negative regulator of heterotypic cell-cell adhesion and protein kinase C signaling.On chromosome 20 represented by rs1051904 many promoterenhancer interactions were implicated between the SMOX (spermine oxidase) gene and long-distance enhancers.SMOX may act as a determinant of cellular sensitivity to antitumor polyamine analogs.
Furthermore, on chromosomes 10 and 11, GeneHancer regulatory elements were connected to long non-coding RNAs, however with unknown functional consequences (Table 2, Supplementary Fig. 1).On chromosome 12, a region represented by rs10840622 overlapped with the SETD1B (SET Domain Containing 1B, Histone Lysine Methyltransferase) and HPD (4-Hydroxyphenylpyruvate Dioxygenase) genes and showed strong promoter-related histone marks at the 3'end of HPD, but no interactions with any GeneHancer regulatory elements.

DISCUSSION
We conducted a haplotype-based analysis on MGUS to complement the earlier GWAS analyses where individual SNPs were considered [13][14][15]29].A previous meta-analysis of the GWAS data from MGUS, MM and AL-amyloidosis identified 17 independent regions with genome-wide significance [13].In the present study we identified 23 haplotypes that were associated with the risk of MGUS at the genome-wide significance level p < 5 × 10 −8 and showed increased risk of MGUS in all three study populations.Notably, only the ULK4 containing haplotype on chromosome 3 shared significance (OR 1.70, p = 1.13 × 10 −8 ) with the previous study, which may suggest the importance of this locus.ULK4 is a serine/threonine kinase.Several associated haplotype regions affected pathways important for MM cell survival and genes encoding important current therapeutic targets for MM, which may highlight the advantage of the haplotype approach homing to functional units.
We identified five haplotype regions, in which associations with regulatory elements and their connections to target genes supported by genome segmentation highlighted the possible role of the related haplotypes in MGUS.None of these regulatory elements were specific for B-cells, but were active also in other immune cells, and other adult cells and tissues, as implicated from different sources from GeneHancer and Roadmap Epigenomics.The haplotype on chromosome 1 included two (de)ubiquitinationrelated genes, RNF186 and OTUD3, which interacted with each other.As MM cells produce high amounts of monoclonal antibodies, maintaining protein homeostasis is crucial for MM cells [30].The ubiquitin-proteasome system plays an important role in this process and MM cell killing can be caused by blocking or interfering this system.Thus, it has been one of the primary targets in MM treatment starting in early 2000 with proteasome inhibitors and immune modulators [30].Commonly used  proteasome inhibitors include bortezomib, carfilzomib and the oral drug ixazomib [31].Commonly used immune modulators include thalidomide, lenalidomide and pomalidomide [31].E3 ubiquitin ligase and deubiquitination inhibitors are currently being tested in preclinical studies and clinical trials [30,31].A haplotype on chromosome 6 covered the promoter/enhancer of HINT3, probably involved in PI3K/AKT/mTOR pathway, and interacting with promoter/enhancers of NCOA7 which encodes an estrogen receptor associated protein.PI3K/AKT/mTOR pathway is one of the signaling pathways in the bone marrow microenvironment that promotes signaling events in MM cells and enhances their survival [32].The pathway shows aberrant activation in many MM patients.Targeting PI3K/AKT/mTOR pathway has shown promising results in preclinical studies, however, clinical trials have been disappointing showing limited clinical efficiency in MM and severe side effects [32].
The haplotype on chromosome 17 encoding the SEC14L1 gene and another haplotype on chromosome 20 associated with the ZBP1 gene both are involved in innate immunity.SEC14L1 was one of the genes that was mutated in the germline of one German MM family [33].ZBP1 is integral part of host defense against pathogens.It responds to a variety of stimuli, such as viral infection and homeostatic perturbations, leading to the formation of the PANoptosome complex and release of numerous cytokines and chemokines [34].It is also known to be active in MM [35].
The haplotype on chromosome 22 covered two genes, BID and MICAL3.MICAL3 is involved in cell cycle regulation and it has been implicated in various cancers [36].BID belongs to the BCL-2 family of cell death regulators which have recently become extremely promising therapeutic targets in hematological malignancies, including MM, with the prime drug venetoclax [31,37,38].Overexpression of Bcl-2 has been found in MM harboring a common translocation (11;14) which is a special indication for venetoclax [39].This translocation is also common in MGUS [40].Unfortunately, the MGUS samples of the present study have not been screened for cytogenetic alterations and we could not investigate this aspect further.
A further haplotype region with a possible association with immune functions was found on chromosome 4 represented with an enhancer associated with RBPJ.This enhancer seemed to be specific for immune cells, especially for B-cells, T-cells and natural killer cells.RBPJ is a key translational transducer of NOTCHsignaling and it plays a role in cellular immune response [41].Similar to PI3K/AKT/mTOR pathway, NOTCH-signaling promotes communication between adjacent cells in the bone marrow microenvironment and has been shown to be dysregulated in the MM tumor niche [42].
The challenges of our haplotype-based association study included three different study populations and their different sample sizes as well as different genotyping arrays used for the sample sets.This also illustrates some limitations of our study.In general, use of several populations increases the sample size and allows replication of the findings.However, as the frequencies of the SNPs and their LD vary between populations, it is difficult to find exactly the same haplotypes in all populations.The use of different genotyping arrays with different numbers of genotyped SNPs affected the imputation and the final number of SNPs included in the study, and at the end the composition of the haplotypes.We aimed to solve these problems by identifying the common SNPs in all populations and defining a joint SNP as a representative for the haplotype as shown in Fig. 1.However, the different composition of the haplotypes in the study populations makes it also difficult to evaluate the effect of the haplotypes on the expression of genes connected to these haplotypes.We used GeneHancer and Roadmap Epigenomics data to investigate the activity of the regulatory elements in B-cells, other immune cells and other adult cells and tissues as a potential indicator for the effect of the haplotypes.The problem of different sample sizes in the meta-analysis of GWAS data has been considered by Cook et al. [43].They demonstrated that linear/logistic models can be used for meta-analysis of GWASs of binary phenotypes, without loss of power, even in the presence of extreme sample size and case-control imbalances, provided that inverse-variance weighting of allelic effect sizes after conversion onto the log-odds scale has been performed, as was done in our study.
In conclusion, our haplotype-based genetic association study identified several novel loci associated with the risk of MGUS.Genes and regulatory elements connected to neighboring genes were related to pathways dysregulated in MM, which serve as targets for already existing therapies, or may serve as targets for drugs that are currently tested in preclinical studies or clinical trials.Interestingly, these loci with exception of the ULK4 locus, have not been found in large GWASs on MM which may show the strength of the haplotype-based approach testing functional chromosomal units rather than individual SNPs.Whether these MGUS-associated loci are important in the disease progression to MM as high-risk markers remains to be investigated.

Fig. 2
Fig. 2 Haplotype regions and Forest plots from the five most interesting haplotypes associated with the risk of MGUS.Haplotype regions are shown using UCSC Genome browser's GRCh37/hg19 assembly and annotation tracks from ENCODE and GeneHancer.Forest plots show the overlapping haplotypes, represented by a joint SNP, of the three study populations from the Czech Republic (CZE), Germany (GER) and Sweden (SWE).For each population the odds ratio and the corresponding 95% confidence interval (CI) are shown as well as the summary estimate of the meta-analysis.A Chromosome 1, rs11368313, B Chromosome 6, rs10658790, C Chromosome 17, rs10163481, D Chromosome 20, rs111797554, E Chromosome 22, rs1045588.

Table 1 .
Haplotype regions associated with the risk of MGUS.

Table 2 .
GeneHancer promoters and enhancers in the haplotype regions associated with the risk of MGUS, details extracted via GeneCards.

Table 2 .
continued b Double Elite GeneHancer score and Gene association score.c Number of gene targets; Gene targets in the haplotype region.d Examples of tissues, in which the promoter/enhancer is active in GeneHancer sources; Superenhancer (SE) in B-cells or immune cells.e Cell populations with active chromatin state in 6 immune cell populations and over 10 adult cell and tissue types were extracted via Roadmap Epigenomics.H. Thomsen et al.