Search for rare protein altering variants influencing susceptibility to multiple myeloma

The genetic basis underlying the inherited risk of developing multiple myeloma (MM) is largely unknown. To examine the impact of rare protein altering variants on the risk of developing MM we analyzed high-coverage exome sequencing data on 513 MM cases and 1,569 healthy controls, performing both single variant and gene burden tests. We did not identify any recurrent coding low-frequency alleles (1–5%) with moderate effect that were statistically associated with MM. In a gene burden analysis we did however identify a promising relationship between variation in the marrow kinetochore microtubule stromal gene KIF18A, which plays a role in control mitotic chromosome positioning dynamics, and risk of MM (P =3.6×10−6). Further analysis showed KIF18A displays a distinct pattern of expression across molecular subgroups of MM as well as being associated with patient survival. Our results inform future study design and provide a resource for contextualizing the impact of candidate MM susceptibility genes.


INTRODUCTION
Multiple myeloma (MM) is a malignancy of plasma cells [1] for which there is an increasing incidence as the population ages. Case-control and cohort studies have consistently demonstrated a two to four-fold increased risk of MM in first-degree relatives of MM patients supporting the role of inherited susceptibility in tumour development [2].
Defining the genetic basis of this risk has proven difficult but recent genome-wide association studies (GWAS) have provided the first direct evidence for genetic susceptibility to MM, identifying risk single nucleotide polymorphisms (SNPs) at several independent loci [3][4][5]. Statistical modelling indicates that additional common variants with small effect should be identifiable by further GWAS. However, other types of variants should also be important and inference from studies of other cancers shows it is likely that rare, high-impact variants also contribute to the heritable risk of MM. Identifying such variants is important as this class of susceptibility can provide important insights into the molecular basis of familial and sporadic tumorigenesis. Furthermore, improved understanding of the molecular factors involved in tumorigenesis through such mechanisms has provided a basis for the rational development of targeted therapies for a number of cancers. While imputation broadens the accessible frequency spectrum of GWAS datasets, its fidelity is typically restricted to the detection of variants having minor allele frequencies >0.01. Other methodologies offer advantages over this and there is a strong rationale for searching for rare-disease associated alleles directly utilising high-throughput sequencing.
Since the exome is a highly enriched subset of the genome in which to conduct such screens, we have
We next examined the impact of rare alleles (MAF<1%) collectively within a gene on MM risk by aggregating SNVs and indels ('T1' test) in each gene and comparing the counts between cases and controls. Acknowledging the limitations of in-silico prediction to enrich for harmful alleles, we considered three nested classes of variant (see Methods): 'disruptive' (Class 1), 'predicted damaging' (Class 2) and 'all non-synonymous' (Class 3).
No individual gene showed a significant enrichment of Class 1 variation in cases, the strongest association being shown for CC2D2B (P =4.2x10 −4 ; Table 1). Furthermore, there was no global overrepresentation of associations across Class 1 variants (e.g. 6 vs. 3.3 expected at P ≤0.01, Supplementary Figure 1C; Supplementary Table 2). While no gene was formally statistically significant across any class when adjusting for multiple tests (i.e. P >3.3×10 −6 ; Table 1; Supplementary Table 2) we did identify a promising relationship with Class 3 variants in the marrow kinetochore microtubule stromal gene KIF18A gene (P =3.6×10 −6 ; Supplementary Table 3 details specific variants). For 14 of the 16 cases where matched tumor WES data was available there was no statistical evidence for preferential loss of wild-type allele in carriers (P = 0.19, one-sided binomial test). Analyzing Total Therapy and MRC-IX trial data, KIF18A expression was significantly elevated in the MAF/MAFB (MF) and overexpression of proliferationrelated genes (PR) subtypes (P =3.9×10 −14 and P =0.019 in Total Therapy and MRC-IX data sets, respectively; Figure 1A). The level of expression of KIF18A in normal plasma cells was also found to be lower than that seen in MM (P<0.001; Figure 1A). KIF18A expression is significantly correlated with the gene expression-based proliferation index (GPI) [7] in both data sets ( Figure 1B). In addition, there is a significant association between high expression of KIF18A (top 10% versus lower 90%) and poor outcome in the Total Therapy and MRC-IX trials (P<0.001 in both sets; Figure 1C). However, in a multivariate cox regression including MF and PR subgroup designation, elevated KIF18A expression, and GPI, KIF18A expression did not retain significance suggesting that proliferation is the primary independent prognostic factor. Pathway analysis revealed significantly increased activation of Cell Cycle and DNA replication pathways in samples with high expression of KIF18A.
Since many cancer susceptibility genes (CSGs) have pleiotropic effects, influencing the risk of different cancer types to varying degrees, we assessed a set of 114 wellestablished CSGs for enrichment in mutations in cases versus controls. The strongest CSGs associations were provided by SOS1 (P =0.03), BUB1B (P = 0.002) and ABCB11 (P =0.001) for Classes 1, 2 and 3 respectively (Supplementary Table 2). Imposing a significance threshold of 4.4x10 −4 to address the issue of multiple testing due to evaluating the 114 CSGs, no significant association was shown.

DISCUSSION
We report the first analysis of the contribution of rare disease-causing alleles to MM by analyzing germline WES data. Our results summarize observed variation in the largest MM germline sequencing study to date, thus providing an invaluable reference for future genetic and functional studies. We have identified KIF18A as a possible candidate for defining MM susceptibility. The proliferation arrest of MM cells out of niche has been shown to be associated with the widespread down regulation of mitotic and transcriptional genes [7], which includes KIF18A, hence variation in KIF18A has a strong biological basis for having a role in MM susceptibility a priori. Moreover elevated expression of KIF18A has been shown in other cancers to be associated with enhanced cell proliferation and predictive of poor prognosis [8] [9] [10].
Some regions of the genome are refractory to WES, hence we cannot exclude the possibility of disease-causing variants which could not be assayed. Accepting such a caveat a number of conclusions can be derived from our findings. First, the existence of rare recurrent protein altering variants with population frequencies of 1% or greater and conferring a RR of 4.0 seem implausible. Our analysis is however, based on UK data and does not exclude the possibility of this class of allele contributing to MM risk in populations with a more restricted allelic architecture. We acknowledge that our study had limited power to identify alleles with moderate penetrance. The use of familial cases provides a means of significantly empowering the search for rare disease-causing alleles for cancer.
Although in concept an attractive strategy the number of familial MM cases are few, hence the practicality of adopting this as a means of gene identification for MM is inherently problematic. Such considerations should not however detract from performing WES or whole genome sequencing on MM families that potentially offer the prospect of discovering high-impact mutations likely to be highly informative for understanding MM biology.

Patients
We analyzed WES data on 553 (307 male) patients with MM (mean age at diagnosis 65 years) that have been the subject of a previously reported study [11]. Briefly, the patients comprised 527 from the UK Medical Research Council (MRC) Myeloma XI trial and 260 from the UK MRC Myeloma IX trial. Germline DNA was extracted from EDTA venous blood samples obtained at diagnosis. Paired tumor DNA was extracted from CD138-sorted bone marrow samples where available [11]. WES of germline and tumor samples was performed using Agilent-Custom 53Mb Exome Capture (Agilent, Santa Clara, CA, USA) and Illumina HiSeq2000 technology (Illumina, San Diego, CA, USA).

Controls
The controls comprised 1,609 healthy individuals from the UK 1958 Birth cohort [12] -961 from the ICR1000 dataset (EGAD00001001021) [13] and an additional 648 individuals all sequenced using Illumina TruSeq 62Mb expanded exome enrichment kit in conjunction with Illumina HiSeq2000 technology.
We analyzed germline WES data on 553 UK MM patients and 1,609 1958BC controls. Cases and controls had similar sequencing metrics (Supplementary Table 4). 80 samples were excluded due to low-quality data or nonnorthern European ancestry leaving 513 cases and 1,569 controls for analysis.

Variant analysis pipeline
CASAVA (v.1.8.1, Illumina) was used to extract paired end fastq files, then Stampy and BWA [14] were used to align reads to human reference genome build 37 (hg19). Alignments were processed using the Genome Analysis Tool Kit (GATK v3), according to best practices [15,16]. Variants were called on the genomic region comprising the union of the TruSeq 62Mb capture and the Agilent-Custom 53Mb capture, plus 100bp padding at each boundary (Supplementary Table 5). Variants were called simultaneously across all case and control samples. For the loss of heterozygosity analysis, MM germline and tumor sample variants were called using Platypus [17]. The Variant Effect Predictor (VEP) was used to annotate each variant with its effect on canonical protein transcripts [18]. For the gene burden analysis if a variant received multiple relevant VEP annotations for a gene, we used only the single annotation deemed likely to have the most profound impact adopting the hierarchy: stop gained, frameshift, splice acceptor/donor variants, inframe insertion/deletion, and lastly missense, which were additionally annotated with predicted pathogenicity using the CONDEL algorithm [19]. All variants were annotated with their distance from simple repeats, and their 100mer alignability, using the UCSC browser [20]. ClinVar was used to check variants in promising genes for previously documented evidence of pathogenicity [21]. Linkage disequilibrium (LD) between variants was retrieved using the SNAP pairwise LD online tool with SNP data set '1000 Genomes Pilot 1' and population panel 'CEU' [22].

Variant quality control
A variant was only considered to be present if the GATK genotype-quality was ≥30, the alternate depth was >3, and it was in an acceptable truth tranche (i.e. <99.5 for SNVs and <99 for indels) as per GATK best practices.

Variant quality control for single variant analysis
For the single-variant analysis criteria were chosen to ensure the genomic inflation factor over the highest 90% of passing P-values remained close to unity (Supplementary Figure 1A). Variants were thus discarded if: (i) UCSC alignability ≠ 1 (100bp window size); (ii) variant within 10 bps of a simple repeat region; (iii) highly significant deviation from Hardy-Weinberg equilibrium (HWE) in cases or controls (P <10 −5 ); (iv) no call rate significantly different between case-samples and controlsamples (P <10 −5 ); (v) no call rate across either case or control samples >0.03. Furthermore for each variant the 'minimum possible' P-value it could obtain was calculated based upon the number of variant alleles observed, by assuming all variant alleles were in cases. Variants with high minimum possible P-values were disregarded if this value was greater than the Bonferroni corrected significance threshold resulting from their inclusion leaving 24,752 for analysis.

Variant quality control for gene centric analysis
For the gene-centric analysis (sample, variant)pairs were excluded if a heterozygous site was called yet the read counts in support of the reference and alternate alleles were unbalanced (P <0.0001, χ 2 test). Variants were excluded across all samples if the UCSC alignability ≠ 1, the variant mapped within a simple repeat, the HWE Pvalue across cases and controls was <10 −8 , or the no call rate across either case-or control-samples was ≥25%.

Publicly accessible control data
As additional sources of variant frequency in controls we referenced UK10K exome data (ALSPAC n =1,828 and TWINSUK n =1,754) and Exome Aggregation Consortium (ExAC-release 0.3, non-Finnish European population, excluding samples analyzed by The Cancer Genome Atlas).

Statistical analyses Single variant association
The difference in allele frequency in cases and controls was assessed using Fisher's exact test (twosided) implemented in R [24]. A P-value of <2.02x10 −6 was declared as significant; corresponding to a Bonferroni correction for 24,752 tests.

Gene-centric analysis
To test whether rare mutations contribute to MM we performed a collapsing burden test imposing a maximal MAF threshold of 1% (T1 test). We applied the T1 test for three nested classes of variant: Class 1) 'disruptive' (nonsense, frameshift); Class 2) 'predicted damaging' (disruptive + missense predicted to be damaging by CONDEL, and splice site acceptor/donors); Class 3) 'all non-synonymous' (predicted damaging + all sufficiently rare non-synonymous variants). To ensure nominal power to identify associations we restricted our analysis to genes featuring variants in ≥10 samples amongst cases and controls (Supplementary Figure 3). Exome-wide significance was considered to be P =3.3×10 −6 , corresponding to a Bonferroni correction for the 15,358 tests conducted when considering all classes. Significance levels were assessed using a onesided permutation test on case/control status.

Study power
Single variants were characterized by their minor allele frequency (MAF) and relative risk (RR); for (MAF, RR)-pairs discovery power was analysed by simulating 10,000 draws of case and control alleles from the population (Supplementary Figure 4A). Gene based analysis was treated in a similar manner to single-variants, except instead of using MAF, the "% of population with a Class 1 (or 2 or 3) variant in the gene-of-interest" was used (Supplementary Figure 4B).

Loss of heterozygosity
A search for loss of heterozygosity in paired tumor samples was performed using ExomeCNV (v1.4) [25].

Impact of pleiotropic effects of cancer susceptibility genes
In addition to performing an agnostic search for novel susceptibility genes, under the hypothesis that some MM cases might be ascribable to the pleotropic effects of known cancer susceptibility genes (CSG) we also adopted a focused assessment of the 114 established CSGs [26].

Expression analysis of KIF18A
The impact of KIF18A (221258_s_at) expression in MM was from Affymetrix Human Genome U133 2.0 Plus array data in plasma cells from Total therapy (GSE2658) and MRC Myeloma IX trial patients (NCBI GEO Datasets GSE31161) by TC classification. Differences in KIF18A expression between MM subtypes was assessed using the Wilcoxon test. Significance of difference in patient survivorship was determined using the log-rank test with "high" expression defined as the top 10% of samples, and "low" as the bottom 90%. All statistical analyses were performed using R version 3.2.1 software. A P-value of 0.05 (two-sided) was considered to be statistically significant.

Data availability
Whole-exome sequence data that support the findings of this study have been deposited in EGA with accession codes EGAS00001001 and EGAD00001001021. Expression data that support the findings of this study have been deposited in GEO with accession codes GSE2658 and GSE31161. The remaining data are contained within the paper and Supplementary Files or available from the author upon request.

ACKNOWLEDGMENTS
This study made use of genotyping data on the 1958 Birth Cohort generated by the Wellcome Trust Sanger Institute (http://www.wtccc.org.uk). We also thank the staff of the CTRU University of Leeds and the NCRI haematology Clinical Studies Group. Use of the ICR1000 UK exome series data generated by Nazneen Rahman's team at The Institute of Cancer Research, London is acknowledged.