A multiple myeloma-specific capture sequencing platform discovers novel translocations and frequent, risk-associated point mutations in IGLL5

Multiple myeloma (MM) is a disease of copy number variants (CNVs), chromosomal translocations, and single-nucleotide variants (SNVs). To enable integrative studies across these diverse mutation types, we developed a capture-based sequencing platform to detect their occurrence in 465 genes altered in MM and used it to sequence 95 primary tumor-normal pairs to a mean depth of 104×. We detected cases of hyperdiploidy (23%), deletions of 1p (8%), 6q (21%), 8p (17%), 14q (16%), 16q (22%), and 17p (4%), and amplification of 1q (19%). We also detected IGH and MYC translocations near expected frequencies and non-silent SNVs in NRAS (24%), KRAS (21%), FAM46C (17%), TP53 (9%), DIS3 (9%), and BRAF (3%). We discovered frequent mutations in IGLL5 (18%) that were mutually exclusive of RAS mutations and associated with increased risk of disease progression (p = 0.03), suggesting that IGLL5 may be a stratifying biomarker. We identified novel IGLL5/IGH translocations in two samples. We subjected 15 of the pairs to ultra-deep sequencing (1259×) and found that although depth correlated with number of mutations detected (p = 0.001), depth past ~300× added little. The platform provides cost-effective genomic analysis for research and may be useful in individualizing treatment decisions in clinical settings.

these probes. Probes were also designed across exonic and intronic regions of the MYC locus spanning ~50Kb upstream to ~100Kb downstream (chr8:128697680-128853674;; hg19 coordinates). 70% of this region was covered by probes. All chromosomes arms have coverage except 13p, 14p, 15p, 22p, Yp, and Yq. As noted in the Discussion, all of these except Yq are tandem-rich arms of acrocentric chromosomes. Sequencing coverage calculation Average depth and on-target efficiency were calculated using the Genome Modeling System's 1 utilities for measuring depth and alignment coverage. These tools rely on the RefCov software suite (http://gmt.genome.wustl.edu/gmt-refcov), which provides a number of methods for analyzing nucleotide sequence coverage. RefCov calculates summary and per-base position coverage statistics relative to a reference genome based on an input alignment BAM file.
Reported mean, minimum, and maximum coverage statistics are based on on-target basesi.e., bases aligned within the coordinates of the designed probes, as specified by the BED file (Table  S1). Hence, no bases within the wingspan of the probes were considered on-target.
Per-base coverage was then calculated for each base in the target space. A sample's mean coverage was then calculated as the mean per-base coverage across all bases having at least 1X coverage. The reported mean, minimum, and maximum coverages are then the mean, minimum, and maximum across all samples of the per-sample mean coverages. In addition to total aligned on-target bases, Figures S1 and S2 show total duplicate bases (i.e., total number of on-target bases occurring in duplicate reads), total aligned off-target bases (which includes both unique and duplicate reads), and total unaligned bases. Finally, percent on target efficiency at a specified depth (30X in Figures  S1  and  S2) was calculated by first summing the total coverage at on-target bases that met or exceed the specified depth and then by dividing this sum by the total number of bases sequenced.
Capture sequencing-based copy number detection Copy number variants (CNVs) were called using CopyCAT2 (https://github.com/abelhj/cc2/;; Ref. 2) parameterized to detect copy number alterations exceeding the level of noise estimated from diploid regions using a gaussian mixture model (https://github.com/genome/bmm). CopyCAT2 was specifically developed to detect CNVs from capture sequencing data. CNVs were called if the (nominal) binomial p-value output by CopyCAT2 (p.cov.np) was less than 0.05, which is computed based on the number of capture probes having a tumor/normal depth log ratio outside some upper or lower limit. These upper and lower limits were defined as the mean plus or minus, respectively, two times the standard deviation of the distribution of log ratios from a "typical" diploid region. Chromosome 2 was generally used as the reference diploid region, as it is infrequently altered in multiple myeloma. In several instances, however, chromosome 2 proved too noisy and a different chromosome was used as the diploid region, namely: To prevent the upper and lower limits from being overly sensitive to potential focal and/or armlevel CNVs within the supposedly diploid region, a gaussian mixture model (i.e., sum of gaussian distributions) was fit to the tumor/normal ratios (not their log-transformed values) within the region ( Figure  S3). Since the bulk of the region was assumed diploid, the gaussian making the largest contribution to the mixture (i.e., with the largest weight and hence representing the most probes) was taken as a model of the unaltered subregions. The gaussian mixture model was fit using a variational Bayesian approach implemented in the R bmm package (https://github.com/genome/bmm) and used previously as the backend of SciClone-a method for inferring clonal evolution from sequencing data. 3 The mixture was fit with two gaussian components by: invoking init.gaussian.bmm.hyperparameters with N.c = 2, passing the resulting initialized hyperparameters to init.gaussian.bmm.parameters, and finally passing the parameters resulting from that call, along with the hyperparameters, to gaussian.bmm.
gaussian.bmm was also passed parameters convergence.threshold = 10 -4 , max.iterations = 10000, and pi.threshold=10 -2 . The data passed to these functions was output by an initial call of CopyCAT2. CopyCAT2 was invoked independently for each tumor sample, with the corresponding normal sample used as the (single) control sample, and with parameters coverage.min.ratio = 0.125, coverage.max.ratio = 8, min.num.normals = 1, min.norma.corr = 0.5, segalpha = 0.05, and vafs_normalize = FALSE. Additionally, the BED file describing the custom capture probes (Table  S1) was passed as the target.bedfile parameter after first excluding IGH coordinates. This effectively prevents CopyCAT2 from attempting to call CNVs within the IGH locus. This BED file was used to calculate coverage using bedtools coverage, which was subsequently passed to CopyCAT2.
Following fitting of the gaussian mixture, CopyCAT2 was invoked a second and final time to detect CNVs based on the margins established by the fit. Parameters were as specified above in the initial run, except with coverage.min.ratio and coverage.max.ratio set to the mean (which is one, since CopyCAT2 mean centers the data) plus or minus twice the standard deviation of gaussian model of the unaltered regions.
CNVs output by CopyCAT2 were annotated to indicate whether they were amplifications (CopyCAT2 finalcn field > 2) or deletions (finalcn < 2), whether they were focal or arm-level, whether they participated in a hyperdiploid event, and, for focal CNVs, what genes they encompassed. A CNV was annotated as belonging to a chromosome arm if at least one breakpoint was within that arm;; it was labeled as "arm"-level if its length was at least half the length of the targeted region of the arm and "focal" otherwise (Table  S4). An event that involved both arms of the chromosome was annotated for both the p- and q-arms, with a separate entry in the table (Table  S4) for each. A sample was considered hyperdiploid if it had amplifications of at least five of the chromosomes 3,5,7,9,11,15,19, and 21 (i.e., both p- and q-arms, except for chromosome 15, since 15p was not targeted). Focal CNVs were annotated with (hg19) genes they encompassed using findOverlaps from the GenomicRanges R package. (Clonal) single-copy gains occur at a log 2 ratio of log 2 (3/2) ~ 0.58, whereas (clonal) heterozygous/single-copy losses occur at a log 2 ratio of log 2 (1/2) = -1. Homozygous losses occur, in principle, at log 2 (0/2) or negative infinity. The finite negative ratio of the homozygous focal loss ( Figure  1B) may indicate that it is subclonal and/or reflect a small number of spurious alignments to this region of the genome.

Capture sequencing-based translocation detection
Translocations were detected using LUMPY (https://github.com/arq5x/lumpy-sv;; Ref. 4), with results filtered by a machine learning approach optimized to achieve high precision relative to available FISH results. First, FASTQ files were aligned against the human genome (hg19) using the aln command of SpeedSeq 5 (v0.0.1) and parameters "-t 4 -o prefix," which resulted in three BAM files: prefix.bam containing all alignments, prefix.splitters.bam containing all split reads, and prefix.discordants.bam containing discordant read pairs. The empirical insert size distribution was calculated for each alignment BAM file using the pairend_distro.py utility distributed with LUMPY. Specifically, samtools 6 was used to output the entries of the prefix.bam file, with the first 10,000 entries skipped and the remainder piped to pairend_distro.py with parameters "-X 4 -N 10000," with results output to prefix.hist. The mean m and standard deviation sd of the insert size for prefix.bam were parsed from prefix.hist and used to define back_dist = m + 3 * sd. LUMPY (v0.2.13) was then invoked independently for each patient, with paired-end "pe" and split read "sr" arguments for each discordant and split-read BAM file (for both tumor and normal samples), respectively, associated with that patient. Specifically, each prefix.bam file associated with the patient resulted in a set of arguments: "-pe id:<sample_name>,bam_file:prefix.discordants.bam,histo_file:prefix.hist,mean:m,stdev:sd,read_ length:<read_length>,min_non_overlap:<read_length>,discordant_z:5,back_distance:back_dist, weight:1,min_mapping_threshold:<threshold>" and "-sr id:<sample_name>,bam_file:prefix.splitters.bam, back_distance:back_dist,weight:1,min_mapping_threshold:<threshold>," where <sample_name> was the name of the sample, <threshold> was 50, and <read_length> was the mode of the lengths of the first 110,000 reads (as determined by outputting the first 110,000 reads of prefix.bam using samtools), which was 100. Translocations were annotated with the nearest cancer-associated gene (as cataloged in the cancer gene census 7 ) within 1Mb of either breakpoint.
Putative translocations involving IGH (defined as those with a partner within the region chr14:105982614-107338051) or MYC (chr8:128697680-128853674) were parsed out of the LUMPY VCF (variant call format) output using a custom script. In addition to the indicated coordinates, putative translocations were required to include a VCF MATEID field and SVTYPE=BND (indicating a complex rearrangement with two breakends).
Each putative inter-chromosomal IGH translocation was further filtered using a support vector machine (SVM) trained on available FISH data and using as input the number of split reads   FISH results for MYC translocations were not available for filtering LUMPY results. Hence, we manually defined a decision boundary in the space of number of supporting split reads and paired-end reads to separate those LUMPY calls that were likely to be false positives (in particular, those that were detected in normal samples) from those more likely to be true positives. To do so, independently for intra- and inter-chromosomal MYC translocations, we defined the separating hyperplane such that all translocations inferred in normal samples were assigned to the likely false positive class. Specifically, we manually selected translocations based on their numbers of supporting reads that should be used to define the boundary (i.e., a subset of which would be selected as support vectors to define the hyperplane). This was accomplished by defining an SVM via SVM(kernel='linear', C=1) and then by invoking its fit method with a sample_weight argument that assigned a non-zero weight (10)

Somatic single nucleotide variant detection
Reads were aligned against human reference genome GRCh37-lite using BWA. 12 The SNVcalling pipeline used a combination of samtools, 6  In addition to producing the standard position and base pair change of a variant, the somatic variation pipeline produces both a classification for mutation type (e.g., silent, missense, nonsense) as well the reference and alternate read counts and variant allele frequencies for both the tumor and matched normal samples. Together, this information provided a means of stratifying variants by relative importance and of assessing the sensitivity of the custom capture platform to detect low-frequency mutations.

Comparison between initial capture and subsequent deep sequencing
We explored the sensitivity afforded by increased sequencing depth by performing additional sequencing of 15 tumor (mean depth=1,259X, min=506X, max=1,660X) and paired normal (mean=1,326X, min=763X, max=1,727X) samples. We then performed a comparison of variants discovered by the initial and deeper sequencing. Both data sets were processed using the same pipeline parameters as detailed above. The final SNV variant calls were then compared to look for commonalities as well as those unique to each set.
Several filtering steps were carried out prior to the comparison of variants. This served to both highlight the genes of interest, as well as to account for the additional information provided by deeper sequencing. This additional information influences SNV results both by revealing rare low-frequency variants as well as by identifying potential contamination in the original lowercoverage results. For instance, a SNV with low variant allele coverage in the initial sequencing may be called if reference coverage is also low, so that the resulting VAF is appreciable and exceeds the caller's threshold. However, if deeper sequencing leads to additional coverage of that reference allele, without a corresponding increase in the variant allele, the resulting VAF may fall below a caller-required threshold and be filtered. This situation may indicate the variant reads are artifacts.
The following variants were removed prior to comparison between initial and subsequent deep sequencing: 1. Those annotated as intronic, intergenic, silent, or 5' flanking (to focus the comparison on those variants most likely of biological importance).
2. Those occurring in the IGH region (as these are likely caused by physiological somatic hypermutation and are not of biological significance).
3. Those rejected by a caller as likely germline in either data set.

Sequencing downsampling
We explored the effect of varying read depth on variant discovery by downsampling the deep sequencing data sets. Comparisons were performed at 25%, 50%, and 75% of the total coverage on the set of 15 samples for which additional sequencing was performed. The original BAM files containing all instrument data were first query-sorted using the SortSam utility from the Picard 1.138 toolkit (parameters: SORT_ORDER=queryname VALIDATION_STRINGENCY=LENIENT). To recreate the effects of lower coverage, the query-sorted instrument data were then randomly down-sampled without replacement using the bamsample tool from the fastq-tools package (version 0.8). Five repetitions of subsampling were performed on the 15 samples at all three levels of lower coverage. The reduced data sets were then imported into the standard somatic variation pipeline (above) to maintain consistency with the samples having full coverage. SNV calls from the downsampled results were excluded from the comparison as above or if they were annotated as RNA mutations. After filtering, the results were compared to the full complement of calls from the 100% coverage data set. The number of variants that overlapped the full-coverage variants exactly in position and nucleotide change are indicated on the y axis of Figure  S10.

Comparison between exome and capture sequencing
In order to establish performance against an existing platform, we compared the results of capture sequencing to those previously obtained via exome sequencing (dbGaP Study Accession: phs000348.v2.p1). We downloaded alignment BAM files from dbGaP, converted them to FASTQ files, and reprocessed the unaligned reads using the same alignment and variant discovery pipeline as used for the capture sequencing data (above). This ensured that discrepancies reported between the two studies were not an artifact of different bioinformatic pipelines, but rather reflected differences in the sequencing platforms employed. Though 79 pairs overlapped between the two studies, we were only able to reprocess the data from 44 pairs, which are reported here.
To address the issue of the capture platform's much more restricted coverage, the comparison was limited to only those coordinates nominally targeted by the probes (Table  S1) Mutual co-occurrence and mutual exclusivity Mutation co-occurrence and mutual exclusivity were calculated using MuSiC. 17 Raw p-values calculated using 100,000 permutations are reported.

IGLL5 survival analysis
Clinical and non-synonymous SNV and indel data were downloaded from the MMRF Researcher Gateway as part of CoMMpass trial IA9 data release (files STAND_ALONE_SURVIVAL.csv and MMRF_CoMMpass_IA9_All_Canonical_NS_Variants.txt).
These data were generated as part of the Multiple Myeloma Research Foundation Personalized Medicine Initiatives (https://research.themmrf.org and www.themmrf.org). Progression events and times were defined using the "ttcpfs" and "censpfs" fields, respectively, from the file STAND_ALONE_SURVIVAL.csv. Survival analysis was performed in R using the survival and survminer packages: Kaplan-Meier curves were generated using survfit and plotted using ggsurvplot, while a Cox proportional hazards model was fit using coxph.

Fluorescence in situ hybridization
Fluorescent in situ hybridization (FISH) was performed on ACK lysed BM aspirates using cIg-FISH as previously described. 18 All samples were hybridized with commercial probes (Abbott/Vysis). A dual color break apart probeset for 14q32 was first used to determine if there was a translocation involving the IGH locus. If the break apart was positive, a reflex to the most common translocations observed in multiple myeloma were used: t(11;;14)(q13;;q32) (i.e., release of the CoMMpass trial, which was also downloaded from the MMRF Research Gateway. Table  S1. BED file of capture sequencing platform probes. Coordinates in hg19 genome.    from the unperturbed/diploid-region ratios by fitting a gaussian mixture model to the (non-logtransformed) depth ratios. (C) Only those regions whose ratios are further than two standard deviations from the mean diploid ratio are assessed for, and ultimately identified as having (blue;; p < 0.05), CNVs by CopyCat2.    Table  S5: Annotated and SVM-scored LUMPY IGH translocation calls. Translocation partner 1 indicated by "chrom" and "pos" and partner 2/mate indicated by "mate_chrom" and "mate_pos" (hg19 coordinates). "sr": number of split reads supporting translocation;; "pe": number of paired-end reads supporting translocation;; "tot.evidence": total evidence (i.e., sum of "sr" and "pe") supporting translocation. "mm.gene1" and "mm.gene2": multiple myelomarelevant gene within 1Mb (irrespective of strand/orientation) of breakpoint "pos" or "mate_pos," respectively. "census.gene1" and "census.gene2": gene annotated in Cancer Gene Census gene within 1Mb (irrespective of strand/orientation) of breakpoint "pos" or "mate_pos," respectively. "svm": output of SVM classifier;; 1 indicates likely translocation and 0 indicates likely false positive.      Table   S5. Table  S8. SNVs and indels detected during initial targeted sequencing.   Table  S9. MuSiC analysis of mutation co-occurrence and mutual exclusivity. P-values for co-occurrence and mutual exclusivity of mutations involving genes listed in columns Gene1 and Gene2 in columns Pvalue_And and Pvalue_Xor, respectively.