A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing

As whole-genome sequencing for cancer genome analysis becomes a clinical tool, a full understanding of the variables affecting sequencing analysis output is required. Here using tumour-normal sample pairs from two different types of cancer, chronic lymphocytic leukaemia and medulloblastoma, we conduct a benchmarking exercise within the context of the International Cancer Genome Consortium. We compare sequencing methods, analysis pipelines and validation methods. We show that using PCR-free methods and increasing sequencing depth to ∼100 × shows benefits, as long as the tumour:control coverage ratio remains balanced. We observe widely varying mutation call rates and low concordance among analysis pipelines, reflecting the artefact-prone nature of the raw data and lack of standards for dealing with the artefacts. However, we show that, using the benchmark mutation set we have created, many issues are in fact easy to remedy and have an immediate positive impact on mutation detection accuracy.


Overlap of SSMs called by DKFZ at different unbalanced coverages
Total number of SSMs shared by all shared by 5 shared by 4 shared by 3 shared by 2 private Reddish colors correspond to a higher proportion of mutations or higher scores for the feature in false positive calls versus the Gold Set. Both features and submissions are clustered hierarchically. The features shown here include sameAF (the probability that the allele frequency in the tumor sample is not higher than that in the normal samples, derived from the snape-cmp-counts score), DacBL (in ENCODE DAC mappability blacklist region), DukeBL (in Encode Duke Mappability blacklist region), centr (in centromere or centromeric repeat), mult100 (1 -mappability of 100mers with 1% mismatch), map150 (1 -mappability of 150mers with 1% mismatch), dups (in high-identity segmental duplication), nestRep (in nested repeat), sRep (in simple repeat), inTR (in tandem repeat), adjTR (immediately adjacent to tandem repeat), msat (in microsatellite), hp (in or next to homopolymer of length >6), AFN (mutant allele frequency in normal) and AFTlo (mutant allele frequency in tumor < 10%). Read depth was not used as a feature for the CLL benchmark. Comparison of ability to discover SSMs with different pipelines. a) Overlap of SSMs called by each center on its own library. All SSMs detected by at least one center are shown on the xaxis. The SSMs were sorted and colored by recurrence. SSMs were considered to be identical when both the exact position and the base substitution were the same. The bar plot shows the percentage of all nonunique SSMs for the given levels of concordance. Shown on the bottom are the density plots of the variant allele frequencies for each level of concordance. b) Sequence context of SSMs detected by each center on its own library. For each single base substitution, the sequence context (plus/minus one base) was determined.         In this example, out of the 1,800,604 times that br was "A", the machine read "A" 1,772,053 times (about 96%). However, in 4,029 occasions (0.2%), the machine read "T". As expected, if the quality is low (a0 in the figure), the error frequency is higher. In this case, the machine read "A" 161,319 times and "T" 3,486 times.

Filter Description
DistanceToAlignmentEndMedian The median shortest distance of the variant position within the read to either aligned end is less than 10

DistanceToAlignmentEndMAD
The median absolute deviation of the shortest distance of the variant position within the read to either aligned end is less than 3

LowMapQual
The proportion of reads at the variant position with low mapping quality (less than 1) is greater than 10%

MapQualDiffMedian
The difference in the median mapping quality of variant reads (in the tumor) and reference reads (in the normal) is greater than 5

VariantMapQualMedian
The median mapping quality of variant reads is less than 40

VariantBaseQualMedian
The median base quality at the variant position of variant reads is less than 30

VariantAlleleCount
The number of variant-supporting reads in the tumor is less than 4

VariantAlleleCountControl
The number of variant-supporting reads in the normal is greater than 1

StrandBias
The strand bias for variant reads covering the variant position, i.e. the fraction of reads in either direction, is less than 0.02, unless the strand bias for all reads is also less than 0.02.

Repeat
The length of repetitive sequence adjacent to the variant position, where repeats can be 1-, 2-, 3-, or 4-mers, is 12 or more

SNVCluster50
The largest number of variant positions within any 50 base pair window surrounding, but excluding, the variant position is greater than 2; variant positions are those in which the number of alternate allele is supported by at least 2 reads and at least 5% of all reads covering that position.

SNVCluster100
The largest number of variant positions within any 100 base pair window surrounding, but excluding, the variant position is greater than 4; variant positions are those in which the number of alternate allele is supported by at least 2 reads and at least 5% of all reads covering that position.

Supplementary Note 1
The ICGC Verification group was tasked with providing guidelines for sequencing and somatic mutation calling. We first carried out a mutation calling benchmark exercise on chronic lymphocytic leukemia (CLL), which was followed by a second exercise using a case of medulloblastoma (MB), in which we compared both sequencing methods and somatic mutation calling pipelines. The experience gained from the organization of CLL benchmark was used to improve the organization and methods of the MB benchmark.
Here we present the entire CLL benchmark methods and results as well as additional results of the MB benchmark analysis.

CLL Whole genome sequencing
Whole genome sequencing reads were produced using DNA from a patient suffering from CLL who had given informed consent for sample collection and analysis. Tumor samples were from before treatment and tumor cells were separated from non-tumor cells by immunomagnetic depletion of T cells, natural killer cells, monocytes and granulocytes 2 . Tumor cell purity was >98% as assessed by flow cytometry. Normal blood cells from the same patient were used for the normal sample that contained less than 0.05% tumor cell contamination as assessed by flow cytometry. The tumor sample was sorted using FACS CD. Two different libraries each for the tumor and the corresponding normal DNA sample using Illumina TruSeq™ DNA Sample Preparation procedures with slight modifications. While one library was prepared following the standard protocol with 10 cycles of PCR, the second library was heated to 72 degrees Celsius before adapter ligation and cooled down suddenly to 4 degrees Celsius. This resulted in a biased proportion of high GC content reads and counterbalances some of Illumina's sample preparation methods' GC-bias (improved coverage of elevated GC-content regions of the genome. The normal and GC enriched libraries were sequenced on Illumina GAIIx (2x150 bp) and Illumina HiSeq2000 (2x100 bp) instruments. The same amount of data (~40x coverage) was produced for tumor and corresponding normal sample with the proportions of the two library types -600 million reads for the standard and 200 million reads for the GC enriched library. Reads in FASTQ format were generated using the RTA software provided by Illumina.

Submission of CLL mutation calls
Illumina sequence corresponding to 40x coverage of a CLL tumor sample and the corresponding normal sample (above) were made available to members of the ICGC. Participants were asked to return VCF v4.1 files for simple somatic mutations (SSM) and somatic indel mutations SIM), as well as a list of structural variations. We specified that submitters should classify calls that they had high confidence in and calls that might be present, but in which they had less confidence. The threshold for high and medium confidence was left up to the participants. We received 19 SSM and 16 SIM submissions, of which several were independent submissions by the same group using different versions of their pipelines (e.g. CLL.C1 and CLL.C2, CLL.D1 and CLL.D2) or a pipeline used for the MB benchmark (see below) that was run later on the CLL benchmark data (CLL.F, CLL.L and CLL.N, for example).

Verification by target capture and orthogonal sequencing technologies
Verification of mutation calls and generation of a "Gold Set" was carried out after the first 14 submissions. 4080 SSMs and 883 SIMs were selected for Haloplex target capture (Supplementary Table 9). This corresponded to all mutation calls shared by at least two submissions and up to 400 randomly selected For verification of somatic mutations using IonTorrent, a HaloPlex custom capture was designed to enrich a total of 5179 mutations (4492 substitutions and 687 small indels) with a target region of 100 bp centered on the mutation position using the SureDesign software (Agilent). Sequencing was performed using two runs of the IonTorrent 418 chip, resulting in a total of more than 5 and 8 million reads for the tumor and normal samples, respectively.

Generation of a CLL Gold Set
All 40x reads (GAIIx and HiSeq2000 reads that were provided to the participants), MiSeq and IonTorrent verification reads were aligned with GEM (gem-mapper) and converted to BAM format. Alignments were filtered to retain only primary alignments with mapping quality >= 20. Duplicates were removed with Picard, indels realigned at 1000 genomes indel target locations, and indels were left-aligned using GATK. The pileups at SSM positions were extracted using samtools mpileup with base quality threshold >= 13. Read depth and base counts were extracted using a custom script. Mutant allele and normal counts were compared using in-house software snape-cmp-counts, which compares alternate and reference allele counts in tumor and normal, and then scored according to the probability that they are derived from different beta distributions. Mappabilities with 0, 1 and 2% mismatches were computed for the reference genome (h37d5). The average mappabilities in 100bp windows preceding each candidate mutation were stored as tracks for visualization in IGV. In addition, the segmental duplication annotation from the UCSC browser was loaded into IGV. Mutations were then classified as follows. Mutations with sufficient depth (>=20) and a snape score >=0.98, average mappability of one, and no overlap with segmental duplications were automatically classified in the Gold Set according to their mutant allele frequency (class 1: MAF>=0.1, class 2: 0.1>MAF>=0.05 or class 3: MAF<0.05). All other candidates with snape score >0.9 were reviewed visually in IGV. Mutations with ambiguous alignments were assigned to class 4. Low depth but otherwise plausible mutations were assigned to class 5. Somatic mutation Gold Set tiers were compiled by cumulative addition of classes so that Tier 1 only includes class 1, while Tier 2 includes class 1 and class 2, Tier 3 includes classes 1, 2 and 3, etc. All other candidate mutations were rejected and assigned to class 0 (Supplementary Table 10). The CLL Gold Set had a total of 1507 bona fide mutation calls across all, with 1142, 1238, 1312, and 1319 SSMs in Tiers 1, 2, 3 and 4, respectively, and 118 and 134 SIMs in Tiers 1 and 4, respectively. This corresponds to a mutational load of almost one verified mutation every two Mbp.

Initial assessment and revisions
Submitted mutation calls on the CLL benchmark data exhibited a low level of concordance ( Supplementary  Figures 22-25). In the initial submission we observed an alarmingly low degree of agreement: the number of CLL SSMs called in any one call set ranged from 250 to 18000 in the first 14 submissions prior to correction, and after correction ranged from 500 to 15,000, while the number of SSMs that all call sets agreed on was initially only 77, increasing to 149 after correction. CLL.C2, CLL.K and CLL.U contributed a large number of private calls (see Supplementary Fig. 23) whereas several other submissions had hardly any private calls. For CLL SIMs, the number of calls ranged from 18 (CLL.N) to 4152 (CLL.U) ( Supplementary  Fig. 25). It is clear that there is substantially less overlap among SIM call sets. The intersection of all SIM call sets resulted in only five SIMs. As the goal of the exercise was not to compete but to cooperate in order to improve the state of affairs, submitters were allowed to revise their submissions several times, each time using feedback on the level of concordance among all submitted call sets, but with no information about specific mutations. The overall results of the resubmissions and additional new submissions were not much improved, with the exception that submissions with very high numbers of SSM/SIM calls had revised call rates more in line with the rest of the pack.

Evaluation of CLL submissions with respect to Gold Set
Precision and recall of CLL SSM and SIM submissions are shown in Supplementary Fig. 26 and tabulated in Supplementary Table 11. Clustering of TP or FP mutation calls based on Jaccard score (overlap) is shown in Supplementary Figures 27-30. Clustering of pipelines based on parameters is shown in Supplementary Fig.  31 (input parameters are available in the Supplementary Data File). A rainfall plot of genomic positional clustering of calls is shown in Supplementary Figure 32. Supplementary Figures 33-36 show enrichment or depletion of genomic and alignment features features in FP and FN CLL SSMs and SIMs. The absolute numbers for SSMs and SIMs are tabulated in Supplementary Tables 12 and 13, respectively.

Effect of library preparation and sequencing methods
A single tumor-blood DNA pair was sequenced at multiple sites (the National Center for Genome Analysis (CNAG), Barcelona, Spain; the German Cancer Research Center (DKFZ), Heidelberg, Germany; the RIKEN institute, Tokyo, Japan; the Ontario Institute for Cancer Research (OICR), Toronto, Canada and the Wellcome Trust Sanger Institute, Hinxton, UK). The results were subsequently compared using both local and centralized analysis. The tumor chosen for this analysis was a medulloblastoma (a malignant pediatric brain tumor arising in the cerebellum 3,4 ) from the ICGC PedBrain Tumor project. This tumor type typically shows a very high tumor purity (usually >95%), but also often carries ploidy changes and other copy number alterations, thereby allowing for analysis of mutation detection performance at different allele frequencies 5 . Merging data from the different contributing centers and analyzing the combined dataset resulted in an extremely high WGS coverage of >300x for the tumor and >250x for the germline control. This allowed us to investigate variant-calling parameters at very low allele frequencies, as well as the impact of imbalanced tumor vs. control coverage levels and of total sequencing coverage on mutation detection performance.

Influence of library preparation on sequencing metrics
An even coverage is important for RNA and DNA sequencing 6 . The evenness of the coverage will be highly affected by structural events such as chromothripsis. Therefore, chromosome 22, being a small chromosome without copy number aberrations in this tumor (Supplementary Figure 1), was chosen to further assess base-wise coverage. The standard deviation of coverage ranged from 8.58 in the most evenly covered library to 38.49 in the most unevenly covered, which could have a significant impact on the ability to call copy number variations (Supplementary Table 14). Differences in coverage between tumor and control also influence the ability to call simple somatic mutations (SSMs) and small insertions/deletions (somatic indel mutations, SIMs), so we additionally calculated the standard deviation of the absolute pairwise coverage difference (tumor vs. control). The values ranged from 5.14 in a good library to 11.37 in a strongly biased library (Supplementary Table 14). Two methods showed a marked variation in coverage, with a dramatic and unexpected increase in the number of sequencing reads mapping to regions of high GC content. This also resulted in much 'noisier' copy number profiles derived from these libraries, likely reducing the resolution at which structural variants could be reliably called (Supplementary Figure 1). One possible explanation for this may be DNA-binding beads used during the clean-up process, which could feasibly bind more strongly to GC-rich sequences at a given fragment size under certain concentration and/or temperature conditions.
While combining all libraries to give a coverage of close to 300x reduced the 'missing' exon fraction to just 0.1%, some regions of the genome (including part or all of ~80 genes) were still only covered at <=10x (Supplementary Data 1; none of these genes are listed in the Cancer Gene Census 7 ). The vast majority of these regions (>98%) were in non-uniquely mappable areas such as telomeric or centromeric repeats. These will likely never be covered using routine short-read methods, regardless of the total read count (e.g. in stretches of long, highly homologous repeats). Library A also contained some longer 2x250bp MiSeq reads as opposed to standard HiSeq 2x101bp, but the overall contribution of these (below 2x) was too low to assess whether they may help in covering some of the missed regions.
We also examined the performance of each dataset in regions of biased nucleotide composition that were previously reported to be challenging to sequence across different platforms 8 . There was a marked variation in coverage in these regions, in keeping with the notable GC-bias observed in some libraries (Supplementary Table 15). The best overall performance in terms of evenness of coverage was seen with the PCR-free library, and this also outperformed the methods previously reported in the study of Ross et al. 8 . Of note, some regions showed a significant discrepancy in coverage between tumor and normal in certain regions, which would likely compromise variant calling in these loci.

Comparison of variant calling on the different libraries
The first comparison of variant calls that we performed was using each individual center's own mutation calling algorithm on their sequencing output, which resulted in a surprising amount of variation. Whilst there was a core set of mutations called by all 5 centers, this was the case for less than 20% of the total number of called variants (Supplementary Figure 37a). Allele frequency plots indicated that these consensus calls showed clear peaks at ~50% (heterozygous mutations occurring while the tumor was diploid) and ~25% (mutations occurring in 1 of 4 alleles after tetraploidization of the genome), while those made by less than 4 centers were shifted towards a lower allele frequency. This may indicate either increased variability in sensitivity of the pipelines as allele frequency decreases, and/or some mutations at such a low frequency that there were no variant reads in certain datasets (Supplementary Figure 37a). The mutation contexts of the variants were reasonably similar across centers, with the majority being C > T transitions in a GpCpG or ApCpG context, although some variability can clearly be seen across the 5 sets (Supplementary Figure 37b). Roughly one third of the mutations were unique to only one center, with the remainder variably called by 2-4 groups. One of the most notable differences was the low total number of calls made by center C, resulting in a large proportion of calls called by the other 4 centers but not this one. Based on the outcome of the ICGC benchmark analyses, however, this center has now modified its analysis pipeline to slightly relax some over-stringent filtering steps, resulting in a much greater overlap with the other calls (not shown). When looking further at mutational signatures as defined by Alexandrov and colleagues 1 rather than simple base change contexts, variation can also be seen per center in the number and type of mutational processes identified (Supplementary Figure 37c).
In terms of coding alterations, there was a greater degree of overlap, but certainly not 100% concordance. Four non-synonymous, one splice-site (SET) and one stop gain (ANGPT1) SSMs were identified from the curated MB Gold Set of somatic mutations, which were also present at more than 10% allele frequency in each individual dataset. Of these, one center called all 6, two centers called 5, and one 4. One outlier called only two originally, which was found to be a result of minor contamination of the control sample with tumor DNA. A second library preparation resolved this issue, and all 6 SSMs were subsequently called (Supplementary Figure 38). One analysis pipeline also indicated a potential SSM in ZMYM3 that was not detected in the other sets. Further inspection revealed that this alteration is probably a complex SIM rather than a single point change (discussed below).
Interestingly, the variation of calls between these five centers was higher for this exercise than for the somatic mutation calling pipeline benchmark. In particular, each center calling on their own library produced a higher variation than for the same centers calling on the same tumor-normal pair, but on data from only one center (L.A), clearly indicating that library variations contribute to the observed heterogeneity of mutation calls. When excluding unique calls, fewer than 60% of SSM calls were shared between four or more centers and fewer than 20% were called by all five when analyzing different libraries. When using only one library, however, more than 60% of SSM calls were shared across all centers (Supplementary Figure 39).
Although the previous comparison already provided some evidence of a role for pre-analysis sequencing pipelines in generating differences between datasets, we wanted to further assess this by removing any variation in the analysis pipeline itself. We therefore re-aligned and re-called mutations on each dataset using one standardized pipeline (the DKFZ pipeline was chosen for logistical reasons). This resulted in a notably better consensus of mutations called by more than one center (>80% called by at least 4 out of 5 centers, Supplementary Fig. 40a, versus <60% with different pipelines, Supplementary Fig. 37a), but an unexpected increase in the number of private mutations, particularly for one center. A shift towards lower allele frequencies was again seen in the mutations not called in all centers (Supplementary Fig. 40a). Analysis of mutation contexts indicated that the vast majority of these excess mutations were T>G transversions with low allele frequency, which were not observed at high frequency in the other datasets. Simply filtering out mutations with low allele frequency arising in this context resulted in an improvement in the overlap of mutation calls, but many more exclusively called ('private') alterations remained compared with the center's own calls on their data (filtered against other reference samples, Supplementary Fig. 37a and Supplementary Fig. 40b). Closer investigation revealed that the cause for this artifact was a center-specific method for adjusting base quality q-scores, whereby a calibrating PhiX library was spiked into each sequencing lane. Unfortunately, this actually led to an increase in the specific artifact detected in this comparison (Supplementary Fig. 40c), and the center has subsequently reverted to default q-score metrics. The fact that the same phenomenon was not seen in the center's own calls on their data (Supplementary Fig. 37) is because it had already been identified, and a customized filter applied to account for it (removal of such changes also observed in a panel of 48 sequenced normal samples). This emphasizes that care must be taken when re-analyzing publicly available genome data from external centers, particularly when details on library preparation and center-specific customized 'blacklists' are often not known. This effect also had an impact on the mutational signatures identified, with a different distribution of processes observed in this mutation set than for each center calling their own variants ( Supplementary Fig. 40d), further suggesting that both library preparation and calling algorithms can strongly affect the ability to accurately detect such signature.

Effect of reference genome builds
In addition to testing the choice of mapper on SSM calling, we investigated the effect of genome reference build version on SSM calling. Novoalign2 (http://www.novocraft.com/documentation/novoalign-2/) was used for mapping against the selected three human genome reference builds ("hg19r" being a subset of "b37", which in turn is a subset of "b37+decoy"). Supplementary Table 16 shows that using a larger reference build leads to higher mapping rates and shorter mapping times. Rather than benefiting from the additional alignments (alignments to likely uninteresting parts of the extended reference), the advantage of the decoy sequences should be in prevention of misalignments (i.e. avoiding incorrect mapping to chromosomal regions by providing the appropriate reference). Conducted SSM calling tests showed that smaller reference builds lead to significantly larger unfiltered SSM call sets (removal of decoy sequences from the reference accounts for 25 -31 % increase in the SSM call set size in our experiment, depending on the caller (Supplementary Fig. 41). The fraction of these additional calls that overlap the Gold Set is however negligible (0.00).

Effect of mapper
Qprofiler results indicate that relative mapping rates for the individual chromosomes are generally similar for all the mappers, with minor exceptions for chromosome Y and the decoy contig (Supplementary Fig.  42). Distributions of insert sizes are similarly in good overall agreement, especially in the ranges that are expected to contain the bulk of proper pairs (Supplementary Fig. 43).
Apparent inter-mapper differences appeared to be in the assignment of mapping quality values (Supplementary Figures 44a and 44b), with GEM tending to assign lower mapping qualities than the other mappers, while Novoalign2 tending to do the opposite. Whether distinct mapping quality assignments alone can impact SSM calling will depend on given SSM caller's methods for utilizing mapping quality information. One of the main reasons for an abnormally large SSM call set produced by MuTect on Novoalign2 alignments might be MuTect's sensitivity to Novoalign2's mapping quality assignments -a very low fraction of reads with mapping quality 0 together with a noticeably higher fraction of reads with mapping quality 20 (when compared to the other mappers, Supplementary Fig. 44c). Reads with mapping quality of 0, unlike reads with mapping qualities in the range between 1 and 4, are used by MuTect in a built-in mutation filter; mutation support by at least one read with mapping quality of 20 is required by the same filter. Fig. 45) reveal that GEM alignments contain slightly fewer deletion events (of size at least 15 bp), while for Novoalign2 alignments a similar trend is evident for insertions and deletions with size of at least 35 bp. The latter phenomenon might be caused by Novoalign2 trimming its input reads to 150 bp in length by default.

Alignment-level insertion and deletion statistics (Supplementary
Perhaps the most prominent differences can be observed in MD-field derived alignment mismatch statistics. When mismatch counts are considered (Supplementary Fig. 46), consistent differences between the mappers can be observed -for all non-ambiguous mismatch types, the counts are highest for GEM alignments, with BWA-mem, BWA and Novoalign2 following always in the same order. When ratios of the individual substitution types are considered as a measure instead of absolute counts (Supplementary Fig.  47), two additional disproportions become more clearly visible. Firstly, some symmetrical mismatches are well balanced (C>T and G>A) while others are not (C>A and G>T). Being largely mapper-independent, this bias might be originating upstream of the mapping step. Secondly, the individual mappers seem to have preferred mutations types (e.g. Novoalign2 alignments contain relatively high fractions of C>T and G>A mismatches and relatively low fraction of A>C mutations, while the situation is reversed for BWA-mem alignments).
Except for the suspicion that mapping quality score assignment strategy might considerably influence SSM calling in specific pipeline setups, no links have been established between the observed mapper differences and their causes (i.e. particularities of respective alignment methods) or effects (impacts on mutation calling). An additional study, wholly focused on establishing such links, could however be of great benefit for evaluation and development of custom pipeline designs.
It is important to note that performances of individual components and their combinations might vary depending on the experiment type, sequencing depth and analysis steps that have not been tested or included in our benchmarking.

CLL.A FASTQ PROCESSING
(no details specified)

MAPPING AND BAM PROCESSING
Read sequences were mapped by BWA v0.5.8a to the human reference genome (GRCh37) with default settings. Possible PCR duplicated reads were removed by SAMtools 9 v0.1.8 with default settings. After filtering by pair mapping distance, mapping uniqueness and orientation between paired reads, the mapping result files were converted into pileup format by SAMtools with -scf option. We used three kinds of read filters: set1; both read pairs were uniquely mapped with consistent orientation and pair distance (within average ± 3 s.d.), set2; at least one read pair was uniquely mapped with consistent orientation and pair distance and, set3; all uniquely mapped paired reads and set2, as described elsewhere.

SSM SIM
SSM and SIM calls were done using all three sets of filtered reads, and mutations identified in the all three sets were considered as candidates.
The scripts for SSM and SIM calling are available from http://emu.src.riken.jp.

SV
Inconsistent read pairs which occurred within 500bp of each other were considered to support the same rearrangement. We identified candidate rearrangements in both tumor (support read pairs ≥ 4) and normal tissue (support read pairs ≥ 1) samples, and tumor specific rearrangement candidates were identified. To exclude mapping errors, we performed a blast search of read pairs that support rearrangements against the reference genome. If a paired read mapped with correct orientation and distance (≤ 500 bp) with an E-value < 10-7, we excluded that read pair. Reads mapped with more than two mismatches were also discarded. After filtering, candidates supported by ≥ 4 read pairs and at least one perfect match pair were considered as somatic rearrangements.

CNV
Copy number alternations were detected by calculating the ratio of the average depth of coverage in cancer to that in blood for 5kbp bins and analyzed the ratio using the DNAcopy R package.

SSM
 non-reference calls with a frequency ≥ 0.15 after removing bases calls with base quality < 10, and mapping quality < 20  supported by at least two base calls including one base call with base quality ≥ 30  a SAMtools consensus quality ≥ 20 and maximum mapping quality ≥ 40  if three or more SNVs were found within any 10bp windows, or distance from nearest indel was less than 5bp, we discarded all SNVs  if candidate non-coding SNVs were in a tandem repeat region suggested by tandem repeat finder, we discarded the SNVs  if candidate SNVs were in RepeatMasker repeat regions (http://www.repeatmasker.org) within 1Mb from the centoromeric or telemeric gaps, we discarded the SNVs  if a base with consensus quality lower than 20 occurs within 3bp on either side of the target SNV, we discarded the SNVs After SNV calling in the tumor samples, candidate SSMs were filtered based on the lymphocyte sequence of the same patient:  candidate SNV alleles with a frequency ≥ 0.03 after removing reads with base quality < 15, and mapping quality < 20  depth of coverage in lymphocyte ≤ 7  depth of coverage in lymphocyte ≤ 10 and candidate SSM allele was represented in the dbSNP database v131.

SIM
Short indels were identified based on gaps in a read's alignment by BWA. We defined indels using the following criteria:  if indels were supported by a frequency ≥ 0.1 and ≥ 4 reads after removing reads with mapping quality < 20  if candidate non-coding indels were in repeat regions suggested by tandem repeat finder or RepeatMasker, we discarded the indels.  After indel calling in tumor samples, the candidate SIMs were filtered based on the lymphocyte sequence of the same patient using the following criteria;  depth of coverage in lymphocyte ≤ 7  for coding and non-coding region, if any indels were identified within 5bp or 10bp region in lymphocyte, respectively, the candidate SIM was discarded.

CLL.B FASTQ PROCESSING
(no details specified)

MAPPING AND BAM PROCESSING
In the first step sequencing data were aligned to the human reference genome (GRCh37) using Burrows-Wheeler alignment (BWA) and sorted according to their chromosomal position. In the second step, read groups, library, sequencing platform and sample information were added to the initial alignment result BAM file Picard tools. In the third step, lane-level BAM files were aggregated into library-level BAM files, which at last were combined into samplelevel BAM files. Mismatch clusters in narrow regions are likely due to read misalignment and will lead to the accumulation of erroneously called mutations so we used the GATK to do the local realignment to solve this problem. In the final step, molecular duplicate reads from PCR amplification of library fragments are flagged to indicate artifacts. After above processing the final BAM file was used to call SSM and SIM. BWA version: 0.5.9-r16 bwa aln : -o 1 -i 15 -l 31 -k 2 -t 10 -m 100000 -e 63 -q 10 -I bwa sampe using default parameter Picard version: 1.54 We used the Picard AddOrReplaceReadGroups.jar module to add the reads group, library, sequencing platform and sample information in the BAM header, and the MarkDuplicates.jar module was used to flag the artifacts reads using default parameters. GATK version: v1.0.6076 We used the RealignerTargetCreator and IndelRealigner module to select the regions where mismatches were clustered and to do the realignment process using default parameters.

SIM
SIMs were first called using GATK SomaticIndelDetector using default parameters. For the SIMs and their flanking regions of 500bp, normal reads and tumor reads were realigned to hg19. Then we uses Varscan to identify SIMs: SAMtools mpileup was run with parameters -Q 0 Varscan was run with parameters --min-coverage 5 --min-coverage-normal 5 --min-coverage-tumor 5 --min-var-freq 0.1 Distance between adjacent SIM and to adjacent SSM had to be >10bp.

SV
We use seeksv which is inhouse software to call structural variations. It is similar to CREST and uses next-generation sequencing reads with partial alignments to a reference genome to directly map structural variations at the nucleotide level of resolution. This pipeline has 5 steps:  Get soft-clipped reads from the original normal alignment file and tumor alignment file.  Aligned the clipped sequences (unmapped parts of the soft-clipped reads) to human reference genome.  Get SV breakpoints basing on the two breakpoints' (soft-clipped reads) alignment positions.  Compare the final tumor breakpoints to the soft-clipped reads file of normal sample and get the somatic SV breakpoints.  Confirm the SV breakpoints with discordant read-pairs.

SSM
Filter parameters of high confidence results:  Adjacent somatic mutation distance > 10  The number of alternative allele reads >= 5  Alternative allele frequency in the matched normal sample < 0.05  Reads mapping to this mutations are not significantly enriched for mismatches (p-value>0.005)  Reads with best hits greater than 1 are not significantly overrepresented at this position (P-value>0.01)  Map Quality score not significant lower in mutated reads than other alleles (Map Quality score cutoff is 30. Fisher's exact test, p < 0.01)  Base Quality score for mutated position not significantly lower than other alleles (Base Quality score cutoff is 20. Fisher's exact test, p < 0.05)  Mutant allele frequency change between tumor and adjacent normal (Fisher's exact test p < 0.05)  Mutations were not in gap aligned reads (in neither 20bp flank region less than 10 gap flags)  Mutant allele not significantly enriched within 5 bps of 5' or 3' ends of reads (Fisher's exact test, p < 0.05)  Mutations were not in simpleRepeatRegion (Repeat events less than 6) , simpleRepeatRegion (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz ) Filter parameters of medium confidence results:  Adjacent somatic mutation distance > 10  Map Quality score not significantly lower in mutated reads than other alleles (Map Quality score cutoff is 30. Fisher's exact test, p < 0.20)  Base Quality score for mutated position not significantly lower than other alleles (Base Quality score cutoff is 20. Fisher's exact test, p < 0.05)  Mutant allele frequency change between tumor and adjacent normal (Fisher's exact test p < 0.05)  Mutations were not in gap aligned reads (in neither 20bp flank region less than 10 gap flag)  Mutant allele not significantly enriched within 5 bps of 5' or 3' ends of reads (Fisher's exact test, p < 0.05)  Mutations were not in simpleRepeatRegion (Repeat events less than 6) , simpleRepeatRegion (http://hgdownload.cse.ucsc.edu/goldenPath/hg 19/database/simpleRepeat.txt.gz)

SIM
Filter parameters of high confidence results:  Indels were not annotated in the 1000 Genomes Project indels database  The number of alternative allele reads >= 5  Reads mapping to this mutations are not significantly enriched for mismatches (P-value>0.005)  Reads with best hits greater than 1 are not significantly overrepresented at this position (P-value>0.01)  Map Quality score not significantly lower in mutated reads than other alleles (Map Quality score cutoff is 30. Fisher's exact test, p < 0.01) Filter parameters of medium confidence results:  Indels were not annotated in the 1000 Genomes Project indels database

SV
Filter parameters of high confidence results :  Deletions smaller than 100 base-pairs are filtered.  When the left-breakpoint and the right-breakpoint of the tumor breakpoint are from the same chromosome. position of the left-breakpoint: p1 ,position of the right-breakpoint:p2, if 0 < p1 -p2 < 100 , this breakpoint is filtered.  The mutation frequency of the tumor breakpoints must be at least 0.1.  The minimum number of tumor soft-clipped reads is 3  In normal sample, the number of soft-clipped reads must be 0.  In the tumor sample, the number of discordant read-pairs that support the tumor breakpoint must be at least 1.  In the normal sample, the number of discordant read-pairs that support the breakpoint must be 0.
Filter parameters of medium confidence results :  Deletions smaller than 100 base-pairs are filtered.  When left-breakpoint and right-breakpoint are from the same chromosome. position of left-breakpoint: p1 ,position of right-breakpoint:p2, if 0 < p1 -p2 < 100 , this breakpoint is filtered.  The mutation frequency of the breakpoints must be at least 0.1.  The minimum number of soft-clipped reads is 3.  In the normal sample, the number of discordant read-pairs that support the breakpoint must be 0.  In the normal sample, the number of soft-clipped reads must be 1 or 2 ， and let the number of discordant readpairs support the breakpoint in the tumor sample be N, let the number of soft-clipped reads support the breakpoint in the tumor sample be M, let the number of soft-clipped reads support the breakpoint in the normal sample be a, if M+N-a >20, put this result into medium confidence results.

CLL.E FASTQ PROCESSING
Data quality was assessed using FastQC and in-house contaminant/adapter screening tool. The quality score distribution from FastQC was used to guide trimming of reads, particularly for the GAIIx 150bp paired end runs.

MAPPING AND BAM PROCESSING
Alignment: BWA v0.5.9 was used to align the sequence reads in paired end mode (bwa aln and sampe) to the GRCh37/hg19 human reference genome from UCSC, with the following settings: -I (to convert Phred+64 quality scores to Phred+33) Merging lane data and marking duplicates: Picard v1.66 was used to mark duplicate reads on a per-lane basis and then to merge BAM files for each sequencing lane into a per-sample BAM file (one for each of 184TD and 184ND) while also marking duplicates for the sample as a whole.
Post-alignment QC: Further data QC was performed on the aligned BAM files to assess alignment and error rates, coverage, duplication, etc., both on a per-lane and per-sample basis, and to detect spatial artifacts, using tools developed in-house.

SV
Large-scale structural rearrangements were detected using a combination of approaches including discordant read pair clustering, soft-clipped (or split) read analysis and de novo assembly to determine breakpoints to base pair resolution.
Discordant read pair analysis : The analysis pipeline for identifying putative structural mutations (SVs) from paired end data comprises the following steps.  Compute distribution of inferred insert sizes and extract discordant read pairs where each end maps to different chromosomes, or in incorrect orientation, or with an insert size greater than 5 standard deviations above the median  Realign discordant read pairs and read pairs for which only one end was mapped using novoalign v2.07.15b using options -I (to specify the median insert size and standard deviation) and -r Random  Extract discordant read pairs from resulting alignments using proper pair bit flag  Cluster discordant read pairs from the tumor (184TD), matched normal (184ND) and a pool of 22 other non-related normal samples sequenced to 50x depth as part of the UK OAC ICGC project  Soft-clipped read analysis  CREST v1.0 15 was used to analyze soft-clipped reads from the BWA-aligned BAM files in paired mode. Somatic SV calls were taken forward for assessment by de novo assembly as described below.BAM files for the tumor and normal were split on a per-chromosome basis, and the extractSClip perl script applied separately to each chromosome. The extracted soft-clipped reads were then concatenated together, and subtracted using the countDiff perl script before being passed to CREST.pl with the default parameters.
Local de novo assembly: A local de novo assembly of reads in the vicinity of putative SV breakpoint ends was carried out to determine the breakpoint to base pair resolution and assess the validity of that breakpoint by finding breakpoint-spanning reads. This is carried using an in-house developed Java program that is under active development and that makes use of the SAM-JDK and Picard packages and calls a number of tools including Readjoiner v1.1, BLAT v34 and exonerate v2.2.

SSM
SSMs were filtered using the criteria below to produce a set of high and medium confidence call:  Somatic mutations only (somatic status, SS = 2)  False positive filters using fpfilter.pl script from SomaticSniper/VarScan2  Average mutation position in supporting reads, relative to read length between 0.1 and 0.9  Strandedness -fraction of supporting reads from the forward strand between 0.01 and 0.99  Mutant read count >= 4  Mutant read frequency >= 0.05  Mismatch quality sum difference < 50  Mutation mismatch quality sum < 100  Mapping quality difference < 30  Difference in average trimmed read length between reference and mutant reads < 25  Distance to effect 3' end for mutant reads < 0.  ---

CLL.G FASTQ PROCESSING
(no details specified)

MAPPING AND BAM PROCESSING
BWA: Default settings, appropriate insert size set for library Sanity checking of SW mapped reads: If a read is SW mapped the soft clipping must pass the following criteria:  leading edge of read must not be clipped by more than 5 bp  trailing edge of read must not be clipped by more than (readlength*0.1)+1

SIM
In -house post-processing methods for pindel output.

CLL.I FASTQ PROCESSING
(no details specified)

MUTATION FILTRATION
(no details specified)

SIM
SIMs were called jointly with SSMs (see above).

MUTATION FILTRATION
The final merge utilized an in-house scripts that captured calls found from both Strelka and MuTect analyses. Some field names were reworked to allow inclusion from both sources. Headers were merged and annotated with source caller. Genotypes (GT) were taken from the MuTect SNV calls as Strelka does not provide these. For indels called by strelka, genotype calls were formed by review of the data.

CLL.O FASTQ PROCESSING
All lanes were used to create the sample bam file. The trimming option of BWA should avoid mapping of poor quality reads.

MAPPING AND BAM PROCESSING
Our mapping pipeline combines the following tools: Core mapping with BWA v0.5.9:  bwa aln -l 32 -t 6 -q 20 (6 CPUs used)  bwa sampe -P We use '-q 20' option of BWA to trim 3' end of the reads. Reads are trimmed down to a point where the remaining bases are above a quality threshold of 20 (see BWA documentation). But a read could not be shorter than 35bp. This trimming is done before BWA tries to map back the read to the reference genome. Presently 'bwa sampe' is called within a Perl wrapper and pairs with both reads unmapped are written to a separate BAM file.
Bam processing:  Sorting + indexing lane BAM file (Picard tools v1.73, standard parameters)  Indel realignment at known sites (GATK2 v2.0-23, standard parameters)  Duplicates Marking (Picard, standard parameters)  Recalibration (GATK2, standard parameters)  Picard MergeSamFile  Picard Markduplicates (with remove duplicates option) We merged all the lanes that passed the quality controls. We removed the duplicates to obtain a BAM file at the sample level.

SIM
We used SomaticIndelDetector, a tool included in GATK2 to call small indels.

SSM
We selected about 24k calls from the output of MuTect and filtered them according to S. Nik-Zainal:  base quality of mutant alleles  mapping quality of reads carrying mutant allele  the number and the position in the read of the mutant allele in the tumor sample  the phase of reads (balanced or not) carrying the mutant allele  the position of the mutant allele within boundaries of known repeated, centromeric and telomeric regions  the presence in a panel of 20 other germline samples sequenced with Illumina protocol  the presence of the mutant allele in the germline sample  the strand of reads (balanced or not) carrying the mutant allele  min number of mutant allele in tumor sample According to the thresholds used in the filters, calls were classified either "high confidence" or "low confidence".

SIM
 min number of reads carrying indel  mapping quality of reads carrying indel  germline filtering  strand bias filtering  mismatches in mapping of reads carrying indel (number and windows around indel filtering)  base quality around indel  the position of indels within boundaries of known repeated, centromeric and telomeric regions  mutation filtering (indel starts at a SNV position)

MAPPING AND BAM PROCESSING
(no details specified)

SSM
Sidrón computes a parameter, called S, that can be used to predict whether a candidate mutation in a pileup reflects an actual genomic mutation or can be attributed to noise from the sequencing machine. The noise is estimated from a large collection of genomic sites which are known or strongly suspected to contain no mutations. Then, the S value is calculated for each position, and pre-determined cut-off points are used to predict the genotype. Finally, positions classified as somatic mutations must pass a series of filters based on putative alignment problems.
The algorithm: Get a set of probably homozygous positions and extract their pileups. When SNP microarrays are available, we extract these positions from their results. If not, we select positions with coverage higher than 12 and the same read base in at least 90% of the covering reads. In either case, we use about 300,000 positions for the next step. Use the Perl script set_model.pl to calculate and save the estimated error table. The script assumes that the real base is the most frequently read base in each position. Then, the script populates a table where real bases (br) are related to observed bases (bo). These frequencies are broken down into quality tiers according to the Illumina score: low (0-9), medium-low (10)(11)(12)(13)(14)(15)(16)(17)(18)(19), medium-high (20-29) and high (30 or higher). See Supplementary Figure 48 for example for a part of a table.
Assemble a pileup file with all the candidate positions where a tumor mutation is expected (TD.mq). This step can be performed with the filtering programs in Samtools and Bcftools. The filtering conditions are set to non-restrictive, so that lots of false positives are expected. The same positions are extracted from the alignment of the normal (constitutive) sample with samtools into TD.fi ( samtools mpileup -f path_to_genome -l TD.mq normal_bam_file > TD.fi ). Use the estimated error table with the sidron.pl Perl script to set an S score for each candidate position. Here is one example: . Then, the S values were calculated and cutoff points were selected to achieve less than 5% false positives (predicted Het when in fact Hz) and as high a sensitivity as possible. Depths lower than 6 were subsequently discarded.

SIM
SIMs were called using the same pipeline as for SSMs (see above).

MUTATION FILTRATION
Use the Perl script polyfilter.pl to filter out putative sequencing and alignment artifacts. This scripts processes each pileup position and looks for three hallmarks of putative artifacts:  The somatic mutation creates a polyN stretch of 10 or more nucleotides. Sometimes, in the context of near-polyN stretches (i. e. AAAAGAAAAAA), the sequencing machine completes the polyN (in the example, AAAAAAAAAAA) even in the absence of a mutation. These positions are eliminated from the result.  Reads supporting a mutation which can align at other sites in the genome according to BLAT (local installation) are eliminated from the result.  If all of the reads pointing to the mutation contain the mutant nucleotide at similar positions, the position is eliminated from the result. This typically arises from alignment artifacts in the proximity of small indels.  Repeat the sidron.pl step. This is necessary because some pileup positions may have been changed by polyfilter.pl.

CLL.S FASTQ PROCESSING
Quality trim -quality 2 or lower trimmed off end.

MAPPING AND BAM PROCESSING
SNAP (http://snap.cs.berkeley.edu/). We used the default settings for all parameters but max hits, which we increased to 1000 (-h 1000).

SSM
BamBam  2 minimum reads to support SNPs  minimum base quality of 10  minimum neighborhood quality of 10 (5 surrounding bases must have base quality 10 or higher)  minimum mapping quality of 10 (made no difference, mapping software only has mapping qualities of 60 and 0)  10% expected fraction germline  maximum base quality of 40

COMPUTATIONAL RESOURCES
Run on a cluster environment with 2.33GHz Xeon processors (4 years old) 25 compute-hours for initial results 1000 compute-hours for post processing and VCF creation (this number will be reduced in the future, it's just absurdly high right now. ---

CLL.T FASTQ PROCESSING
(no details specified)

SSM
Our existing production pipeline calls SSMs and SIMs using GATK and an in-house filtering script. For this benchmarking task we have submitted the intersection of our GATK and VarScan calls as our high confidence set, and the calls private to each tool as our medium confidence set.

CNV
Two in-house tools were used to identify 5 obvious regions of copy number variation based on the lower heterozygous allele frequency and the ratio of normalized depth of coverage between the normal and tumor. These measures were visualized, regions of interest were identified by eye and breakpoints were called manually from the depth of coverage data.

COMPUTATIONAL RESOURCES
Everything was run on an SGE Cluster with 5500 3GHz cores arranged in 16

MUTATION FILTRATION
See above Mutation Calling.

MAPPING AND BAM PROCESSING
Read sequences were mapped by BWA v0.5.8a to the human reference genome (GRCh37) using default options.
Possible PCR duplicated reads were removed by SAMtools v 0.1.18 using default options. After filtering by pair mapping distance, mapping uniqueness and orientation between paired reads, the mapping result files were converted into pileup format by SAMtools with -scf option. We used three kinds of read filters:  set1: both read pairs were uniquely mapped with consistent orientation and pair distance (within average ± 3 s.d.).  set2: at least one read pair was uniquely mapped with consistent orientation and pair distance.  set3: all uniquely mapped paired reads and set2.

SSM
Calls were made using all three sets of filtered reads, and mutations identified in the all three sets were considered as candidates.
The scripts for SSM and SIM calling are available from http://emu.src.riken.jp.

SIM
Calls were made using all three sets of filtered reads, and mutations identified in the all three sets were considered as candidates.
The scripts for SSM and SIM calling are available from http://emu.src.riken.jp.

SSM
 non-reference calls with a frequency ≥ 0.15 after removing bases calls with base quality < 10, and mapping quality < 20  supported by at least two base calls including one base call with base quality ≥ 30  a SAMtools consensus quality ≥ 20 and root mean square mapping quality ≥ 40  if three or more single nucleotide mutations were found within any 10bp windows, or distance from nearest indel was less than 5bp, we discarded all SNVs  if candidate non-coding SNVs were in a tandem repeat region suggested by tandem repeat finder, we discarded the SNVs  if candidate SNVs were in RepeatMasker repeat regions (http://www.repeatmasker.org) within 1Mb from the centoromeric or telemeric gaps, we discarded the SNVs  if a base with consensus quality lower than 20 occurs within 3bp on either side of the target SNV, we discarded the SNVs  After SNV calling in the tumor samples, SSMs were filtered based on the lymphocyte sequence of the same patient:  SSM alleles with a frequency ≥ 0.03 after removing reads with base quality < 15, and mapping quality < 20  depth of coverage in lymphocyte ≤ 7  depth of coverage in lymphocyte ≤ 10 and SSM allele was represented in the dbSNP database v131.

SIM
 if indels were supported by a frequency ≥ 0.1 and ≥ 4 reads after removing reads with mapping quality < 20, and root mean square mapping quality ≥ 40  if candidate non-coding indels were in repeat regions suggested by tandem repeat finder or RepeatMasker, we discarded the indels.  After indel calling in tumor samples, the candidates were filtered based on the lymphocyte sequence of the same patient using the following criteria:  depth of coverage in lymphocyte ≤ 7  for coding and non-coding region, if any indels were identified within 5bp or 10bp region in lymphocyte, respectively, the candidate indel was discarded.

MB.B FASTQ PROCESSING
(no details specified)

MAPPING AND BAM PROCESSING
The BAMs were provided by CNAG.

SIM
SIMs were first called using GATK SomaticIndelDetector using default parameters. For the SIMs and their flanking regions of 500bp, normal reads and tumor reads were realigned to hg19. Then we uses Varscan to identify SIMs: SAMtools mpileup was run with parameters -Q 0 Varscan was run with parameters --min-coverage 5 --min-coverage-normal 5 --min-coverage-tumor 5 --min-var-freq 0.1 Distance between adjacent SIM and to adjacent SSM had to be >10bp.

SSM
Filter parameters of high confidence results:  Adjacent somatic mutation distance > 10  The number of alternative allele reads >= 5  Alternative allele frequency in the matched normal sample < 0.05  Reads mapping to this mutations are not significantly enriched for mismatches (p-value>0.005)  Reads with best hits greater than 1 are not significantly overrepresented at this position (P-value>0.01)  Map Quality score not significant lower in mutated reads than other alleles (Map Quality score cutoff is 30. Fisher's exact test, p < 0.01)  Base Quality score for mutated position not significantly lower than other alleles (Base Quality score cutoff is 20. Fisher's exact test, p < 0.05)  Mutant allele frequency change between tumor and adjacent normal (Fisher's exact test p < 0.05)  Mutations were not in gap aligned reads (in neither 20bp flank region less than 10 gap flags)  Mutant allele not significantly enriched within 5 bps of 5' or 3' ends of reads (Fisher's exact test, p < 0.05)  Mutations were not in simpleRepeatRegion (Repeat events less than 6) , simpleRepeatRegion (http://hgdownload.cse.ucsc.edu/goldenPath/hg 19/database/simpleRepeat.txt.gz)

SIM
Filter parameters of high confidence results:  Indels were not annotated in the 1000 Genomes Project indels database  The number of alternative allele reads >= 5  Reads mapping to this mutations are not significantly enriched for mismatches (P-value>0.005)  Reads with best hits greater than 1 are not significantly overrepresented at this position (P-value>0.01)  Map Quality score not significantly lower in mutated reads than other alleles (Map Quality score cutoff is 30. Fisher's exact test, p < 0.01)