Ximmer: a system for improving accuracy and consistency of CNV calling from exome data

Abstract Background While exome and targeted next-generation DNA sequencing are primarily used for detecting single nucleotide changes and small indels, detection of copy number variants (CNVs) can provide highly valuable additional information from the data. Although there are dozens of exome CNV detection methods available, these are often difficult to use, and accuracy varies unpredictably between and within datasets. Findings We present Ximmer, a tool that supports an end-to-end process for evaluating, tuning, and running analysis methods for detection of CNVs in germline samples. Ximmer includes a simulation framework, implementations of several commonly used CNV detection methods, and a visualization and curation tool that together enable interactive exploration and quality control of CNV results. Using Ximmer, we comprehensively evaluate CNV detection on four datasets using five different detection methods. We show that application of Ximmer can improve accuracy and aid in quality control of CNV detection results. In addition, Ximmer can be used to run analyses and explore CNV results in exome data. Conclusions Ximmer offers a comprehensive tool and method for applying and improving accuracy of CNV detection methods for exome data.

neuropathy and Smith-Magenis-Syndrome, as well as a rare cause for a wide range of mendelian diseases. In particular, single copy deletions can be pathogenic for any disorder caused by haploinsufficiency. To detect CNVs, patients are often screened for CNVs using SNP or array-CGH microarrays prior to use of WES. However, affordable microarrays have limited resolution and add time, cost and complexity to the overall diagnostic workflow. There are consequently significant potential advantages if CNVs can be ascertained directly from WES.
CNVs can be detected from three primary signals in HTS data. These are: anomalous mapping of paired end reads that span CNV breakpoints (PE signals), the splitting of individual reads by CNV breakpoints (SR signals), and fluctuation in the coverage of reads falling in the body of a CNV (the read depth, or RD signal). While all of these signals are effective in whole genome data, the breakpoints of CNVs usually fall between the regions targeted by WES. Therefore only the RD signal is reliably observable in WES data. The RD signal has been shown to be informative due to a strong correlation of copy number with read coverage depth (Sathirapongsasuti et al., 2011).
However, detection of CNVs is confounded by a range of other factors that also influence read coverage depth. Therefore, these factors must be corrected, and failure to do so can result in significantly degraded accuracy.
However, independent comparisons frequently fail to replicate their findings. For example, Guo et al. reported ExomeDepth having sensitivity of only 19% (Guo et al., 2013), while Ligt et al.
observed a sensitivity of 35% (de Ligt et al., 2013). In the same studies, sensitivity of CoNIFER was cited as 3% and 29% respectively, compared to the original evaluation estimate of 76%. In some contexts, high accuracy is reported. For example, Jo et al (Jo et al., 2016), Ellingford et al (Ellingford et al., 2017) and Feng et al (Feng et al., 2014) all cited 100% sensitivity and high specificity for detection of larger CNVs encountered clinically, in each case using high coverage data. However, the circumstances in which high accuracy can be achieved are currently not well understood.
calling as well as high false positive rates (Samarakoon et al., 2016). Some of the performance variability observed in these studies may be due to differences between the data sets and sequencing design such as the number of samples, read length, insert size, and mean read depth.
Also of critical importance is the size and type of CNVs assessed. However, even when these known technical factors are controlled, significant variability is often still observed between data sets.
In this work we present Ximmer, a software tool that improves CNV calling reliability by enabling users of CNV detection tools to efficiently assess and tune performance. Ximmer contains three parts: a simulation method, an analysis pipeline, and a graphical report. First, Ximmer simulates synthetic single copy deletions in existing WES data. Then, the analysis pipeline automates detection of the deletions with up to 5 commonly used CNV detection methods. Finally, the graphical report shows the combined CNV calling results, including a suite of plots that give insight into the accuracy achieved and strategies for improving performance.
In this article we explain the implementation details of Ximmer, and demonstrate how using Ximmer improves CNV detection accuracy. We show results from four CNV callers on four datasets representing different exome capture kits and different sequencing depths. Our results concur with previous studies, finding that CNV detection performance is highly variable both within and between data sets. However, we show that using Ximmer to gain insight into the variability enables optimisation of the CNV calling, and improves detection of real CNVs. Ximmer offers an integrated framework that is easy to use and freely accessible, from http://ximmer.org.

Methods
The Ximmer process consists of a series of steps designed to optimise CNV detection performance. The steps consist of: (i) simulation of CNVs in the user's data, (ii) execution of CNV callers to find both real and simulated CNVs, (iii) quality and accuracy assessment to discover optimal settings for CNV calling, and finally, (iv) filtering of results to produce a curated CNV list.
This process can be time consuming if conducted manually, however Ximmer automates all of the steps needed. The high level process is depicted in Figure 1.

Figure 1
The Ximmer Process -Ximmer consists of three high level steps. In the first step, simulated CNVs are added to a set of sequence alignments in BAM or CRAM format. This creates new BAM files containing simulated CNVs which are passed to the integrated analysis pipeline. The analysis pipeline runs up to 5 different CNV detection methods and collates the results into a graphical report which generates insight into the performance of the tools and possible avenues for improvement. Finally, when the analysis is optimised provides an interface to filter CNVs, review and interpret them using the built in CNV curation tool.

Simulation
Simulation is the first and most important element of the Ximmer method. By simulating CNVs in the user's own data, Ximmer generates both a prediction of the CNV calling performance, and also insights regarding how to improve performance. To simulate CNVs, Ximmer takes advantage of the exclusive use of the RD signal by WES based CNV detection methods.
Specifically, Ximmer removes reads that overlap selected target regions, such that the RD signal is reduced to match the predicted level associated with a single copy deletion. Ximmer focuses on deletions because depleting reads is significantly simpler than realistically synthesising and adding new reads. While the approach is strictly limited to deletions, the inferences derived are still likely to apply to other CNV states, because in most CNV calling tools the same underlying statistical principles are applied regardless of the number of copies being searched for.
The simulation process begins by randomly selecting the genomic region to become the deletion "target". The reads mapping to these locations can then be depleted using two alternate methods, referred to as "Downsampling" and "X-Replacement". The downsampling method randomly removes each read overlapping the deletion target with probability of 0.5, based on an assumption that the relationship between copy number and read depth is linear. By contrast, the X-Replacement method avoids this assumption. The X-Replacement method replaces reads mapping to X chromosome deletion target regions in a female sample with an adjusted number of reads from the same genomic regions in a male sample. This method exploits the true difference in copy number between male and female X chromosomes to avoid the assumption of linearity implied by downsampling. The X-Replacement method also ensures that other aspects of the reads are preserved in a realistic manner, such as the zygosity and phasing of overlapping SNVs and indels. Further details of the simulation implementation are provided in the supplementary methods (S-1). The result of the simulation step is a new set of alignments (BAM files) for the whole exome, but with deletions simulated in selected regions.

CNV Analysis Pipeline
The second step in the Ximmer process is to analyse the data containing simulated CNVs to produce CNV calls. Ximmer provides a built in analysis pipeline that automatically installs, configures and runs 5 commonly used CNV detection methods. These tools are: ExomeDepth, XHMM, cn.MOPS, CoNIFER and CODEX. The analysis pipeline is constructed using Bpipe (Sadedin et al., 2012), a framework for creating bioinformatic workflows. In addition to running the CNV detection tools, Ximmer performs any necessary pre-processing required by the tools and also post processes the results to merge and annotate the resulting CNV calls. Additional CNV callers can be added to Ximmer with only a small effort through the extensible Bpipe framework.
The analysis produces a report in HTML format that contains a full summary of all the simulated deletions, along with a range of plots and tables to highlight CNV calling performance and potential quality issues.

Results Assessment
Once CNV analysis has been performed, the next step in the method is to critically review the HTML report to assess performance of the CNV callers for detecting the simulated deletions, and to evaluate options for improving the results.

Quality Assessment
Three plots are of particular relevance in understanding potential quality issues. These are the Sample Counts, Genome Distribution and Quality Score Calibration plots.
The Sample Counts plot ( Figure 2A) shows the distribution of the number of CNV calls among the samples, separately for each CNV caller. In most studies we expect the number of CNV calls to be similar for each sample. If some samples contain a disproportionate fraction of the total CNV calls, it is likely that there is a problem with the sample quality. It may be desirable to either remove the samples from the CNV calling altogether, to adjust the caller settings to compensate, or to isolate poor quality samples from use in normalising other samples.
The Genome Distribution plot ( Figure 2B) divides the genome into 5 megabase bins and displays the number of CNVs overlapping each bin. Clicking on a particular region displays an enlarged plot encompassing that region for more detailed inspection. If particular regions contain very large numbers of CNV calls, it may be desirable to remove these from the target regions used for calling, as their presence may distort quality statistics and degrade overall calling accuracy.
The Quality Score Calibration plot ( Figure 2C) assists in interpreting the confidence measures (or quality scores) assigned to CNVs by the CNV callers. For each caller, Ximmer groups the whole CNV call set into approximately 5 quality score bins that collectively span the full range of values assigned by the caller. Ximmer then calculates the fraction of calls categorised as true positives in each bin as an estimate of the precision. The estimates are plotted as a line to illustrate the empirical relationship between precision and quality score for each CNV caller.
When quality scores are well behaved it is expected that the precision should increase monotonically as quality score increases. Failure to observe this relationship suggests the caller may produce high confidence false positives, in which case filtering by quality score alone may be insufficient to reduce the false positive rate. As with the Sample Counts plot, it may be appropriate to review normalisation settings for methods if quality scores assigned by tools are not well behaved.

Accuracy Assessment
After reviewing the quality assessment the next step in the Ximmer process is to review the accuracy estimate. This is presented using a plot designed to mimic a traditional " Receiver   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 Operator Characteristic" curve, but displayed using absolute measures to better accommodate the unknown positions of true negatives in CNV calling. Instead, the ROC-style plots show how the detection of simulated true positives (Y-axis) changes with the number of detections not part of the simulation (false positives, X-axis) as results are progressively filtered to lower significance levels. It should be noted that false positives are defined as regions that are not simulated to be deletions but they could actually be true positives from the sample itself. Unlike comparisons of absolute sensitivity and precision, this method primarily compares the ranking of true and false positives, and thus takes into account the utility of confidence measures output by tools for filtering the results. For the CNV calling tools included in Ximmer, the confidence measure used for ranking results was chosen in each case by consulting the documentation or by discussion with the tool author (Supplementary Methods, Table S1).
The initial display of the ROC-style curve shows the accuracy for the whole set of simulated deletions. As a first step this can suggest an appropriate level at which to filter results so that the optimal level of sensitivity and specificity is achieved. However, the plot can be interactively adjusted, to show performance of a subset of CNVs within specific size ranges.

CNV Discovery
Once the performance of the CNV callers is well understood, the final step in the Ximmer process is to filter the CNV calls according to the decided quality filtering thresholds. The remaining CNVs may then be manually reviewed in Ximmer's CNV curation interface. The interface includes a range of annotations to support interpretation of the likelihood that a CNV is real, and whether the CNV is potentially of functional interest. The annotations include overlapping genes, population frequency of relevant CNVs from the Database of Genomic Variants (DGV (Zhang et al., 2006)), overlapping single nucleotide variants (SNVs) or indels, and a pictorial diagram of the read depth deviation over the CNV region.

Data Sets
To demonstrate the application of Ximmer, we applied it to four data sets representing different Illumina sequencing platforms, exome captures, read configurations and sequencing depths (Table 1). The SureSelect data set was produced as part of an unrelated research program, the Nextera data was created as part of the Melbourne Genomics Health Alliance demonstration project (https://www.melbournegenomics.org.au/) and the TruSeq data was created by the Broad Institute, Center for Mendelian Genomics. The NimbleGen data set was downloaded from the Sequence Read Archive (SRA) from a previous study as part of the Simons Foundation Research Autism Initiative (Sanders et al., 2012).
The SureSelect, Nextera and Nimblegen data sets were analysed in house to produce alignment files in BAM format using Cpipe (Sadedin et al. 2015). The TruSeq data set was produced and analysed at the Broad Institute using the institute's standard analysis pipeline, also based on GATK.

Ximmer Simulations
In order to demonstrate Ximmer we applied it to four different exome datasets with a variety of different properties (Table 2). We configured Ximmer to simulate between 2 and 10 deletions per sample using the X-replacement method in each of the four datasets. As the X-replacement method was employed, deletions were simulated only in the X chromosome of female samples from each respective data set. The number of simulated CNVs ranged from 72 -144 for each dataset (see supplementary material). The simulated deletions spanned between 100bp and 6.9kbp of targeted bases, equating to genomic spans of between 471bp and 4.3Mbp.

Comparison of CNV Detection Methods with Default Settings
First we used Ximmer to compare the accuracy of the four different CNV detection methods.
Parameters for each tool were set to their defaults, except for cases where the tool setting was clearly misaligned to the simulated data. Specifically, the cn.MOPs minimum CNV width was lowered to 1, and XHMM mean number of targets were lowered to 3 to better match the generally smaller size of deletions included in the simulation.
In the Nimblegen data set, we observe that there were significant differences between the performance of the different CNV callers (Figure 3). ExomeDepth achieved substantially better absolute sensitivity than any other tool, finding 88% of all the simulated deletions compared to XHMM finding 57%. However, the precision of ExomeDepth was poor (54%) compared to XHMM (93%). A substantial difference in precision persisted even when ExomeDepth results were filtered to equivalent sensitivity as XHMM. Therefore in this case the optimal caller is a trade off between sensitivity and specificity. Both cn.MOPs and Conifer performed poorly in terms of sensitivity, each finding less than 30% of simulated deletions. cn.MOPs has very poor precision in this data set (0.5%), and appears to output many very high confidence calls that are ranked higher than the true positives it detects.

Comparison between Data Sets
We next compared Ximmer results using the four CNV callers with default settings on all four data sets. Our results (Figure 4) show that individual methods have marked differences in performance on different data sets. For example, all callers exhibited a low false positive rate when applied to the SureSelect data (fewer than 10 false positive calls for any caller), but showed much higher false positive rates on Nextera data (ExomeDepth and cn.MOPs both having more than 200 false positive calls). cn.MOPs performed poorly on the SureSelect, Nimblegen and Nextera data, detecting very few true and many false CNVs. However cn.MOPs is arguably the best caller for the TruSeq data, having a lower false positive rate than ExomeDepth and Conifer, but higher sensitivity than XHMM. These differences suggest that some data sets are better suited to the algorithms or default settings of particular calling methods.
Despite the differences, some aspects of individual caller performance were consistent across all datasets: XHMM consistently achieved the lowest false positive rate for a given sensitivity, and conversely ExomeDepth consistently achieved higher total sensitivity than any other caller. In some respects differences between datasets are consistent between callers. With SureSelect data (30x mean coverage), no caller could achieve more than 60% sensitivity. However with TruSeq data (90x mean coverage), all callers found more than 60% of the simulated deletions, and ExomeDepth found nearly all deletions (96%).

Figure 4 ROC-style Curve with Default Parameters -Count of true positives vs false positives as ranked results are filtered by varying quality score threshold, when the four CNV calling methods are executed on four different data sets with their default parameters. Performance is highly variable both between different methods on the same data set, and between the same method on different data sets
It is likely that homogeneity of the data is an important factor in determining these characteristics: data sets having very low inter-sample variation with few significant batch effects may work well with callers that apply relatively little normalisation or are flexible in their normalisation approach.

CNV Calling performance can be improved with parameter optimisation
We next configured Ximmer to re-analyse the Nimblegen data while varying several configurable parameters of each CNV caller. The parameters varied were chosen by reviewing the documentation and experimenting to find those having the largest direct effect on sensitivity (Table S2).
We found that adjusting two parameters (the exome-wide CNV rate to 10 -4 and the normalisation factor to 0.2), increased XHMM sensitivity ( Figure 5A) by 21% (67% to 88%) with an acceptable loss of precision (81% to 55%). Similarly, we evaluated alternative values for the SVD number and calling threshold for Conifer ( Figure 5C), and found that, by adjusting the calling threshold parameter from 1.5 down to 1.25, sensitivity could be improved from 25% to 40% with only a small sacrifice in precision. cn.MOPs adjustments were able to improve sensitivity from 19% to 36% by adjusting the prior impact parameter from 5 to 2, and the calling threshold upwards from -0.8 to -0.4. Although we tried varying two parameters (transition probability and expected CNV length), ExomeDepth appeared to have nearly optimal parameters as its defaults for this data set.
This analysis demonstrates that tuning parameter settings should be considered an important element of usage of CNV detection tools, and can lead to significantly improved accuracy. Many previous comparison studies (Hong et al., 2016;Tan et al., 2014) have been evaluated without rigorous optimisation of parameters. Our results suggest that the discrepancies in the results from these studies may have been reduced if calling parameters were optimised.

Figure 5
Results of adjusting CNV calling parameters on ROC-style curves. XHMM, Conifer and cn.MOPs all have configurations where sensitivity or precision can be substantially improved: reducing Conifer calling threshold to 1.25 increases sensitivity from 25% to 40%; increasing the exome-wide CNV rate to 10 -4 and reducing the normalisation factor from 0.7 to 0.2 increases XHMM sensitivity from 67% to 88%; Reducing the cn.MOPs prior impact factor to 2 and raising the calling threshold to -0.4 allowed sensitivity to nearly double (from 24% to 36%), however these settings caused a substantial reduction in precision.

Optimisation of Parameters across Data Sets
We applied the optimised settings derived from simulation performance on Nimblegen data to the analysis of the other three data sets. However, we observed that these settings are not optimal for every other data set. For example, on the SureSelect and TruSeq data ( Figure 6A, 6B), XHMM achieves both high sensitivity (85%) and precision (80%) with the default settings, but produces no calls at all with the optimised settings. The optimisations increase sensitivity in cn.MOPs, however, the marginal increase (69% to 77%) is much less significant than for Nimblegen data, and causes a substantially higher number of false positive calls. Similarly Conifer also shows a much smaller proportional increase in sensitivity (64% to 73%) and experiences a significant fall in precision (75% to 59%).
We conclude that optimisation needs to be performed on each data set or data type separately.
Ximmer supports this process efficiently and easily through modifying simple configuration settings.

Conclusion
While there is great utility in detecting CNVs from WES data, adoption of CNV detection methods in practice has met with significant challenges. These are primarily centred around highly unpredictable performance and lack of reproducibility between data sets. We have addressed these challenges by creating Ximmer, a tool that facilitates efficiently assessing and improving the accuracy of WES-based CNV detection methods. Our comparison of four different data sets analysed by four different CNV calling methods represents one of the most comprehensive evaluations to date. Our results show, consistent with previous studies, that there is significant variability in performance of CNV detection between tools and between data sets. We conclude that to effectively use these methods, attention must be applied to understand and optimise their behavior on each individual data set. Ximmer can be used to automate these procedures, avoiding a significant burden. In addition, we have demonstrated that Ximmer can produce valuable insights into the quality of data sets for CNV calling and the behavior of CNV detection tools. As the first tool offering combined simulation, evaluation, tuning and interpretation of results from CNV detection methods, we believe Ximmer will assist increasing practical adoption of CNV detection methods for exome data. Ximmer is open source and available at http://ximmer.org. An example Ximmer report can be viewed online at http://example.ximmer.org.

Competing Interests
The authors declare no competing interests.

Ethics, consent and permissions
Not applicable.

Availability of Data and Material
The SureSelect, Nimblegen and Nextera datasets are available from SRA under accessions SRP132744, SRP010920 and SRP148622 respectively. The NimbleGen dataset is available from dbGaP under accession phs001272.

Authors' Contributions
SS conceived of and implemented the method, performed simulation work and drafted the manuscript. AO advised on design, implementation and interpretation of results, and edited the manuscript. SM and JE collected samples, extracted DNA, arranged sequencing and provided feedback on the manuscript. All authors approved the final manuscript.