Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing

Background Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. Results We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. Conclusion These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy. Electronic supplementary material The online version of this article (10.1186/s13059-019-1720-5) contains supplementary material, which is available to authorized users.


Table of contents in Additional file 1
Figure S1 -Outline of evaluation process for the performances of SV detection algorithms.                     Table S1 -SV detection algorithms used in this study.

Table S2
-List of algorithms that did not work in our computational environment.  Table S12 -Effect of the length, insert size, and coverage of reads on precision and recall of SV detection algorithms (summarized data of Additional file 2: Figure S13).
Table S13 -Mean coefficient of variation of SV detection accuracy for each category of SV type and read property Table S15 -Genotyping precision of SV detection algorithms.
Table S19 -Extended list of algorithms exhibiting good SV calling results in both the simulated and the NA12878 real datasets.  Figure S1. Outline of evaluation process for the performances of SV detection algorithms. Simulated data (Sim-A) and real datasets consisting of NA12878 trio, HG00514, and HG002 datasets were used for evaluation of the performance of single algorithm (a) and overlap calls from pairs of algorithms (b). For real data, Mendelian inheritance errors that were called only for the child NA12878 but not for the parents NA12891 and NA12892 were counted for each SV type for each algorithm. The mean precision, recall, and Mendelian error rates from four distinct NA12878 child datasets derived from different sequencing libraries were determined. For long read-based algorithms, the mean precision and recall values from three distinct NA12878 long read datasets were determined. In addition, each SV detection algorithm was also evaluated with HG00514 real datasets of short and long reads. The number of algorithms used in each analysis is indicated in blue. For evaluation for single algorithm, 69 algorithms, containing 2 genotype calling algorithms, were selected from available 79 algorithms. For evaluation of overlap calls, 51 algorithms out of the 69 algorithms were selected. with the scales at the left and right sides on the y-axis, respectively, which were plotted against the minimum number of reads supporting the called SVs, indicated on the x-axis.
As SVs with a lower number of supporting reads than the minimum number of supporting reads are removed, the minimum number of supporting reads functions as a filtering classifier to increase precision, despite decreasing recall.

(d) INV [Real data]
RP (7) RD (8) SR (4) AS (2) LR (3) RP-RD (4) RP-SR (9) RP-AS (1) RP-RD-SR (6) RP RD (8) SR (1) AS (1) LR (1) RP-RD (2) RP-SR (6) RP-RD-SR (6) RP-SR-AS (3)  Table S1 in Additional file 1. For the real data the mean values of the results obtained with the four (or three) NA12878 real datasets (data1 to data4) are indicated. The algorithms were categorized by the 10 methods: RP: read pair, RD: read pair, SR: split read, AS: assembly, LR: long read, RP-RD, RP-SR, RP-AS, RP-RD-SR, and RP-SR-AS. The hyphenated tools are combinations of RD, RD, SR, and AS. MetaSV was categorized into the RP-RD-SR method, and the data for BreakSeq2, which did not meet the criteria for these methods, was not used. The means of the precision and recall percentages of the algorithms categorized for each method are indicated with the scales on the x-axis and y-axis, respectively. The number of tools used for taking the means is indicated in parentheses for each symbol. The standard errors for precision and recall for each method are indicated with bars in light gray. Methods exhibiting a higher precision and a higher recall are placed toward the upper right corner.
RD (8) SR (1) AS (1) LR (1) RP-RD (2) RP-SR (6) RP-RD-SR (6) RP   and DUPs (c,d) were determined with the HG00514 real data. Modified F-measures (the combined statistics for precision and recall) are shown for the algorithms indicated with orange (for S: 100 bp-1 Kb). blue (for M: 1 Kb-100 Kb), and red (for L: 100 Kb-1 Mb) bars. The algorithms were categorized according to the methods used to detect SV signals, as in Figure  S8.            Figure S16. Recall and precision of SVs commonly called between a pair of SV detection algorithms for the DEL-S category. DELs in the short size-range (100 bp-1 Kb), called using the indicated algorithms, were filtered with the minimum number of reads supporting the called SVs, indicated with the suffix number of the algorithm name. The DELs overlapping between the filtered SV sets from a pair of the indicated algorithms were selected, and the recall and precision of the selected DELs were determined. Recall and precision percentages are presented as in Figure  S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.

Figure S17. Recall and precision of SVs commonly called between a pair of SV detection algorithms for the DEL-M category.
DELs in the middle size-range (1 Kb-100 Kb), called using the indicated algorithms, were filtered with the minimum number of reads supporting the called SVs, indicated with the suffix number of the algorithm name. The DELs overlapping between the filtered SV sets from a pair of the indicated algorithms were selected, and the recall and precision of the selected DELs were determined. Recall and precision percentages are presented as in Figure S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.

Figure S18. Recall and precision of SVs commonly called between a pair of SV detection algorithms for the DEL-L category.
DELs in the large size-range (100 Kb-1 Mb), called using the indicated algorithms, were filtered with the minimum number of reads supporting the called SVs, indicated with the suffix number of the algorithm name. The DELs overlapping between the filtered SV sets from a pair of the indicated algorithms were selected, and the recall and precision of the selected DELs were determined. Recall and precision percentages are presented as in Figure S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.

Figure S19. Recall and precision of SVs commonly called between a pair of SV detection algorithms for the DUP-S category.
DUPs in the short size-range (100 bp-1 Kb), called using the indicated algorithms, were filtered with the minimum number of reads supporting the called SVs, indicated with the suffix number of the algorithm name. The DUPs overlapping between the filtered SV sets from a pair of the indicated algorithms were selected, and the recall and precision of the selected DUPs were determined. Recall and precision percentages are presented as in Figure S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.  Figure S20. Recall and precision of SVs commonly called between a pair of SV detection algorithms for the DUP-M category. DUPs in the middle size-range (1 Kb-100 Kb), called using the indicated algorithms, were filtered with the minimum number of reads supporting the called SVs, indicated with the suffix number of the algorithm name. The DUPs overlapping between the filtered SV sets from a pair of the indicated algorithms were selected, and the recall and precision of the selected DUPs were determined. Recall and precision percentages are presented as in Figure S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.  Figure S21. Recall and precision of SVs commonly called between a pair of SV detection algorithms for the DUP-L category. DUPs in the large size-range (100 Kb-2 Mb), called using the indicated algorithms, were filtered with the minimum number of reads supporting the called SVs, indicated with the suffix number of the algorithm name. The DUPs overlapping between the filtered SV sets from a pair of the indicated algorithms were selected, and the recall and precision of the selected DUPs were determined. Recall and precision percentages are presented as in Figure S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.      Figure S15. The data contained in the top 20th percentile of the combined precision scores for the simulated and real data are highlighted with a red background, and the next data contained in the top 21st-50th percentile of the combined precision scores are shown with a pale red background.   Table S17 in Additional file 3.       These simulation data were generated only for chromosome 17.
The number of heterozygous alleles is indicated within brackets.
(C) Component of SVs identified in the 1000 Genomes project [77] SV   Others: the highest precision or recall at the category of the indicated values.
The classification of the precision (recall) tendency into I, D, and C is based on > 5-pont differences between the categories (e.g., R-100, R-125, and R-150) at multiple points of the minimum number of supporting reads.

MindTheGap
Precision and recall are low despite of long runtime and high memory consumption.

VirusSeq
Precision and recall are low despite of long runtime and high memory consumption. AS-GENSENG,

OncoSNP-Seq
Their precision and recall are low. However, these algorithms may more effectively and accurately detect somatic CNVs or germline CNVs from whole exome sequencing data. The use of CNVnator, PennCNV-Seq, or readDepth is recommended if read depth-based algorithms are selected.