A benchmark of structural variation detection by long reads through a realistic simulated model

Accurate simulations of structural variation distributions and sequencing data are crucial for the development and benchmarking of new tools. We develop Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it reveal the strengths and weaknesses for current available structural variation callers and long-read sequencing platforms. With these findings, we develop a new method (combiSV) that can combine the results from structural variation callers into a superior call set with increased recall and precision, which is also observed for the latest structural variation benchmark set developed by the GIAB Consortium. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02551-4).


Background
In order to decipher the genetic basis of human disease, a comprehensive knowledge of all genetic variation between human genomes is needed. Until recently, the emphasis has been on single-nucleotide polymorphisms, as these variants are easier to trace with current sequencing technologies and algorithms [1,2]. Over the past 20 years, we gained a better view on the prevalence of structural variation (SV), which changed our perspective on the impact it has on genomic disorders. We now know that structural variation contributes more to inter-individual genetic variation at the nucleotide level than single nucleotide polymorphisms (SNPs) and short indels together [3,4]. Structural variation covers insertions, deletions, inversions, duplications and translocations that are at least 50 bp in size. The limited length of Next-Generation Sequencing (NGS) reads (≤ 300 bp) hampers the detection of SVs, especially for insertions [3,5]. These technical limitations can be partially overcome by the third-generation sequencing, which is capable of producing far longer read lengths [6,7]. The race for dominance on the third-generation sequencing market has significantly reduced the costs per Mb and increased the throughput and accuracy, which makes these technologies (Pacific Biosciences (PacBio) [8] and Oxford Nanopore Technologies (ONT) [9,10]) currently the best options for structural variance detection [11].
The introduction of long sequencing reads required a revolution in new computational tools for sequencing analysis. Even though several algorithms for SV detection were developed over the past decade, there is a large discrepancy between their outputs [3]. Assessing the performance of SV detection tools is not straightforward, as there is no gold standard method to accurately identify structural variation in the human genome. To overcome this shortcoming, the Genome in a Bottle (GIAB) Consortium recently published a sequence-resolved benchmark set for identification of SVs, though it only includes deletions and insertions not located in segmental duplications [12]. For as long as there is no completely resolved benchmark available, it is crucial to simulate a human genome with a set of structural variations that resembles reality as close as possible. There are a wide range of structural variation and long sequencing reads simulators available, yet without a thorough benchmark, it is impossible to know which tools are best suited to design the model you want to simulate. Therefore we compared several structural variance and long-read simulators for their system requirements and available features. Furthermore, we introduce Sim-it, a new SV and long-read simulator that we designed for the assessment of SV detection with long-read technologies.
The most extensive structural variance detection study to date identified around 25,000 SVs for each individual by combining a wide range of sequencing platforms [3]. The large amount of sequencing data used for this study makes it too costly to reproduce it on a larger scale, but it can be used to estimate the number of SVs in a human genome. We used the results of this study to produce a realistic model for the evaluation of the available SV detection algorithms and to develop a new script that can improve SV detection by combining the results of existing tools.

Structural variation simulation benchmark
We compared the features and computational resources of five structural variation simulators, as shown in Table 1 and described in the "Methods" section "Benchmark of structural variation simulators". Although all simulators can simulate the most common types of structural variation (insertions, deletions, duplications, inversions, and translocations), more complex SV events need to be included in order to reproduce a realistic SV detection model. For Sim-it, we also included complex substitutions and inverted duplications, both common types of variation in germline and somatic genomes [5,[13][14][15]. A complex substitution is defined as a region which been deleted and replaced with another region of the genome, while an inverted duplication is a tandem duplication of an inverted segment. Additionally, it is possible to combine random generated SV events with a defined list of SVs at base pair resolution. Random generated SVs will be distributed realistically across the genome with higher prevalence around the telomeres. As output, Sim-it produces a sequence file in FASTA format and optionally long sequencing reads (PacBio or ONT). Additional files to draw gnuplot [16] figures of the length distributions from the simulated SVs are provided (Figs. S4-6). Although none of the other tools has a proprietary method to simulate long reads, Varsim can generate long reads through PBSIM or LongISLND. Currently, Sim-it does not support short read or phylogenetic clonal structure simulation. As for computational resources, Sim-it performed Table 1 Available features and system requirements of structural variation simulators *SCNVsim and SURVIVOR was excluded from the benchmark best on peak memory consumption and runtime. With 1 GB as peak memory consumption and 5 min 30 s as runtime (single core) to simulate 24,600 SV events, Sim-it can be implemented for any set of SVs on a small desktop or laptop. SVEngine and Varsim also have relatively low runtimes, though a peak memory consumption of respectively 24.3 GB and 8 GB limits it's use on machines with limited computational resources. SCNVsim and SURVIVOR were excluded as they do not accept a set list of SVs as input and have an upper limit of SVs (600 for SCNVsim, less than 24,000 for SURVIVOR) for random simulation.

Long-read simulation benchmark
We assessed the quality of the simulated long reads by comparing their error profiles to those of real PacBio and ONT sequencing reads. Additionally, we compared the features and system requirements for each tool, as described in the "Methods" section "Benchmark of the long-read simulators".
Several systems of ONT and PacBio technologies have been released in the last decade, each with different specifications for the sequencing reads. This complicates an accurate simulation as a specific error profile is needed for each released system. From the 9 tested simulators, only Sim-it, Badread, SURVIVOR and LongISLND support simulations for both ONT and PacBio. Sim-it provides error profiles for ONT, PacBio RS II, PacBio Sequel II, and Pacbio Sequel HiFi systems, while other simulators are limited to one or two error profiles. This shortcoming can be overcome by training a new model for a system, a feature supported by all simulators apart from PBSIM and SimLoRD. This is more laborious and a real dataset along with an accurate reference sequence is required to train a new model. Not all updates require a completely new error profile, therefore we provide the option to adjust the overall accuracy and read length independently from the error profile. Sequencing depths can fluctuate strongly in real datasets, Sim-it can imitate this with a sequencing depth profile file. Such a file can be created with Samtools [17] from an alignment file. As for computational resources, PBSIM performed the best with just 5 min and 0.25 GB of RAM to simulate 15x coverage for chromosome 1 of GRCh38. Besides for DeepSimulator, Badread and NanoSim, computational resources stayed within a reasonable range. Sim-it needed 35 min for chromosome 1, yet this does not represent the real speed of Sim-it. From version 1.3 on, Sim-it uses multiple threads to simulate each chromosome and haplotype in parallel. A complete overview of the features for each long read simulator can be found in Additional file 2: Table S1.
Available features and computational resources determine the suitability and userfriendliness of the simulators, but not the accuracy of the simulation. Therefore, we compared the context-specific error patterns of the simulated reads to real long sequencing datasets. Figure 1A shows the context-specific errors derived from real data from Nanopore PromethION and PacBio Sequel II sequencing reads, as well from their respective simulations by Sim-it. These context-specific error heatmaps were generated for each of the 9 simulators and can be found in Additional file 1: Figs. S1-3. NanoSim generated random errors in stead of a context-specific error pattern, while SURVIVOR, PBSIM and SimLoRD have simplified patterns. For Sim-it, the length of deletions and insertions closely match the real data ( Fig. 1C, D). LongISLND has proportionally too many single nucleotide deletions, while the asymmetry for Deep-Simulator is caused by a low absolute number of deletions, which is not adjustable. Besides the heatmaps, three more tables can be found in Additional file 1. Two tables (Additional file 1: Tables S2 and S3) with general statistics of the simulated reads and a table with the Euclidean distances for the context-specific errors and for the length distributions of deletion and insertion errors (Additional file 1: Table S3). The Euclidean values confirm the heatmaps and Fig. 1C, D, with LongISLND and Sim-it as the most accurate simulations. LongISLND has the most accurate contextspecific errors, while Sim-it has the most accurate length distributions of deletion and insertion errors.

Structural variance detection using simulated long reads
We assessed the performance of 7 long-read SV detection algorithms through a realistic model of 24,600 SV events, as described in the Methods section "SV detection on simulated reads". Additionally, we made a comparison between PacBio and ONT technology and evaluated the impact of the read length and sequencing depth. For each simulated dataset, a separate score for each type of SV and for the four essential parameters that define SVs; namely position, length, type and genotype were calculated.
We performed a complete analysis on each of the 7 SV callers for a Nanopore and a PacBio Sequel II long reads and a HiFi reads dataset with a sequencing depth of 20x (Table 2). For each dataset, Picky had more than 19,000 false positives and false negatives, with an outlier of 46,502 false positives for the PacBio HiFi dataset. We therefore excluded Picky for any further analysis or graphical output. All the statistics of Picky for all three 20x coverage datasets can be examined in Additional file 2.
For a sequencing depth of 20x, cuteSV achieved the best overall performance for the more erroneous reads of Nanopore and PacBio CLR, while SVIM performs best for PacBio HiFi. cuteSV produced the lowest number of false positives independent from sequencing platform, median read length or coverage depth (Table 2 and Fig. 2). Although pbsv generally has a lower recall, it calls the position and length of the SVs more accurately Fig. 1 Context-specific error patterns for mismatches and indels. A Context-specific error patterns for real data of Nanopore (9.4.1) and simulated data from Sim-it. B Context-specific error patterns for real data of PacBio Sequel II and simulated data from Sim-it. C Deletion lengths for real Nanopore data and the simulations of the benchmarked tools. D Insertion lengths for real Nanopore data and the simulations of the benchmarked tools than any other tool, independent from the platform or coverage depth. Subsequently, this high accurateness results in a significant higher number of perfect matches compared to other tools. Perfect matches are SVs called with the correct type, genotype, exact length and position. For PacBio CLR and PacBio HiFi reads, pbsv manages to call respectively 46% and 58.2% of the detected SVs perfectly, which is quite impressive compared to the other tools. Only SVIM achieved a similar percentage for PacBio HiFi reads (56.3%), however not for PacBio CLR reads (7.6%). The highest recall is achieved by NanoSV and to a certain extend NanoVar (only for PacBio HiFi), however this is at the expense of a disproportional number of false positives.
The 24,600 SVs can be classified by 5 different types, namely deletions, insertions, duplications, inversions and complex substitutions. We calculated the recall and precision metrics for each type of SV; Table 3 shows the results for the Nanopore 20x dataset, data metrics for the PacBio 20x and PacBio HiFi 20x datasets reveal similar patterns and can be examined in Additional file 2. NanoSV only classifies insertions, other SVs are indicated as breakend (BND). None of the SV callers classify complex substitutions in their output, which explains the missing precision values for this type. These complex substitutions seem to be the most problematic, as their recall values are very low for each of the tools. Recall and precision values of inversions are also far below the average for each of the tools. The low precision value for duplications detected by NanoVar can be explained by the fact that a significant fraction of the insertions is typed as a duplication.  Table 3 Precision and recall statistics for each type of SV from the Nanopore 20x dataset. (combiSV (6): all 6 tools combined) To investigate the influence of increased sequencing coverage, we simulated 4 different datasets with sequencing depths of 10x, 20x, 30x, and 50x for both Nanopore and PacBio HiFi (Fig. 2). The general trends for increased sequencing depth are an increased recall and decreased precision, although depending on the tool, they can be disproportional to each other. NanoVar was designed to work on low sequencing depths and therefore does not display much gain in recall, yet a significant reduction in precision. Sniffles benefits the most from additional coverage with increasing recall together with almost no loss of precision. pbsv has a stable performance across all coverages, except for Nanopore 50x, which exhibits a steep decrease of precision. The big drop in precision for NanoSV and SVIM at 20x and 50x coverage of Nanopore are caused by the additional filtering step we implemented for the "minimal read support" (3 for 10x and 20x, 5 for 30x and 50x), which is necessary to keep a good balance between recall and precision. This pattern is to some degree visible for all tools, with an exception to cuteSV, which has stable precision values across all coverages.
Besides sequencing depth, it is often believed that increasing sequencing lengths can improve assemblies and variance detection. We compared the SV detection metrics for five datasets of Nanopore with median read lengths of 15, 25, 40, 75, and 100 kbp. We observed an increase in recall with increasing read lengths for all tools except NanoVar, with the most pronounced improvement from median lengths of 15k to 25k. pbsv shows a modest rise in recall of 2% between 15k and 100k lengths, while Sniffles, cuteSV, SVIM, NanoSV and combiSV show an increase between 6 and 8%. Both pbsv and NanoVar show a significant drop in precision for read lengths of 75 and 100 kbp. As pbsv is specifically designed for shorter PacBio reads, it could be an explanation for this drop in precision. NanoVar is the only tool that has no benefit from longer read lengths, as we observed a drop in both recall and precision for median read lengths of 75 and 100 kbp (Fig. 3). All metrics of this comparison can be found in Additional file 2.

Structural variance detection using real datasets
We based our simulated datasets on a SV call set of NA19240 (nstd152), which was obtained through an elaborated SV study that combined a wide range of sequencing data [3]. To compare our simulation to the original genome, we performed the same benchmark on a public available PacBio CLR dataset of that study. Recall and precision values of the real dataset were significantly lower, with an average of respectively 65% and 50%. An even more striking difference were the recall percentages of around 60% for complex substitutions, while these values ranged between 1 and 20% for the simulated datasets, independent from sequencing platform or sequencing depth. While the overall lower recall and precision values were to be expected due to inaccuracies of the real SV call set, we found the large rise in recall for complex substitutions questionable. We therefore examined several alignments of SVs that were typed as complex substitutions. We found that most of these complex substitutions are in fact insertions or deletions, which would explain the high recall values. Most of the complex substitutions in nstd152 were determined by merging of experiments (optical mapping, sequence alignment and de novo assembly) and not associated to just one method. It is possible that conflicting findings between methods were thought to be caused by complex substitutions as they consist of both a deletion and an insertion. We added some concrete examples with screenshots of alignments and BLAST results of individual reads in Additional file 1: Figs. S7-S38 as evidence of these findings.
The large discrepancy between our simulated dataset and the real dataset are an indication that the SV call set of NA19240 (nstd152) has not been called accurately. The fact that the real NA19240 dataset has much lower precision and recall values than the real GIAB SV call set makes it unlikely that the discrepancy between the real and simulated SV call sets of NA19240 is caused by an inaccurate simulation. As additional validation of our simulation accuracy, we also simulated the GIAB SV callset and called SVs with Sniffles, SVIM and cuteSV. Both the recall and precision values of the simulated and real data were within a 2-5% range from each other.

Improved SV calling with combiSV
This benchmark revealed the strengths and weaknesses of each SV calling tool for long read sequencing. With this performance data we were able to develop a tool (combiSV) that can combine the outputs of cuteSV, pbsv, Sniffles, NanoVar, NanoSV, and SVIM into a superior SV call set, with Sniffles, pbsv or cuteSV as mandatory input (it can run without, but not recommended). The VCF outputs of each tool serve as input and the minimum count of supported reads for the variance allele has to be given. The complete wall time is under 1 minute and less than 1 GB of virtual memory is required. By combining the strengths of each of the 6 SV callers, we were able to eliminate distinct weaknesses and improve overall performance ( Table 2). The most significant improvements were the ratio of total matches versus false positives and the accurate definement of the SV parameters. The added value of combiSV can also be seen by the sequence depth analysis (Fig. 2), where combiSV has consistently the best overall performance and does not show any significant drops in recall or precision for any of the sequencing depths. The improved performance of combiSV is less pronounced by the precision and recall values of the individual SV types, which can be explained by the fact that the performance gain was mostly limited for deletions and insertions. Most importantly, combiSV also showed significant improvement for the real GIAB dataset, as it combines the highest recall and precision from all tools, together with the accuracy from pbsv. This high recall is also achieved without NanoSV, as combiSV(3) only combines pbsv, sniffles and cuteSV. The combination of all 6 callers reduced the recall and precision slightly, which is probably caused by the high number of false positives of NanoSV and NanoVar. Therefore, it is not necessary to include the output of all 6 SV callers to run combiSV, although it is advised to add two additional callers besides cuteSV, pbsv or Sniffles. Despite the fact that combiSV takes less than one minute to run, total runtime will increase because multiple SV callers are being used. To have an idea how this will affect the total runtime, we performed a system requirements benchmark for each SV caller (Additional file 1: Table S5).
Currently, there are no similar tools as combiSV available for long sequencing reads. Therefore, we limited our comparison of combiSV to SURVIVOR, a tool that combines VCF files based on overlap. When combining the output of the 6 SV callers, combiSV produced a higher F-score and Total score for each of the 4 datasets in Table 2. Combining 6 tools requires additional effort and computational time, we therefore produced an additional benchmark where we tested 9 combinations of 3 SV callers for combiSV and SURVIVOR on the simulated Nanopore (20x) and the GIAB datasets (Table 4 and Additional file 1: Table S6). For the simulated dataset, combiSV achieved a higher F-score and Total score for each combination. For the GIAB dataset, combiSV had a higher Total score for all combinations and a higher F-score for 7 out of 8 combinations. The only combination with a higher F-score for SURVIVOR was SVIM, NanoSV and NanoVar. With a recall of 85%, combiSV performs significantly less on this combination, as other combinations have recalls above 93.5%. This is because combiSV is designed to have at least cuteSV, Sniffles or pbsv in the combination, while SVIM, NanoSV, and NanoVar can be added as support.

Discussion
We developed a realistic simulated model to benchmark existing structural variation detection tools for long-read sequencing. This was accomplished with Sim-it, a newly developed tool for the simulation of structural variation and long sequencing reads. Although there are several tools available that can simulate structural variation or long sequencing reads, a benchmark study to assess the accuracy of these simulators was Table 4 Comparison between combiSV and SURVIVOR for 9 combinations of three SV callers on a simulated Nanopore dataset of 20x and the GIAB reference dataset (Nanopore). The highest scores between combiSV and SURVIVOR are indicated in gray needed. Besides Sim-it, the combination of Varsim and LongISLND (despite the aberration for the length of deletions) could also have been used for this benchmark study. We simulated in total 5 PacBio and 8 Nanopore whole genome sequencing datasets of GRCh38 with coverages ranging between 10x and 50x and lengths between 15 and 100 kbp. With these simulations, we assessed the performance of 7 SV callers and the influence of increasing sequencing depths and read lengths.
For most datasets, cuteSV, or SVIM produced the best overall performance with a good balance between recall and precision. cuteSV has the highest precision across all datasets, yet performs significantly less for PacBio HiFi datasets with a coverage below 30x. pbsv defines the SVs the most accurate across all datasets and since it is designed for PacBio, it performs the best on this type of data. NanoSV and NanoVar have high recall numbers, however at the cost of a disproportional high false positive rate (to a lesser extent for PacBio HiFi data). We found similar patterns for the high-fidelity SV call set of GIAB, although with some distinct differences. The real GIAB dataset had for each of the tools a higher recall and a lower precision compared to the simulations. The higher recall can be explained by the fact that the GIAB call set only contains insertions and deletions in non-complex regions, which are easier to call than other types of SVs or SVs in complex regions. Similar lower precision values from a simulation of the SV call set of GIAB suggests that these lower values are sample-specific and not caused by inaccurate simulations. The high precision values for the simulations of the 24,600 SVs could be misleading, as the SV callset of the Chaisson et al. study possibly contains a significant number of false positives that we simulated as true SVs.
Recall and precision values are the preferred metrics to measure SV detection accuracy. Called SVs are scored as 0 or 1, based on a reference SV set. However, there is no consensus about how accurate the position, length or SV type has to be called to be matched with the reference set. Therefore, we chose a tolerant matching algorithm and included a total score that integrates the accuracy of the call, which is not integrated in the Fscore. The downside of our total score is that some SV callers that do not call SV types or genotypes are too heavily punished. In addition, we added separate accuracy scores for position, length, type and genotype.
It is often assumed that higher sequencing depths and longer read lengths will improve assembly and variance calling outcomes. Yet in our benchmark, increasing sequencing depths does not guarantee improved structural variation calling. Although there was still a modest rise in recall numbers for sequencing depths above 30x, we did observe a disproportional rise in false positives above 30x. This rise in false positives was not observed for increasing sequencing lengths, while we observed an increase in recall for longer read lengths across all methods.
Finally, we looked at precision and recall rates for each type of SV. Each tool showed the best performance for deletions and insertions, which are the majority of SVs in a human genome. More problematic SVs are inversions and complex substitutions, wherefore recall rates are respectively between 50-68% and 5-25%. As complex substitutions are not defined by any of the tools, it seems likely that these algorithms are not designed to detect this type of SV. New SV callers or updates of existing ones could make significant improvements in this direction. Although the SV study we used as blueprint [3] detected around 3000 complex substitutions per individual, we discovered that most of these complex substitutions were insertions or deletions. The actual prevalence of this type of structural variation is therefore possibly not accurate and requires further studies in order to map the complete structural variation profile in the human genome.
This study shows that a simulated model can be beneficial to gain a better understanding in the performance of structural variation detection tools. The development of combiSV was solely based on simulated datasets, but the recall and precision values from combiSV for the GIAB dataset shows that the statistics from the simulated data is transferable to real datasets. It is crucial that the simulations are as accurate as possible. Currently, Sim-it does not simulate small indels and SNPs, although they can influence the detection of small SVs and will therefore be included in the next update.

Conclusions
This extensive benchmark unveiled the strengths and weaknesses of each SV detection algorithm and provided the blueprint for the integration of multiple algorithms in a new SV detection pipeline, namely combiSV. This Perl script can combine the VCF outputs from cuteSV, Sniffles, pbsv, NanoVar, NanoSV, and SVIM into a superior call set that has the high precision of cuteSV, the accuracy of pbsv and the high recall of SVIM. combiSV also achieves a higher recall, precision and accuracy compared to SURVIVOR, an existing algorithm to generate a consensus VCF. The added value of combiSV on simulated data was supported by the real dataset of GIAB, where the gains were even more outspoken, which demonstrates the strengths of an accurate simulated model for the development of new bioinformatic tools.

Sim-it
We developed a new structural variation and long-read sequencing simulator, called Simit. The structural variation module outputs fasta files of each haplotype, plus an additional one that combines all SVs in one sequence. A set list of SVs can be combined with additional random generated SVs as input. The long-read sequencing module outputs sequencing reads based on a given error profile and 4 metrics (coverage, median length, length range and accuracy). We provide error profiles for Nanopore, PacBio RS II, Sequel II, and Sequel HiFi reads. Additional error profiles can be generated with a custom script. Both simulation modules (SV and long reads) can be used separately or simultaneously, starting from a sequence file as input. We also provide plots with the length distributions for the simulated sequencing reads and structural variations (insertions, deletions and inversions). Sim-it was written in Perl and does not require any further dependencies. Sim-it is open source and can be downloaded at https://github.com/ndierckx/Simit, where a more complete manual can be found.

Train customized error profiles for Sim-it
Error profiles were trained by a customized script, which aligns each read individually to the assembled reference sequence with BLAST [30]. For each kmer of 3 bp, the error rates of substitutions, insertions and deletions of the middle nucleotide were determined, along with the length patterns of deletions and insertions. The E. coli K12 substrain MG1655 dataset of PacBio Sequel II and PacBio RS II was downloaded from the github website of Pacific Biosciences. Using the above two datasets, we trained the error profile of PacBio Sequel II and PacBio RS II. We also downloaded the GIAB HG002 dataset of PacBio Sequel II HiFi reads powered by CCS. To reduce the computational time, we trained the error profile of PacBio Sequel II HiFi reads based on chromosome 1 of GRCh38. The Nanopore error profile is based on sequencing reads for chromosome 1 of GRCh38 from the publicly available GIAB HG002 dataset GM24385.
Furthermore, we simulated additional Nanopore and PacBio HiFi reads for GRCh38 at sequencing depths of 10x, 30x and 50x to study the influence of increasing sequencing depths for SV calling. Each of the Nanopore simulations had a median read length of 25 kbp, we also included four additional simulations of 15 kbp, 40 kbp, 75 kbp, and 100 kbp with a sequencing depth of 20x. PacBio long reads have a median length of 25,000 bp and the PacBio HiFi reads a median length of 15,000 bp. An additional filtering step was added for each VCF output; we only retained variances that obtained a PASS for the FILTER value, that have a length of 50 bp or more and wherefore at least 3 (for sequencing depths 10x and 20x) or 5 (for sequencing depths 30x and 50x) reads support the variance. This additional filtering step significantly improved the output for each tool compared to the raw VCF output.
Benchmark metrics were calculated by comparing the VCF output of each SV caller against the simulated reference set of 24,600 SVs. For each detected SV, we looked for possible matches in the reference set within a 1600-bp range of the detected position. When the length of the SV was determined, we tolerated an error margin of 35% for SVs longer than 300 bp and no error margin for shorter SVs. If these two conditions were met, a detected SV was matched to the SV of the reference set, independent from the type or genotype that was called. Based on these paired SVs, recall, precision and the F-score (2*((precision*recall)/(precision+recall))) were calculated. As there are multiple metrics that define the performance of an SV detection algorithm, we adopted an overall score that combines each of the metrics. For each detected SV, a maximal score of 1 was possible: 0.4 for the correct position, 0.2 for the correct length, 0.2 for the correct type of SV, and 0.2 for the correct genotype. The scores for length and position proportionally decreased with difference compared to the reference set. Finally, the number of false positives were subtracted from the total score and eventually expressed as a percentage of the maximum possible score ( Table 2).

SV detection on real datasets
The Genome in a Bottle (GIAB) Consortium recently developed a high-quality SV call set for the son (HG002/NA24385) of a broadly consented and available Ashkenazi Jewish trio from the Personal Genome Project. We performed a benchmark on the latest most conserved BED file (HG002_SVs_Tier1_v0.6.2.bed) for this sample, which contains 5260 insertions and 4138 deletions. The public available ultralong Nanopore reads (GM24385) with an average sequencing depth of 45x were used for this benchmark. This GIAB dataset was also simulated to estimate the impact of simulated versus real reads on recall and precision values. Furthermore, we compared SV detection metrics of a public available PacBio dataset of NA19240 [3] with an average sequencing depth of 37x against the results of our simulated datasets.

combiSV
With the results of the SV detection benchmark, we developed a script to combine the results of cuteSV, pbsv, Sniffles, NanoVar, NanoSV and SVIM. The output VCF files of each of the 6 tools serve as input, from which the files of cuteSV or Sniffles are obligatory to run combiSV. For each SV detection tool, we examined the connections between the false positive rates and the accuracy of the stats (position, SV type and genotype) to each type of SV and genotype. When multiple callers detect the same SV, for each stat one SV caller will be prioritized based on the statistical analysis of the simulated benchmark. pbsv wil for example be prioritized for position and length stats, while cuteSV for the genotype. To further improve recall percentages, specific types of SVs that exhibited low false positive rates in the benchmark will be included in the final SV set, e.g., homozygous SVs from SVIM or heterozygous insertions and deletions from Sniffles. The number of callers needed to confirm an SV is set for 2 when 2 to 5 callers are combined and 3 when