Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.


Introduction
All human genomes are different, and to a large extent the genetic variation between individuals can be explained as insertions, deletions, duplications, or translocations [1,2]. Taken together, such structural variation (SV) events affect a large portion of the genomic sequence when comparing any given individual to the existing reference genome, and although our knowledge about SVs is still incomplete, they have been shown to contribute to a number of human diseases and conditions [3][4][5][6][7]. In recent years, SV analysis has become a key endeavour for genomics research [8][9][10][11][12] and these efforts will likely lead to an increased understanding of disease etiology and progression and possibly also improved clinical therapy. Although several different methods exist for SV detection in a human genome, whole-genome sequencing (WGS) is at present the most efficient tool to find SVs at base-pair resolution. Large-scale sequencing

Sample Collection
The procedure for genomic DNA extraction from whole-blood samples is described in the study by Ameur et al. [29]. The blood samples were collected over a decade ago and the two individuals (Swe1 and Swe2) were selected from a group of participants involved in the SweGen project [35]. Sample collection was performed as part of the Northern Sweden Population Health Study (NSPHS), which was approved by the local ethics committee at the University of Uppsala (Regionala Etikprövningsnämnden, Uppsala, 2005:325 and 2016-03-09). In compliance with the Declaration of Helsinki of 1975, study participants provided written informed consent for inclusion to the study including the examination of environmental and genetic causes of diseases.

Whole-Genome Sequencing of Two Swedish Genomes
The DNA libraries were prepared using ONT SQK-LSK109 ligation kit, and sequenced using PromethION flow cell R.9.4.1. Lambda-phage (Accession J02459.1) was used as a control DNA. The sequencing for both individuals was conducted on beta release of PromethION device, and real-time basecalling was performed with Guppy version 1.4.0 (ONT). For each individual, libraries were prepared using both native DNA and DNA sheared to 20 kilobases (kb) using the MegaRuptor system. The native and sheared DNA libraries were run on separate PromethION flowcells. In total, 28 million reads were generated with a yield of 209 Gb. Experiment metrics can be found in Tables S1 and S2 and Figure  S1; visualisation for quality control of ONT data was performed with NanoComp and NanoPlot [36]. PacBio sequencing data was obtained from the 2018 study by Ameur et al. [29].

Alignment and Reference Genome
Genome indexing and alignments were performed with Minimap2 version 2.14-r883 [37]. The human reference genome assembly used for this study is the GRCh38 [38] release (GCA_000001405.15) that does not contain alternative contigs. It represents a non-redundant haploid genome containing a total of 195 sequences; primary sequences of assembled (both autosomal and sex) chromosomes, mitochondrial genome (chrM), (unlocalized) scaffolds with unidentified location in a chromosome, unplaced scaffolds, i.e., sequences with unknown chromosome assignment, and a decoy chromosome of 1718 basepairs (bp) for the Epstein-Barr virus (AJ507799.2).

Alignment of ONT and PacBio Reads
The ONT reads were aligned to the human reference GRCh38 using the -map-ont preset that sets a mapping mode suitable for ONT data with k-mer value set to 15. PacBio reads were aligned to the human reference GRCh38 using the -map-pb as preset, that sets a k-mer value of 19 and allows indexing of homopolymer compressed minimsers (k-mers) that essentially means compression of homopolymers to a single base. The alignment statistics were gathered using SAMtools version 1.9 [39] and Qualimap version 2.2.1 [40].

Structural Variant Calling and Analysis
Alignments were analyzed for SVs using Sniffles version 1.0.10 [17]. Sniffles has been tested to perform at a high precision and recall after Minimap2 alignment [25,41]. The minimum SV length was set to 50 bp. Callsets were investigated for overlaps between the SVs detected in ONT and PacBio alignments, and between the SVs detected in two Swedish genomes and high confidence set of SV calls. The high confidence set calls were based on multi-platform high-resolution data generated by Chaisson et al. [33]. For identification of overlaps, SURVIVOR version 1.0.7 [42] was used to merge the SV callsets; maximum distance was set to 500 bp and SV type was enabled. SURVIVOR was also used to collect SV statistics. Data visualisation was performed in R with ggplot2 [43] and VennDiagram [44].

Downsampling Analysis
Downsampling of aligned PacBio reads was performed with Sambamba version 0.7.1 ([45] using the view option; format was set to BAM, fraction of reads was set to 0.5, number of threads was set to 10, and subsampling seed was set to 16. The output files, in BAM format, were used for SV calling that was performed as described in Section 2.4.

Overview of the Study
The objective of this study was to evaluate the utility of two long-read technologies for detection of germline structural variation in human whole genome sequencing data. For this purpose, we made use of previous PacBio RS II data from one male (Swe1) and one female (Swe2) Swedish individual that we previously sequenced to high coverage [29]. Additionally, we generated ONT data for the Swe1 and Swe2 individuals on the PromethION system. For each individual, one sheared and one unsheared DNA library was constructed and runs were performed on two separate PromethION flow cells. The sheared libraries produced an average 67.54 Gb per flow cell, and an average N50 of 13.1 kb. The unsheared libraries generated less data (36.9 Gb), and read lengths at an average of 23.75 kb. We further merged the data for the sheared and unsheared libraries, generating one ONT dataset for each individual for all further analysis. The overall statistics for the ONT libraries are presented in Tables S1 and S2. The read length distributions for the ONT and PacBio datasets are shown in Figure S2.

Mapping of Long-Read Data for two Swedish Individuals
To ensure that the same types of SV events are detected in the ONT and PacBio data, and to reduce potential bias between the datasets, a common analysis pipeline was constructed and applied to all samples ( Figure 1). In the first step of this pipeline, Minimap2 [37] was used to align the long reads to the human reference genome GRCh38. The mapping results are shown in Table 1 (detailed alignment statistics can be found in Tables S3 and S4). For the PacBio data, we obtained 66.54× and 65.56× average coverage for Swe1 and Swe2, respectively. The ONT data generated lower average coverage; 31.69× for Swe1 and 31.49× for Swe2. Although there are differences in coverage between the two technologies, the coverage reached in the ONT data is above the traditional 30× limit for detection of germline genetic variation from high-throughput sequencing data [46]. Thus, we believe we should be able to identify a large fraction of germline homozygous as well as heterozygous SVs in all of our long-read datasets.

Characterisation of Structural Variation in ONT and PacBio Data
The software Sniffles [17] was used for SV calling from the aligned reads (full SV calling results including parameters are specified in Table S5). This resulted in a detection of 17,157 (Swe1) and 16,715 (Swe2) SVs of length greater than or equal to 50bp in the ONT data, and 22,829 (Swe1) and 23,284 (Swe2) ≥ 50 bp SVs in the PacBio data (see Table 2) across autosomal and sex chromosomes. This includes a count for translocations and nested events such as inversions flanked by deletions, and inverted duplications that are combinations of main SVs. The length distributions of the SVs are displayed in Figure 2. A peak of insertion and deletion elements was detected around 300 bp, representing Alu repeat elements. Moreover, the majority of SV elements were below 1 kb in length. Both of these observations are consistent with results from previous studies [25,29]. In comparison, the study on the Icelandic population by Beyter et al. [26] detected more SVs per individual in ONT data than what we found in the Swe1 and Swe2 ONT data, a median of 23,111 SVs per individual. This discrepancy can likely be explained by differences in sequencing coverage and analysis parameters between the two studies.

Comparing the Swedish SVs to a High Confidence Set of SVs
To assess the quality of our SVs, we compared our results to the high confidence set of structural variants detected from the multi-platform study by Chaisson et al. [33]. In this comparison, the maximum distance between SV breakpoints was set to 500 bp. We found that 15,467 of merged 22,116 ONT SVs (69.9%) overlap with the high confidence set, as compared to 17,418 of 29,800 merged PacBio SVs (58.4%) (Figure 3). The fact that a majority of the structural variations detected in our Swedish genomes overlap with events found in three family (parent-child) trios, which originate from other geographic regions and have been sequenced and analyzed using different methods than our PacBio/ONT data, makes us confident that our analysis pipeline is efficient in detecting true SV events. Moreover, the overlap of SVs between the Swedish genomes and the high confidence set are in agreement with a previous study by Audano et al. [47], where 15,291 SVs events were observed in the majority of the human genomes. It is likely that, as stated in the study, the current human reference genome carries a minor allele or an error at these positions [47].

Comparison of SVs between Sequencing Technologies and Individuals
We further performed a comparison of all the SVs detected in Swe1 (male) and Swe2 (female) to determine which ones are overlapping between individuals and which ones are overlapping between sequencing technologies (Figure 4). When considering insertions, 8814 events were found to be overlapping for Swe1 in both the ONT and PacBio dataset for that individual. The corresponding number for Swe2 was 7785. These numbers can be compared with the insertions that were detected from the same technology but in different individuals: 5388 insertions overlapping between Swe1 and Swe2 for ONT sequencing, and 7963 insertions overlapping for PacBio. Thus, there is a slightly higher overlap between insertions detected in the same individual but with different technologies, as compared to insertions detected with the same technology but in different individuals. When considering deletions, 7162 events were detected for Swe1 both in the ONT and PacBio dataset, and 7165 deletions were detected in Swe2 by both technologies. These numbers can be compared with the deletions that were detected from the same technology but in different individuals: 4699 deletions overlapping between Swe1 and Swe2 when using ONT, and 5617 deletions overlapping with PacBio. This shows that the pattern with a higher overlap for SVs detected in the same individual, as compared to SVs detected in different individuals but using the same technology, is even more pronounced for deletions than for insertions. One possible explanation for this observation is that deletions are easier to correctly call in long-read sequencing data [48]. Overall, the comparisons show that both long-read technologies are good at detecting the same SVs.

Comparison of SVs in ONT and Subsampled PacBio Data
In our study, the PacBio data has about two times higher coverage than what was obtained for ONT. To compare the performance of the two technologies at similar coverage, we performed a random subsampling of the PacBio data to obtain~33× coverage for each of Swe1 and Swe2, which is similar to the coverage in the ONT datasets. Unsurprisingly, fewer SVs were called from the downsampled PacBio data (17,289 for Swe1 and 16,848 for Swe2) as compared to from the complete PacBio data for Swe1 and Swe2. The SV events from the downsampled PacBio data displayed a high overlap with the ONT SVs, both for insertions (77-78%) and for deletions (85-86%) ( Figure 4B). These numbers can be compared to the corresponding SV overlaps when using all the PacBio reads, which are in a range of 67-69% for insertions and 81-82% for deletions. Our conclusions from this analysis are that ONT and PacBio detects a similar amount of SVs when sequencing is performed at around 30× coverage, and that SVs found in the downsampled PacBio data are in good agreement with the ONT SV calls.

Discussion
This study reports findings from the first PromethION sequencing runs of Swedish human genomes, as well as the analysis of SVs in the same human individuals sequenced on two different long-read platforms. One limitation of this study is that the PacBio sequencing was performed on an older generation instrument (RSII). Possibly, the PacBio results would be somewhat different on the Sequel I or II instruments, and in particular if the recent protocol for highly-accurate long-read HiFi sequencing was used [49]. However, it should be noted that we obtained a high coverage in the RSII PacBio data (>65×), sufficient even for creating high quality de novo genome assemblies [29]. For this reason, and since the underlying technology is the same in the RSII and Sequel systems, we believe that the RSII dataset can be used to represent PacBio's CLRs. Regarding the ONT data, it was generated on the state-of-the art PromethION instrument but at a coverage lower than that for PacBio and this could possibly explain why fewer SVs were detected in the ONT data. Recent updates of the ONT basecalling software could have increased the quality of the reads and resulted in improved SV results. Despite these limitations, which highlight the difficulty to perform cross-platform comparisons in this rapidly developing field, we believe our PacBio and ONT datasets are of a sufficient quality and coverage depth to jointly analyse and compare SVs detected from the two technologies.
Another challenging aspect of this study was to compare SVs between samples and determine whether two partially overlapping SVs should be considered as the same event. We relied on the software SURVIVOR [42] for the SV comparison, and allowed the start and end coordinates to differ with at most 500 bp. This is an important parameter to determine breakpoints and detect overlaps between SVs. We performed testing for maximum allowed distance with multiple values and noticed a trade-off between the number of merged and overlapping SV events across samples Table S6). Since the aim of our study was to find broad trends between different technologies, rather than to examine individual SV events, the uncertainty about which SVs are overlapping does not have a major impact on our results and conclusions. For this same reason, we chose to only use the SV caller Sniffles [17] which has been shown to perform well on long-read data [25], instead of also using other SV calling methods [30,34,41,50] that would likely give slightly different results. However, for projects aiming to determine all SVs occurring in a population, or to understand the functional consequences of SVs and their relation to medical conditions or diseases, it will be crucial with a robust and standardised method for identification and comparison of SVs between different individuals and samples.
As long-read sequencing technologies continue to improve and become more accessible, we envisage that large human structural variation databases will be constructed for various populations and cohorts, similar to what is currently available for short-read sequencing [9,13]. Already now, a few population-scale initiatives have been initiated but the number of such projects is likely to grow in the coming decade. In a future where SV calling from long-read data becomes a standard approach to examine human genetic variation, it will be important to evaluate the available technologies and determine which strategy is currently the best option for detection of different types of SVs. Our results suggest that both PacBio and ONT are viable options for SV calling in human germline samples.

Supplementary Materials:
The supplementary material is available online at http://www.mdpi.com/2073-4425/ 11/12/1444/s1 and includes Figure S1: ONT Sequencing Yield, Figure S2: Distribution of Read Lengths, Table S1: Summary statistics for ONT sequencing, Table S2: Read length and quality score for ONT (PromethION) sequencing, Table S3: Mapping ONT reads to GRCh38, Table S4: Mapping PacBio reads to GRCh38, Table S5: Structural Variant (SV) Calling: ONT and PacBio Data, and Table S6: Parameter testing for overlap detection. The Swe1 and Swe2 raw sequence data will be made available from a local Swedish installation of the European Genome-phenome Archive (EGA) (https://www.ebi.ac.uk/ega) that is now being implemented at Uppsala University and SciLifeLab. The dataset has the following doi:https://doi.org/10.17044/nbis/g000006. Author Contributions: A.A. and U.G. provided the SweGen data. A.A. and L.F. conceived the study. A.P. performed nanopore sequencing. N.F analyzed and visualised the data. N.F. and A.A wrote the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Science for Life Laboratory (SciLifeLab) for the National Genomics Infrastructure (NGI).
Acknowledgments: Oxford Nanopore sequencing was performed at the Uppsala Genome Centre (NGI), hosted by the SciLifeLab. Computational work was performed on secured resources provided by the Swedish National Infrastructure for Computing (SNIC) through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under projects b2015225 and sens-2016003.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: