Introduction

Next generation sequencing (NGS) technologies have enabled a number of applications in genomics1,2,3,4,5. Whole Genome Sequencing (WGS) is among these and is the most comprehensive approach for characterising and analysing an individual’s genetic variation6. Substantial reductions in cost mean that WGS has become an increasingly important tool for clinical diagnosis and targeted treatment of rare disease and cancer4,7,8,9,10,11. Of particular importance to precision oncology is the ability of WGS to identify driver mutations (including those in non-coding regions)12,13, detect mutational signatures14,15,16, characterise structural variation and chromosomal rearrangements17,18, and pinpoint the genomic integration sites of oncoviruses, such as human papillomavirus19. However, the clinical interpretation of these results remains a challenge and precision oncology programmes and clinical trials involving NGS are underway20.

Although short-read, Illumina sequencing is considered the gold standard for the majority of clinical sequencing projects21, such data lead to biases even in WGS, due to uneven coverage of regions with high/low GC content and the difficulty of aligning short reads derived from repetitive DNA sequences20,22,23. Long read sequencing technologies, such as those developed by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have proved invaluable for overcoming these challenges24,25. Thorough comparative studies have shown that long reads reduce the number of ‘dark’ or `camouflaged’ regions of the genome26 and improve the sensitivity of structural variant (SV) detection27. Of course, both technologies have their pros and cons, but with fast turn-around times and lower start-up costs, and despite higher error rates of > 10%, ONT WGS has already been used to resolve SVs in clinical cases28,29. Within cancer research specifically, low coverage ONT (Nanopore) WGS has been used for same-day diagnosis of brain tumours30, while targeted approaches have been developed for detecting BCR-ABL1 fusion transcripts31, analysing prognostically relevant genes in chronic lymphocytic leukaemia32,33, and sequencing the entirety of BRCA134.

With the release of ONT’s PromethION device, the generation of high coverage, Nanopore clinical WGS data has become much more straight-forward and cost-effective. In this study, we aimed to evaluate the extent to which high coverage, long-read Nanopore sequencing enables the chromosome-wide or genome-wide analysis of a broad range of somatic variation, by comparison with the current gold-standard, Illumina short-read WGS. We do this through conducting in-depth analysis of germline and tumour sequencing data from a patient with Diffuse Large B-cell Lymphoma (DLBCL), an aggressive form of non-Hodgkin’s Lymphoma. This cancer genome was chosen for detailed study due to the frequent and clinically relevant co-occurrence of both somatic hypermutations and structural rearrangements within DLBCL tumours35. DLBCL is also characterised by significant inter-patient heterogeneity making the design of targeted sequencing approaches challenging. We find that, while currently available tools are not able to provide reliable somatic small variant calls from two chromosomes of relatively low-depth Nanopore data, advantages in terms of calling large somatic SVs genome-wide and phasing multiple mutations are already evident. Additionally, we compare the performance of multiple tools on the long-read data and provide recommendations for future studies.

Results

The DLBCL patient was recruited as part of a large-scale clinical sequencing study utilizing the Illumina short read platform, details of which have been previously published10. This patient was selected for subsequent Nanopore sequencing on the basis that this tumour type has long been recognised to harbor chromosomal abnormalities, such as loss of chromosomal 17p, which can influence clinical prognosis and management. Furthermore, multiple complex rearrangements have been reported in more recent WGS analyses of this tumour type35. Peripheral blood and fresh-frozen tumour tissue samples were collected for extraction of germline and tumour DNA respectively (see Methods for details). DNA samples were whole-genome sequenced using both Illumina HiSeq 150 bp paired-end sequencing and long read sequencing on Oxford Nanopore Technologies’ PromethION device. All read sets were mapped to the GRCh37 build of the human genome (see Methods for details). This resulted in ~ 25X and ~ 60X coverage of the germline sample on the Illumina and Nanopore platforms respectively, and > 80X coverage of the tumour sample on both platforms. Nanopore reads had a mean length of 4.5 kb for the germline sample and 5.2 kb for the tumour sample. Further sequencing output statistics are shown in Table 1.

Table 1 Summary of sequencing data.

Single nucleotide variants

Somatic small variant calls from the Illumina (hereafter, short-read) data were generated by running Strelka (v2.0.14.1)36 on the tumour and germline samples. There were 7709 somatic variants detected across the whole genome, of which 241 were indels and the rest were single nucleotide variants (SNVs). 2446 of the somatic variants were located within genes. We filtered the short-read somatic small variant calls for clinical relevance using a number of different sources, including COSMIC Cancer Genes Census (v71)37, “My Cancer Genome” (www.mycancergenome.org) and ClinicalTrials.gov (http://clinicaltrials.gov), as previously described10. This resulted in 13 SNVs, 3 of which were located on chr17 (Supplementary Table S1). These were missense variants in TP53 and TAF15, and a CD79B splice donor variant (c.552 + 1G > A) which RNAseq using MinION showed results in intron retention (see Methods and Supplementary Fig. S1).

For the Nanopore (hereafter, long-read) data, we initially ran SNV calling using FreeBayes38, as described39, on reads aligning to reference chromosome 17. Freebayes was chosen as previous work had shown it could be run in a mode robust to the high number of candidate variants in the long-read data39. Analysis was restricted to a single chromosome due to the computationally intensive nature of SNV calling from long-read data. Chromosome 17 was chosen because it was known to harbour mutations of interest, as described above. FreeBayes was run on the tumour and germline samples separately. After initial filtering to exclude very low quality calls (see Methods), germline calls were subtracted from the tumour calls to obtain a list of 18,079 putative somatic SNVs on chromosome 17. More stringent quality filters were then applied (see Methods), leaving 9952 ‘pass’ calls. By comparison, there were 123 somatic SNV calls within chromosome 17 from the short-read data, and only 40 calls were present in both call sets (Fig. 1A). While the short-read calls are not expected to exactly reflect the ground truth, this still suggests low sensitivity and very low specificity of calling somatic SNVs in the long-read data. Upon further investigation, we found that 6332 (> 60%) of the long-read-only somatic calls were called as germline variants in the short-read data. Hence, the low sensitivity of long-read calling on the germline data resulted in many germline variants being called as ‘somatic’ using this method. Subtracting these germline calls left 3620 calls from the long-read data, with the number of calls shared between both long- and short- read datasets remaining unchanged. To investigate further the poor sensitivity of calling somatic SNVs, we first looked to see how many of the 123 somatic SNV calls from the short-read data were present at each step in the subtraction and filtering of the long-read calls. This revealed that 32/123 somatic SNVs were never called by FreeBayes, while 50/123 had been removed as part of the initial filtering of the long-read tumour vcf as they had not passed the QUAL > 1 criterion (see Methods). None of the 123 somatic calls had been wrongly removed by germline subtraction and only 1 had been filtered out by the more stringent quality filters post-subtraction. We then compared the properties of the 40 sites called as somatic SNVs by both methods with those of the 83 sites called only with the short-read data. Both sets had very similar median values of depth, average base quality and average mapping quality (Fig. 1B), hence these properties could not account for the poor sensitivity of somatic SNV calling in the long-read data. Encouragingly, all 3 of the somatic SNVs of clinical interest on chromosome 17 were included in the calls detected by both methods.

Figure 1
figure 1

Comparison of somatic SNV calls from short-read and Nanopore data sets. (A) Venn diagram to show the overlap between somatic SNV calls from the short-read (green) and Nanopore (purple) data for chromosome 17. The two numbers in the Nanopore-only sector show the number of calls before (top) and after (bottom) subtraction of short-read germline calls. Note that the total number of Nanopore calls prior to subtraction is 9912 + 40 = 9952, and the total after subtraction is 3580 + 40 = 3620. (B) Read depth, average mapping quality and average base quality calculated from Nanopore reads covering each of the 123 sites at which an SNV is called in the short-read data. Sites are portioned into those at which an SNV was called vs uncalled in the Nanopore data. Boxes span the interquartile range, with the median marked as a horizontal bar, while whiskers mark the farthest points that are not outliers.

To check that these results were not particular to chromosome 17, we then applied the same methods to reads mapping to chromosome 22, obtaining very similar results in terms of the amount of overlap between short and long-read calls (see Supplementary Fig. S2). Additionally, we repeated the analysis of chromosome 17 SNVs using a second variant caller, Clairvoyante40. Clairvoyante is a recently pubished neural network-based variant caller which includes a model trained on long-read data. We used this model to generate germline and tumour SNV calls separately as before, subtracting the germline calls from the tumour calls and filtering the resulting somatic calls based on QUAL scores. Results, in terms of indicated levels of sensitivity and specificity, were similar to those obtained with FreeBayes. For example, using a QUAL score cut-off of 180 resulted in 9987 putative somatic SNVs, only 40 of which overlapped with the short-read calls (full results shown in Supplementary Fig. S2).

Large structural variants and copy number abnormalities

We compared three whole-genome structural variant (SV) callsets; calls generated from short-read data using Manta41, calls generated from long-read data using the long-read SV caller, Sniffles42 following mapping with minimap243, and calls generated from long-read data using Sniffles following mapping with ngmlr42 (see Methods). We focused on large structural variants > 10 kb, examining each in turn to classify them as either true (genuine somatic abnormalities) or false (germline variation or bioinformatic artefacts). The following results describe the assessment of ‘PASS’ calls from the short-read data and ‘PRECISE’ calls from the long-read data. We initially also examined ‘IMPRECISE’ Sniffles calls with allele frequency (AF) > 0.2, the majority of which came from the reads mapped with ngmlr. However, out of more than a hundred examined, only one was found to be a true somatic variant and so all ‘IMPRECISE’ calls were excluded from the final results set. There were 39 deletions, 9 inversions, 5 duplications and 3 translocations detected in both the short-read and long-read data, all of which were assessed to be true (Fig. 2A). No large insertions were detected by any method. SV calls of different types were unevenly distributed across the genome. In particular, there was a high density of inversions and duplications located on chromosome 16 (Fig. 3).

Figure 2
figure 2

Large (> 10 kb), somatic structural variant calls obtained via different methods. (A) Venn diagrams showing overlap between calls obtained from Illumina short-read data (I; ochre), Nanopore long-read data mapped with minimap2 (M; turquoise) and Nanopore long-read data mapped with ngmlr (N; blue). Bold, black numbers represent true positive calls, while italic, dark red numbers represent false positive calls. (B) Data as above for (A), but shown as total counts for each method in a bar chart. Calls are grouped into deletions (DEL), duplications (DUP), inversions (INV) and translocations (TRA). True positive (TP) calls are shown in green and false positive (FP) calls are shown in red-brown.

Figure 3
figure 3

Circos44 plot. Rings from outside in: (1) Human karyotype for assembly GRCh37. (2) Ploidy of the tumour sample (blue), with radial axis ranging from 0 to 8 and grey lines corresponding to each integer value. Data generated from Nanopore read counts (see Methods). (3) Number of somatic small variants per Mb in the Illumina short-read data (green). (4) Large somatic deletions detected in the Nanopore data (purple, outer half) and Illumina short-read data (red, inner half). Only those assessed as true somatic variants are shown. (5) Large somatic inversions detected in the Nanopore data (blue, outer half) and Illumina short-read data (orange, inner half). Again, only true somatic variants are shown. (6) Links denoting verified translocations and duplications detected in both datasets (green), one translocation detected only in the Illumina short-read data (orange), and one duplication detected only in the Nanopore data (blue).

Across the whole genome, there were 3 deletions, 1 inversion and 1 translocation that were called only in the short-read data but assessed to be true somatic variants (Fig. 2A). Upon closer examination of the long-read alignments in IGV, all these SVs also had read support in the long-read data, but at too low a frequency to be called. Coverage, mapping quality and base quality in the long read data at the relevant breakpoints is detailed in Supplementary Table S2.

Conversely, 7 deletions and 1 duplication were detected in the long-read data but not the short-read data and visually assessed as being real (Fig. 2A). These calls had been missed in the short-read data for a variety of reasons including low allele fraction, not passing the somatic score threshold, the presence of other small SVs nearby, and breakpoints being located in a segmental duplication resulting in miscalling (example given in Supplementary Fig. S3).

We estimated false discovery rates (FDRs) and false negative rates (FNRs) for all SV calling methods. In order to calculate the FNRs we made the assumption that all large somatic SVs in this sample were detected by at least one of the methods. Since there may be large somatic SVs that were not uncovered as part of our study, our FNR calculations may all be underestimates, but these calculations still facilitate useful comparison between methods. The lowest FDRs and FNRs were achieved using the long-read data for deletions and duplications, and using the short-read data for inversions and translocations. There were noticeable differences between the SV calls resulting from the two long-read mappers, but these depended on the category of SVs being considered. Considering all SV types together, lower FDRs but higher FNRs were achieved when using ngmlr (Fig. 2A,B; Table 2). The differences in false discovery rate were most evident in the translocations category, while the differences in false negative rate were most evident for the deletions. In other words, we found that mapping with ngmlr lead to much better specificity for calling somatic translocations, while mapping with minimap2 led to much better sensitivity for calling somatic deletions.

Table 2 Estimates of false negative rates (FNR) and false discovery rates (FDR) for different structural variant calling methods. Bold numbers highlight the lowest values in each column.

We additionally called copy number abnormalities (CNAs) by calculating the Log2R of read counts in the tumour vs germline sample, in sliding windows across the genome (see Methods for details). This revealed that the ploidy of chromosomes 3, 7, 18 and Y differed in the tumour sample, changing to 4, 3, 3 and 0 respectively (Fig. 3). These changes in ploidy were detected in both the short-read and long-read data. There were also three smaller CNAs that were called in both datasets by analysing the Log2R of read counts but had not been detected by Manta/Sniffles. These were a terminal copy number (CN) loss starting at chromosome 10q24.2 and interstitial CN losses in chromosomes 15q13.3-q14 and 22q11.1-q11.21. For the variants on chromosomes 10 and 22, breakpoints corresponded with regions of zero coverage, which explained why they were missed by the SV callers. In the case of the chromosome 15 CN loss, the end breakpoint had been reported as part of an inversion, and the start breakpoint also showed split reads, but with secondary mappings to the decoy chromosome which had not passed filtering in our SV calling pipelines.

We noticed that several of the inversions reported in this data set only had read support for one breakpoint, suggesting they were part of a larger SV complex. An advantage of the long reads is the ability to phase and piece together such complex SVs. To demonstrate this capability, we looked in further detail at chr16:3,213,000–3,894,000, a region containing two large ‘inversions’ with end breakpoints within 300 bp of each other, as well as a 6 kb deletion. WhatsHap45 was run using germline SNP calls from the short-read data to phase the long reads. All reads that could be phased and supported the deletion and inversions were assigned to the same haplotype, confirming that the multiple SVs lie in cis (Fig. 4A). Hence we were able to reconstruct the highly rearranged somatic haplotype and infer the copy number changes at higher resolution than given by raw CNV calling output (Fig. 4B).

Figure 4
figure 4

(A) Diagram to show the reconstruction of a complex somatic structural rearrangement comprising three deleted segments and two inverted segments. Top: arrangement of segments in the reference sequence, and coordinates of SV breakpoints given to the nearest kilobase. Breakpoints are indicated by vertical dashed lines. Bottom: The rearranged haplotype is obtained by following the path of the dotted/purple line, with purple sections indicating the included segments. The inversion breakpoints between segments A–E and segments C–G both span over 10 kb and were reported by both Manta (short-read data) and Sniffles (long-read data). (B) Integrative Genomics Viewer screenshot showing Nanopore reads for the tumour sample, grouped and coloured by haplotype as determined by WhatsHap. Again, vertical dashed lines indicate the position of breakpoints and segments are labelled with letters as in (A). Split ends supporting the breakpoints are indicated with a red asterisk. All reads supporting all 3 breakpoints are assigned to the lower (purple) haplotype, confirming all breakpoints in (A) as in cis. Note that it would not have been possible to confidently phase these breakpoints from the short-read data alone. The nearest heterozygous SNVs to section F are at 3,887,048 and 3,895,908. This means it is not possible to say whether the breakpoints at A|E and E|C are in phase, nor whether A|E and C|G are in phase.

Large variants of clinical interest

Acquired copy number (CN) events detected in the DLBCL genome ranged in size from ~ 10 kb to ~ 79.86 Mb (Supplementary Table S3). A total of 66 CN events included one or more genes annotated in the cross-referenced cancer-related gene lists. Of particular note were events involving genes reported previously as disrupted in DLBCL. Examples include (1) a heterogeneous (i.e. not present in all cells), heterozygous CN loss involving TP53, (2) a homozygous loss involving CDKN2A/2B, noted previously in 30% cases of activated B-cell like (ABC) DLBCL46,47 (3) a heterogeneous, heterozygous CN Loss involving the acetyl transferase gene CREBBP, somatically mutated or deleted in ~ 25% cases48,49 (4) the more rarely observed CN Loss of P2RY8 (5) a high CN gain involving the proto-oncogene BCL6 and PIK3CA (6) a high CN gain involving CIITA, mutation of which is implicated in tumour immune escape50 (7) a CN gain involving CARD11 in the BCR signaling pathway, mutated in 9% of ABC-DLBCL cases51,52 (8) a high CN gain involving MYD88, encoding an adaptor molecule critical for signal relay from the TLR to the NF-κB transcription complex, as well as the interferon and JAK/STAT3 signaling cascade53,54. Data supporting these CN events are shown in the form of read alignments (Supplementary Fig. S4 A-G) and/or in the Circos plot shown above (Fig. 3).

On reviewing these findings, the heterogeneous, heterozygous 1.6 Mb CN loss encompassing TP53 was of particular interest because we additionally found a somatic missense change in a TP53 exon (p.Arg249Met; ClinVar accession VCV000376653.2). Clinically, TP53 mutations or losses have been associated with high grade malignancies and have been found in ~ 20% of DLBCL patients55,56. Therefore, it was of interest to find out if we could determine from our sequencing data whether the point mutation and the deletion were in cis or in trans (i.e. arising on different alleles or occurring within the remaining undeleted alleles of the chromosome from which the CN loss originated). In the tumour sample as a whole, the deletion was estimated to occur at an allele fraction of 22% in the short-read data and 27% in long-read data. The tumour sample purity was estimated at ~ 60% (see Methods). Given this level of purity, we would expect a somatic SNV on the deleted chromosome to appear at much less than 30% frequency, whereas a somatic SNV on the non-deleted chromosome could appear at up to 43% frequency, depending on the level of subclonality. The somatic variant appeared at 44% and 52% frequency in the short-read and long-read data respectively, supporting the in trans orientation. We wished to add further weight to this conclusion through phasing. However, since the nearest heterozygous germline SNPs were more than 5.5 kb away, this was not possible with the short-read data. In the long-read data (which had an average read length of 5.2 kb; Table 1), we were able to find three reads containing the somatic variant and spanning to one or other of the heterozygous SNPs. In each case the somatic variant was in phase with what appeared to be the non-deleted alleles (Fig. 5A). We were able to extend the phase block further in the 5′ direction by running WhatsHap45 on the long-read data (Fig. 5B and Supplementary Fig. S5). The allele counts at 13/14 heterozygous germline SNPs within this phase block were suggestive of this haplotype corresponding to the non-deleted chromosome. The counts at the remaining site (rs2908807) were inconclusive likely due to a high frequency of errors. Taken together, evidence from the long-read data gives high confidence that the p.Arg249Met variant and deletion are in trans. Confirmation of biallelic mutations in TP53 may be clinically important for assessing prognostic and therapeutic implications.

Figure 5
figure 5

(A) Read data used for phasing a somatic variant and large deletion in chromosome 17. Top half: The locations of two heterozygous SNPs (left and right) and somatic single nucleotide variant, p.Arg249Met, (centre) are shown. All three lie within the region covered by the deletion. Base counts are shown reads from both samples run on both platforms (for Nanopore, the reads mapped with minimap2 were used). The somatic variant is highlighted in red. Bottom half: 7 reads spanned from the site of the somatic variant to one or other of the heterozygous SNPs. Each read is shown as a horizontal bar with the base calls at the relevant sites annotated. The approximate distance between the sites is indicated by the grey arrows and text. (B) Counts of REF (green) and ALT (dark red) alleles at sites of germline heterozygous SNPs within the range of the phase set covering 17:7,577,535 and in which the somatic mutation p.Arg249Met is seen. Allele counts are given for only reads within the phase set (and assigned to the same haplotype), as given by WhatsHap (LHS), and for all reads covering the position (RHS). The chromosome 17 coordinates and SNP id are given on the extreme left. In all but one case the vast majority of the reads in the phase set match the majority (i.e. non-deleted) allele in the total allele count. rs2908807 is the exception since the allele count in the phase set is much more even. Although this phase block extends ~ 17 kb in the 5′ direction from the somatic mutation, this does not reach the distal breakpoint of the deletion, which is ~ 166 kb away. A snapshot showing the extent of the phasing blocks in shown in Supplementary Fig. S5.

Discussion

Long-read sequencing is becoming increasingly used in clinical research57. Advantages over short-read methods have been noted for applications such as identifying SVs28,58, resolving complex SVs29, phasing alleles59, sequencing repetitive or highly homologous regions60 and inferring methylation state30. The fast turn-around time of Nanopore sequencing in particular has already led to the development of rapid diagnostic assays for specific cancers30,31. However, the question of whether Nanopore long-read sequencing should replace or complement short-read sequencing for cancer diagnosis more generally remains open.

Our work highlights one of the main issues preventing a switch-over to Nanopore long-read sequencing; the difficulty of genome-wide SNV calling with error-prone long-reads alone. In this study, we obtained high coverage long-read data (~ 60X germline and ~ 80X tumour) of samples from a patient with diffuse large B-cell lymphoma and employed the latest published methods for genome-wide SNV detection on two chromosomes, 17 and 22. We found that the low sensitivity and specificity of these methods seriously hampers their use for the detection of somatic SNVs, highlighting the need for sophisticated joint calling algorithms that can handle long reads and high error rates.

When it comes to structural variant calling, error rates are less problematic and current long-read tools perform well on Nanopore data. In this study, we examined a total of 158 potential large somatic SVs reported by short- and long-read methods and found the overall sensitivity and specificity of long-read methods for detecting large somatic SVs is similar to that of short-read methods. However, the results vary by SV type, and our analysis highlights the strengths and weaknesses of various tools. From this we can draw a couple of recommendations for those wishing to use Nanopore data to detect somatic SVs: (1) only consider ‘PRECISE’ calls reported by Sniffles, and (2) minimap2 should be used for mapping, except for the detection of translocations (where ngmlr performs better). These recommendations are preliminary and need to be confirmed by the study of additional samples. In examining all large somatic SV calls produced from the short-read data, we found that all those that were “missed” in the Nanopore SV calling appeared clearly supported by the long reads when visually inspected in IGV. This would suggest that the information needed for more sensitive and specific calling of structural variants is readily available in long-read data and hence SV calling performance can be expected to improve as tools are fine-tuned based on the availability of more data.

A limitation of our study is that it is restricted to a single sample, and, with respect to somatic SNV calling, to individual chromosomes rather than genome wide. Although chromosomes 17 and 22 are both above average in terms of gene density, number of CpG islands per Mb and overall GC content61,62, these differences are relatively small compared to the fluctuations seen at a regional level. Therefore, we suggest that the somatic variant calling results reported here have relevance to the genome as a whole. Further studies could look at how sample or chromosome-specific properties might impact on the results of somatic variant calling from long reads.

In this study we focused on SVs over 10 kb in length, as SVs of this size are more likely to be acquired and not germline SVs. However, others have reported that the largest improvements in sensitivity gained by using long reads for SV calling are seen in the 50–2000 bp size range27,28,63,64. Individually assessing all these medium-size somatic SV calls for validity is not within the scope of this study, but it is reasonable to expect that with somatic SVs we would also see increased sensitivity compared to short read methods at smaller length scales.

We have provided a couple of examples showing the advantages of long reads for phasing. Firstly, long reads improve our ability to resolve complex structural variants, as shown with the chromosome 16 example (Fig. 4). Complex SVs are common in cancer and resolving them can be critical for interpreting pathogenicity18,29. Secondly, we also used long read information to phase a somatic mutation (p.Arg249Met in TP53) and large deletion encompassing the same gene, confirming that they were in trans. Although it is not clearly established whether bi-allelic TP53 mutations have a worse prognosis compared to monoallelic disruption, this seems logical as TP53 forms homo-dimers and expression is bi-allelic. Deletion of one allele will therefore lead to decreased expression of a normal TP53 protein combined with expression of an abnormal protein from the other allele. In both of these cases high-quality germline SNV calls (from short-read data) were used to extend phase blocks with WhatsHap, but simply examining long reads spanning the variants of interest visually would have been sufficient for giving good phasing confidence.

ONT devices, firmware and tools are continually being updated, and consequently most results may no longer reflect the state-of-the-art by the time they are published. The Nanopore sequencing of these samples commenced at the beginning of 2018, and a PCR-based workflow was chosen in order to achieve high coverage with the available number of flowcells. Since that time, we have seen more than tenfold increases in yield on the PromethION, and now routinely obtain 30X coverage of the human genome (90 Gb mapped data) from a single flowcell of PCR-free library, with mean read lengths of 10-12 kb. We have also seen a decrease in mean error rate from 14 to 10%, following firmware and basecalling updates. ONT have recently released the R10 version of the flowcells which are anticipated to lead to additional improvements in the sensitivity and specificity of SNV calling, getting closer to that obtained with short-read data. It is worth pointing out that deep learning tools such as Clairvoyante may not necessarily achieve better results given higher quality input data until they have been re-trained on similar higher quality data. As a result of longer read lengths and the removal of the need for PCR amplification, we would expect newer ONT data to lead to more uniform coverage of the genome, with fewer ‘dark’ regions26. This would enable variant calling within these previously ‘dark’ regions, which contain a substantial number of disease-relevant genes, and would allow us to phase across longer regions.

Nanopore sequencing is not the only long-read technology available. Pacific Biosciences have recently launched a high-fidelity (HiFi) long-read sequencing approach which can achieve mean per-read accuracy of 99.8%65. These long read data are likely to hugely improve SNV calling results, however the method requires high input amounts of DNA (e.g. > 15ug for 11 kb insert sizes66) which would be a major drawback for some clinical applications. Both of these technologies are exciting and have potential to expand the horizons of clinical research, however the unstable nature of new technologies represents a challenge for the benchmarking of sequencing results and downstream analysis tools. Additionally, costs, while decreasing, remain high for now; generating 30X coverage of a genome with long-reads is still several times more expensive than with short-reads. Even considering these issues, the power of Nanopopre sequencing to detect structural variants has already led to it being used in large scale population sequencing projects67. Our results suggest it as a useful complementary approach for cancer genome sequencing also.

Methods

Patients and ethics

The National Health Service (NHS) Health Research Authority South Central—Oxford C Research Ethics Committee approved this study and all research was performed in accordance with relevant regulations and guidelines. Written informed consent was obtained from the patient according to current Oxford Radcliffe Biobank (ORB) guidelines (Oxford C Research Ethics Committee No: 09/H0606/5+5). Consent was for analysis of tumor and constitutional DNA. Feedback of clinically actionable germline variants was optional.

The Oxford Radcliffe Biobank is a research facility of the University of Oxford and the Oxford Radcliffe Foundation Hospitals Trust and provides governance, management and regulatory oversight for approved research projects.

Sample preparation and genome sequencing

Tumor tissue handling

The patient underwent biopsy of primary and/or metastatic cancer to obtain fresh tissue for sequencing (as previously described in10). Briefly, fresh tissue samples were collected at the time of resection of the ileocaecal and regional lymph nodes. The histopathology report confirmed DLBCL (ABC type). Samples were snap-frozen in liquid nitrogen and an H&E frozen histology section was taken to confirm tumour content. Only samples with microscopically estimated tumor cell content of > 40% were used for sequencing. Frozen tissue was thawed rapidly for nucleic acid extraction and sequencing.

DNA extraction and short-read whole genome sequencing

Tumor DNA was extracted from fresh frozen tissue using the All Prep Mini DNA Extraction kit (QIAGEN), as described in the manufacturer's protocol. Germline DNA was isolated from 1.5 ml peripheral blood using the QIASymphony DSP DNA Midi kit (QIAGEN), according to the manufacturer's protocol. Libraries of fragments with a target length of 350 bp were generated from 1 µg sheared genomic DNA using the TruSeq PCR-Free library preparation kit (Illumina, San Diego, CA). 2 × 150 paired-end sequencing was performed using the HiSeq4000. (Illumina, San Diego, CA). WGS was performed at a planned coverage of 30 × for the germline DNA and of 75 × for the tumour.

Long-read library preparation and sequencing: tumour DNA

2 µg of lymphoma DNA was thawed fragmented in 49 µl of Nuclease Free Water (NFW) for 2 min at 7,000 revolutions per minute (rpm) using g-TUBE (Covaris, Woburn, MA, USA) following manufacturer recommendations. FFPE repair, End repair-dA tailing and PCR Adapter (PCA) ligation, were performed following Oxford Nanopore Technologies’ (ONT) 1D Low input genomic DNA with PCR (SQK-LSK108) version LIP_9021_v108_revL_11Nov2016. End repair incubation times were extended to 30 min at 20 °C and 30 min at 65 °C. Incubation of DNA with Agencourt AMPure XP beads (Backman Coulter Inc., Brea, CA, USA) and elution times were increased to 20 and 10 min at room temperature respectively. Pre-PCR size selection was performed as follows: After washing the beads with 70% ethanol, the PCA ligated DNA was re-suspended in 100 µl of NFW and transferred to a new 1.5 µl Lobind Tube (Eppendorf, Hamburg, Germany). Avoiding pelleting the beads on magnetic bar, an additional volume of 40 µl of beads was added to the previous re-suspended material resulting into a 0.4X ratio of beads to DNA volume. The tube was incubated on the Hula Mixer (Thermo Fisher Scientific, Waltham, MA, USA) for 20 min at room temperature. The beads were washed twice with 200 µl of 70% ethanol and re-suspended in 28 µl of NFW followed by 10 min elution at room temperature. The PCA ligated and size selected DNA was diluted to a concentration of 10 ng/µl in NFW. The PCR reaction was set up as follows: 46 μl Nuclease-free water, 2 μl Primers (PRM, Oxford Nanopore Technologies, Oxford, UK), 2 μl 10 ng/μl adapter ligated template, 50 μl LongAmp Taq 2 × master mix (New England Biolabs, Ipswich, MA, USA). Initial denaturation was for 3 min at 95 °C, denaturation 15 secs at 95 °C (15 cycles), annealing 15 secs at 62 °C (15 cycles), extension 15 min at 65 °C (15 cycles), final extension 15 min at 65 °C, hold at 4 °C. PCR products were cleaned up with a 0.4X ratio of beads to DNA volume and eluted in 25 µl NFW. Five aliquots of 1.5–2 µg size selected and amplified tumour DNA were prepared and sequenced according to the ONT protocol kit 9 chemistry version GDLE_9056_v109_revC_02Feb2018. These were run on four PromethION flow cells using the MinKNOW software version 1.18.02 for 64 h.

Long-read library preparation and sequencing: germline DNA

2 µg of germline DNA were treated following the same pre-PCR procedures as per the tumour sample above. Post-PCR procedures were completed according to a newly updated version of the ONT SQK-LSK109 kit protocol, GDE_9063_v109_revC_04Jun2018. Five size selected and amplified germline DNA aliquots of 1–1.5 µg were run on four PromethION flow cells using ONT MinKNOW software version 1.18.02 for 64 h.

In parallel, two additional R9.4.1 MinION flow cells were run using 1 µg of size selected and amplified germline and tumour DNA following the SQK-LSK109 protocol version GDE_9063_v109_revA_23May2018 (Oxford Nanopore Technologies, Oxford, UK).

RNA lymphoma cDNA library preparation and sequencing

RNA was extracted using the AllPrep Extraction kit (Qiagen). The lymphoma transcriptome was processed according to the Oxford Nanopore SQK-PCS108 cDNA-PCR sequencing protocol (version PCS_9035_v108_revD_26Jun2017, last update 25/10/2017). We used 500 ng of total RNA into a final volume of 9 µl. Note that due to the low input of the original clinical aliquot, we could not recover the optimal starting input of ~ 50 ng of PolyA RNA. The mRNA was targeted for reverse transcription using Nanopore oligo-dTs. Subsequently, full-length double stranded cDNA was generated employing the Nanopore Strand Switching Primer and amplified in 50 µl reaction volumes.

The PCR reaction contained 5 µl cDNA, 25 µl 2X LongAmp Taq Master Mix (NEB Biolabs Inc, Ipswich, Massachusetts, USA, cat. #M0287S), 1.5 µl Nanopore cDNA Primers and 18.5 µl NFW. The reactions were incubated at 95 °C for 30 s, followed by 15 cycles of (95 °C for 15 s, 62 °C for 15 s, 65 °C for 8 min) with a final extension at 65 °C for 6 min. The amplification products were normalised to ~ 400 fmol into 20 µl of Nanopore Rapid Annealing Buffer. The cDNA was adapter ligated adding 5 µl of Nanopore cDNA Adapter Mix for 5 min at room temperature. The adapter ligated library was subsequently cleaned-up by adding 40 µl AMPure XP beads, incubating for 5 min at room temperature and re-suspending the pellet twice in 140 µl of Nanopore Adapter Binding Beads. The purified ligated cDNA was then eluted in 12 µl of Nanopore Elution Buffer. The library was run on a MinION R9.4.1 flow cell using the 48 h PCS108 MinION sequencing script.

Bioinformatic analysis of the Illumina data

Analysis was performed using a bespoke, locked-down, and version-controlled bioinformatics pipeline according to the required specification for clinically accredited laboratories.

Paired-end alignment of sequencing data against the reference genome hg19 (GRCh 37) was performed using the Whole-Genome Sequencing Application v2.0, based on Isaac Alignment Tool, within BaseSpace (Illumina). Somatic single nucleotide (SNV) and insertion/deletion (InDel) variant calling analysis was performed using the Tumour-Normal Application v2.0, based on Strelka, within BaseSpace. Calls were annotated using variant effect predictor (VEP) within Ensembl-tools v84, including COSMIC v71.

Manta (v0.23.1, as part of Illumina’s Tumor Normal Application v2.0.0) was run for somatic structural variant calling.

Detection of acquired CN events

WGS bam files (hg19) were analysed for copy number (CN) events using ngCGH (https://github.com/seandavi/ngCGH) for the paired tumour:germline analysis and ngbin (BioDiscovery, Inc., El Segundo, California, USA, available from http://www.biodiscovery.com) for single sample analyses. A window size of 300 reads was applied for ngCGH, and a window size of 1 kb and read depth of ten for ngbin. B-allele frequency information was obtained from vcf files generated using Platypus68. Singleton and paired tumour-germline Log2R outputs were visualised together with B-allele frequency data using Nexus Discovery Edition 10.1 (BioDiscovery, Inc., El Segundo, California, USA). CN events and regions of homozygosity (ROHs) were flagged using the SNP Rank Segmentation setting. All putative events were visualised and curated manually prior to filtering in Microsoft Excel to remove germline copy number variants (CNVs), germline ROHs and calls due to underlying sequence complexity (eg. segmental duplications, tandem repeat regions) or insufficient confidence on visualisation. For acquired ROHs, referred to as copy neutral loss of homozygosity (cnLOH), the reporting size threshold was > 2 Mb. All acquired copy number events were cross-checked using IGV, looking for supporting evidence from read pairs mapping to the breakpoint regions. The final Microsoft Excel file included all filtered events intersected with both database37 and ‘in house’ gene lists comprising known genes of interest in cancer.

Bioinformatic analysis of the Nanopore data

Nanopore reads were basecalled with Albacore v 2.3.1 or, for later runs, guppy v 1.6.0 (Oxford Nanopore Technologies; note guppy 1.6.0 contained a similar neural network model to Albacore 2.3.1 but was optimised to run on GPUs). They were then trimmed with Porechop v 0.2.369 and mapped using minimap2 v 2.1043. Trimmed reads were additionally mapped with ngmlr v 0.2.7, since the developers of this mapper had reported superior results for structural variant calling42.

Somatic SNV calling

Reads mapping to chromosome 17 (with minimap2) were selected and split into mapped sections of max 100 bp in length. FreeBayes v 1.0.238 was run on the split bams, for tumour and germline samples separately, using the contamination parameters RH/RA = 0.7/0.1. These parameters had produced good results for variant calling without prior phasing in previous work39. Indels and MNPs were not called. Both tumour and germline bams were filtered based on the following criteria: QUAL > 1 & NUMALT < 2 & SAF > 1 & SAR > 1. After this initial filtering, germline calls were subtracted from tumour calls if they shared the CHROM, POS and ALT fields in the vcf. Having previously observed that strand bias is often an indicator of a repeated Nanopore error, we removed calls where the fraction of alternate allele observations was more than 4 × higher or lower in the forward strand vs the reverse strand. The resulting somatic vcf was examined for overlap with the short-read SNV calls and the distributions of the following properties were plotted: AO (alternate allele observations), RO (reference allele observations), QA (alternate allele quality sum in phred), QR (reference allele quality sum in phred), MQM (mean mapping quality of observed alternate alleles), AB (the ratio of reads showing the reference allele to all reads), SAP (strand balance probability for the alternate allele), as shown in Supplementary Fig. S6. Based on these distributions the following criteria were used for final filtering: AO > 10 & RO > 10 & QA > 100 & QR > 100 & AO < 100 & RO < 100 & SAP < 30 & MQM > 55. The filtering criteria decided on here were applied to the chromosome 22 data.

Somatic SV calling

Sniffles v1.0.942 was used to call structural variants with the following parameters: -l 100 -s 10 –genotype. Germline calls were subtracted from tumour calls based on matching ALT fields (i.e. SV type), and both the starts and ends being within (germline SV length)/100 bases of each other. Remaining somatic calls were then filtered according to SVLEN > 10,000 & !(CHR2 = hs37d5) & PRECISE. IMPRECISE calls with AF > 0.2 (as reported by Sniffles) were examined separately. SV calling was done for reads mapped with minimap2 and reads mapped with ngmlr. We examined all long-read and short-read SV calls visually in the Integrative Genomics Viewer (IGV)70, with tracks of long-read data, short-read data, and segmental duplications simultaneously loaded. In this way each call was classfied as either true or false depending on whether it was deemed to be a genuine somatic abnormality, or a germline variant or mapping artefact.

Comparison of short-read and long-read SV calls

Lists of calls from the three different methods (Illumina + Manta, Nanopore + minimap2 + Sniffles, Nanopore + ngmlr + Sniffles) were examined to decide where the same SV was reported by multiple methods (see Supplementary Table S3). The breakpoints of each unique call were looked at in IGV to assess whether there was sufficient evidence to class a call as a genuine somatic variant. If there was evidence of reads supporting the breakpoint in the germline sample, the call was classed as ‘false’. False discovery rates and false negative rates we calculated by assuming that all true variants were picked up by one of the three methods, and then using the following formulae (TP = true positive; FP = false positive):

$$ \begin{aligned} {\text{FNR}} & = \left( {\text{Number of TPs not detected with this method}} \right){/}\left( {\text{Total TPs}} \right) \\ {\text{FDR}} & = \left( {\text{Number of FPs detected with this method}} \right){/}\left( {\text{Total calls from this method}} \right) \\ \end{aligned} $$

Detection of acquired CN events

Aligned reads were split into aligned sections of max 100 bp in length, as described for SNV calling above. The number of reads with MAPQ >  = 20 within 100 kb windows was counted, for both the tumour and germline bam files. Let N(w) be the read count in the germline sample for window, w, and T be defined similarly for the tumour sample. The normalised log read count ratio, R, was calculated as

$$ R\left( w \right) = \left\{ {\begin{array}{*{20}ll} {\log_{2} \frac{T\left( w \right)}{{N\left( w \right)}} - \log_{2} \frac{252}{{187}},} &\quad {N\left( w \right) > {\text{mean}}\left( N \right) - 2 \times {\text{sd}}\left( N \right)} \\ {{\text{NA}},} &\quad {{\text{otherwise}}{.}} \\ \end{array} } \right. $$

The R package DNAcopy71, was used to perform circular binary segmentation and detect change points using the following parameters: alpha = 0.01, min.width = 5, undo.splits = "sdundo", undo.SD = 3. The mean value of points in all segments longer than 500 windows and representing ploidy = 2 (i.e. clustered around 0) was calculated as 0.094. This was subtracted from all segment means, to account for the ploidy of the tumour sample being greater than 2. The purity of the tumour sample was then calculated by similarly computing the mean value of points in segments representing ploidy = 1, 3, 4 (identifiable as distinct clusters of segment means). This gave three separate estimates of purity – 0.61, 0.66 and 0.64, where purity was defined as the amount of tumour DNA in the tumour sample as a proportion of total DNA (tumour + germline). A purity value of 0.63 was used to calculate the theoretical segment means for values of ploidy from 1 to 6 and then classify the recorded segment means accordingly, for both the autosomes (with ploidy 2 in the germline) and the sex chromosomes (with ploidy 1 in the germline).

Estimating expected frequencies of the p.Arg249Met somatic variant in TP53

For this calculation we rounded the purity estimate to 60%. Hence we assume that 40% of cells on average contain germline genetic material. Let the fraction of cells containing the somatic deletion encompassing TP53 be \(x\) %, then the fraction of tumour cells without the deletion is (60—\(x\))%.

If the somatic SNV is in cis with the deletion, the fraction of reads containing the somatic SNV can be estimated as at most:

$$ \begin{aligned} & \frac{{\# {\text{ tumour cells without the deletion }}}}{{2{ } \times { }\left( {\# {\text{ cells without the deletion }}} \right){ } + { }\# {\text{ cells with deletion}}}} \\ & \quad = 60 - x/2 \times \left( {60 - x + 40} \right) + x \\ & \quad = 60 - x/200 - x \\ \end{aligned} $$

Which takes a maximum value of 30% when x = 0. If the somatic SNV is in trans with the deletion, the fraction of reads containing the somatic SNV can be estimated as at most:

$$ \begin{aligned} & \frac{{\# {\text{ tumour cells }}}}{{2{ } \times { }\left( {\# {\text{ cells without the deletion }}} \right){ } + { }\# {\text{ cells with deletion}}}} \\ & \quad = 60/200 - x \\ \end{aligned} $$

This varies from 30 to 43% as \(x\) varies from 0 to 60.

Phasing

A list of heterozygous SNVs was obtained by running Illumina’s Isaac variant caller on the germline bam. whatshap phase was then run using this vcf and Nanopore reads from both the germline and tumour samples. The resulting phased SNV calls were used along with whatshap haplotag to assign haplotype and phase set tags to Nanopore reads from the tumour sample. Haplotag bams were visualised in IGV, grouped by haplotype and coloured by phase set.