Sequential filtering for clinically relevant variants as a method for clinical interpretation of whole exome sequencing findings in glioma

Ülgen, Ege; Can, Özge; Bilguvar, Kaya; Akyerli Boylu, Cemaliye; Kılıçturgay Yüksel, Şirin; Erşen Danyeli, Ayça; Sezerman, O. Uğur; Yakıcıer, M. Cengiz; Pamir, M. Necmettin; Özduman, Koray

doi:10.1186/s12920-021-00904-3

Research article
Open access
Published: 23 February 2021

Sequential filtering for clinically relevant variants as a method for clinical interpretation of whole exome sequencing findings in glioma

Ege Ülgen¹,
Özge Can²,
Kaya Bilguvar^3,4,
Cemaliye Akyerli Boylu⁵,
Şirin Kılıçturgay Yüksel⁵,
Ayça Erşen Danyeli⁶,
O. Uğur Sezerman¹,
M. Cengiz Yakıcıer⁷,
M. Necmettin Pamir⁸ &
…
Koray Özduman⁸

BMC Medical Genomics volume 14, Article number: 54 (2021) Cite this article

3225 Accesses
1 Altmetric
Metrics details

Abstract

Background

In the clinical setting, workflows for analyzing individual genomics data should be both comprehensive and convenient for clinical interpretation. In an effort for comprehensiveness and practicality, we attempted to create a clinical individual whole exome sequencing (WES) analysis workflow, allowing identification of genomic alterations and presentation of neurooncologically-relevant findings.

Methods

The analysis workflow detects germline and somatic variants and presents: (1) germline variants, (2) somatic short variants, (3) tumor mutational burden (TMB), (4) microsatellite instability (MSI), (5) somatic copy number alterations (SCNA), (6) SCNA burden, (7) loss of heterozygosity, (8) genes with double-hit, (9) mutational signatures, and (10) pathway enrichment analyses. Using the workflow, 58 WES analyses from matched blood and tumor samples of 52 patients were analyzed: 47 primary and 11 recurrent diffuse gliomas.

Results

The median mean read depths were 199.88 for tumor and 110.955 for normal samples. For germline variants, a median of 22 (14–33) variants per patient was reported. There was a median of 6 (0–590) reported somatic short variants per tumor. A median of 19 (0–94) broad SCNAs and a median of 6 (0–12) gene-level SCNAs were reported per tumor. The gene with the most frequent somatic short variants was TP53 (41.38%). The most frequent chromosome-/arm-level SCNA events were chr7 amplification, chr22q loss, and chr10 loss. TMB in primary gliomas were significantly lower than in recurrent tumors (p = 0.002). MSI incidence was low (6.9%).

Conclusions

We demonstrate that WES can be practically and efficiently utilized for clinical analysis of individual brain tumors. The results display that NOTATES produces clinically relevant results in a concise but exhaustive manner.

Peer Review reports

Background

Next-generation sequencing (NGS) has proven remarkably beneficial in not only understanding cancer biology but also guiding cancer care [1,2,3]. Various NGS methods are routinely used in cancer care [4, 5]. Targeted sequencing panels, whole-exome sequencing (WES), and whole-genome sequencing (WGS) are the most commonly utilized methods, each with its advantages and limitations [6,7,8]. Targeted sequencing panels are tailored to investigate curated cancer-related information, provide excellent depth, and are suited for working with formalin-fixed paraffin-embedded (FFPE) samples [9, 10]. In contrast, WES/WGS provides more comprehensive genomics data suited for both screening previously investigated/reported variants and exploring novel relevant variants. More comprehensive genomics data also provide additional information such as direct measurement of the mutational burden [11, 12] and exploration of signatures of mutational processes [13, 14]. Brain tumors have complex genetic landscapes [15,16,17]. Therefore, it is beneficial to gather the most comprehensive genomics information for each neurooncology patient. We hence advocate utilizing WES for neurooncological genomics analyses as it gathers comprehensive information with a lower cost than WGS and is technically less challenging to analyze and interpret.

The bioinformatics workflows for variant calling are well established but the clinical interpretation of the identified variants constitutes a bottleneck in the analysis [18]. In the clinical setting, the analysis workflow should produce results that are both exhaustive and suitable for clinical interpretation. Intending to be simultaneously comprehensive and practical, we created a clinical WES workflow tailored for neurooncology. This approach sequentially filters and presents layers of findings relevant to neurooncology (the layers being alterations that are detected in curated collections of clinically-relevant genes). This sequential filtering approach prioritizes highly relevant findings while still reporting less relevant but possibly important findings. This article presents our approach and provides results of the analysis of our findings on a sizable glioma cohort, demonstrating that our approach yields clinically relevant results.

Methods

Reads-to-variants workflow

The overview of the complete workflow is presented in Fig. 1. The reads-to-variants pipeline is presented below.

For quality control, FASTQC (v0.11.9) [19] is used. For tumor and normal samples, the reads are mapped to the reference (hg38) using bwa (version 0.7.17-r1188) [20] and pre-processed, including cleaning the SAM file, sorting SAM by coordinate and converting to BAM, fixing mate information, and marking PCR duplicates (all via Picard version 2.23.8) [21]. For samples that were sequenced in multiple lanes, data for all lanes are combined at this step. Finally, base quality score recalibration (GATK [22] v4.1.9.0) is performed. For quality control, GATK3–DepthOfCoverage (version 3.8-1-0-gf15c1c3ef) and Picard-CollectAlignmentSummaryMetrics are used.

For detecting germline variants (single nucleotide variants (SNVs) and short insertion/deletions (indels)), GATK–HaplotypeCaller is used. For detecting somatic SNV/indels, GATK–MuTect2 is used. Both germline and somatic SNV/indels are annotated using GATK–Funcotator. For detecting somatic copy number alterations (SCNAs), ExomeCNV is used [23]. Annotations of gene-level SCNAs and cytoband annotations are performed via an in-house script.

Personalized neurooncology report workflow

To produce comprehensive reports of WES results, we developed the reporting workflow NOTATES. NOTATES uses curated datasets of glioma- and cancer-related variants and genes to sequentially report clinically relevant findings.

After a summary of somatic WES findings, the report contains the following sections:

1
Quality Metrics
1. a.
  Summary Table of Quality Metrics
2. b.
  Tumor Purity
2
Germline Alterations
1. a.
  ACMG Incidental Findings
2. b.
  Variations in Cancer Gene Census Genes
3. c.
  Variations in Cancer Predisposition Genes
4. d.
  Variations in DNA Damage Repair Genes
5. e.
  Common Variants
3
Somatic Single Nucleotide Variations (SNVs) and Small Insertion/Deletions (Indels)
1. a.
  Tumor Mutational Burden (TMB)
2. b.
  Microsatellite Instability Status (MSI)
3. c.
  Variants in Established Glioma Genes
4. d.
  Hotspot Variants in Cancer Gene Census Genes
5. e.
  Other Variants in Cancer Gene Census Genes
6. f.
  Other Possibly Important Somatic SNV/indels
  1. i.
    Variants in DNA Damage Repair Genes
  2. ii.
    Variants in Important KEGG Pathway Genes
4
Somatic Copy Number Alterations (SCNAs)
1. a.
  SCNA Burden
2. b.
  Established SCNAs in Glioma
3. c.
  SCNAs in Cancer Gene Census Genes
4. d.
  Broad SCNAs
5. e.
  Plots of SCNA Segments by Chromosome
5
Loss of Heterozygosity (LOH) Events
1. a.
  LOH Overview
2. b.
  LOH + Somatic SNV/Indel
3. c.
  LOH Events in Cancer Gene Census Genes
6
Genes with Double Hit
7
Tumor Heterogeneity Analysis
8
Mutational Signatures
9
pathfindR—KEGG Pathway Enrichment Analysis

The contents of these sections are detailed in the Results section. NOTATES was written in R [24] and R Markdown.

Analyses and patients

Using NOTATES v1.5, 58 WES analyses from matched blood and tumor samples of 52 patients were analyzed: 47 primary and 11 recurrent diffuse gliomas. Overall, 47 grade IV (81.03%), 7 grade III (12.07%), and 4 grade II tumors (6.9%) were analyzed. Clinical details for all patients and analyses are presented in Additional file 2: Table S1. For each tumor specimen submitted for WES, sections were reviewed by a neuro-pathologist to confirm the diagnosis of diffuse glioma and specifically excise a region within the tumor sample containing only tumor tissue. DNA was extracted using the DNeasy Blood & Tissue Kit (QIAGEN).

All analyses of NOTATES results presented here were performed using R. Selected results were compared with results from the TCGA pan-glioma cohort [15].

Software availability

The reads-to-variants and reporting workflow NOTATES is available for non-commercial purposes on GitHub: https://github.com/egeulgen/NOTATES.

Results

Analysis and reporting of exomes

Sequencing quality metrics

The median mean read depths were 199.88 for tumor and 110.955 for normal samples. The median percentages of reads with at least 25X coverage were 99.3% and 98.55% for tumor and normal samples, respectively. Detailed quality metrics are presented in Additional file 2: Table S2.

Germline variants

Raw germline variants (median = 80,328, range = 72,008–120,635 per patient) are initially filtered according to GATK's best practices [22] for eliminating technical artifacts to yield a median of 64,815 (range = 58,528–87,619) variants per patient (Fig. 2a). For reporting, we only include variants that:

have MAF < 1%
are not reported as “benign” or “likely benign” in ClinVar [25]
have non-synonymous impact
are not in FLAGS [26] genes.

This filtering results in a median of 464 (range = 400–536) variants per patient. A median of 22 (range = 14–33) variants per patient is in the reported categories: A median of 2 (range = 0–6) in “ACMG Incidental Findings”, 16 (range = 10–27) “Variants in Cancer Gene Census Genes”, 0 (range = 0–2) in “Variants in Cancer Predisposition Genes” and 3 (range = 0–7) in “Variants in DNA Damage Repair Genes”.

Considerable percentages of combined reported variants (in all patients) per each category did not have a record in ClinVar (“not reported”) and for variants with a ClinVar record. The most frequent clinical significances were “Drug response” for “ACMG Incidental Findings” (37.3%), “Conflicting” for “Variants in Cancer Gene Census Genes” (5.17%), and “VUS” for “Variants in Cancer Predisposition Genes” (16.67%) and “Variants in DNA Damage Repair Genes” (3.82%) (Additional file 2: Fig. S1). Very small fractions of reported variants per each category were reported as “Pathogenic” or “Likely Pathogenic”: 2.38% for “ACMG Incidental Findings”, 0.48% for “Variants in Cancer Gene Census Genes”, 4.17% for “Variants in Cancer Predisposition Genes” and 1.91% for “Variants in DNA Damage Repair Genes” (Additional file 2: Fig. S1).

Somatic short variants

To filter out sequencing artifacts, raw somatic short variants (median = 14,000, range = 4068–55,533 per analysis) are similarly filtered following the GATK best practices recommendations to result in a median of 223 (range = 57–22,271) variants per analysis (Fig. 2b). For reporting, we further filter these “called” variants and only include variants that:

have tumor Variant Allele Frequency (VAF) > 5%
have non-synonymous impact
are not in FLAGS genes.

This filtering results in a median of 49.5 (range = 2–5646) variants per analysis. A median of 6 (range = 0–590) variants is in the reported categories: A median of 2 (range = 0–44) in “Variants in Established Glioma Genes”, 0 (range = 0–28) in “Hotspot Variants in Cancer Gene Census Genes”, 2 (0–309) in “Other Variants in Cancer Gene Census Genes”, 0 (range = 0–57) in “Variants in DNA Damage Repair Genes” and 1 (range = 0–152) in “Variants in Important KEGG Pathway Genes”.

Figure 3 presents the reasoning behind the sequential filtering of somatic short variants. “Called” (sequencing artifacts excluded) somatic short variants are initially filtered according to the above-mentioned criteria, excluding an average of 78.08% (SD = 8.36%) of “called” variants (Fig. 3a). An average of 2.91% (SD = 1.62%) of “called” variants were reported sequentially in the (1) “Glioma-related” subsection (“Variants in Established Glioma Genes”), (2) “Cancer-related” subsections (“Hotspot Variants in Cancer Gene Census Genes” and “Other Variants in Cancer Gene Census Gene”) and (3) “Selected Gene Sets” subsections (“Variants in DNA Damage Repair Genes” and “Variants in Important KEGG Pathway Genes”). On average, 19.01% (SD = 7.63%) did pass the reporting filter but were not reported. By sequential filtering, a variant reported in a category is not reported in the following ones. A mean percentage of 31.28% (SD = 22.37%) of all reported short somatic variants were in the “Glioma-related” subsection, 42.85% (SD = 21.99%) were in the “Cancer-related” subsections and 25.87% (SD = 22.92%) were in “Selected Gene Sets” subsections (Fig. 3b).

Somatic copy number alterations

ExomeCNV analysis yields a median of 3222 (range = 112–42,370) segments per analysis (Fig. 2C). For high confidence, only SCNAs with a |log₂(Tumor/Normal) ratio|≥ 0.25 are reported (median = 1964, range = 66–26,636 segments per analysis). For gene SCNA events, under “Established SCNAs in Glioma”, a median of 6 (range = 0–12) SCNA events per analysis are reported, and a median of 0 (range = 0–12) SCNA events per analysis are reported under “SCNAs in Cancer Gene Census Genes”. Under “Broad SCNAs” a median of 19 (range = 0–94) cytoband-level SCNA events per analysis are reported. Chromosomal-arm-level SCNA events in each tumor are presented in Additional file 2: Fig. S2.

Tumor mutational burden and microsatellite instability

The TMB values of all tumors are presented in Additional file 2: Fig. S3A. TMB in primary gliomas (median = 3.2/Mb) were significantly lower than the TMB in recurrent cases (median = 5.8/Mb. Wilcoxon, p = 0.002). The TMB values in different molecular subsets (devised based on WES findings) were also significantly different (Kruskal–Wallis, p = 0.0072. Additional file 2: Fig. S3B).

The TMB distribution of this glioma cohort was comparable to (i.e., not significantly different than) the TMB distributions of the TCGA–Glioblastoma multiforme (GBM) and TCGA-Low-grade Glioma (LGG) cohorts (t-test p = 0.7 and p = 0.37 for GBM and LGG, respectively. Additional file 2: Fig. S3C).

There were 4 cases (6.9%) that were predicted to have microsatellite instability and none of the cases were predicted to have POLE deficiency.