Germline pathogenic variants in PALB2 and other cancer-predisposing genes in families with hereditary diffuse gastric cancer without CDH1 mutation: a whole-exome sequencing study

Summary Background Germline pathogenic variants in the E-cadherin gene (CDH1) are strongly associated with the development of hereditary diffuse gastric cancer. There is a paucity of data to guide risk assessment and management of families with hereditary diffuse gastric cancer that do not carry a CDH1 pathogenic variant, making it difficult to make informed decisions about surveillance and risk-reducing surgery. We aimed to identify new candidate genes associated with predisposition to hereditary diffuse gastric cancer in affected families without pathogenic CDH1 variants. Methods We did whole-exome sequencing on DNA extracted from the blood of 39 individuals (28 individuals diagnosed with hereditary diffuse gastric cancer and 11 unaffected first-degree relatives) in 22 families without pathogenic CDH1 variants. Genes with loss-of-function variants were prioritised using gene-interaction analysis to identify clusters of genes that could be involved in predisposition to hereditary diffuse gastric cancer. Findings Protein-affecting germline variants were identified in probands from six families with hereditary diffuse gastric cancer; variants were found in genes known to predispose to cancer and in lesser-studied DNA repair genes. A frameshift deletion in PALB2 was found in one member of a family with a history of gastric and breast cancer. Two different MSH2 variants were identified in two unrelated affected individuals, including one frameshift insertion and one previously described start-codon loss. One family had a unique combination of variants in the DNA repair genes ATR and NBN. Two variants in the DNA repair gene RECQL5 were identified in two unrelated families: one missense variant and a splice-acceptor variant. Interpretation The results of this study suggest a role for the known cancer predisposition gene PALB2 in families with hereditary diffuse gastric cancer and no detected pathogenic CDH1 variants. We also identified new candidate genes associated with disease risk in these families. Funding UK Medical Research Council (Sackler programme), European Research Council under the European Union's Seventh Framework Programme (2007–13), National Institute for Health Research Cambridge Biomedical Research Centre, Experimental Cancer Medicine Centres, and Cancer Research UK.

Supplementary material: Whole exome sequencing study to detect germline pathogenic variants in PALB2 and other cancer-predisposing genes in CDH1-negative diffuse gastric cancer families.

Supplementary Materials and Methods:
Bioinformatics pipeline for VCF generation Fastq files underwent demultiplexing and standard QC checks using FastQC prior to trimming of Illumina adaptors and low quality bases using Cutadapt (ver 1.8.1). The BWA-MEM algorithm (ver 0.7.12) was applied to align reads to GRCh37. BAM files from multiple lanes were merged, sorted and pre-processed (removal of PCR duplicates, base quality recalibration and local realignment around indels) using Samtools (ver 1.2), Picard (ver 2.6.0) and GATK (ver 3.6.0). Variant calling was performed across the set with GATK Haplotype Caller with 10bp padding around Nextera Exome Rapid Capture targets.
Optimised hard filters were applied, including a VQSR truth sensitivity of 99·5% for SNPs and 97% for INDELs, an average 10x depth (variant DP) per sample and a QUAL threshold of 200. The QUAL threshold corresponded to a TiTv ratio of 2 as calculated by Samtools VCF-Stats. Multi-allelic variants were flagged and excluded for the purpose of this analysis. Only genotypes with quality (GQ) >20 and individual depth (genotype DP) in sample < 500 were retained for further analysis. Ensembl VEP annotations were applied to select protein-affecting variants: loss of function (stop gained, stop lost, start lost, splice acceptor variant, splice donor variant, or frameshift variant), inframe indels and missense variants that were simultaneously called deleterious and probably damaging by SIFT and PolyPhen respectively. Common variants (AF > 0.05 in European 1000 genomes) were excluded. The non-common protein-affecting variants were aggregated per gene; these genes were used for interaction analyses and prioritised as described in main methods and in scripts below.
Scripts generated for all analysis downstream of VCF generation can be found at the following link (https://github.com/elliefewings/Fewings_HDGC_exome_2018). VCF data can be downloaded from the following repository (https://doi.org/10.17863/CAM.17181)

Validation by Sanger sequencing
Custom primers were designed for each variant and are summarised in supplementary table 1. Primers were designed to be between 18 and 26 bases in length with a melting temperature of around 60°C. The UCSC In-Silico PCR tool was used to check specificity of primer binding. Due to their proximity, both RECQL5 variants (c.2806-2T>C and c.2828C>T) were covered by one pair of primers.

Gene interaction network analysis -Control data
The 1000 genomes project was used as a control set to test for an enrichment of loss of function variants under selected gene ontology terms in HDGC. Variants from European phase-3 1000 genomes data were filtered to select 28,833 uncommon (European AF <0·05 in 1000 genomes), protein affecting variants (loss of function, predicted deleterious and damaging missense and inframe indels). Variants were aggregated into 11,796 genes, which were filtered to select those with at least one loss of function variant and remove the top 1% most variable genes. Variability was measured by the number of rare, protein affecting variants each gene contains; 3,634 genes containing 4,601 loss of function variants were retained. Aggregated allele counts for each selected gene ontology term were generated using these loss of function variants for further analysis.

Supplementary Results: VCF generation and quality metrics
Samples were sequenced across five whole exome sequencing libraries. Data quality of aligned, merged BAM files was checked using metrics generated by Qualimap and Picard (supplementary table 3). The mean percentage of targets covered at 20x across all samples was 80.23%. All identified candidate variants were manually checked in BAM files using IGV for region coverage and appropriate percentage of reads supporting the alternative variant call. Additionally all candidate variants were validated successfully by Sanger sequencing.