Exome sequencing identifies novel susceptibility genes and defines the contribution of coding variants to breast cancer risk

wide significance for PTVs had a posterior probability of association >0.90 (with all other posterior probabilities <0.4). These results demonstrate that large exome sequencing studies, combined with efficient burden analyses, can identify additional breast cancer susceptibility genes. The excess of positive associations at P<0.001 indicates that further genes should be identifiable through large datasets: the heritability analyses suggest the number of associated genes might be of the order of 130. Further replication in larger datasets will also be necessary to provide more precise estimates for variants in the novel genes ATRIP , MAP3K1 and SAMHD1 , to define the set of variants in these genes associated with breast cancer and the clinic-pathological characteristics of tumors in variant carriers. The heritability analyses suggest that most of the contribution of PTVs is mediated through the five genes BRCA1 , BRCA2 , ATM , CHEK2 and PALB2 , commonly tested for in clinical cancer genetics 16 . While subsets of missense variants may also make important contributions (exemplified by SAMHD1 ), these results suggest that the majority of the “missing” heritability is likely to be found in the non-coding genome.


Introductory paragraph (150 words max)
Linkage and candidate gene studies have identified several breast cancer susceptibility genes, but the overall contribution of coding variation to breast cancer is unclear.To evaluate the role of rare coding variants more comprehensively, we performed a meta-analysis across three large whole-exome sequencing datasets, containing 16,498 cases and 182,142 controls.Burden tests were performed for protein-truncating and rare missense variants in 16,562 and 18,681 genes respectively. ): the five known susceptibility genes BRCA1, BRCA2, CHEK2, PALB2 and ATM, together with novel associations for ATRIP and MAP3K1.Predicted deleterious rare missense or protein-truncating variants were additionally associated at P<2.5x10 -6 for SAMHD1.The overall contribution of coding variants in genes beyond the previously known genes is estimated to be small.

Main Text (up to 2000 words)
Breast cancer is the leading cause of cancer-related mortality for women worldwide.Genetic susceptibility to breast cancer is known to be conferred by common variants, identified through genome-wide association studies (GWAS), together with rarer coding variants conferring higher disease risks.The latter, identified through genetic linkage or targeted sequencing studies, include protein-truncating variants (PTVs) and/or some rare missense variants in ATM, BARD1, BRCA1, BRCA2, CHEK2, RAD51C, RAD51D, PALB2 and TP53 1 .However, these variants together explain less than half the familial relative risk (FRR) of breast cancer 2 .The contribution of rare coding variants in other genes remains largely unknown.
Here, we used data from three large whole-exome sequencing (WES) studies, primarily of European ancestry, to assess the role of rare variants in all coding genes: the BRIDGES (Breast Cancer Risk after Diagnostic Gene Sequencing) dataset that included samples from eight studies in the Breast Cancer Association Consortium (BCAC), the PERSPECTIVE (Personalised Risk assessment for prevention and early detection of breast cancer: integration and implementation) dataset that included three BCAC studies (Supplementary Table 1), and UK Biobank.After quality control (see methods), these datasets comprised 16,498 cases and 182,142 controls (Supplementary Table 2).
We considered two main categories of variants: protein-truncating variants (PTVs) and rare missense variants (minor allele frequency <0.001).Single variant association tests are generally underpowered for rare variants; however, burden tests, in which variants are collapsed together, can be more powerful if the associated variants have similar effect sizes 3 .To further improve power, we incorporated data on family history of breast cancer (see methods) 4 .Association tests were conducted for all genes in which there was at least one carrier of a variant (16,562 genes for PTVs and 18,681 genes for rare missense variants).
In the combined analysis of PTVs, 18 genes were associated at P<0.001 (Supplementary Table 3, Figures 1-2).Of these, 7 met exome-wide significance (P<2.5x10 - ), of which 5 are known breast cancer risk genes -BRCA1, BRCA2, CHEK2, PALB2 and ATM. ).Of the other previously identified breast cancer susceptibility genes, nominally significant associations were observed for BARD1, CDH1 and RAD51D (Supplementary Table 4).There was no evidence for an excess of significant associations after allowing for the 7 exome-wide significant genes (Figure 2).However, 17 of the 18 associations at P<0.001 correspond to an increased risk, compared with ~9.5 that would be expected by chance.This imbalance suggests some of the other associations may be genuine.
Within the BCAC dataset, we evaluated the association with subtypes of breast cancer; ER+ or ER-, PR+ or PR-, and triple-negative disease.We compared different subtypes of cases to controls and performed analysis within cases only (Supplementary Table 5).The expected associations for known genes were observed, notably the higher OR for ER-negative and triple-negative disease for BRCA1 and higher OR for ER-positive disease for CHEK2, but no additional genes were associated with subtype-specific disease at exome-wide significance.For the rare missense variant meta-analysis, 15 genes had a P-value <0.001, 12 of which corresponded to an increased risk of breast cancer (Supplementary Table 6, Supplementary Figures 1-2) compared to 7.5 expected by chance. ).We next considered missense variants predicted deleterious by CADD (CADD score ≥20) combined with PTVs.In this analysis, 24 genes had a P-value < 0.001, 17 of which corresponded to an increased risk of breast cancer (Supplementary Table 7); Supplementary Figures 3-4).Six genes met exome-wide significance: CHEK2 (P=2.1x10 -4 ), PALB2 (P=1.1x10 - ), BRCA1 (P=1.5x10 - ).
Notably, all three novel genes achieving exome-wide significance, MAP3K1, ATRIP and SAMHD1 have prior evidence of involvement in tumorigenesis or DNA repair.MAP3K1 is a stress-induced serine/threonine kinase that the ERK and JNK kinase pathways by phosphorylation of MAP2K1 and MAP2K4 5,6 .Inactivating variants in MAP3K1 are one of the commonest somatic driver events in breast tumors 7,8 .MAP3K1 is also a well-established GWAS locus 9 ; at least 3 independent signals have been identified mapping to regulatory regions with MAP3K1 expression as the likely target 10,11 .To evaluate whether the MAP3K1 PTV association we observed was driven by the GWAS associations, or vice-versa, we fitted logistic regression models to UK Biobank data in which the PTV burden variable and the lead GWAS SNPs (SNP1: rs62355902, SNP2: rs984113 and SNP3: rs112497245) were considered jointly (Supplementary Table 8).In the model with all variables, the OR associated with carrying a PTV (OR = 8.54 (2.96, 24.66)) was similar to the unadjusted OR.Similarly, the ORs for each of the SNPs were similar to the ORs without adjustment for PTVs.This suggests that the PTV burden and GWAS associations are independent and reflect the distinct effects of inactivating coding alterations and regulatory variants.ATRIP (ATR interacting protein) codes for a DNA damage response protein which forms a complex with ATR.ATR-ATRIP is involved in the process that activates checkpoint signalling when single-stranded DNA is detected following the processing of DNA double-stranded breaks or stalled replication forks 12,13 .SAMHD1 promotes the degradation of nascent DNA at stalled replication forks, limiting the release of single-stranded DNA 14 .SAMHD1 also encodes dNTPase that protects cells from viral infections 15 and is frequently mutated in multiple tumor types, including breast cancer.
Pathology information was available in the BCAC dataset for 9 carriers of MAP3K1 PTVs, 14 carriers of ATRIP PTVs and 46 carriers of SAMHD1 PTVs or predicted deleterious rare missense variants.These data suggest a higher proportion of low-grade, low-stage ductal breast cancer for MAP3K1 carriers, but high-grade, low-stage ductal breast cancer for ATRIP carriers (Supplementary Table 9).However, none of the associations with tumor characteristics were statistically significant.
To evaluate the overall contribution of PTVs to the Familial Relative Risk (FRR), we fitted models to the effect size using an empirical Bayes approach.Under the assumption of an exponentially distributed effect size, the estimated proportion of risk genes was 0.0066 with a median OR of 1.38.Under this model, an estimated 18.07% of the FRR would be explained, of which would be due to the five genes BRCA1, BRCA2, ATM, CHEK2 and PALB2 and 2.06% due to all other genes combined (MAP3K1 -0.44%, ATRIP -0.13%) (Supplementary Table 10).Only the seven genes reaching exome-wide significance for PTVs had a posterior probability of association >0.90 (with all other posterior probabilities <0.4).
These results demonstrate that large exome sequencing studies, combined with efficient burden analyses, can identify additional breast cancer susceptibility genes.The excess of positive associations at P<0.001 indicates that further genes should be identifiable through large datasets: the heritability analyses suggest the number of associated genes might be of the order of 130.Further replication in larger datasets will also be necessary to provide more precise estimates for variants in the novel genes ATRIP, MAP3K1 and SAMHD1, to define the set of variants in these genes associated with breast cancer and the clinic-pathological characteristics of tumors in variant carriers.The heritability analyses suggest that most of the contribution of PTVs is mediated through the five genes BRCA1, BRCA2, ATM, CHEK2 and PALB2, commonly tested for in clinical cancer genetics 16 .While subsets of missense variants may also make important contributions (exemplified by SAMHD1), these results suggest that the majority of the "missing" heritability is likely to be found in the non-coding genome.oversampled for early-onset (age of diagnosis below 50 years) or family history of breast cancer.Cases were preferentially selected to have information on tumor pathology.Samples with previously identified pathogenic mutations in BRCA1, BRCA2 or PALB2 were excluded.
For BRIDGES, library preparation was conducted in the 3 laboratories using the Nextera DNA Exome kit (Illumina) for tagmentation, barcoding and amplification steps.Subsequently, 500ng of DNA per sample were pooled in 12-plex and concentrated using a vacuum system.Afterwards, hybridization capture reagents for DNA libraries were used for overnight hybridization with the xGen Exome Research panel (Integrated DNA technologies, IDT), capture and amplification.Barcoded pooled libraries of 96 samples were sequenced on each lane of a NovaSeq 6000 S4 flowcell (Illumina) using NovaSeq Xp 4-Lane Kit (2x100bp).
The same pipeline for variant calling was applied to both the BRIDGES and PERSPECTIVE data and followed the GATK (Genome Analysis Toolkit) best practices.Briefly, raw sequence data (FASTQ format) were pre-processed to produce BAM files.This involved alignment to the reference genome, identification and removal of duplicate read pairs from the same DNA fragments, and base recalibration.The base recalibration included the generation of a base quality score recalibration table, later applied to the read bases to adjust their quality scores and increase the accuracy of the variant calling algorithms.An intermediate and informal Quality Control was performed for a sanity check, including coverage and alignment mapping metrics.Variants were then called using Haplotype Caller for the whole exome.This was later split into chromosomes for the analysis due to file size constraints.Variants were further filtered using Variant Quality Score Recalibration (VQSR), a proprietary algorithm from GATK that applied new calibration scores independently at SNP and indel variant levels.
From the final dataset, samples and variants were excluded based on coverage, allelic balance, and Hardy-Weinberg equilibrium, using the same filtering as for UK Biobank.We also excluded samples where the genotypes were inconsistent with previous array genotyping or targeted sequencing data 1,2 .
The BRIDGES study sequenced 6,912 samples, of which 3,461 cases and 3,200 controls remained in the final dataset after QC.The PERSPECTIVE study sequenced 10,523 samples, of which 4,777 cases and 5,210 remained in the final dataset.

Data preparation
For both the UK Biobank and BCAC datasets, Ensembl Variant Effect Predictor (VEP) was used to annotate variants 21 .Annotations included the 1000 genomes phase 3 allele frequency, sequence ontology variant consequences and exon/intron number.For each gene, the MANE Select 22 transcript was used if it was available for that gene, or the RefSeq Select transcript .Annotation files were used to identify PTVs and rare (allele frequency <0.001 in both the 1000 genomes dataset and the current dataset) missense variants.PTVs in the last exon of each gene were excluded as these are generally predicted to escape Nonsense-Mediated mRNA Decay (NMD).VEP was also used to annotate missense variants by Combined Annotation Dependent Depletion score (CADD) 24 .Here CADD ≥20 was used to define variants predicted to be deleterious.

Burden Test analysis
Association analyses were carried out for each gene separately for PTVs, rare missense variants, and predicted deleterious rare missense variants (defined by CADD score ≥20) and PTVs combined.The main association analyses were burden tests in which genotypes were collapsed to a 0/1 variable based on whether samples carried a variant of the given class.That is,   = 1  ∑    =1 > 0  0  ∑    =1 = 0 where   = 0, 1, 2 is the number of minor alleles observed for sample i at variant j, and p is the number of variants in the gene (thus, heterozygous and homozygous carriers were combined).
Different statistical tests were considered by simulation, and the best method was chosen for each dataset.For the BCAC datasets, we used an exact conditional test 25,26 , stratified by country and library preparation method (BRIDGES vs Perspective).Ethnicity was not adjusted for as within each country ethnicity was constant.This method had greater power than the Mantel-Haenszel test 27 and Wald test from logistic regression 28 , and the type-1 error rate was closest to the specified significance level.The exact conditional test was also used for case-control analyses for subtypes and case-only analysis when comparing subtypes e.g., oestrogen receptor status.Two genes with very common PTVs, NUDT11 and ZNF598, were excluded because missing genotypes (which were treated as non-carriers) led to spurious associations even though the variants passed QC filtering.AFF1 was also removed as the PTV frequency was high in PERSPECTIVE but rare in BRIDGES and UKB.This was likely due to a single PTV artefact within the PERSPECTIVE dataset.
For UK Biobank, we used logistic regression analysis but also incorporated family history as a surrogate for disease status.This markedly improves power since susceptibility variants will also be associated with family history; in particular, it allows information on males in the cohort with a family history of female breast cancer to be utilised.To do this, we treated genotype (0/1) as the dependent variable and family history weighted disease status as the covariate; the latter is defined as d+1/2f, where d=0,1 was the disease status of the genotyped individual and f=0 or 1 according to whether the individual reported a positive first-degree family history.The rationale for this weighting is that, for small effect sizes, the log(odds ratio) associated with a positive first-degree relative is approximately ½ that associated with the disease.All analyses adjusted for the first 10 principal components.For genes on chromosome X, only females were used in the analysis.
To combine the results from the BCAC and UK Biobank datasets in a meta-analysis, the association tests for each gene were converted to Z-scores.The combined Z-score was defined as: Here,   is the combined z-score,   is the z-score for study j and   is the weight associated with study j. .

Figure 1 |
Figure 1 | Manhattan Plot of Z-scores from the meta-analysis assessing the association between PTV carriers within genes and breast cancer risk.The orange line corresponds to Z=±3.29, p=0.001.Red line corresponds to Z=±4.71, p=2.5x10 -6 .

Figure 2 |
Figure 2 | Quantile-Quantile Plot of P-values from the meta-analysis assessing the association between PTV carriers and breast cancer risk.All highlighted genes with p<0.0005 correspond to an increased risk of breast cancer.