Breast cancer genome and transcriptome integration implicates specific mutational signatures with immune cell infiltration

A recent comprehensive whole genome analysis of a large breast cancer cohort was used to link known and novel drivers and substitution signatures to the transcriptome of 266 cases. Here, we validate that subtype-specific aberrations show concordant expression changes for, for example, TP53, PIK3CA, PTEN, CCND1 and CDH1. We find that CCND3 expression levels do not correlate with amplification, while increased GATA3 expression in mutant GATA3 cancers suggests GATA3 is an oncogene. In luminal cases the total number of substitutions, irrespective of type, associates with cell cycle gene expression and adverse outcome, whereas the number of mutations of signatures 3 and 13 associates with immune-response specific gene expression, increased numbers of tumour-infiltrating lymphocytes and better outcome. Thus, while earlier reports imply that the sheer number of somatic aberrations could trigger an immune-response, our data suggests that substitutions of a particular type are more effective in doing so than others.

T he recent advance in DNA sequencing technologies has added substantially to the knowledge-base of breast cancer, but also reaffirmed its proverbial heterogeneity [1][2][3][4][5][6][7][8] . Sequencing DNA to greater depth enables analyses in distinguishing the 'driver' mutations-thought to be involved in the tumourigenesis-from the 'passenger' mutations-those arising in the lifetime of the cancer cell, without affecting the cancer cell's fitness. We recently reported on the identification of 12 substitution and 6 rearrangement signatures, as well as 7 consensus patterns therein, using whole genome sequence (WGS) data of 560 cases with primary breast cancer 9 . Numbered according to an earlier scheme 2 , some of the 12 substitution signatures had underlying mechanisms likely explaining the signature: for example, signature 1 was described as 'age-related' due to the many C4T mutations in an NCG trinucleotide context (the underlined base is mutated). Similarly, signatures 2 and 13 were related to APOBEC-type mutations (predominant C4T and C4G in a TCN context, respectively), while signature 3 lacked specific features but was strongly associated with inactivated BRCA1 and BRCA2. Signature 5 (with unknown causal mechanism) is present in most cases and is primarily characterized by C4T and T4C mutations. Other signatures were found in few cases (for example, signatures 6, 20 and 26) showing signs of mismatch repair deficiency 9 . It is important to realize that most of the breast cancer cases exhibit a blend of these signatures, possibly obfuscating the resulting effects. The very diverse pattern of mutations found in primary breast cancer was previously studied in depth 1 in a large cohort, integrating (epi)genomic, transcriptomic and proteomic data, but this study mainly focused on the driver genes and reported the resulting effect on its expression in solo or in concert in pathway analyses. The possible effects of the mutational signatures themselves on the transcriptome have been studied with less scrutiny.
In the current study we analyse the transcriptome of 266 cases (191 ER-positive/Her-2 negative and 75 triple-negative) by RNA sequencing with available WGS results 9 to explore the possible consequences of the substitution signatures at the transcriptome level. We report on clinically and biologically relevant gene expression signatures and associate these with the number and character of signature mutations.

Results
Validation of cohort. RNA-sequencing data of 266 primary breast cancer cases with available WGS data were generated and were first analysed and visualized to verify whether the cohort exhibited the archetypical breast cancer patterns. To this end, the 5,000 most variable transcripts were used to correlate and hierarchically cluster all cases resulting in five clusters (Fig. 1). Cluster 1 contained all cases of the basal-like intrinsic subtype 6,10 and held 92% of ER-negative cases, while the remaining four subclusters (clusters 2-5) were luminal. Among the subtypes, the total number of substitutions was highest in the basal-likes (Kruskal-Wallis (KW) Po0.0001), while within the luminal cases, the more aggressive luminal B (refs 11,12) compared with luminal A type cancer also showed higher numbers of substitutions (Mann-Whitney U-test (MWU) Po0.0001, Supplementary Fig. 1, and previously reported 1 ).
Next, we evaluated the reported driver genes (derived from WGS data 9 ) with at least five events in this cohort (with the top recurrent events presented in Fig. 1). Amplification of MYC (n ¼ 47, w 2 Po0.0001), of CCNE1 and CCND3 (both n ¼ 7, w 2 P ¼ 0.0003), of PIK3CA (n ¼ 9, w 2 Po0.0001) and, as expected, mutations in TP53 (refs 12-14) (n ¼ 105, w 2 Po0.0001) were predominant in basal subtype cancers, while mutations in GATA3(refs 8,15) (n ¼ 25, w 2 P ¼ 0.0006), and PIK3CA (n ¼ 74, w 2 Po0.0001) and amplification of CCND1 (n ¼ 39, w 2 Po0.0001) were largely restricted to luminal breast cancer, with the latter strongly related to luminal B cancers 16,17 (33/39 cases, w 2 Po0.0001). The lobular cancer driver 18,19 CDH1 was mutated in lobular cases (15/21, w 2 Po0.0001) and mutations of this type were predominantly observed in luminal A cases (12/21, w 2 P ¼ 0.0013). In line with their oncogenic role, amplified regions had significantly higher expression for MYC, CCND1, ZNF217, PIK3CA, MDM2, CCNE1, IGF1R, KRAS and ERBB2 (MWU all Po0.05, Supplementary Fig. 2). The 8p11 locus was identified as recurrently amplified as well, wherein, for example, FGFR1, ZNF703 and WHSC1L1 are potential driver candidates. These three genes are indeed significantly higher expressed in the amplified cases (data not shown) precluding isolation of the actual driver from the passenger genes. Tumour suppressors showed lower expression in cases with mutations (CDH1 and PTEN; MWU Po0.0001, Supplementary Fig. 2). For TP53, however, the type of mutation mattered ( Supplementary Fig. 3) with obvious deleterious variants such as frameshifts (n ¼ 19, MWU P ¼ 0.002) and nonsense substitutions (n ¼ 19, MWU Po0.0001) clearly showing lower expression, while increased expression of TP53 was found in samples with missense substitutions (n ¼ 47, MWU Po0.0001) supportive of a potential tumour promoting role of some of these latter type of TP53 variants 20 . The type of mutation was furthermore associated with copy neutral LOH (Fisher's Exact P ¼ 0.021), with cases with missense mutations showing less than expected numbers with copy neutral LOH (17.8 expected, 9 observed), again supportive for a dominant role of missense mutations in the TP53 gene ( Supplementary Fig. 3). Finally, we validated earlier reported 1 mutual exclusivity of PIK3CA aberrations, a predominant driver in breast cancer, with any event in PTEN, AKT1 or AKT2 in all 560 cases. Of the 167 cases with a PIK3CA aberration, the majority (91%) lacked any event in the three abovementioned genes (P ¼ 3.6e-5 coMEt exact test 21 , Supplementary Fig. 4) suggesting these are core constituents in this driver pathway in breast cancer.
As indicated, many of these observations were reported earlier, confirming the validity of our cohort for subsequent analyses. However, a few additional observations were noted for known driver genes; for CCND3, solely found amplified in basal cases, similar expression levels were observed in all but one sample of the amplified cases compared with diploid cases, implying CCND3 is likely not the sole driver in this amplified region (Fig. 2, top left). TCGA results (n ¼ 960 cases) confirmed our finding but did show a significant increase in expression levels of CCND3 (ANOVA Po0.0001) in amplified cases supportive of it being an oncogene (Fig. 2, top right). However, with the mean log2 expression level in amplified cases being only 1.2-fold higher compared with diploid cases, this is not the effect one would expect from a biological relevant driver in a particular amplified genomic region. Evaluating genes in the vicinity (B700 kb) of CCND3, only USP49 and BYSL could be considered as putative drivers instead, showing a more than 3 fold higher expression in the amplified cases in our cohort. In TCGA data, differences were less pronounced, with BYSL showing the largest fold-change (1.5 Â ) ( Supplementary Fig. 5a).
Next to the above, in our cohort increased GATA3 expression was observed in the 25 cases with GATA3 mutations (MWU Po0.0001, Fig. 2 bottom panel), which may point to a dominant, activating role for mutations in this gene. TCGA results confirmed (ANOVA Po0.0001) this, showing a threefold higher expression found in GATA3 mutated (n ¼ 95) compared with wild-type cases (n ¼ 864). The vast majority of mutations were found in exon 5 and 6 (22/25 in our cohort and 90/95 in TCGA), but no significant difference was seen in expression levels between exon 5 or 6 mutated cases, thus expression changes are independent of the fact that exon 5 mutations predominantly are frameshifts leading to shorter proteins and exon 6 mutations are frameshifts causing proteins with a C-terminal extension ( Supplementary Fig. 5b). Noteworthy, the 4% of TCGA cases with an amplification of GATA3 showed significantly lower expression levels compared with diploid cases (sixfold reduction, ANOVA Po0.0001, data not shown).
Cell cycle and immune pathways and mutational signatures. Switching from single genes to studying the effects of the various substitution signatures 9 on the global transcriptome, we first studied the molecular subtypes in relation to the signatures. Significantly higher numbers of signature 3 substitutions but lower number of signature 2 substitutions were found in the basal subtype (KW both Po0.0001) compared with the other subtypes. Substitutions of signature 5 were more abundant in the Her-2 group (KW Po0.0001) while signatures 8 and 13 again showed increased numbers in the basal subtype (KW both Po0.0001). Additional details can be found in Supplementary Fig. 6.
To further evaluate transcriptomic pathways and substitution signatures, we grouped samples according to their dominant mutation signature and compared the cases with, to those lacking that dominant signature (Methods section). The most prominent pathways emerged in ER-positive cases (Supplementary Table 1/ Supplementary Fig. 7) and related to cell cycle, which was significantly associated with the total number of substitutions, while several immune-response pathways were found associated with cases showing predominantly signatures 2 and/or 13. These latter substitution signatures have been accredited 5 to activity of the APOBEC/AID family of cytidine deaminases, targeting the TCN trinucleotide sequence (the underlined base gets mutated). In ER-negatives, several metabolism related pathways were found related with signature 3 substitutions, but these were predominantly driven by few genes or otherwise were always higher expressed in cases with low numbers of that signature (green bars in the geneplot, Supplementary Fig. 7a), giving less confidence that these pathways were of interest. In contrast, the cell cycle pathways consistently showed multiple genes higher expressed in the ER-positive cases with high numbers of the signature (see Supplementary Fig. 7b,c for examples). To validate the pathway results, we studied cell cycle and immune-related gene expression signatures in further detail.
As an objective readout of the cell cycle phenotype, we used genes annotated to the mitotic cell cycle (MCC, GeneOntology:0000278) and associated global expression of these genes to the number and type of substitutions (signatures 6, 17, 20, 26 and 30 were excluded since very few cases harboured these substitution types) and also to pathological grade (Fig. 3a). Next to confirming that the total number of substitutions were higher in the top quartile of samples with high MCC gene expression (MWU Po0.0001), we found that several signatures did similarly, especially signatures 13, 8, 1 and 3 (MWU Po0.0001, P ¼ 0.0002, P ¼ 0.002 and P ¼ 0.005, respectively). Insofar data were available, pathological grade III was also clearly enriched in the top quartile (w 2 Po0.0001). Thus, irrespective of the type of signatures identified in a patient, a higher number of substitutions was related to increased cell proliferation. Adverse outcome is thus to be expected in patients with a high mutational load, and this was confirmed by evaluating overall and relapse-free survival data (Fig. 3b,c).
The other major theme in the analysis of the transcriptome was the involvement of immune response in relation with particularly the APOBEC-type substitution signatures 2 and 13, also in ER-positive cases. Revisiting the pathway results ( Supplementary  Fig. 7c) showed a clear association of, for example, IFNG, CTLA4 and several chemokines with these 2 APOBEC-signatures, substantiating a relation with T-cell response. To study objectively if indeed activation of the immune system is mutational signature specific, we related in ER-positive cases (n ¼ 192) a previously developed tumour-infiltrating lymphocytes (TIL) RNA expression signature for breast cancer 22 to the diverse mutational signatures, again excluding signatures 6, 17, 20, 26 and 30 due to low numbers ( Fig. 4 and Table 1). Samples with overall high expression of the TIL-signature genes (top quartile of cases) had a significantly higher number of signature 13 (MWU P ¼ 0.0027) and also of signature 3 (MWU P ¼ 0.027). In contrast, the age-related signature 1 (MWU P ¼ 0.0005) displayed a significantly lower number in the high TIL group. The number of mutations of the other signature types was not significantly associated with high TIL expression. Next to analysing absolute numbers, we also analysed the data proportionally, by correcting for the total number of mutations, which removes total mutational load as possible confounding factor but this analysis yielded comparable results (Supplementary Table 2).
Even though the number of patients at risk is low, patients with high TIL and low MCC have a significantly better outcome than those with low TIL and high MCC (Fig. 5a,b). We validated this finding in an independent cohort of 625 lymph-node negative, not (neo)adjuvantly hormonal/chemotherapy treated cases (Fig. 5c). The TIL and MCC groups analysed separately also show significant survival differences in this independent cohort (logrank P ¼ 0.001 and Po0.0001 for TIL and MCC, respectively. See Supplementary Fig. 8) and when TIL and MCC groups were analysed in a multivariate model, both remained significant (TIL Hazard Ratio 0.53 (95% confidence interval (CI) 0.36-0.79), P ¼ 0.002 and MCC HR 1.81 (95% CI 1.33-2.47), Po0.0001).
Next, we investigated the pathological infiltrate status in all ER-positive cases (n ¼ 266, which includes those without RNAseq data). We associated the number of mutations per signature with the pathological infiltrate status. Lymphocytic infiltrate was grouped into three groups, having no, mild and moderate-severe infiltrate, and a test for trend across ordered groups showed significantly higher numbers of signatures 3 and 13 (both Po0.0001, see Table 2) in cases with increasing infiltrate, with signature 2 (the other APOBEC signature) showing a weak association as well (test for trend P ¼ 0.022). Of note, the results for signature 3 may be hampered by the fact that many cases lack signature 3 mutations entirely; of the n ¼ 266 cases in this analysis 230 (86%) have zero signature 3 mutations. An analysis with the proportion of substitutions of a signature yielded largely similar results (Supplementary Table 3), in the sense that signatures 3 and 13 remain strongly associated with the infiltrate status (test for trend Po0.0001), whereas signature 1 and 5 showed higher proportions in cases with decreasing infiltrate (test for trend P ¼ 0.015 and Po0.0001, respectively). To exclude that individual drivers themselves were not responsible for the observed association we investigated if mutation (including substitution, indel, rearrangement, copy-number variation) of the driver genes was directly associated with infiltrate levels. The following genes were analysed: ARID1A, CCND1, CDH1, GATA3, MAP2K4, MAP3K1, MLL3, MYC, PIK3CA, PTEN, TP53 and ZNF217 (n ¼ 10, 32, 17, 22, 17, 20, 11, 21, 64, 13, 33 and 11, respectively), but none of the investigated genes was found to be significantly associated with infiltrate status (Fisher's Exact test, P40.05).  Amino acid properties and immune response. To try to understand why only certain mutation signatures relate to a more effective immune response, we determined in-silico which substitutions yield predicted neo-epitopes 23,24 but found no striking signature specific difference in the fraction of predicted neo-epitope presenting substitutions (Supplementary Fig. 9a). We also studied the altered chemical properties 25,26 of the mutational signature induced amino acid changes. However, changes in hydrophobicity ( Supplementary Fig. 9b) were not strongly differently proportioned across the substitution signatures. In contrast, amino acid substitutions resulting from particularly signature 13 but also from signature 2 displayed a clear increase in charge ( Supplementary Fig. 9c), while substitutions caused by signature 1-which were associated with lower infiltratedisplayed a loss of a charge instead. The number of signature 13 substitutions that lead to an increase in charge was significantly associated with increasing lymphocytic infiltrate (test for trend Po0.0001); a similar test for signature 2 substitutions showed a P value of 0.014. Analysing the total number of substitutions leading to increased electric charge, irrespective of signature type, also showed increasing numbers with increasing lymphocytic infiltrate (test for trend Po0.0001) which remained significant even when signature 13 substitutions were excluded (test for trend Po0.0001).

Discussion
In this paper we investigated the mutation signatures present in primary breast cancer and their relation to the transcriptome. Although somewhat impeded by the fact that most tumours display a multiplicity of signatures, we were able to show that individual signatures have specific effects on the gene expression phenotype of tumour cells. As may be noted from the results, the main observations were obtained in ER-positive cases. The lack of results in ER-negative cases is attributed to the more homogeneous nature of this subtype; for example, virtually all ER-negative patients have high infiltrate levels (in our cohort only four cases display a nill infiltrate phenotype) and a high mutational load (just two ER-negative cases are among the bottom 20% after ranking on mutational load), which impedes identifying expression differences due to the lack of a sizeable contrast group.
Regarding the global transcriptome in relation to the substitution signatures, first, we observed that the total number of substitutions was positively associated with expression of cell cycle genes, independent of which of the individual substitution signatures contributed to the total. Reviewing the driver genes, as expected CCND1 and MYC amplified cases, as well as TP53 mutated cases were enriched in the top quartile MCC expression (Fisher's Exact test, all Po0.0001). The relation between an active cell cycle and various known drivers of cell cycle progression (CCND1, MYC, TP53) and poor outcome in ER-positive patients has been widely investigated 16,[27][28][29][30][31] , thus our observation that the mutational burden (defined here as the total number of substitutions, regardless of type) was also associated with shorter DFS and OS may be not surprising. The association of mutational burden and poor outcome in ER-positive patients was described earlier as well 32 , but using data based on exome sequencing, which we firmly validated here using whole genome sequencing data, with the latter enabling removal of the possible bias which may exist on the total mutational burden derived from the transcribed part (exome) compared with the entire genome (WGS). Noteworthy is that both mutational signatures which are replication dependent (for example, signature 13) and those that are not (for example, signature 8) 9 contribute to the mutational burden. This suggests that mutational load as a result of   insufficient or improper repair during DNA replication, as well as those from other sources, both contribute to more rapid progression of ER-positive disease. Second and interestingly, we observed a significant, positive relation between substitutions of signatures 13 and 3 and a TIL gene expression signature and also to increasing levels of TILs. The association of signature 3 substitutions with immuneresponse pathways is not readily explained by the type of substitutions-which are broadly patterned-or by the investigated amino acid properties; we speculate that the strong association of signature 3 with inactivated BRCA1/2 (ref. 9) could make this signature a proxy for BRCA-ness (with its particular rearrangements) in ER-positive cases. Inactivated BRCA is tightly linked with loss of homologous recombination (HR) repair mechanisms, leading to distinct rearrangement patterns 9 . We speculate that these rearrangements, which can comprise up to hundreds of events in a sample, leading to randomly aberrant proteins, potentially triggering an immune response. And, although the exact modus operandi of how signature 3 or BRCA-ness could increase immune response is as of yet obscure, higher lymphocytic infiltrate have been previously reported in familial BRCA1 (ref. 33) and BRCA2-affected tumours 34 , the former study also showing significantly better relapse-free survival for BRCA-patients with higher infiltrate levels.
The connection between TIL and signature 13 substitutions may be based on the distinguishing feature of signature 13; its clear APOBEC pattern 2,5 , deaminating cytidines in a TCN context, favouring C to G substitutions. We hypothesize that TILs are enriched at sites where the proper peptides are presented; a possible way signature 13 substitutions could activate this immune-response is via its resulting neo-epitopes. One difference we observed for altered peptide sequences resulting from signature 13 substitutions as compared with other equally abundant ones originating from other sources (for example, signature 1, 5, 8 substitutions) was not the number of predicted neo-epitopes, but rather its tendency to being positively charged. The observed increase in charge in signatures 2 and 13 can be readily explained by the fact that these signatures frequently affect the only two negatively charged amino acids Glu and Asp, respectively, which are modified in this mutational context into Lys (signature 2) and His (signature 13), both positively charged. Thus, these latter 2 signatures are most efficient in generating mutated peptides with increased electric charge. Interestingly, patients with an increased lymphocytic infiltrate showed, irrespective of signature type, higher numbers of positively charged amino acid substitutions, indicating that the association of signature 13 with TIL might be because positively charged amino acids are more commonly generated in this signature type.
Concluding, we show that ER-positive breast cancer responds to DNA mutations with two specific changes in the transcriptome. On the one hand, the cell cycle with its associated adverse clinical outcome appears to be more active with an increasing number of mutations, regardless of their underlying signatures. On the other hand, we observe a signature specific association with immune response, possibly arising from an increased number of neo-antigens that consequently may stimulate the immune response more effectively. However, independent confirmation of our finding is needed as well as mechanistic studies explaining the observed association. Still, our results augment earlier reports [35][36][37] , which postulate that a high mutational load is sufficient for the immune system to sense one or more neo-epitopes as non-self. However, by untangling the mutational load into specific signatures, our results suggest that the breast tumour cell, while gaining advantage by those mutations stimulating proliferation, may inadvertently provoke  NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12910 ARTICLE the immune system more effectively if the mutational load is moulded through specific mutational processes. Exploiting this knowledge by purposefully augmenting T-cell reactivity against amino acid substitutions resulting from the proper DNA substitution type could improve cancer immunotherapies.

Methods
Cohorts. The sample cohort used in the study has been described in detail recently 9 . To recapitulate, DNA and RNA of breast cancer tumours were analysed using WGS (WG-DNA data), copy-number analysis, methylation profiling, RNA sequencing and miRNA analysis. Here, we studied all patients with available RNAseq and WG-DNA data (n ¼ 266, no replicates). Sequencing protocols, QC and post-processing of data as well as the DNA substitution and rearrangement signatures and data availability are described elsewhere 9 . For the RNA-sequence cohort, IHC-scored Her2-positivity (as ascertained by a panel of pathologists according to the current standard practice) was an exclusion criterion, though at the time of analysis, some cases showed DNA amplification of the Her2 locus. Gene expression values were available as log2-FPKM values for 49,738 transcripts. For the correlation matrix ( Fig. 1) transcripts with 420% missing data were excluded, and the top 5,000 variable transcripts-which included besides 3,484 proteincoding genes, 1,516 non-coding RNA species such as lincRNAs, snRNAs and snoRNAs and so on-were median centred and used to correlate (Pearson) all samples with each other. The resulting correlation matrix was hierarchically clustered 38 (Cluster 3.0) using uncentered correlation as distance metric. The molecular subtypes were established using the AIMS method 6 .
Pathway analysis. Since the transcriptome is heavily driven by ER-status, pathway analyses were performed within ER-positive and -negative breast cancer cases separately. In the pathway analysis each signature was analysed separately: the proportion of substitutions of a particular signature was used for grouping samples.
Samples which had at least 50% of a particular signature were compared with samples having o20% of such substitutions, where both groups should have at least 15 cases. In ER-positive cases, signature 5 (n ¼ 54 450% proportion versus n ¼ 32 o20%) and the APOBEC-signatures (Sig 2 þ 13 combined) with n ¼ 18 450% proportion versus n ¼ 45 o20%) were available for analysis and in ER-negative cases, signature 3 (n ¼ 24 450% versus n ¼ 22 o20%) could be evaluated. In addition, disregarding the type of signature, the total number of substitutions was used to group samples with the most extreme number of substitutions (top and bottom 20%). Initially samples were grouped in a low and high mutational burden group, separately for ER-positive and -negative cases. However, in the ER-negatives, samples in the bottom 20% still had a high absolute number of substitutions. Since our aim was to identify differential pathways associated with the absolute tumour burden, we wanted to have a bottom 20% group with absolute low numbers of substitutions. Samples were thus ranked according to mutational burden independent of ER-status to obtain the bottom and top 20% groups. In ER-positive cases, this gave a group of n ¼ 15 in the top 20% versus n ¼ 52 in the bottom 20%. In ER-negative cases there were only two samples in the bottom 20%, precluding a meaningful analysis. Pathway analysis were performed using the R-package 'global test' 39 using Biocarta (www.biocarta.com) and KEGG (ref. 40) as annotation databases. All P values were corrected for multiple testing (Bonferroni-Holm) and checked by re-sampling 1,000 times to ascertain the number of times an equally sized, randomly chosen group of genes is at least as significant as the true set of genes belonging to a pathway. Pathways were considered of interest if the P value of the global test after correcting for multiple testing and the re-sampling P value were both below 0.05. Second, pathways were considered relevant pathways if these consistently showed multiple genes higher expressed in the cases with high numbers of a signature.
Neo-antigen prediction and hydrophobicity of amino acids. To ascertain which of the amino acid changing substitutions potentially contain a neo-epitope, all unique, non-silent, non-terminating amino acid substitutions were selected (n ¼ 26,475) and we focused our analyses on CD8 T-cell epitopes. FASTA files of the wild-type sequences were obtained via the consensus coding sequence database (http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and custom Perl scripts were used to mutate the wild-type sequence according to the identified substitution in our cohort. Since the software to evaluate neo-epitopes uses all possible 9-mer peptides in a sequence, we selected a 17-mer sequence with the mutated amino acid in the middle -thus making sure the mutated amino acid was always present in all possible 9mers of that sequence. Care was taken if the substitution was too near to the start or end of a gene to ensure a 9mer was always present.  23 , a predicted EC50o50 nM of a peptide towards HLA-A2 alleles indicates a 'strong binder' and if any of the possible predictions was found o50 nM the gene was considered harbouring a potential neo-antigen.
Changes in hydrophobicity of mutated amino acids was ascertained using the scale of Kyte & Doolittle 41 where the level of the mutated amino acid was subtracted from the wild-type amino acid. A positive difference (40) was labelled as 'increase' and o0 as 'decrease' in hydrophobicity. Changes in electric charge were evaluated by assigning Aspartic acid and Glutamic acid as negatively charged and Lysine, Arginine and Histidine as positive. Next, base substitutions were evaluated if a resulting amino acid change led to a change in electrical charge. Since all substitutions are assigned to a mutational signature, the total number of charge changing (and hydrophobicity changing) substitutions per signature was established.
Independent validation set. In-house and publicly available gene expression data of lymph-node negative primary breast cancer patients, who were not treated (neo)adjuvantly with hormonal/chemotherapy and available metastasis-free survival data were used, leading to a cohort of 867 patients (n ¼ 625 ER-positive). This selection enables a pure study on prognosis. Data were gathered from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) entries GSE2034, GSE5327, GSE2990, GSE7390 and GSE11121, with all data available on Affymetrix U133A chip. Raw.cel files were processed using fRMA (ref. 42) parameters (median polish), after which batch effects were corrected using ComBat (ref. 43). The average expression of TIL genes 22 and MCC genes (Gene Ontology:0000278) were calculated of each sample, and the samples in the top quartile for TIL and the top quartile for MCC were used to group the patients in three groups; those with high TIL/low MCC, those with low TIL/high MCC and the remaining patients.
Statistics. Associations between gene expression levels and gene-status (mutant, amplification and so on) were analysed using MWU, or KW when more than two categories were evaluated. In the analyses where the expression level of a gene was associated with its own status and not with other genes, no multiple testing correction was necessary. To test for a trend across ordered groups (for example, number of substitutions in increasing grade), Cuzick's nonparametric test for trend was used, an adjunct to the KW test. Pearson's w 2 test was used to evaluate categories (for example, mutation status versus molecular subtype) or by Fisher's Exact test when number of events were low. Stata v13 (StataCorp, Texas, USA) was used for the statistical tests and Kaplan-Meier survival curves. A two-sided P value of o0.05 was considered statistically significant.
Data availability. WGS, RNA-seq and methylation data that support the findings in this study are available at the European-Genome Phenome Archive (https:// www.ebi.ac.uk/ega) under the accession code EGAS00001001178. For the validation data, in-house GSE2034 and GSE5327, and otherwise public data GSE2990, GSE7390 and GSE11121 are available at the Gene Expression Omnibus website (http://www.ncbi.nlm.nih.gov/geo/). All other data are contained within the Article and its Supplementary files or available from the authors on request.