DNA methylation profiling reveals a predominant immune component in breast cancers

Breast cancer is a molecularly, biologically and clinically heterogeneous group of disorders. Understanding this diversity is essential to improving diagnosis and optimizing treatment. Both genetic and acquired epigenetic abnormalities participate in cancer, but the involvement of the epigenome in breast cancer and its contribution to the complexity of the disease are still poorly understood. By means of DNA methylation profiling of 248 breast tissues, we have highlighted the existence of previously unrecognized breast cancer groups that go beyond the currently known ‘expression subtypes’. Interestingly, we showed that DNA methylation profiling can reflect the cell type composition of the tumour microenvironment, and in particular a T lymphocyte infiltration of the tumours. Further, we highlighted a set of immune genes having high prognostic value in specific tumour categories. The immune component uncovered here by DNA methylation profiles provides a new perspective for the importance of the microenvironment in breast cancer, holding implications for better management of breast cancer patients.


Breast cancer "expression subtype" determination
Two approaches were used to determine "breast cancer expression subtypes". First, on the basis of an IHC determination, basal-like tumours were defined as negative for ER and HER2 receptors and as histological grade 3, HER2 tumours as overexpressing the HER2 receptor, and luminal tumours as ER positive and HER2 negative. This last group was divided into luminal A and B tumours corresponding respectively to histological grade 1 and grade 3 tumours. Secondly, the subtypes were identified on the basis of gene expression by applying the Subtype Classification Model as described in  and . The only difference was in the use of the single probes "205225_at", "216836_s_at" and "208079_s_at" instead of the full ESR1, ERBB2 and AURKA modules, respectively. We chose to use this simplified version of the Subtype Classification Model as this model showed excellent performance when applied to the Affymetrix dataset, while reducing the number of genes in the clustering model (data not shown). We used the 'genefu' R package, available on CRAN (http://cran.rproject.org/web/packages/genefu/).

Isolation of ex vivo lymphocytes
Blood mononuclear cells from an hemochromatosis patient were isolated with density gradient centrifugation using Lymphoprep (Axis-Shield PoCAS, Oslo, Norway), and extensively washed in cold phosphate-buffered saline containing 2 mM EDTA, to eliminate platelets. CD3+ and CD20+ cells were purified with magnetic microbeads using the CD3 Isolation Kit or CD20 Isolation Kit (Miltenyi Biotec, Bergisch Gladbach, Germany) in an AUTOMACS magnetic sorter (Miltenyi), following the manufacturer's instructions. Cell purities were higher than 99% and 92% for the CD3+ and CD20+ cells, respectively, as determined with standard flow cytometry.

Bisulphite genomic sequencing
Methylation status of four CpG sites -cg07471052, cg11566244, cg22498251 and cg09847584 -located respectively near the transcription start sites of the CDK3, GSTP1, TWIST1 and RIMBP2 genes, was examined by bisulphite genomic sequencing applied to 1 normal (N1) and 3 breast cancer (BC10, BC32 and BC109) samples. Primers were designed manually and sequences are provided in Supplementary Table SV. The PCR amplified fragments were purified by QIAquick ® Gel Extraction kit (Qiagen), cloned into the pCR ® II-TOPO ® vector (Invitrogen, Carlsbad, CA, USA), and used to transform competent Escherichia coli TOP10 cells. Clones were selected by blue/white colonie screening and amplified. Plasmids were purified with the Qiagen-MiniPrep kit (Qiagen).
The PCR products were sequenced by Genoscreen (Lille, France) and CpG methylation status were analysed with the BiQ Analyzer software as described in (Bock et al., 2005).

Bisulphite pyrosequencing
750 ng of genomic DNA were bisulphite-converted using the EZ DNA Methylation™ kit (Zymo Research) as for DNA methylation profiling. One third of the converted DNA was used as template for each subsequent PCR. To ensure sufficient amount of PCR product for sequencing we performed nested PCRs. PCR primers for pre-amplification (EF, ER primers) were deduced manually or with the help of "BiSearch Primer Design and Search Tool" (http://bisearch.enzim.hu) and checked for tendency to form oligomers, hairpin loops etc. using the Generunner software (version 3.05, Hastings Software Inc.). Primers for nested amplification and sequencing were deduced manually or using PyroMark ® Assay Design 2.0 software (Qiagen).
Pre-amplification PCRs were conducted with 3mM MgCl 2 , 1mM of each dNTP, 12% (v/v) DMSO, 500nM of each primer (EF+ER primers, see Supplementary Table SXXX) and optionally 500mM Betaine in heated-lid thermocyclers under the following Amplification success was assessed with agarose gel electrophoresis and pyrosequencing of the PCR products (S primers) was performed with the Pyromark™ Q24 system (Qiagen).

Histopathologic analysis of the lymphocyte infiltration
Histopathologic analysis of tumours in order to evaluate both stromal and intratumoral lymphocyte infiltration was performed on hematoxylin and eosin-stained sections, as previously described (Denkert et al., 2010).

Unsupervised clustering
In a first step, as a completely unsupervised approach, hierarchical clustering was performed on all 123 breast tissues of the main set (119 IDCs and 4 normal breast tissues) on the basis of the 10% most variant CpGs between all samples (see Fig S2). This has been done also for all samples of the validation set (see Fig S15). In a second step, hierarchical clustering was performed only on the 119 IDCs of the main set on the basis of a reduced list of CpGs differentially methylated between IDC and normal tissues identified in Table SIII. Among the 6,309 CpGs identified as being differentially methylated between IDC and normal samples, we chose to work with those showing a 20% methylation difference in at least 30% of the IDCs as compared to the normal breast samples (see Table SVII). This ensured selection of a reasonable number of CpGs (2,985) having potentially informative variance in our dataset and yielded clusters showing good stability. Complete linkage and distance correlations were used for clustering arrays and CpGs. The stability of the clustering was estimated with the 'pvclust' R package (Suzuki and Shimodaira, 2006), available on CRAN (http://cran.r-project.org/web/packages/ pvclust/). We measured the uncertainty in hierarchical clustering by bootstrap stability probabilities ranging from 0 to 1, with 0 indicating poor stability and 1 indicating a very high stability. The bootstrap probability value of a cluster is the frequency that it appears in the bootstrap replicates. These stability values quantify how strong a cluster is supported by data. The criteria used to select the 6 methylation clusters reported in this paper were: (i) a stability probability of minimum 0.75, and (ii) a minimum number of samples of 8 (see Fig S5).

Module/signature scores
The calculation of module/signature scores is described in  and . Briefly, a signature score, denoted by R s , was defined as the weighted combination of all the gene expressions in the corresponding signature: where Q is the set of genes in the signature, n Q is the number of genes in Q, x i is the expression of gene i, and w i is either -1 or +1 depending on the sign of the statistic/coefficient published in the original study. For the particular cases of the two divided "ESR1 positive" and "ESR1 negative" modules, w i is always equal to +1. For DNA methylation data, signature scores were calculated in a manner similar to that of gene expression data with an additional mapping procedure: each CpG probe was mapped to the corresponding gene through Entrez Gene ID. Each signature score was scaled so that quantiles 2.5% and 97.5% equaled -1 and +1, respectively. This scaling was robust to outliers and ensured that the signature score lay approximately within the [-1,+1] interval, allowing comparison of datasets based on different microarray technologies and normalizations.

Annotation of Infinium array in terms of CpG location
Additional annotations of the Infinium array were added to the ones provided by Illumina regarding the location of the CpG (i) versus CGI (CpG inside a CGI, CpG island shore, other CpG) and (ii) versus promoter classes (High-, Intermediated or Low-CpG-density promoter). They are provided in Table SVI.

CpG location versus CGI
CpGs were classified according to their position relatively to CpG islands (i.e. CpG inside a CGI, CpG island shore or other CpG). Two classifications were established, and this in function of the CGI definition used: the UCSC definition (CpG_Island_UCSC classification) or the improved and revisited definition described in (Bock et al., 2007) (CpG_Island_Revisited classification). A CpG was considered as a CpG island shore if it was located inside a 2 kb region around a CGI (as defined in (Irizarry et al., 2009)). A CpG located neither in a CGI nor in a 2 kb region around a CGI was considered as other CpG. Both classifications are provided in Table SVI; we only used the revisited classification described in (Bock et al., 2007) for all analyses.

CpG location versus promoter classes
Promoters represented on the Infinium array were categorized using their CpG content as defined in (Weber et al., 2007). First, regions from -700 to +500 bp surrounding the transcription start site (TSS) were extracted using the UCSC genome browser data (Rhead et al., 2010). Then, using the DNA sequences corresponding to those promoter fragments, the CpG ratio and the GC content were calculated in sliding windows of 500 bp with 5 bp offsets. Finally, according to the definition provided in (Weber et al., 2007), the promoters were classified as HCPs (High-CpG-density promoters) if a least one 500 bp window contains a CpG ratio > 0.75 and a GC content > 0.55 was found; as LCPs (Low-CpG-density promoters) if no 500 bp window has reached a CpG ratio of 0.48; or as ICPs (Intermediate-CpG-density promoters) otherwise.

Methylation difference criterion
Several indications led us to choose 20% as the methylation difference criterion. First, it seemed that the Infinium assay gave values ranging from 0 to 0.2 for unmethylated CpGs.
Second, a recent study has shown that for more than 90% of the loci, the sensitivity of methylation difference detection is 0.2 (Bibikova et al., 2009).

Class comparison analyses in the main set of patients
A two-sided Mann-Whitney test (also called Wilcoxon-Mann-Whitney test) was employed to test the null hypothesis (H 0 ) assumption of equality of the methylation values in two defined groups of data. The loss of power induced by multiple tests was corrected by the false discovery rate (FDR) approach (Benjamini and Hochberg, 1995).
For normal samples we considered the mean of methylation values, because of the small sample size and the low variance. For tumour samples, because of their higher heterogeneity, we considered the median value, less sensitive to extreme values.

Between IDCs and normal breast tissue samples
A particular CpG was considered hyper-or hypo-methylated in IDCs as compared to normal breast tissue samples according to the following two criteria: 1/ the CpG had to show at least a 20% methylation difference in IDCs as compared to normal breast tissue samples in at least 10% of the IDCs; 2/ to be considered hypermethylated, the CpG had to show at least ten times more hypermethylation events than hypomethylation events in breast cancer. Conversely, to be considered hypomethylated, it had to show at least ten times more hypomethylation events than hypermethylation events in breast cancer.

Between the two main clusters, I and II
CpGs differentially methylated between clusters I and II were determined according to these two criteria: 1/ they had to show a methylation difference of at least 20% between the two groups; 2/ the FDR-corrected Wilcoxon p-value for the concerned CpGs had to be lower than 0.1.

Between each methylation subcluster and normal breast tissue samples
The criteria for determining that a given methylation subcluster showed differential methylation with respect to normal breast tissue samples were: 1/ The CpGs concerned had to show a difference in methylation of at least 20% between the two groups; 2/ the Wilcoxon p-value for the CpGs concerned had to be lower than 0.01. Here, we did not use the FDR criterion as described above, because of the small number of samples composing each group.

Gene Set Enrichment Analysis (GSEA)
GSEA is a powerful analytical method first developed to determine if the members of a given gene set are significantly enriched among the genes most differentially expressed between two sample groups (Mootha et al., 2003). Here we applied this method to both our methylation data and our expression data to assess the possibility that ER biology might be regulated by DNA methylation. For this, we hypothesized that the ESR1 module genes were more highly methylated in cluster I ("ER-negative tumours") than in cluster II ("ER-positive tumours").
For this analysis, the ESR1 module described in  had to be divided into two sub-modules: an ESR1-positive module, containing all ESR1 module genes whose expression correlates positively with ESR1 expression, and an ESR1-negative module containing those whose expression correlates negatively with ESR1 expression.
All 14,475 genes represented on the bead array were ranked from the most hypermethylated to the most hypomethylated in cluster I with respect to cluster II. The signal-to-noise ratio (the difference in means of the two classes divided by the sum of the standard deviations of the two classes) was used to perform the ranking. When a gene was represented by several probes on the bead array, the most variant one was selected for this analysis. The 20,606 genes represented on the Affymetrix array were ranked according to the same method.
The goal of this GSEA analysis was to determine whether the ESR1 module genes are randomly distributed throughout the ranked lists (suggesting no enrichment of these gene sets in one of the two clusters) or primarily found at the top or bottom (suggesting an enrichment of these gene sets in one of the two clusters). A running sum statistic, corresponding to the enrichment score, was calculated for each gene set on the basis of the ranks of the investigated gene set members, relative to those of the non-members. The significance of such enrichments was estimated by calculating a permutation-based pvalue corrected for multiple tests by the false discovery rate (FDR) approach.
This analysis was performed with the freely accessible software GSEA-P, provided by the Broad Institute (http://www.broadinstitute.org/gsea/). This GSEA technique has been described in detail in (Subramanian et al., 2005).

Correlation between methylation and expression data
The correlation between methylation and expression data in the main set of patients was evaluated by Pearson's correlation test between each Infinium methylation probe and the most variant Affymetrix expression probe for the gene concerned. Infinium methylation probes presenting values with a range lower than 20% were excluded from this analysis.
The range was calculated by subtracting the smallest methylation value from the greatest one for each probe.

Establishment of the 86 CpG-classifier
To transfer class discovery results from one data set to another in order to independently confirm the results, we used the nearest centroid classification method (Lusa et al., 2007;Sorlie et al., 2003) for assigning new samples of the validation set to one of our 6 clusters. This method is based on the similarity of the DNA methylation profile of a new sample to the DNA methylation profile of the previously identified clusters. A centroid was defined as the vector containing the median methylation values of all the samples assigned to that cluster in the original hierarchical clustering in the main set. For each new sample, a Spearman rank correlation was calculated between its methylation data and the six centroids; the predicted cluster was defined as the category having the highest correlation value. For training the classifier, we excluded those patients in the main set not belonging to any of the 6 most robust clusters. We used the Kruskal-Wallis non parametric test to find the differently methylated CpGs between the six clusters. A ranked CpG list was constructed according to the Kruskal-Wallis test statistic values (see Table  SXI). In order to find the minimal number of CpGs to be used for the nearest centroid classifier, we created different classifiers from this list and calculated the proportion of correctly classified samples from the main set as compared to the original clustering. We started with a classifier using the top 5 CpGs most differentially methylated CpGs between the 6 clusters from this list and added one by one an additional CpG from this list up to a total of 1519 (the number of CpGs for which the FDR-adjusted p-value was 0). At the end, the minimal number of CpGs that yielded the maximum percentage of correct classification (96.38%) was given by 86 (see Figs S7 and S8, and Tables SXII, SXIII and SXIV). Finally, the resulting 86-CpG classifier was applied to the validation dataset to classify the new patients into one of the 6 clusters.

Gene ontology analysis
Gene ontology analysis was done with DAVID (http://david.abcc.ncifcrf.gov/), a webaccessible program providing a comprehensive set of functional annotation tools for understanding the biological meaning of large lists of genes (Huang et al., 2009a). Only genes differentially methylated between each subcluster and normal breast samples and displaying an acceptable anti-correlation between their methylation and expression status (Pearson's coefficient below than -0.4) were selected for this analysis (see also Tables SXX and SXXI). This ensured the selection of genes whose expression is affected by methylation changes, facilitating the biological interpretation of results.

Collection of publicly available gene expression datasets
Gene expression datasets were retrieved from public databases or authors' websites. We used normalized data (log2 intensity in single-channel platforms or log2 ratio in dualchannel platforms). Hybridization probes were mapped to Entrez GeneID as described in (Shi et al., 2006) using RefSeq and Entrez database version 2007.01.21. When multiple probes were mapped to the same GeneID, the one with the highest variance in a particular dataset was selected. Ten breast cancer microarray datasets were used (Table SXIV).
Distant metastasis-free survival (DMFS) was used as survival endpoint. We censored the survival data at 10 years in order to have comparable follow-up across the different studies as described in Haibe-Kains et al., 2008).

Relapse-free survival analysis
For the meta-analysis performed on publicly available gene expression data, we selected only the genes displaying a high anti-correlation between their methylation and expression status (Pearson's coefficient below than -0.7) in our main set of patients.
Among the 85 genes meeting this criterion, several were eliminated because they were not represented on the microarray platforms (9 genes) or because information for these genes was available for less than 700 patients (15 genes). Six other genes were excluded from this meta-analysis because they did not display differential methylation between normal breast samples and IDCs in our population.
The prognostic value of individual CpGs or genes was estimated by univariate Cox regression. Multivariate Cox regression was used to test the independent prognostic values of CpGs or genes of interest in the presence of traditional clinical variables. Cox models were stratified by datasets to account for the possible heterogeneity in patient selection or other potential confounders, as implemented in the 'survival' R package available on CRAN (http://cran.r-project.org/web/packages/survival). The significance of individual hazard ratios was estimated by Wald's test. For univariate analysis, the pvalues were corrected for multiple testing by means of the false discovery rate (FDR) and variables with a FDR below than 0.1 were considered prognostic. For multivariate analysis, variables with a p-value below than 0.05 were considered prognostic.

Additional statistical analyses
Spearman's correlation was used to compare Infinium data with bisulphite genomic sequencing or pyrosequencing data. The Mann-Whitney U test and the Kruskal-Wallis test were used to test for differences of a continuous variable between two or multiple subgroups, respectively. Chi-square tests were used to compare discrete variables and the p-values were estimated by the likelihood ratio or Fisher's Exact test (for comparison of binary variables).
We used the Phi coefficient to determine the strength of associations between the "known expression subtypes" of breast cancer and our DNA methylation-based clusters. The values range from 0 to 1, and can be interpreted in a similar way to Spearman's rank correlation coefficient. The significance of such associations was computed by means of a chi-square test.       In order to find the minimal number of CpGs to be used for the nearest centroid classifier, we created different classifiers from the list of differentially methylated CpGs between the 6 clusters (see Table SXI) and calculated the proportion of correctly classified samples from the main set as compared to the original clustering. We started with a classifier using the top 5 CpGs most differentially methylated CpGs between the 6 clusters from this list and added one by one an additional CpG from this list up to a total of 1519 (the number of CpGs for which the FDRadjusted p-value was 0). At the end, the minimal number of CpGs that yielded the maximum percentage of correct classifications (96.38%) was given by 86.    Cluster 3 tumours showed an expression profile very close to that of luminal progenitor cells, whereas clusters 1, 4, 5, and 6 tumours appeared to be closer to mature luminal cells. These observations suggest that methylation patterns distinguished here might reflect the cell type of origin of the studied tumours.      Column description:
-GE_QC: 1 and 0 indicate respectively that the sample passed or not the quality control for gene expression profiling. NA indicates that gene expression analysis was not performed on this sample.
-Methyl_QC: 1 indicates that the sample passed the quality control for DNA methylation profiling.
-Subtype_IHC: "Breast cancer expression subtype" determined by IHC as described in the Supplemental Materials and Methods section.
-GRADE: Histological grade of the tumour.
-Size_Bin: 1 and 0 indicate, respectively, that the size of the tumour was above or below 2 cm.
-Size_cm: Size of the tumour in cm.
-Nodal_Status: 1 and 0 indicate, respectively, the presence or absence of cancer cells in lymph nodes.
-Subtype_GE: "Breast cancer expression subtype" determined by gene expression as described in the Supplemental Materials and Methods section.
-Age_diagnosis: Patient's age (in years) at the time of diagnosis.
-Age_bin: 1 and 0 indicate, respectively, that the patient was above or below 50 years old at the time of diagnosis.
-RFS_time: Relapse-free survival time in years.
-RFS_time_censored: Relapse-free survival time in years, censored at 10 years.
-Relapse_5years: 1 and 0 indicate, respectively, the presence or not of a relapse event within the first 5 years of follow up.
-OS_event: 1 and 0 indicate, respectively, the occurrence or not of an overall survival event.
-OS_time: Overall survival time in years.
-Subcluster: Methylation subcluster membership. This Table is provided in the additional file Sup_3.xls. The "All data" tab contains data for all 27,578 CpGs investigated by means of the Infinium bead array. The "HYPER" and "HYPO" tabs are the lists of CpGs that are, respectively, hypermethylated or hypomethylated in IDCs as compared to normal breast tissue samples, according to the criteria described in the Supplemental Materials and Methods section.
Column description: -Illumina_ID: Illumina probe reference for each investigated CpG.
-SYMBOL: Symbol of the gene concerned. -Wilcox.pVal: p-value given by the Wilcoxon's test.
-Gene_ID: Gene ID as defined by the NCBI.
-Distance_to_TSS: Distance between the investigated CpG and the transcription start site (in base pairs).
-MapInfo: Position of the investigated CpG on the chromosome.
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).
-CpG_Island_UCSC: 'TRUE', 'shore' and 'FALSE' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to UCSC definition).
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).
HCP: High-CpG-density promoter; ICP: Intermediate-CpG-density promoter; LCP: Low-CpG-density promoter.  Figures 2A and 3A. The percentage of methylation for each of these selected CpGs is given for each of the 119 breast cancer samples.
Column description: -Illumina_ID: Illumina probe reference for each investigated CpG.
-SYMBOL: Symbol of the gene concerned.

. List of CpGs differentially methylated between clusters I and II.
This Table is provided in the additional file Sup_6.xls. The "All data" tab contains data for all 27,578 CpGs investigated by means of the Infinium bead array. The "I vs II" tab is the list of CpGs differentially methylated between clusters I and II according to the selection criteria described in the Supplemental Materials and Methods section. -Delta.Beta: Difference in methylation between the two clusters for each CpG.
-EntrezGene_ID: Gene ID as defined by the NCBI.
-Distance_to_TSS: Distance between the investigated CpG and the transcription start site (in base pairs).
-MapInfo: Position of the investigated CpG on the chromosome.
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).

Table SIX, related to Figure 2. GSEA results for the ESR1 module.
This Table is provided in the additional file Sup_7.xls and contains two tabs corresponding to the two ESR1 sub-modules, the ESR1 positive and negative modules. Rows in grey indicate genes represented on the Affymetrix expression array but not on the Infinium Methylation bead array.
Column description: -EntrezGene_ID: Gene ID as defined by the NCBI.
-SYMBOL: Symbol of the gene concerned.
-coefficient: Coefficient value indicating the degree of correlation in term of the expression of each gene of this module with ESR1 (see Desmedt et al., 2008).
-Methylation Enrichment: This column indicates whether the gene showed a significant enrichment in cluster I or II in terms of DNA methylation.
-Expression Enrichment: This column indicates whether the gene showed significant enrichment in cluster I or II in terms of expression.
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).

Table SX, related to Figure 3. Association between the 6 methylation clusters identified in the main set of patients and the "known expression subtypes". Upper Table indicates the p-values provided by
Fisher's Exact test to evaluate the association between each methylation group and each "known expression subtype" determined by immunochemistry (IHC) as well as the Phi value in brackets. Lower Table   indicates the likelihood ratio p-values provided by Chi square test to evaluate the association between each methylation group and each "known expression subtype" determined by gene expression (GE) as well as the Phi value in brackets.
-pVal: p-value of the Kruskal-Wallis test for differential methylation between clusters 1 to 6.

Table SXII, related to Figure 3. Proportion of correctly classified patients as a function of the number of CpGs in the classifier.
This Table is provided in the additional file Sup_9.xls.
Column description: -c.index: concordance index estimate (or percentage of similarity) i.e. number of correctly classified patient / total number of patients of main set.
-se: standard error of the estimate.
-upper/lower: upper and lower bound of the confidence interval.
-p.value: p-value of the statistical test (H0: the estimate is different from 0.5).
-No.CpG's: Number of CpG used for the estimation. Column description: -Illumina_ID: Illumina probe reference for each investigated CpG.
-SYMBOL: Symbol of the gene concerned.
-MapInfo: Position of the investigated CpG on the chromosome.
-Gene_ID: Gene ID as defined by the NCBI.
-Distance_to_TSS: Distance between the investigated CpG and the transcription start site (in base pairs).
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).

Table SXIV, related to Figure 3. Spearman's correlation values for each tumour of the main set with the 6 centroids.
This Table is provided in the additional file Sup_11.xls.
-Spearman_GrX: Value of the Spearman's correlation coefficient between the indicated sample and the centroid of group X.
-Max_Spearman: Maximum value of the Spearman's coefficient obtained for the indicated sample with one of the 6 centroids.
-Group_Clustering: Methylation group assigned to the indicated sample by the unsupervised clustering.
-Group_Centroid: Methylation group assigned to the indicated sample by the nearest centroid method.

Table SXV, related to Figure 3. Demography of breast cancer samples of the validation set.
This Table is provided in the additional file Sup_12.xls.
-Methyl_QC: 1 indicates that the sample passed the quality control for DNA methylation profiling.
-Subtype_IHC: "Breast cancer expression subtype" determined by IHC as described in the Supplemental Materials and Methods section.
-GRADE: Histological grade of the tumour.
-Size_Bin: 1 and 0 indicate, respectively, that the size of the tumour was above or below 2 cm.
-Size_cm: Size of the tumour in cm.
-Nodal_Status: 1 and 0 indicate, respectively, the presence or absence of cancer cells in lymph nodes.
-Age_diagnosis: Patient's age (in years) at the time of diagnosis.
-Age_bin: 1 and 0 indicate, respectively, that the patient was above or below 50 years old at the time of diagnosis.
-RFS_time: Relapse-free survival time in years.
-Relapse_5years: 1 and 0 indicate, respectively, the presence or not of a relapse event within the first 5 years of follow up.
-OS_event: 1 and 0 indicate, respectively, the occurrence or not of an overall survival event.
-OS_time: Overall survival time in years.
-Methylation_Group: Methylation group assigned to the sample by the 86-CpG classifier.

Table SXVI, related to Figure 3. Spearman's correlation values for each tumour of the validation set with the 6 centroids.
This Table is provided in the additional file Sup_13.xls.
-Spearman_GrX: Value of the Spearman's correlation coefficient between the indicated sample and the centroid of group X.
-Max_Spearman: Maximum value of the Spearman's coefficient obtained for the indicated sample with one of the 6 centroids.
-Group_Centroid: Methylation group assigned to the indicated sample by the nearest centroid method.

. Lists of CpGs differentially methylated between each of the 6 methylation clusters and normal breast tissue samples in the main set.
This Table is provided in the additional file Sup_14.xls. The "All data" tab contains data for all 27,578 CpGs investigated by the Infinium bead array. The 6 "GRx vs N" tabs are lists of CpGs differentially methylated between group x and normal breast samples. The selection criteria used to compile these 6 lists are defined in the Supplemental Materials and Methods section.
Column description: -Illumina_ID: Illumina probe reference for each investigated CpG.
-SYMBOL: Symbol of the gene concerned.
-Mean.Normal: Mean of the methylation percentage of each CpG for the normal breast samples.
-Median.GRx: Median of the methylation percentage of each CpG for the methylation subcluster x.
-Delta.GRx.vs.N: Methylation difference for each CpG between group x and normal breast samples.
-GRx.pval: p-value given by Wilcoxon's test between group x and the normal group.
-GRx.fdr: FDR-corrected Wilcoxon p-value between group x and the normal group.
-EntrezGene_ID: Gene ID as defined by the NCBI.
-Distance_to_TSS: Distance between the investigated CpG and the transcription start site (in base pairs).
-MapInfo: Position of the investigated CpG on the chromosome.
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).

Table SXIX, related to Figure 5. Correlation between DNA methylation and gene expression data in the main set.
This Table is provided in the additional file Sup_15.xls.
Column description: -Illumina_ID: Illumina probe reference for each investigated CpG.
-EntrezGene_ID: Gene ID as defined by the NCBI.
-SYMBOL: Symbol of the gene concerned.
-CPG_ISLAND: TRUE indicates that the investigated CpG is located in or close to a CpG island. FALSE indicates that the investigated CpG is not close to a CpG island.
-Pearson_coef: Pearson coefficient of correlation between the methylation status of the indicated CpG and the expression status of the gene concerned determined by taking the most variant Affymetrix probe.
-CpG_Island_Revisited: 'true', 'shore' and 'false' indicate that the investigated CpG is located inside a CGI, is a CpG island shore, or is neither in a CGI nor a CpG island shore, respectively (according to the definition in (Bock et al., 2007)).
-Promoter_Class: Promoter class based on CpG density and CG content as defined in (Weber et al., 2007).

Table SXX, related to Figure 5. Lists of genes differentially methylated between each of the 6 methylation clusters and normal samples of the main set that display an anti-correlation between their methylation and expression status.
This Table, provided in the additional file Sup_16.xls, gives for each cluster the lists of hypo-and hypermethylated CpGs and genes (compared to normal samples) displaying an anti-correlation between their methylation and expression status (Pearson's coefficient ≤ -0.4) Column description: -GRx_HYPOmethylated: CpGs and associated genes hypomethylated in group x as compared to normal samples and displaying an anti-correlation between their methylation and expression status.
-GRx_HYPERmethylated: CpGs and associated genes hypermethylated in group x as compared to normal samples and displaying an anti-correlation between their methylation and expression status.
-SYMBOL: Symbol of the gene concerned.    Figure 5. Spearman correlation between methylation status of immune genes described in Figure 5 and  This Table is provided in the additional file Sup_18.xls. This analysis was performed on our methylation data for the 6,309 CpGs differentially methylated between IDC and normal breast tissue samples, described in Table SIII.
-EntrezGene_ID: Gene ID as defined by the NCBI.
-hazard.ratio: Hazard ratio as estimated by univariate Cox regression analysis.
-lower and upper: 95% confidence interval for the hazard ratio.
-fdr: FDR-corrected Wald test p-value.  This meta-analysis was performed on the genes displaying high anti-correlation between their methylation and expression status (Pearson's coefficient below than -0.7), as described in the Supplemental Materials  Figure 6. Spearman correlation between methylation status of immune genes described in Figure 6 and Figure 6. Spearman correlation between expression status of immune genes described in Figure 6 and   [Btn]ACTAACTAAACCCCCAAATCTCTAAACAAT UBASH3A-S1 GTAGGAAGAGATGGTAG