Aberrant epigenetic and transcriptional events associated with breast cancer risk

Genome-wide association studies have identified several breast cancer susceptibility loci. However, biomarkers for risk assessment are still missing. Here, we investigated cancer-related molecular changes detected in tissues from women at high risk for breast cancer prior to disease manifestation. Disease-free breast tissue cores donated by healthy women (N = 146, median age = 39 years) were processed for both methylome (MethylCap) and transcriptome (Illumina’s HiSeq4000) sequencing. Analysis of tissue microarray and primary breast epithelial cells was used to confirm gene expression dysregulation. Transcriptomic analysis identified 69 differentially expressed genes between women at high and those at average risk of breast cancer (Tyrer-Cuzick model) at FDR < 0.05 and fold change ≥ 2. Majority of the identified genes were involved in DNA damage checkpoint, cell cycle, and cell adhesion. Two genes, FAM83A and NEK2, were overexpressed in tissue sections (FDR < 0.01) and primary epithelial cells (p < 0.05) from high-risk breasts. Moreover, 1698 DNA methylation changes were identified in high-risk breast tissues (FDR < 0.05), partially overlapped with cancer-related signatures, and correlated with transcriptional changes (p < 0.05, r ≤ 0.5). Finally, among the participants, 35 women donated breast biopsies at two time points, and age-related molecular alterations enhanced in high-risk subjects were identified. Normal breast tissue from women at high risk of breast cancer bears molecular aberrations that may contribute to breast cancer susceptibility. This study is the first molecular characterization of the true normal breast tissues, and provides an opportunity to investigate molecular markers of breast cancer risk, which may lead to new preventive approaches.

studies were used to integrate GWAS and gene expression datasets and identified 154 genes whose predicted expression associated with the risk for BC [5][6][7][8][9]. However, these studies drew data from the Genotype-Tissue Expression (GTEx) project, and, because of the use of autopsy-derived normal breast tissues, the breast-specific transcriptome profilings may be questionable. The relative lack of molecular profiling of normal breast tissue from subjects who are disease-free makes the field challenging.
Many studies searching for cancer biomarkers have identified gene expression signatures, epigenetic signatures, loss of heterozygosity and allelic imbalance resulting from the development of malignancy [10]. Among the molecular processes linked with cancer, DNA methylation has a key role in early cancer development through a process known as epigenetic reprogramming [11], potentially leading to silencing and loss of expression of tumor suppressor genes [12], and genomic instability [13].
Here, we performed an integrated analysis of DNA methylation and transcriptome profiling of cancer-free breast tissues donated by healthy women at either average or high risk for BC. In addition to early epigenetic events, we identified two molecules, FAM83A and NEK2, overexpressed in high-risk breasts and, therefore, potential markers of BC susceptibility. Moreover, using a subcohort of repeated breast tissues donation by the same donors, we confirmed that the molecular changes identified in high-risk subjects are age-independent. These findings will lead to a deeper understanding of BC susceptibility and also provide the scientific community with the molecular profiling of the true normal breast tissue.

Study cohort used to investigate molecular aberrations in association with breast cancer (BC) risk
To identify transcriptomic and epigenetic differences linked with BC risk, we analyzed cancer-free breast tissue cores donated by 146 healthy women (median age: 39 years), including 112 Caucasian, 24 African American, and 10 Asian subjects (Additional file 1: Table S1). Out of 146 participants, 117 were pre-and 29 post-menopausal women. Tyrer-Cuzick model was employed to estimate the lifetime risk of developing BC and allocated the subjects into either high-(score ≥ 20%, N = 68) or averagerisk group (score < 20%, N = 78) (Fig. 1A, Table 1 and Additional file 1: Table S1).

Characterization of the transcriptome alterations in high-risk breast
We performed a transcriptome analysis of the fresh frozen disease-free breast tissue donated by all the participants. Differential expression analysis was performed using DESeq2. From a total of 22,344 genes, the differential expression analysis between high-and average-risk breasts revealed 1874 transcripts to be significant at 5% false discovery rate (FDR). Of these, 1798 transcripts also passed the cutoff of t-test p-value ≤ 0.05 (Additional file 1: Table S2). Sixty-nine genes, including 51 upregulated and 18 downregulated genes, were identified with a fold change ≥ 2 (Table 1). Because both groups consisted of non-malignant breast tissue, a limited number of differentially expressed genes was expected [14]. Canonical pathway analysis revealed enrichment in pathways related to kinetochore signaling (p = 1.3E-05), DNA damage checkpoint (p = 0.0005), granulocytes adhesion (p = 0.002), and the IL17 pathway (p = 0.004) (Fig. 1B, Additional file 1: Table S3). Our data further confirm the impact of dysregulated DNA damage in breast carcinogenesis, as previously described [15]. Molecular network analysis showed an enrichment in functional categories involved in cell cycle, DNA replication and repair (Fig. 1C, Additional file 1: Table S3). One of the major molecular networks regulating cell cycle is centered around AKT and the transcription factor FOXM1 [16].
Except for DCX, the transcriptional changes detected between high-and average-risk breasts listed in Table 2 Fig. 1 Transcriptome profiling of breast tissues from women at either high-or average risk of breast cancer. A Schematics of the study design. Cancer-free breast tissue cores (N = 146) were divided in either high-risk or average-risk group according to the Tyrer-Cuzick breast cancer risk evaluation score (20% used as threashold). The tissues were processed and analyzed for whole transcriptome and methylome profiling and differentially expressed genes (DEG) and differentially methylated sites between high-and average-risk samples were identified. Thirty five women (10 high risk and 25 average risk) donated also a second biopsy (D2) allowing to detect age-dependent aberrations. B Pathway analysis of the transcripts differentially expressed (FDR < 0.05) between average and high-risk breasts. C Major molecular network of the differentially expressed transcripts between the two experimental groups. Genes upregulated in high-risk breasts are in red, while downregulated genes are in green. D FAM83A and NEK2 transcription level in breast tissues from women at either average-or high-risk of developing breast cancer. E Upper panel: Representative image of the immunofluorescence staining of primary breast epithelial cells with the epithelial marker, E-Cadherin (red), mesenchymal marker, Vimentin (green) as control, and nuclear staining, DAPI (blue). E-Cadherin and Vimentin staining of primary cells revealed that isolated primary cells are epithelial in nature. Lower panel: FAM83A and NEK2 expression in primary epithelial cells isolated from either average-risk (n = 4) and high-risk breast (n = 3). F FAM83A and NEK2 expression in primary and h-TERT immortalized isogenic breast epithelial cells (n = 7) from the GSE108541 dataset. G) Representative images of immunohistochemical staining for FAM83A and NEK2 are shown at 20X magnification. Staining quantification is expressed as positivity and H-score. Data are shown as mean ± standard error. #FDR < 0.005, *p < 0.05,**p < 0.001, ***p < 0.0001. Pvalue is calculated using either unpaired nonparametric Mann-Whitney test or nonparametric Wilcoxon test    are independent of both racial background and menopausal status of the tissue donors (Additional file 1: Table S4 and Additional file 2: Fig. S1). METABRIC, cBioportal, UALCAN and Oncoscore databases were interrogated to determine the cancerrelated relevance of the 69 differentially expressed genes. Among them, FAM83A and NEK2 showed overexpression in BC (p > 0.001), high genetic alteration frequency (> 10%), high gene amplification rate, and an Oncoscore > 50, and therefore were selected for further investigation (Table 1 and Additional file 2: Fig. S2) [17][18][19]. The expression of FAM83A and NEK2 in the breasts of high-and average-risk women is shown in Fig. 1D. We detected a 4.5-fold increase in FAM83A and 2.2-fold increase in NEK2 expression in primary epithelial cells isolated from the breast of high-risk women when compared with cells isolated from breast tissue of averagerisk women (Fig. 1E). Overexpression of both targets was detected also in a dataset of hTERT-immortalized epithelial cells as compared with the isogenic primary cells [20] (Fig. 1F). Moreover, immunostaining of a breast tissue microarray showed a 1.4 fold increase in FAM83A protein levels in the breast tissues from women at high risk of BC as compared with the breast tissues from subjects at average risk (p < 0.0001, Fig. 1G, Additional file 2: Fig. S3A and Additional file 1: Table S5). FAM83A overexpression in normal breast tissues was associated with parity (p < 0.001), tobacco use (p = 0.01), and family history of BC (p = 0.02) (Additional file 1: Table S6). On the contrary, NEK2 staining showed no difference in protein levels between the two groups (Fig. 1G). No difference in Ki67, estrogen receptor alpha (ERα), FOXA1, and GATA3 staining between high-and average-risk breasts was observed (Additional file 2: Fig. S3B and Additional file 1: Table S5). This data shows that FAM83A expression changes are specific to breasts of women at high risk of developing BC.

Genome-wide DNA methylation analysis reveals 1698 aberrant DNA methylation sites in normal breast tissue of high-risk women
With the goal of identifying alterations in regulatory regions leading to BC susceptibility, we performed a methylome analysis using the MethylCap-seq approach. Differential analysis of the methylated regions detected in the breasts from average-risk women and those from women at high risk of cancer revealed a wide chromosomal distribution of the epigenetic alterations ( Fig. 2A). DNA methylation changes with a ∆Z ≥ 1 (hypermethylated) or ≤ -1 (hypomethylated) were selected. We identified 1698 regions methylated that differentiate the breast tissue of high-risk women from that of women at average risk (FDR ≤ 5%), mapping to 1115 unique genes (Additional file 1: Table S7). Neither FAM83A or NEK2 genomic loci were found among the regions affected by BC risk-related DNA aberrations, suggesting that an alternative process than DNA methylation may regulate their expression. The twenty most hypermethyated and hypomethylated regions are shown in Fig. 2B and Table 2. Interindividual variability in DNA methylation can be observed within each experimental group. Among the detected DNA methylation changes, 98.9% consisted of hypermethylated loci (p = 9 × 10 − 8 ; Fig. 2C). More than 90% of hypermethylated loci localized in regulatory regions including the promoter, untranslated region, and introns, whereas only 41% of hypomethylated loci localized in these regions (Fig. 2C). Hypomethylated regions were found predominantly in the gene body (59%), a phenomenon that has been linked with the activation in cancers of aberrant intragenic promoters that are normally silenced [21,22].

DNA methylation and gene expression changes in high-risk breast show a weak correlation
To identify potential epigenetically regulated genes linked with BC risk, we performed a Pearson's correlation test on paired DNA methylation and gene expression data (Fig. 3). Among the 69 differentially expressed genes in Table 1, the expression level of eight genes was associated with aberrant intronic DNA methylation, including six genes showing a direct correlation (APELA, DIO2, FEZF2, LPAR3, UNC5D, and PRSS51) and two genes (PROK2 and SULT1C2) with a negative correlation (Fig. 3A). Furthermore, among the DNA methylation changes in Table 2, only the intronic hypermethylation of PHACTR1 (∆Z = 1.88, FDR = 1.0E-31) was negatevely correlated with PHACTR1 expression (fold change = 0.77, FDR = 0.006, r = -0.21) (Fig. 3B). Overall, the correlations identified were weak (r: -0.2,-0.5), suggesting that other regulatory events (chromatin modifications, gene amplification, nucleotide variants), rather than DNA methylation aberrations, may be the determinants of the transcriptomic changes observed in the high-risk breasts as compared with average-risk breasts.

Age-related molecular changes in cancer-free breast tissues in relation with cancer risk
Age is the strongest demographic risk factor for most human malignancies, including BC. Age-related transcriptome and DNA methylation aberrations were investigated on breast tissues cores donated by 35 women at two separate time points (Additional file 1: Table S11). Differential expression analysis (FDR < 0.05) between the two donation time points revealed the dysregulation of 205 genes involved in LXR/RXR activation (p = 7E-04), immune response (p = 2E-03), and senescence (p = 7E-03) (Additional file 1: Tables S12 and S13). Among 25 age-related transcriptomic changes  (Fig. 4A).
Notably, the expression of GRIA4 (r = -0.43, p = 0.04) and DNM3 (r = -0.47, p = 0.03) showed a negative correlation with their DNA methylation pattern, thus suggesting a potential epigenetic regulation for these two molecules (Fig. 4B). Neither FAM83A or NEK2 were found among the genes affected by age-dependent transcriptomic changes. Age-dependent DNA methylation aberrations affected 301 loci corresponding to 280 unique transcripts (Additional file 1: Table S14). As previously reported [29], age-related DNA methylation alterations consisted predominantly of hypermethylation events (85.4%) and affected the intronic regions (Fig. 4C). DNA methylation measurements were previously used to develop epigenetic biomarkers of aging, otherwise known as "DNA methylation age" or the "epigenetic clock" [30,31]. We observed a limited overlap between the 301 DNA methylation aberrations and the epigenetic clocks described by Horvath et al.(1.4%, [31]), whereas 73 genes associated with the differentially methylated bins in our dataset overlapped with age-associated DNA methylation alterations reported by Johnson et al. (24.2%, [29]) (Fig. 4D). Finally, we identified age-related DNA methylation aberrations enhanced in high-risk breasts, localized on  Table S14).

Discussion
This study aimed to define the distinct features of cancerfree breast tissues from women at high risk for breast cancer (BC) and, thus, identify molecular markers that could potentially screen for women susceptible to cancer. We conducted transcriptome and methylome analyses using breast tissue cores donated by healthy women. The participants were divided into two cohorts based on their risk of developing breast cancer, according to the Tyrer-Cuzick lifetime risk assessment score: high-risk (≥ 20%) and average-risk (< 20%) [32]. Among the genes upregulated in high-risk breast, we identified two promising markers of BC susceptibility, FAM83A and NEK2. Furthermore, when investigating DNA methylation aberrations in high-risk breasts, we detected 4-10% overlap with cancer-related signatures.
Our transciptomic analysis of high-and average-risk breasts revealed significant changes in the expression of 69 genes (FDR < 0.05). Pathway analysis suggested the activation of cell cycle and cell adhesion in the high-risk breasts. Furthermore, one of the molecular networks including the differentially expressed genes revealed the involvement of FOXM1 signaling. FOXM1 itself is upregulated 1.6 fold in high-risk breasts (p = 0.001). The transcription factor FOXM1 regulates the transcription of cell-cycle genes essential for transit from the G1/S phase into the G2/M phase, such as cyclin A2, JNK1, ATF2, and CDC25A phosphatase as well as genes critical for chromosome segregation and cytokinesis [33]. FOXM1 is overexpressed and plays a critical role in tumorigenesis, metastasis, and drug resistance in a broad range of human cancer types, such as lung, gastric, and breast cancers [16]. Compounds targeting FOXM1 expression or activity are under investigation [16]. Our results suggest that the transcriptional dysregulation detected in high-risk breasts may be driven by FOXM1.
Two genes, FAM83A and NEK2, both upregulated in high-risk breast, showed a high Oncoscore (75.5 and 61.4, respectively), and have been reported amplified in BC. FAM83A is the smallest member of the eightmember FAM83 family of proteins, that share a conserved amino-terminal Domain of Unknown Function (DUF1669 domain) [34]. It was identified based on its transforming potential [35][36][37]. FAM83A upregulation has been detected in multiple human tumor types, including breast, lung, pancreatic and cervical cancer [37][38][39][40][41][42][43][44]. Lee et al. [45,46] revealed the ability of FAM83A to confer resistance to epidermal growth factor receptortyrosine kinase inhibitors (EGFR-TKIs) through interactions with c-RAF and PI3K p85 in BC. The authors also showed that BC patients with high FAM83A expression had a worse prognosis. FAM83A depletion suppressed proliferation and invasiveness in vitro as well as tumor growth in vivo [36]. Based on the aforementioned studies, FAM83A is considered a candidate oncogene. Our findings suggest that FAM83A may be one of the first molecules dysregulated in cancer transformation and thus a marker of BC susceptibility. The functional role of FAM83A in BC inititation is currently being investigated by our team. Moreover, our DNA methylation data, in agreement with previous literature, suggest that FAM83A overexpression is mainly driven by genomic amplification rather than epigenetic regulation [47,48]. Additional studies such as dual color fluorescence in situ hybridization and deep whole genome sequencing of DNA from breast tissues of high-risk women are required to support this hypothesis.
The NIMA-related kinase 2 (NEK2) protein belongs to a family of serine/threonine kinases, which have a role in mitotic progression by initiating the separation of centrosomes [49]. NEK2 overexpression was previously reported in BC as result of gene amplification [47,50]. NEK2 depletion blocks cell cycle progression and tumor cell growth, making it an ideal therapeutic target [51]. Notably, FOXM1 is reported to both bind NEK2 promoter and interact with NEK2 [52,53]. Our study further suggests a role of NEK2 dysregulation in breast carcinogenesis. However, we did not observe changes in NEK2 protein levels in breast tissues of high-risk women, suggesting a disconnect between mRNA and protein levels, which is not uncommon, due to a more complex regulatory pathway. Our observations indicate that, while increased NEK2 mRNA expression may be indicative of BC risk, post-transcriptional events may bring NEK2 to its basal protein level. NEK2 may have a more critical functional role in a late phase of BC development. Further investigation of the role of NEK2 in breast carcinogenesis is needed.
We observed DNA methylation changes in high-risk breasts, consisting mostly of hypermethylation (98.8%) in the intronic regions (88%). Previous studies reported aberrant hypermethylation in normal breast tissue adjacent to the tumor [54]. Hypermethylation in specific gene promoters is indeed linked to carcinogenesis through transcriptional silencing of tumor suppressor genes or regulatory regions within the genome, leading to dysregulation of cell growth, cancer initiation and progression [55][56][57]. We identified a 4-10% overlap between methylome aberrations in high-risk breasts and previously reported cancer-related signatures [25][26][27][28]. The limited overlap may be linked to the different technical approaches (Methyl-capture vs Infinium HumanMeth-ylation450 array) but may also suggest that most of the cancer-related epigenetic marks are newly acquired during cancer initiation rather than being imprinted into the genome. Moreover, neither FAM83A or NEK2 was found among the genes affected by DNA methylation, suggesting that a different regulatory process may control their transcription. Although the expression of epigenetic modifiers such as DNMTs remain unaffected, we detected the upregulation in high-risk breasts of HASPIN (fc = 1.7; FDR < 0.005), a serine/threonine kinase involved into the phosphorylation of the histone H3 during mitosis [58], suggesting that other genetic and epigenetic mechanisms rather than DNA methylation may drive the transcriptomic aberrations detected in high-risk breasts.
Age is the strongest demographic risk factor for most human malignancies, including BC [59]. The limited size of our age-related cohort (N = 35) prevented us from subdividing the subjects by age at tissue donation. Nevertheless, we identified age-related transcriptomic aberrations enhanced in high-risk breasts including GRIA4 and DNM3, which resulted as potentially epigenetically regulated. In terms of DNA methylation aberrations, we found a limited overlap between the age-related DNA methylation changes from our cohort and the epigenetic clock from Hovarth et al. [31] (Additional file 1: Table S14). However, a 24.2% overlap of our dataset with age-related DNA methylation aberrations described by Johnson et al. [29] was detected. The limited overlap is probably due to the different platform used for DNA methylation detection (Infinium Human Methylation 450 array vs Methyl-Cap-seq) and the type of analysis (epithelium-specific deconvolution vs whole tissue) [29,31]. Notably, we identified specific age-related DNA methylation changes, located on PTPRM, KCNH1, SPOCK1, CFAP43 gene region, enhanced in the high-versus average-risk breasts.
This study harbors some limitations: the relatively small sample size prevented us from investigating in details cancer-related variables such as racial background. The selection of normal breast tissue cores with high content in epithelial compartment limited the number of available samples (Additional file 2: Fig. S6). Outcome data for the women at high risk for BC is not available at this time; however, this cohort is under an ongoing annual medical follow up. Because of the faster processing time and smaller cost, we performed whole tissue analysis instead of the more epithelium-specific laser microdissection or single-cell analysis. This limits the compartment specificity of the data but generates a more comprehensive view of the molecular features of the entire breast tissue core. Further deconvolution analysis may overcome this limitation [60,61].

Conclusions
The present study reveals transcriptomic and epigenetic aberrations linked with BC risk and, thus, provides an avenue for deciphering the functional relevance of genes involved in BC development. We defined a panel of 1698 methylated regions that could be used to predict BC risk. Moreover, among the transcriptional targets here identified, FAM83A showed an increase in both mRNA and protein expression in the breast of women at high BC risk, and therefore may represent a novel tissue biomarker of BC risk.

Study cohorts
Breast specimens were obtained from the Susan G. Komen Tissue Bank at the IU Simon Comprehensive Cancer Center (KTB) and donated voluntarily upon informed consent by healthy women. Subjects were recruited under a protocol approved by the Indiana University Institutional Review Board (IRB protocols number 1011003097 and 1607623663). Subject demographics and breast cancer (BC) risk factors were collected using a questionnaire administered by the KTB and summarized in Additional file 1: Table S1, S5 and S11. Breast tissue cores are collected by using a needle biopsy as previously described [14]. The study cohort consisted of two groups: 1) For the transcriptome and methylome analyses, 146 women (median age: 39 years) were selected based on the lack of clinical and histological breast abnormalities and high content in breast epithelial compartment (cellularity > 40%). Germline mutation status of the subjects was obtained from KTB. Data were retrieved from the LifeOmic's Precision Health Cloud platform (https:// lifeo mic. com/ produ cts/ preci sion-health-cloud/). Nine established breast cancer-predisposition genes (BRCA1, BRCA2, PALB2, ATM, CHECK2, BARD1, RAD51C, RAD51D, CDH1) were evaluated for variants identified as "pathogenic" or "likely pathogenic" in the Clin-Var database (https:// previ ew. ncbi. nlm. nih. gov/ clinv ar/) (Additional file 1: Table S1) [2,3].
Thirty-five of these 146 women, including 10 at high risk and 25 at average risk for BC, donated their breast tissue at two time points at intervals from 1-10 years (mean: 3.2) between the tissue donations ( Fig. 1A and Additional file 1: Table S11). 2) In a second analysis, paraffin-embedded breast tissue blocks related to 395 healthy women were obtained from the KTB and used to generate tissue microarrays. The cohort included 287 Caucasian, 66 African American, 49 Asian, with age ranging from 18 to 61 (Additional file 1: Table S5).

Breast cancer risk assessment
Lifetime risk of developing BC was estimated by using the Tyrer-Cuzick risk score (IBISv8) [32] and a threshold of 20% to separate high-(≥ 20%) from average-risk (< 20%) individuals. The Tyrer-Cuzick model was selected over the other risk estimation tools for its accuracy and inclusion of young subjects [62].

Tissue processing and nucleic acid extraction
To limit stromal contamination, only breast tissue samples abundant in epithelial compartment (cellularity > 40%) were selected and processed. Total DNA and RNA were isolated from fresh frozen breast tissue biopsies (80-150 mg) using AllPrep DNA/RNA/miRNA kit (Qiagen). Tissues were homogenized by using 2 ml prefilled tubes containing 3 mm zirconium beads (Benchmark Scientific, cat.# D1032-30), 350 µl Lysis Buffer and 2-Mercaptoethanol, and BeadBug 6 homogenizer (Benchmark Scientific) in a cold room at the following conditions: 4000 rpm for 45 s was repeated twice with 90 s rest time. The concentration and quality of total RNA and DNA samples were first assessed using Agilent 2100 Bioanalyzer. A RIN (RNA Integrity Number) and DIN (DNA integrity number) of six or higher is required to pass the quality control.

Whole transcriptome analysis
cDNA library was prepared using the TruSeq Stranded Total RNA Kit (Illumina) and sequenced using Illumina HiSeq4000. Data included 146 paired-end fastq sequence libraries (raw read length: 38 × 2). Reads were adapter trimmed and quality filtered using Trimmomatic ver. 0.38 (http:// www. usade llab. org/ cms/? page= trimm omatic) setting the cutoff threshold for average base quality score at 20 over a window of 3 bases. Reads shorter than 20 bases post-trimming were excluded. About 94% of the reads have both the mates passing the quality filters. Cleaned reads mapped to Human genome reference sequence GRCh38.p12 with gencode v.28 annotation, using STAR version STAR_2.5.2b [63]. Only samples with about 99% of the cleaned reads aligned to the genome reference. Differential expression analysis was performed using DESeq2 ver. 1.12.3 (https:// bioco nduct or. org/ packa ges/ relea se/ bioc/ html/ DESeq2. html). Counts table containing mapped read counts for each gene, to be used as input for DESeq2 was generated using featureCounts tool of subread package (https:// doi. org/ 10. 1093/ bioin forma tics/ btt656). Alternatively, we ran t-tests comparing the normalized read counts for the set of replicates from High risk samples to those for the set of replicates from Average risk samples. The normalized read counts were obtained from the DESeq2 run described above.
The pvalues from the t-test were corrected for multiple testing using Benjamini-Hochberg method.

DNA methylation analysis
Library was generated by using MethylCap Library Kit (Diagenode, Denville NJ, US) according to the manufacturer′s protocols followed by single-end 75-bp sequencing on Illumina Nextseq4000. Internal controls and duplicate samples were used to account for any batch effect and technical artifact. The data comprises of 146 paired end read libraries in FASTQ format. These libraries represent replicates for two samples-High risk (68 libraries) and Average risk (78 libraries). The libraries were sequenced across multiple runs and the combined read counts for each library were generated. Reads were adapter trimmed and quality filtered using Trimmomatic 0.38 (http:// www. usade llab. org/ cms/? page= trimm omatic) with the cutoff threshold for average base quality score set at 20 over a window of 3 bases. Reads shorter than 20 bases post-trimming were excluded. Approximately, 96% of the sequenced reads passed the quality filters to be considered "cleaned" reads. This quality control reduced the number of samples to 57 high-and 55 average-risk. Cleaned reads were mapped to Human genome reference GRCh38.p12 using BWA ver. 0.7.15 [64]. Insert sequences were imputed from the concordantly mapped read pair alignments. More than 95% of the cleaned read pairs were concordantly mapped. A previously described differential methylation analysis using either Zratio or ΔZ [65,66] was applied to the current methylcapture dataset with a slight improvisation on the validation of the significance of differential methylation. For any given local bin of a given width on the genome, the method compares across samples, variation in deduplicated insert coverage distribution quantified as the bin's z-score with respect to a larger genome region containing the bin. For this analysis, we used local non-overlapping bins with a fixed width of 250 bp with their z-scores computed relative to 25 KB regions. Z-score is the number of standard deviations by which the bin coverage varies from the larger region's mean coverage. A significant difference in Z-scores, calculated as either as ΔZ or Zratio between the samples would indicate potential differential methylation for that bin, as previously described [67]. The analysis identified 159,438 bins, each 250 bp wide, to be potentially differentially methylated between High risk and Average risk samples with z-ratios or ΔZ significant at 5% FDR and p-values from t-test ≤ 0.05. Based on positional overlap, these bins were annotated using annotation from gencode v28.

Primary breast epithelial cells and immunofluorescence
Primary breast epithelial cells were generated from cryopreserved breast tissue cores obtained from the KTB as previously described [14,20]. Immunofluorescence staining was performed as previously described [14]. Briefly, 5000 cells were cultures overnight into each well of an 8 well-chamber slide (BD Biosciences, San Jose, CA) and fixed with acetone: methanol (1:1) at -20 °C for 10 min. After washing and blocking (PBS1X, 5% normal goat serum, 0.1%TritonX-100) steps, cells were incubated with primary either rabbit anti-vimentin (Cell Signaling, D21H3, 1:100) or mouse anti-E-cadherin antibody (Cell Signaling, 14472, 1:50) overnight. Upon three washes with PBS, cells were incubated with secondary antibodies (goat anti-mouse Alexa Fluor 568 or goat anti-rabbit Alexa Fluor 488; Thermo Fisher Scientific, 1:500) for 1 h at room temperature. After three washes with PBS, the coverslide was mounted using DAKO fluorescent mounting medium (S3023 Agilent, Santa Clara, CA) and the staining was visualized using a fluorescent microscope (Eclipse TS100, Nikon Instruments inc, Melville, NY).

Quantitative real time polymerase chain reaction (qPCR)
Total RNA was extracted from cells using AllPrep DNA/ RNA/miRNA kit (Qiagen). Reverse transcription was performed using SuperScript ™ IV VILO ™ Master Mix (Invitrogen cat#: 11756050) according to the manufacturer's instructions. qPCR was performed using the TaqMan ™ Universal PCR Master Mix (Applied Biosystems, cat# 4304437) and the following TaqMan Gene Expression Assays (Applied Biosystems/Thermo Fisher Scientific, Grand Island, NY): ACTB (Hs99999903_m1), FAM83A (Hs04994801_m1), and NEK2 (Hs00601227_ m1). qPCR reactions were run on a StepOne Plus Real-Time PCR System (Applied Biosystems/Thermo Fisher Scientific), and data analyzed using the StepOne Software v2.3 (Applied Biosystems). Relative quantification was calculated with reference to ACTB and analyzed using the comparative C T method. qPCR experiments were performed in triplicate.

Statistical analysis
Comparisons between groups were done using either Student's t-test or nonparametric Mann-Whitney test on GraphPad Prism 9. Difference between groups is considered significant at p-values < 0.05. Pearson's correlation analysis was performed to determine the strength and direction of the linear relationship between DNA methylation and transcription for given targets. Only correlations with a p < 0.05 are shown. For transcriptome and methylome data, differential analysis was performed using DESeq2 and the previously described Z-score method [65,66], respectively. P-values < 0.05 are considered significanct and are corrected for multiple testing using the Benjamini-Hochberg False Discovery Rate (FDR) algorithm. For the tissue microarrays analysis nonparametric Wilcoxon rank-sum tests were used for unpaired analyses, as positivity and H-scores were not normally distributed, whereas nonparametric Wilcoxon signed-rank tests were used for paired analyses. The statistical software SAS version 9.4 (SAS Institute Inc., Cary, NC) was used to complete the statistical analyses with p < 0.05 considered significant. Baseline