Enriched transcription factor signatures in triple negative breast cancer indicates possible targeted therapies with existing drugs

Purpose Triple negative (TN) breast cancers which lack expression of the estrogen (ER), progesterone (PR), and human epidermal growth factor 2 (HER2) receptors convey a poor prognosis due in part to a lack of targeted therapies. Methods To identify viable targets for the treatment of TN disease, we have conducted a gene set enrichment analysis (GSEA) on seven different breast cancer whole genome gene expression cohorts comparing TN vs. ER + HER2 − to identify consistently enriched genes that share a common promoter motif. The seven cohorts were profiled on three different genome expression platforms (Affymetrix, Illumina and RNAseq) consisting in total of 2088 samples with IHC metadata. Results GSEA identified enriched gene expression patterns in TN samples that share common promoter motifs associated with SOX9, E2F1, HIF1A, HMGA1, MYC BACH2, CEBPB, and GCNF/NR6A1. Unexpectedly, NR6A1 an orphan nuclear receptor normally expressed in germ cells of gonads is highly expressed in TN and ER + HER2 − samples making it an ideal drug target. Conclusion With the increasing number of large sample size breast cancer cohorts, an exploratory analysis of genes that are consistently enriched in TN sharing common promoter motifs allows for the identification of possible therapeutic targets with extensive validation in patient derived data sets.


Introduction
Breast cancer is the most commonly diagnosed cancer in women (more than 230,000 women were diagnosed with breast cancer in the US last year) and second leading cause of cancer-related deaths, statistics that strongly advocate for a better understanding of the mechanisms that drive mammary carcinogenesis (Siegel et al., 2012). Cancers including breast cancer are initiated as a result of changes that occur in the genome. Differential gene expression analysis is commonly used to reveal the deregulated molecular mechanisms of complex diseases including cancer. Gene expression profiling has classified breast cancer into intrinsic subtypes (Carey et al., 2006;Sørlie et al., 2001). Among these is the basal-like subtype, representing ER/PRnegative with low HER2 expressing tumors are characterized as the triple negative breast cancer (TNBC). Thus, TNBC is defined by histopathologies as a subtype that lacks the expression of estrogen receptor (ER), progesterone receptor (PR), and HER2 amplification/overexpression (Dey et al., 2013a;Rastelli et al., 2010;Foulkes et al., 2010). TNBC accounts for approximately 15% of breast cancer diagnoses, but approximately 25% of breast cancer-related deaths due to a more aggressive biology and lack of targeted therapies (Grigoriadis et al., 2012;Dey et al., 2013b). Gene expression data has documented the heterogeneous nature of TNBC (Mayer et al., 2014).
Despite significant success of targeted anticancer therapies in ER+ or HER2+ subtypes of breast cancers, patients with loco-regionally advanced or metastatic triple negative carcinoma have very limited therapy options, especially as chemo-resistance develops to standard chemotherapy. A lack of standardized markers that differentiate "basal-like" and TN subtypes underlines the heterogeneous nature of these cancers. Despite the unclear delineation between TN and "basal-like" they compose some of the worst prognoses in breast cancer, are associated with an undifferentiated metaplastic histology with stem cell-like characteristics and have a high incidence of metastasis (Lien et al., 2007;Ben-Porath et al., 2008;Honeth et al., 2008). Besides array-based gene expression analysis, a number of studies have reported genomic alterations that occur in TNBC clinical specimens and cell lines including comparative genomic hybridization and deep genomic profiling using next generation sequencing (NGS) technologies (Shah et al., 2012;Ding et al., 2010;Weigman et al., 2012;Loo et al., 2011;Neve et al., 2006;Fridlyand et al., 2006;Chin et al., 2006). Wholegenome sequencing of a single metastatic TNBC (mTNBC) patient's germline, primary tumor, metastatic tumor, and xenograft has also been reported, which showed the complexity of the somatic events that arise within a given TNBC (Ding et al., 2010). More recently, genome-sequencing studies of large subsets of retrospectively collected TNBC have been reported, which implicate TP53, PIK3CA, NRAS, EGFR, RB1, and PTEN (Shah et al., 2012;Cancer, 2012).
To identify molecular mechanisms inherent to the TN subtype we have conducted gene set enrichment analysis (GSEA) (Subramanian et al., 2005), comparing TN vs. ER + HER2 −, in seven distinct cohorts, grouping gene sets by common promoter motifs to identify transcription factors and expression patterns of interest. The gene sets that are shown to be enriched in seven distinct cohorts with a Stouffer weighted Z (Whitlock, 2005;Zaykin, 2011) p-value b .01 are used to construct a promoter motif signature for genes determined to be enriched in the maximum number of cohorts. The transcription factor for each identified enriched promoter motif as well as any chemical or genetic perturbation that lowers the expression of the promoter motif gene signature represents potential therapeutic option(s) in TN breast cancer. The workflow is outlined in Fig. 1.

Cohorts
Cohorts with representation of large N samples with immunohistochemistry (IHC) determined ER+/− and HER2 status and clinical outcome data were selected for analysis. All probe or gene expression levels were used as deposited using published normalization, and the following is a summary of each cohort. Each cohort is molecularly profiled on a wide range of platforms with different normalization methodology. GSEA is done independently for each cohort to determine statistically enriched gene sets mitigating the effects of different platforms and normalizations. The GEO deposited cohorts GSE25055 (n = 279 TN = 114/ER + 165) and GSE25065 (n = 187 TN = 64/123) were run on the U133A Affymetrix GeneChip with wellcurated phenotype metadata and metastasis outcome (Hatzis et al., 2011). TCGA-BC RNA Seq V2 RSEM was downloaded from TCGA Data Portal on July 1, 2013 and represents (n = 286 TN = 58/ER+ = 228) samples Fig. 1. Each cohort consisting of TN and ER+(HER2−) samples are run using GSEA to determine gene sets that are enriched and share a common promoter motif. The p-value from each enriched gene set is combined and ranked using Stouffer weighted Z to identify gene sets that have consistent enriched gene sets across all cohorts. The transcription factor for each ranked enriched gene set are searched in the STICH 4.0 database for chemical inhibitors or activators. Additionally, common set of genes in each gene set shown to be enriched across maximum number of cohorts are searched for known chemical and genomic perturbation gene set to identify possible inhibitors or activators.

Probe and gene expression mapping
To provide for consistent gene names each platform assigned gene accession id or UniGene id was programmatically cross referenced to the HUGO recommended gene name. Probes with an identifier that had been withdrawn were removed from the data set. The probe with the maximum expression level for each gene in each sample was used to represent the transcription gene expression level.
Gene set enrichment analysis IHC metadata for ER, PR and HER2 status was used to designate each sample TN or ER+. Samples that lacked corresponding IHC metadata were not included in the analysis. Each cohort has a range of metadata to classify a sample as TN or ER+(HER2 −). For NNN that indicates the first N = ER− status, second N = PR− status and third N = HER2− status. An X indicates any value and in Metabric a 1 was used to indicate HER2− status. GSEA was performed using Broad GSEA version 2.0.13 with default settings (Subramanian et al., 2005;Mootha et al., 2003) comparing TN vs ER+ phenotype for enriched genes in the MSigDB gene set collection c3.tft.v4.0.symbols.gmt representing known gene sets sharing a common promoter motif.

Determining transcription factors and promoter motif signatures of interest
The GSEA determined p-value for each enriched gene set from each cohort was ranked using Stouffer weighted Z (Whitlock, 2005;Fisher, 1958) and ranked by p-value with a p-value b .01. For each statistically interesting gene set, it is expected that some number of genes will be consistently enriched in all cohorts. The enriched genes from each cohort were combined, and the genes found to be enriched in the maximum number of cohorts were used to make the promoter motif signature for each transcription factor gene set. The promoter motif signature can be used to determine chemical or genetic perturbations that down regulate the collection of enriched genes. For an enriched gene set that share a common promoter motif with a known transcription factor that transcription factor also becomes a potential therapeutic target.

Selection of possible therapeutic drugs
The MSigDB contains a unique gene set c2.cgp.v4.0.symbols.gmt that represents curated data of up or down regulated genes from 3402 chemical or genetic perturbation experiments. Each promoter motif gene signature was programmatically compared with each set of down or up regulated genes in c2.cgp.v4.0.symbols.gmt. Criteria stipulating 50% of the promoter motif enriched genes' signature was found in a chemically or genetically altered gene set then that c2.cgp.v4.0.symbols experiment indicates a mechanism to regulate the enriched genes sharing a common promoter motif. Experiments indicating the ability to down regulate the enriched genes are selected as being of interest.
For each enriched promoter motif with known transcription factor, the transcription factor was manually searched using the STITCH 4.0 database (Kuhn et al., 2014), which contains interactions between 390,000 small molecules and 3.6 million proteins from 1133 organisms. Drug interactions with the transcription factor that indicate inhibition or activation of the transcription factor are selected as being of interest.

Results
The list of enriched gene sets sharing common promoter motifs in Table 1 is ranked by Stouffer weighted Z p-value, which combines probabilities from independent tests with cohorts of different sizes. A high degree of overlap occurs when genes share multiple common but distinct promoter motifs for a set of transcription factors and can be seen in Table 2. Genes that share different transcription factor promoter motifs are an example of the redundancy that allows for multiple mechanisms to drive the expression of that gene.
With the ranked list of enriched gene sets any combinations of genes could be enriched with minimal overlap between cohorts. To identify genes or promoter motif signatures each set of GSEA defined enriched genes per gene set are combined for all cohorts. Enriched genes found in the highest number of cohorts per gene set are combined and shown in Table 2. Genes found to be enriched in all seven cohorts are strong evidence that the genes sharing common promoter motifs and the corresponding transcription factor are possible therapeutic targets. Data mining of differential expression of mRNA in cancer cohorts and drug interaction databases can be tested in appropriate cell lines and mouse models for possible therapeutic benefit. This approach can identify drugs that are shown in published experiments and data sets to have properties as possible therapeutics in TN breast cancer. Additionally, examination of antagonists of transcription factors identified as potential drivers of enriched gene sets can increase the number of possible therapeutic targets. To identify possible therapeutic agents that can down regulate the expression of the promoter motif signatures found in Table 2, the signatures are compared to the c2.cgp.v4.0.symbols.gmt gene sets which contain a curated list of up or down regulated genes from 3402 published array studies and is shown in Table 3. This approach has the advantage of determining drugs that act through direct or indirect mechanisms that can down regulate the expression of genes enriched in TN. This is particularly important for promoter motifs with unknown transcription factors or transcription factors with no known antagonists. Additionally, transcription factors that Table 1 Results of GSEA using gene set c3.tft.v4.0.symbols.gmt in seven different TN vs ER+(HER2−) cohorts ranked by Stouffer weighted Z score. Each cohort has a range of metadata to classify a sample as TN or ER+(HER2−). For NNN that indicates the first N = ER− status, second N = PR− status and third N = HER2− status. An X indicates any value and in Metabric a 1 was used to indicate HER2− status. are not expected to be expressed in a normal tissue are indicated in Table 5 and their expression patterns are shown in Fig. 2. As a result of the data analysis, the RAS inhibitor salirasib is identified as a possible therapy as it down regulates mRNA expression for enriched genes sharing V$E2F_01, V$E2F_Q3 and KTGGYRSGAA_UNKNOWN promoter motifs (Blum et al., 2007). Listed in Table 3, the genes that have been shown to be down regulated by salirasib have a high concordance with genes shown to be enriched in TN cohorts sharing V$E2F_01, V$E2F_Q3 and KTGGYRSGAA_UNKNOWN promoter motifs suggesting a regulatory effect that may have a treatment benefit for the patient. By using genes enriched in TN cohorts that have been shown to be down regulated by salirasib may be an effective biomarker strategy for selecting patients to enroll in a salirasib clinical trial. The RAS inhibitor salirasib has been shown to impact transcriptional response across five different cancer cell lines effecting E2F-regulated and NF-Y regulated genes and the transcription factor FOS which control cell proliferation, blocks apoptosis, and induction of activating transcription factor BACH2 regulated genes which participate in translation and stress response (Blum et al., 2007). Additionally, The EGFR inhibitor CL-387,785 down regulates enriched genes in the V$E2F_Q3 promoter motif signature indicating an additional therapeutic target. In non-small cell lung cancer, EGFR inhibitor-resistant gefitinib-treated and the (sensitive) CL-387,785-treated H1975 cells were compared to identify transcriptional changes with EGFR-activating mutations (Kobayashi et al., 2006). Aminopeptidase inhibitor Tosedostat was shown to down regulate enriched genes containing V$E2F_01 promoter motifs including MYC the transcription factor targeting V$MYC_Q2, V$MYCMAX_01 and V$MYCMAX_02 promoter motifs. Tosedostat down regulated five of nine enriched genes in the V$E2F_01 promoter motif signature in HL-60 cells (acute promyelocytic leukemia) (Krige et al., 2008) where down regulation of MYC could impact a large number of TN enriched genes. The SB216763 inhibitor of GSK3Beta treatment of RS4;11 human leukemia cell line, down regulated 474 genes at least 1.5 fold that were significantly enriched for genes related to cell cycle and MYC-regulated genes (Wang et al., 2010). Knockout experiments of TLX and BMP2 indicate a down regulation of V$E2F_01, KTGGYRSGAA_UNKNOWN and V$E2F_Q3 enriched genes implicating them as possible therapeutic targets (Lee et al., 2007;Zhang et al., 2008). E2F1 has minimal overall expression but does have higher expression in TN vs. ER+ with a comparative p-value b 0.0001 (Mann-Whitney test).
In Table 4, each transcription factor was used to identify possible activating and inactivating drugs from the STITCH 4.0 database (Kuhn et al., 2014). Surprisingly, the identified enriched promoter gene sets had a single known transcription factor versus a family of transcription factors that target the same promoter. This is encouraging given that targeting single transcription factors reduces the off target effects of treatment. To identify transcription factors that have minimal expression in normal tissue the transcription factors were submitted to the gene expression barcode 2.0-web service, which determines if a gene is expressed in many tissues using Affymetrix HGU133plus2 array data (McCall et al., 2011). Identification of transcription factors that are not expected to be expressed in breast tissue or have highly specific tissue expression further implicates its value as a therapeutic target where the normal tissue expression profiles are shown in Table 5. Genes with expected expression in breast tissue were also found.
The enriched genes that share HIF1A (Hypoxia-inducible factor alpha subunit) with promoter motif V$HIF1_Q5 and V$HIF1_Q3 as the highest ranked transcription factor resulting in higher expression of genes in TN. HIF1A is expressed in many tissues and known to have mRNA and protein down regulated by mitoxantrone a TOP2 targeting drug (Toh and Li, 2011) and rapamycin a mTOR regulator and immunosuppressant (Wang et al., 2009). HIF1A has higher expression in TN vs ER+ with a comparative p-value b 0.0001 (Mann-Whitney test). HNF1A targeting V$TCF1P_Q6 promoters with a Stouffer weighted Z p-value of .004 and ZIC3 targeting V$ZIC3_01 with Stouffer weighted Z p-value of .002 had a minimum number of enriched genes in common across all cohorts indicating lack of support for strongly enriched genes. ZIC3 is found to be expressed in cerebellum tissue with no known drug antagonist where a drug that blocks the actions of ZIC3 but does not cross the blood brain barrier should have minimal off-target effects. In Fig. 2, ZIC3 expression in TN is slightly higher than ER+ with a comparative p-value = 0.015 (Mann-Whitney test) but overall expression is minimal compared to the other transcription factors. HNF1A also has overall low expression where TN samples do appear to have outliers and a comparative p-value = .018 (Mann-Whitney test). Understanding expected normal tissue mRNA expression profiles can also be used to find novel targets without clearly defined functions or ligands. As an example, NR6A1 an orphan nuclear receptor with no known ligand where normal expression is found in germ cells of gonads and expressed in both TN and ER+ samples from E2197 ( Fig. 2) with no statistical difference. If the 10 TN enriched genes [ABTB2 AMD1 BCL11A BCL11B FGF9 ITGB8 MALL NFIB NFIL3 RASAL1] sharing the V$GCNF_01 promoter motif are important to disease progression then NR6A1 would be an ideal therapeutic target. Hirose et al. (Hirose et al., 1995) and Chen et al. (Chen et al., 1994) showed that NR6A1 mRNAs were predominant in mouse testis and were just detectable in the brain, liver, and lung. The effect of androgen manipulation to lower mRNA expression of NR6A1 was studied by Hu et al. (Hu et al., 2003) using Northern blot analysis. It has been suggested that gene expression in the testis and epididymis are regulated by a male steroid hormone, androgen (Cornwall and Hann, 1995). The expression levels of NR6A1 transcripts (2.3 and 7.4 kb) in epididymis increased gradually and reached a plateau 5-7 days after castration (Hu et al., 2003). A randomized clinical trial compared tamoxifen vs. tamoxifen and flouxymesterone (anabolic steroid with androgenic properties) in postmenopausal women with metastatic breast cancer where median time to progression was 199 days vs. 350 days respectively approaching statistical significance with a one sided p-value = .07 (Ingle et al., 1988). NR6A1 over-expression could be a potential biomarker for effective treatment with androgen therapy.

Discussion
The genetic alterations in tumor cells are dynamic in driving cellular programming and resulting in mRNA expression profiles that are important in tumor fitness. New tools and technologies for genomic or systemic level analysis, as well as conventional biochemistry (including proteomic analysis) and cell biology approaches (phenotypic studies) are revealing how signaling pathways contribute to the development of cancer, as well as tumor evolution, dormancy and therapy-resistance. Here, we show by using GSEA comparing distinct TN vs ER + HER2 − cohorts a collection of genes that are enriched in TN that share common promoter motifs associated with distinct transcription factors.
Using GSEA it is possible to determine sets of genes with common attributes or features that are enriched in a specific cancer phenotype. With the growing collection of large N cancer cohorts, this type of comparative analysis can identify potentially important genes that drive phenotype. Ideally, GSEA could be used to compare a cancer phenotype cohort with the expression profile of the normal tissue type affected by the cancer. We lack established large N normal tissue cohorts that can be used in GSEA. Using enriched gene sets when comparing common cancer phenotypes can identify common attributes that can give insight into the mechanisms that drive cancer and serve as possible therapeutic targets. The enriched gene sets that are differentially expressed are most likely driven by a common set of transcription factors or transcription factor families, which can be used to determine the drivers of the cancer phenotype.
We have identified a collection of enriched genes that share a common promoter element that indicates a possible transcription factor that is driving the differential expression between two cancer phenotypes. We can validate or prioritize a particular transcription factor as a treatment target using expression profiles from normal tissue to understand the potential for off-target effects of therapy. Transcription factors that have minimal expression in normal tissues but are found to be differentially expressed or highly expressed in a cancer phenotype make an ideal candidate for further study. Additionally, with the large collection of publicly available experiments indicating differential expression of mRNA as a result of drug treatment we can identify potential therapies that can regulate the expression of enriched genes in a particular cancer phenotype. Exploratory analysis of mRNA cancer cohorts can identify data patterns that based on biological context can then be used to mine for possible therapies that can attenuate the expression pattern associated with a cancer phenotype. In Table 3, we identify from published experimental data possible therapies that can regulate enriched genes that share a common promoter motif where it may not be clear how those genes are regulated. In Table 4, a collection of possible drugs are identified that target transcription factors that have enriched genes in TN sharing a common promoter. In Table 5, we use the expression profile of the transcription factor in normal tissues to prioritize based on possible off-target effects.
The STITCH 4.0 database did not list a known E2F1 targeted therapy but we are able to discern from the chemical and genetic perturbation data sets that RAS inhibitor Salirasib, EGFR inhibitor CL-387785 and Aminopetidase inhibitor Tosedostat regulate genes with E2F1 promoters which are shown to be enriched in TN. NR6A1 is a member of the nuclear receptor superfamily of ligand-activated receptors that shares a common modular structure. These receptors play vital roles in development, cellular homeostasis, and cancer where over-or under-expression of some receptors has prognostic significance for patient survival. NR6A1 is normally expressed in germ cells of gonads making it an interesting therapeutic target in breast cancer where unexpectedly it has high expression in both TN and ER+ E2197 samples.
The nuclear orphan receptors for which endogenous ligands have not been identified include nuclear receptor NR0B1 (adrenal hypoplasia congenita critical region on chromosome X gene), NR0B2 (small heterodimer partner), NR1D1/2 (Rev-ErbAα/β), NR2C1 (testicular receptor 2), NR2C2 (testicular receptor 4), NR2E1 (tailless), NR2E3 (photoreceptor-specific NR [PNR]), NR2F1 chicken ovalbumin upstream promoter transcription factor 1 (COUP-TFI), NR2F2 (COUP-TFII), NR2F6 (verbA-related protein), NR4A1 (Nur77), NR4A2 (Nurr1), NR4A3 (Nor1), and NR6A1 (GCNF). Results of receptor knockdown or overexpression in vivo and in cancer cell lines showed that orphan receptors exhibit tumor-specific pro-oncogenic or tumor suppressor-like activity (Safe et al., 2014). Nuclear receptors are expressed in multiple cancers and contribute to the cancer cell phenotype, and ligands (agonists or antagonists) for these receptors are important chemotherapeutic agents Lee et al., 2011;Shi, 2007;Tobin and Freedman, 2006;Willson et al., 2000;Jordan, 2003aJordan, , 2003bJordan, , 2007Jordan and O'Malley, 2007). The prognostic and functional roles of orphan receptors in cancer cells and their therapeutic implication especially in the solid tumors are not clear. A member of the family, Nur77 is highly expressed in both ER+ and ER− breast tumors (Muscat et al., 2013), and the receptor appears to be highly expressed in more differentiated, low grade tumors (Alexopoulou et al., 2010). The examination of publicly available array data for lung tumors showed that high expression of Nur77 mRNA predicted increased patient survival , while high levels of Nur77 protein were found prognostic for decreased patient survival in another cohort study (Lee et al., 2012). NR6A1 has been found to be a transcriptional repressor, through specific binding to DR0 response elements, which is found in the Oct4 proximal promoter. Interestingly, NR6A1 can interact with DNA methylation proteins and is suggested to recruit DNA methylation complexes to repress and silence Oct4 expression (Wang and Cooney, 2013). Cancer stem cells play a critical role in the tumorigenesis of basal like breast tumors. Cancer stem cells are molecularly characterized based on the expression of various cell surface receptors including integrin a6 (CD49f), integrin b1 (CD29), hyaluronan receptor (CD44), and the stem cell self-renewal transcription factors NANOG, OCT4, and SOX2 (Zheng et al., 2013). Expression screening of cancer/testis genes in prostate cancer identified NR6A1 as a novel marker of disease progression and aggressiveness (Mathieu et al., 2013). In their study, Mathieu et al. reported the identification of 98 genes detected in castration-resistant prostate cancer CRPC, hormone-sensitive prostate cancer (HSPC) and testicular samples but not in the normal controls. Among them, cellular levels of NR6A1 were found to be higher in HSPC compared to normal prostate and further increased in metastatic lesions and CRPC. It is also reported that an increased NR6A1 immunoreactivity was significantly associated with a high Gleason score, advanced pT stage and cancer cell proliferation.
Our data in line with these reports strongly indicate that NR6A1 expression warrants further investigations in breast cancer subtypes, especially in women with TNBC. The functional and prognostic studies on orphan receptors in different tumor types are limited. The role of NR6A1 in tumors is poorly understood and requires additional research. Although a wealth of integrated molecular data exists in primary breast cancer, the data available are not exhaustive. Our data provide a framework for biological validation and experimentation that should guide preclinical studies. For those pathways with therapeutic targets currently under clinical investigation, these studies can be used as a catalog of molecular lesions potentially representing biomarker(s) of future treatment option or resistance to existing treatment.