BART Cancer: a web resource for transcriptional regulators in cancer genomes

Abstract Dysregulation of gene expression plays an important role in cancer development. Identifying transcriptional regulators, including transcription factors and chromatin regulators, that drive the oncogenic gene expression program is a critical task in cancer research. Genomic profiles of active transcriptional regulators from primary cancer samples are limited in the public domain. Here we present BART Cancer (bartcancer.org), an interactive web resource database to display the putative transcriptional regulators that are responsible for differentially regulated genes in 15 different cancer types in The Cancer Genome Atlas (TCGA). BART Cancer integrates over 10000 gene expression profiling RNA-seq datasets from TCGA with over 7000 ChIP-seq datasets from the Cistrome Data Browser database and the Gene Expression Omnibus (GEO). BART Cancer uses Binding Analysis for Regulation of Transcription (BART) for predicting the transcriptional regulators from the differentially expressed genes in cancer samples compared to normal samples. BART Cancer also displays the activities of over 900 transcriptional regulators across cancer types, by integrating computational prediction results from BART and the Cistrome Cancer database. Focusing on transcriptional regulator activities in human cancers, BART Cancer can provide unique insights into epigenetics and transcriptional regulation in cancer, and is a useful data resource for genomics and cancer research communities.


INTRODUCTION
Cancer and its progression arise from dysregulation of gene expression (1,2). The identification of transcriptional regulators (TRs), including transcription factors (TFs) and chromatin regulators (CRs), that control oncogenic gene expression program in each cancer type is an essential task for cancer research as these regulators could be targets for novel therapies. The Cancer Genome Atlas (TCGA) consortium has generated more than 10000 RNA-seq datasets for tumor and normal samples for over 30 cancer types, as one of the largest resources for gene expression profiles in human cancers (3). Knowing what TRs regulate the genes with differential expression between tumor and normal samples can help better understand functional gene regulatory networks in each cancer type. However, TCGA did not produce ChIP-seq data due to limited cell numbers for most primary tumor samples. Therefore, computational prediction of functional TRs from differential gene expression data from TCGA will be useful in understanding cancer transcriptional regulation. Such efforts include the Cistrome Cancer web resource, which uses a comprehensive computational modeling approach to integrate TCGA data with publicly available chromatin profiling ChIP-seq data and reports the results on predicted TF targets and putative enhancer profiles in each cancer type (4). TF activities in Cistrome Cancer were inferred primarily based on the expression of the TF genes from TCGA and correlations with their putative target genes. The direct binding profiles of TFs from ChIP-seq data and their association with the putative enhancer profiles were not utilized in the model. Additionally, Cistrome Cancer focused on 575 factors, while growing ChIP-seq data from the public domain have covered more than 900 TRs (5)(6)(7), calling for an expansion of the existing database.
We previously developed Binding Analysis for Regulation of Transcription (BART), a computational method for inference of functional TRs from a target gene set (8). BART leverages a large collection of over 7000 human TRbinding profiles and 5000 mouse TR-binding profiles from the Cistrome Data Browser database (6,7) to build the inference model. When using gene list input, BART first uses collected ChIP-seq profiles of H3K27ac, an active enhancer histone mark, to predict a cis-regulatory profile that regulates the gene set. Each TR ChIP-seq dataset from the compendium is then mapped to the cis-regulatory profile and scored based on whether the TR-binding sites tend to overlap with high-ranking regions in the cis-regulatory profile, quantified by a receiver-operating characteristic (ROC) measure. The scores for all ChIP-seq datasets for the same TR are then integrated to generate TR-level quantifications using a statistical model (8). We demonstrated that BART has good performance in identifying the true TRs using differentially expressed genes generated from TR knockdown/knock-out experiments (9,10).
BART can be utilized for inferring functional TRs from differentially expressed genes derived from TCGA data, providing a new approach for integration of tumor molecular profiling data and public ChIP-seq data. By using BART on both up-and down-regulated genes in cancer compared to normal, over 900 TRs can be reported based on their ranks of predicted activity in each cancer type. This allows us to identify not only regulators that promote tumorigenesis (from up-regulated genes) but also regulators that may act as tumor suppressors (from down-regulated genes). Here, we present BART Cancer, an interactive web resource database to report the putative transcriptional regulators that are responsible for differentially expressed genes in 15 different cancer types in TCGA, and multiple activity measures of over 900 transcriptional regulators across these 15 cancer types. An overview of the workflow can be seen in Figure 1.

TCGA data collection and processing
RNA-seq data from 31 cancer types were downloaded from TCGA. In order to obtain a robust set of gene expression profiles, normalized read count in FPKM data from over 10 000 cancer samples were clustered using K-means clustering. About 31 clusters were retained and the clusters from unambiguous cancer type annotation were kept with the TCGA annotation. Misclassified samples (<100) were removed from other cancer type clusters. Clusters with sim-ilar expression patterns from multiple cancer types were merged, e.g., esophagus-stomach cancer (STES) was stomach adenocarcinoma (STAD) and esophageal carcinoma (ESCA) combined; colorectal cancer was colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ) combined. Additionally, breast cancer (BRCA) was clustered into two separate clusters: BRCA 1 for mainly luminal breast cancer (764 of 833 are luminal) and BRCA 2 for mainly basal breast cancer (176 of 234 are basal), similar to the results in Cistrome Cancer (4). A cancer type was only included in BART Cancer if TCGA included both cancer genomic profiles and normal genomic profiles, leaving BART Cancer with 15 different reannotated cancer types. For each cancer type, both up-and down-regulated genes were identified using DESeq2 (11). Differentially expressed genes were defined as those having adjusted Pvalue <1e-7 and log2(fold change) >1 for up-regulated genes and < -1 for down-regulated genes. Differentially expressed genes provided by Cistrome Cancer (4) were used to cross-reference the newly identified cancer up-regulated genes. Because Cistrome Cancer does not provide downregulated genes, newly identified down-regulated genes were not cross-referenced against any sources. These genes were then used as input to BART to get the putative transcriptional regulators responsible for regulating each gene set.

BART analysis for TR prediction
BART is a bioinformatics tool for predicting functional transcriptional regulators that bind at cis-regulatory regions to regulate gene expression in human or mouse (8). BART version 2.0 was used to generate TR prediction results for both up-regulated and down-regulated gene lists derived from each cancer type. The main output of BART is a list of putative TRs with multiple quantification scores. The top ranked regulators are predicted putative regulators that are associated with the input dataset. The relative rank, calculated by the rank divided by the total number of TRs included in the analysis, was used as the BART prediction score for each TR in the data visualization. The Irwin-Hall P-value, calculated from the Irwin-Hall distribution as the null distribution for rank integration, was used as the final significance measure to rank all TRs.

Database design and implementation
BART Cancer is designed using HTML5, CSS and JavaScript technologies. The Tabulator library (http:// tabulator.info/) is used to display the BART result tables as well as list TRs to select plots to display. All TR plots were generated in R with ggplot2. The results, figures, and differentially expressed genes are all freely available for download.
The web-based user interface provides ways for researchers to view, visualize, and download the provided data. By selecting a cancer type and a type of differential expression, i.e., down-or up-regulated in cancer versus normal samples, the user is able to view the BART results of running BART on those genes for the specified cancer type. Directly below the display table include links to download the table as well as supplementary data files. Also provided is the link to the BARTweb (9) help page to better understand the meaning of the BART results ( Figure 2).
Below the BART results table module, TR specific plots are displayed. Each plot displays the expression level, BART rank, and the likelihood ratio score of a TR across all cancer types. TR gene's average expression level was quantified as log(FPKM + 0.1) from TCGA RNA-seq data and ranked across the 15 cancer types. BART rank was quantified as 1-relative rank from the BART result. The likelihood ratio score was obtained from the Cistrome Cancer database, measuring the regulatory association of ChIP-seq data for this TR to the target genes predicted in Cistrome Cancer (4). An additional figure is included that combines the data from the other three figures into one bubble plot for each TR. In these plots, the BART Rank is plotted against the Cistrome Cancer likelihood ratio and the bubble size corresponds to relative expression level ranked in the 15 cancer types (Figure 3). This results in cancer types that had the chosen TR ranked high in both BART and Cistrome Cancer likelihood appearing in the upper right quadrant. A large bubble of some color appearing in the upper right quadrant indicates that the TR is specifically expressed and activating its target genes in the cancer type represented by that color. This module uses a column of TR names which, when selected, change the figures displayed to the selected TR.

RESULTS
BART Cancer web resource is created for users to browse putative TRs that are likely to regulate differential gene ex-pression in a particular cancer and to visualize the relative activity of a TR across multiple cancer types. In order to validate that BART Cancer reports real functional TRs, we did a survey on the top three TRs predicted by BART from differentially expressed genes in each cancer type and crossreferenced with literature (Table 1).

TRs for up-regulated genes
The TRs predicted from the up-regulated genes in cancer can be associated with oncogenic activation and promote tumorigenesis. Here, we explore some of the top ranked TRs including histone deacetylase 2 (HDAC2), estrogen receptor 1 (ESR1), TEA domain family member 4 (TEAD4), achaete-scute homolog 1 (ASCL1), androgen receptor (AR) and yes associated protein 1 (YAP1) in several cancer types.
In bladder urothelial carcinoma (BLCA), HDAC2 ranked among the top three putative TRs. This class 1 deacetylase has been shown to be up-regulated in bladder tumors (12). Additionally, a 2016 study showed that the inhibition of HDAC2 successfully inhibited cell proliferation and suggested that the inhibition of HDAC2 may be a promising therapy for urothelial carcinoma (13).
The top ranked putative regulator in luminal breast invasive carcinoma (BRCA luminal) was ESR1, which has been previously shown as a predictive biomarker of breast cancer. These results agree with previous studies on breast cancer that estrogen receptors and ESR1 mutations are markers of poor prognosis (14).
In colorectal adenocarcinoma (COAD and READ), TEAD4 has been shown to be a biomarker and tumorigenic Example of BART Results Interface on BART Cancer. The radio button menus allow the user to select a reannotated TCGA cancer type and differential expression type. The BART result table then displays the list of putative transcriptional regulators for the cancer type and expression type selected. The table is displayed with the following columns: TR, name of transcriptional regulator; Wilcoxon test statistic, Wilcoxon rank-sum test comparing the set of association score from one TR with all the association scores. The higher the test score is, the more likely the TR regulates the input; Wilcoxon P-value, One-sided P-value of the Wilcoxon rank-sum test. The smaller the P-value is, the more significant the TR regulates the input; Z-score, Z-score indicates the deviation of the TR's Wilcoxon test score from its background model, which is generated from MSigDB gene sets. The background model contains the Wilcoxon test score for each TR and each gene set; Max AUC, Maximum association score of each TR in its datasets indicating the specific binding pattern of one dataset that has the highest correlation with the input; Relative Rank, Rank indicating average between Wilcoxon P-value, Z-score, and maximum association score and then dividing the absolute rank by the total number of TRs; and Irwin-Hall P-value, indicating the rank significance, the smaller the P-value is, the more significant the rank is. Below the table are links to download the tables and supplementary data files as well as a link to BARTweb for more information on BART results. promoter and has been identified as a possible target for therapies (15). The BART results reflect this as TEAD4 is the highest ranked regulator ( Figure 3A).
In both lung cancers, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), ASCL1 ranked among the top five putative regulators ( Figure 3B). It has been demonstrated that ASCL1 is required for tumorigenesis in pulmonary cells (16). Additionally, some ASCL1 targets were ranked among the top twenty regulators including SOX2 whose overexpression has been clearly described in lung cancers (17).
In prostate adenocarcinoma (PRAD), the top ranked factor is AR ( Figure 3C). This nuclear receptor has been shown to promote prostate cell growth and survival when bound to dihydrotestosterone (DHT). Thus, when there is an AR mutation or overexpression, prostate cell growth is directly impacted (18). For this reason, androgen deprivation therapy has been used to suppress prostate tumorigenesis (19).
YAP1 is a transcriptional co-activator and its activity with TEAD transcription factors have proliferative effects (20). BART results for up-regulated genes in stomach and esophageal carcinoma (STES) rank YAP1 and TEAD4 among the top three ranked putative TRs. This is supported by a 2016 study that has shown YAP1 knockdown suppresses esophageal carcinoma and proposes YAP1 to be a target for gene therapy (21).

TRs for down-regulated genes
Transcriptional regulators predicted from down-regulated genes can be associated with tumor suppressor functions. Here we highlight six factors from the top three ranked TRs predicted from the down-regulated genes in several cancer types: GATA-binding protein 4 (GATA4), repressor element 1 (RE-1)-silencing transcription factor (REST), forkhead box A1 (FOXA1), ESR1, hepatocyte nuclear factor 4 alpha (HNF4A) and AR. Integrative visualization of BART rank, Cistrome Cancer likelihood ratio score, and relative expression level of transcriptional regulators across cancer types for three transcriptional regulators. The bubble plots combine information from BART rank, Cistrome Cancer likelihood ratio score, and expression level. Each cancer type is plotted as a bubble on the 1 -relative rank versus likelihood ratio axes. The top right quadrant contains cancer types in which the transcriptional regulators ranked high in likelihood score and high in BART rank. The size of the bubble corresponds to the rank (1-15) of its relative expression level using log(FPKM + 0.1) from RNA-seq with the highest rank being the largest bubble, and the lowest rank being the smallest bubble. TEAD4 (A) and ASCL1 (B) show COAD READ and LUAD in the top right quadrant, respectively, with the largest bubble indicating that they are likely oncogenic regulators in the corresponding cancer type. AR (C) shows PRAD as the largest bubble and highest BART rank. Although it does not rank high in Cistrome Cancer likelihood ratio score, it is still likely to be a functional transcription factor in PRAD. Table 1. The top three transcriptional regulators from BART results on up-regulated genes (left) and down-regulated genes (right). The regulators in bold are those that are discussed further as have literature-supported oncogenic activator or tumor suppressor functions. Cancer type abbreviations were following TCGA nomenclature.

Cancer type
REST is ranked highly among many cancer's BART results for down-regulated genes. This result is not surprising as REST has previously been identified as a tumor suppressor from genetic screens (23). In breast cancer specifically, REST was identified to have significantly decreased expression in tumor compared to normal samples and REST knockdown in MCF-7 cells resulted in an increase in cell proliferation and reduced sensitivity to anticancer drugs (24). Similar results were found in colorectal cancer cell lines, further supporting its function as a tumor suppressor (25).
FOXA1 and ESR1, two of the top three ranked factors for head and neck cancer (HNSC) have been identified as tumor suppressors. First, second ranked FOXA1 was shown to suppress expression of miRNAs that were responsible for malignant behaviors in nasopharyngeal carcinoma, a subtype of head and neck squamous cell carcinoma (26). Second, third ranked ESR1 has been hypothesized to have tumor suppressive function in laryngeal squamous cell carcinoma (LSCC), another subcategory of head and neck cancer (27). It has also been shown that the removal of ESR1 results in a more aggressive LSCC and high ESR1 expression correlated to significantly improved survival (27).
HNF4A is identified as the top ranked regulator for kidney chromophobe (KICH). HNF4A has been associated 6 NAR Cancer, 2021, Vol. 3, No. 1 with diabetes and has more recently been associated with reduction of proliferation in kidney cells, specifically renal cell carcinoma (28). HNF4A works by positively regulating genes that are responsible for regulating proliferation, specifically ABAT and ALDH6A1 (29,30). By regulating these proliferation regulatory genes, HNF4A works as a tumor suppressor.
While AR was shown to be a oncogene, it has been shown to have the opposite effect in kidney chromophobe (KICH) and thyroid carcinoma (THCA). The loss of AR has been shown to correspond to increasing pathological grade of kidney caners (31). Further, it was shown that high expression of AR was associated with better survival rates supporting a tumor suppressor role for AR in kidney cancer (32). Likewise, decreased expression of AR in thyroid carcinoma was associated with poor prognosis and more aggressive tumor behavior (33).

DISCUSSION
BART Cancer web resource provides information about computational model-predicted transcriptional regulators in 15 different cancer types, by integrating TCGA gene expression profiling data with publicly available ChIP-seq data for over 900 factors. It provides insights to cancer transcriptional regulation in two aspects. First, for researchers interested in a particular cancer type, BART Cancer reports putative transcription factors that either activate oncogenic gene expression or regulate tumor-suppressive gene expression. Depending on up-or down-regulation of target genes and transcriptional activator or repressor function of the regulator, these putative regulators can have either oncogene or tumor-suppressor functions in this cancer. As many predicted transcription factors have been cross-referenced with literature, the novel factors in the prediction may guide further functional and mechanistic studies. Second, for researchers interested in a specific transcription factor or chromatin regulator, BART Cancer reports its relative activities across 15 cancer types and can give insights to which cancer types this factor might be functional in.
A few limitations of BART Cancer are worth noting. First, BART prediction was based on RNA-seq data from dozens to hundreds of TCGA samples and differentially expressed genes identified were shared across samples of each cancer type. The heterogeneity across samples were not taken into account. Also, cell heterogeneity could play a significant role especially in solid tumors, and the gene expression profiles could contain components from nontumor cells such as tumor-infiltrating immune cells (4). As a result, BART predicted TRs might reflect cell composition changes rather than transcriptional regulation changes in cancer cells. Second, BART only utilizes publicly available ChIP-seq datasets for making predictions. There are over 1600 likely human transcription factors (34). Despite including more than 900 factors, there are still around 500 transcription factors that do not have high-quality ChIPseq data and are absent in BART Cancer. The predicted TRs are far from complete. Third, discrepancies between BART rank and TR gene expression or likelihood ratio score were frequently found in TR plots. Possible reasons may include that differential transcription levels of a TR gene are not necessarily correlated with different TR activities. Likelihood ratio scores were based on putative TR target genes, whose accuracy could be limited because of the co-expression assumption and various ChIP-seq data qualities. Users should be aware of such caveats when interpreting BART Cancer data. In summary, despite of having certain limitations, BART Cancer can provide useful information in cancer transcriptional regulation and can be a valuable data resource for the cancer research community.

DATA AVAILABILITY
BART Cancer database resource is available at http:// bartcancer.org/.