CATA: a comprehensive chromatin accessibility database for cancer

Abstract Accessible chromatin refers to the active regions of a chromosome that are bound by many transcription factors (TFs). Changes in chromatin accessibility play a critical role in tumorigenesis. With the emergence of novel methods like Assay for Transposase-accessible Chromatin Sequencing, a sequencing method that maps chromatin-accessible regions (CARs) and enables the computational analysis of TF binding at chromatin-accessible sites, the regulatory landscape in cancer can be dissected. Herein, we developed a comprehensive cancer chromatin accessibility database named CATA, which aims to provide available resources of cancer CARs and to annotate their potential roles in the regulation of genes in a cancer type-specific manner. In this version, CATA stores 2 991 163 CARs from 23 cancer types, binding information of 1398 TFs within the CARs, and provides multiple annotations about these regions, including common single nucleotide polymorphisms (SNPs), risk SNPs, copy number variation, somatic mutations, motif changes, expression quantitative trait loci, methylation and CRISPR/Cas9 target loci. Moreover, CATA supports cancer survival analysis of the CAR-associated genes and provides detailed clinical information of the tumor samples. Database URL: CATA is available at http://www.xiejjlab.bio/cata/.


Introduction
Accessible chromatin is a hallmark of an active DNA regulatory element (1).The identification of chromatin accessibility makes it possible to assess the regulatory landscape for human cancers because active chromatin contains a variety of gene regulatory information (2,3).Chromatin accessibility analysis has been shown to be able to identify transcription factor (TF) binding sites and regulatory elements, such as achaete-scute complex-like 1 gene (4) and ARID1A (5).Also, Chromatin-accessible regions (CARs) in different tumors are highly specific.For example, there are many specific CARs and CARs-related genes that are closely related to breast cancer (1), whereas they are rarely present in other cancer types.Cancers also share some common open regions of chromatin.For instance, the promoter of programmed cell death ligand 1 (PDL1), a tumor marker widely existing in cancer, is in an accessible state of chromatin in most cancers and PDL1 is regulated by a variety of regulatory elements (1).
Several high-throughput techniques have been developed to profile chromatin accessibility, such as Assay for Transposase-accessible Chromatin Sequencing (ATAC-seq) (6), formaldehyde-assisted isolation of regulatory elements (7), DNaseI hypersensitivity coupled with high-throughput sequencing (8,9) and micrococcal nuclease digestion followed by high-throughput sequencing (10,11), in which ATAC-seq requires only a small number of cells and becomes a powerful technology with high accuracy and sensitivity to profile genome-wide chromatin accessibility (6).Several databases have stored chromatin accessibility data, such as Cistrome (12), TCGA (https://portal.gdc.cancer.gov/)and ENCODE (13).They have been effective data sources for chromatin accessibility investigation.However, these available resources do not annotate cancer-related CARs.Space (http://fun-science.club/SPACE/) is a web server for linking chromatin accessibility with clinical phenotypes and the immune microenvironment in pan-cancer analysis that effectively helps cancer researchers better understand the immune microenvironment of pan-cancer.However, detailed data on each type of cancer is not provided.Moreover, this database does not have TF binding site information and other related annotation information such as SNPs, expression quantitative trait loci (eQTLs), copy number variation (CNV), single nucleotide variants (SNVs), enhancer and 450K methylation sites.
Here, we developed a comprehensive cancer chromatin accessibility database (CATA, http://www.xiejjlab.bio/cata/),which aims to provide available resources of cancer CARs and to annotate their potential roles in the regulation of genes in cancer type-specific manner.By integrating annotated data from various databases, including TCGA (1), FAN-TOM (14), 1000 genomes (15), Jaspar (16) and Xena (17), CATA stores CARs and corresponding regulatory annotations across different human tumor samples.CATA also supplies the clinical characteristics for every tumor sample that enables researchers to determine the prognosis prediction value of driver genes by survival analysis.CATA also provides multiple user-friendly functions for data storage, browsing, annotation and analysis.It could be a powerful work platform for mining potential functions of CARs and exploring relevant regulatory patterns about cancer.

The collection of chromatin-accessible regions
We downloaded chromatin accessibility region data (.bed file) from TCGA across 23 cancer types, covering 410 samples (Table 1).These regions were identified from ATAC-seq data according to the processing pipeline of TCGA (1,18,19).First, the ATAC-seq data processing and alignment were performed using the PEPATAC pipeline (http://code.databio.org/PEPATAC/).The hg38 genome build used for alignment was obtained from Refgenie (https://github.com/databio/refgenie).Bowtie2 was used to align the ATAC-seq data to the hg38 human reference genome using '-very-sensitive -X 2000rg -id' options.Picard (http://broadinstitute.github.io/picard/) was then used to remove duplicates.For each sample, peak calling was performed on the Tn5-corrected single-base insertions using the MACS2 (20) callpeak command with parameters '-shift -75 -extsize 150 -nomodel -call-summits -nolambda -keep-dup all -p 0.01'.The peak summits were then extended by 250 bp on either side to a final width of 501 bp.The hg38 blacklist (https://www.encodeproject.org/annotations/ ENCSR636HFF/) was then used to filter and finally remove peaks that extend beyond the ends of chromosomes.For the overlapping peaks in a single sample, the most significant peak is retained, and any peak directly overlapping with the significant peak is eliminated.Finally, each sample has a set of fixed-width peaks.For each cancer, TCGA compiled a 'cancer type-specific peak set' containing all of the reproducible peaks observed in an individual cancer type.For the overlapping peaks from different samples, TCGA kept the most significant peak.At last, the 'Pancancer Peak Set' was obtained from the most significant peak of all the cancer types that could be used for cross-cancer comparison.

Chromatin accessibility region annotation
CARs were annotated both genetically and epigenetically using BEDTools (21), including common SNPs, risk SNPs, CNV, SNV, motif changes, eQTLs, transcription factors' binding sites (TFBS), methylation, enhancers and CRISPR/Cas9 target sites.The annotation information is advantageous in discovering the potential function of chromatin accessibility regions.In addition, interactive tables are used to further illustrate the details.

Enhancer collection
In total, 65 423 enhancers were collected from FANTOM5 (14) and then converted to hg38 genome by LiftOver tool (22) for the annotation.

CRISPR/Cas9 target sites
CRISPR/Cas9 target ( 24) can be used in tumor cells to precisely shear genomic loci.CRISPR/Cas9 gRNA sequences target DNA sequences of transcription regions within 200 bp of genomic regions.We downloaded the CRISPR/Cas9 information from UCSC and converted it to HG38 via Liftover (22).We used the CRISPOR tool (25) for prediction to help design, evaluate and clone guidance sequences for the CRISPR/Cas9 system.

Gene annotation
The ROSE genemapper (26) method was applied in the prediction of CAR-associated genes.The genemapper method based on their distance in the linear genome to identify target genes of regulatory regions.Notably, three strategies, including overlap (genes in the CAR region), proximal (Genes within 50kb of the CAR) and closest (the gene closest to CAR), were adopted to locate CAR-associated genes.

Motif changes
We collected position weight matrices from TRANSFAC (32) and JASPAR (16) to explain the effect of annotation mutations on motifs.Then, we used the R package at SNP (33) to calculate the binding affinity of mutation to motifs.SNP mutations affect the binding affinity of mutations to the motif and make the binding of the motif change accordingly.The 30-bp region upstream and downstream of SNPs with MAF > 0.05 of 1000 Genomes Project (15) phase3 that located in super-enhancer regions was calculated.Ultimately, we obtained 254 545 586 motif changes.

TCGA series data
TCGA-related data were obtained from UCSC XENA (17) (http://xena.ucsc.edu/),including methylation data, RNA expression profile data and somatic-mutation-variation, copy-number-variation, clinical information, ATAC-seq raw counts numbers.Besides, methylation and RNA expression profiles were averaged based on the type of cancer.
PancanQTL data included relationships between eQTL and genes of different cancers in TCGA (https://tcga-data.nci.nih.gov/tcga).We mapped and annotated eQTL-related SNPs to CARs and provided SNP-regulated genes as potential targets for CARs.

System design and implementation
CATA is built using MySQL (http://www.mysql.com),running on Linux based Tomcat Web server (http://tomcat.apache.org/).The main framework of CATA was developed based on Java 1 0.8.0 with Springboot and MySQL 5.7.16.

Pathway analysis
CATA provides 10 choices about pathway databases (KEGG, Reactome, NetPath, WikiPathways, PANTHER, PID, HumanCyc, CTD, SMPDB and INO).According to the following formula: where t is the number of genes of the entire genome, and z is the number of genes of interest, of which a gene is involved in the pathway containing n gene.The calculated P-value is provided on the result page, along with the relevant genes, as well as the pathway ID (pathway ID can be clicked into the detailed pathway information).The false discovery rate (FDR) method is used to correct for multiple testing.Users can adjust the number of genes required to be enriched and set thresholds of P-values or FDRs to control the stringency of analysis.

Database maintenance
We have a professional database maintenance team that regularly maintains and upgrades the database.At the same time, for the ever-increasing data, we will regularly add corresponding data to the database every year, such as some new cancer open data and new annotation data.

User-friendly explorations
CATA provides a user-friendly explanation interface to help users navigate quickly and easily (Figure 3A).On the left side of the exploration page, the user has four options to distinguish between different samples (tissue type, cancer type, annotation and chromosome) and use these options to filter the results (Figure 3B).Also, the user can click on 'Peak ID' to navigate to the detail page to learn more about CAR information.

A search interface for retrieving CAR
In the 'Search' page, users can get chromatin accessibility data through four strategies, including 'Search accessible regions by cancer type' (input cancer type), 'Search accessible regions by gene' (input gene of interest, cancer type and strategies), 'Search accessible regions by TF' (input TF name of interest and cancer type) and 'Search chromatin accessibility region by genomic scope' (input cancer type and genomic position) (Figure 3C).In search accessible regions by cancer type, cancer type was used as an input.The output table first displayed the brief annotation information of CARs ID (Figure 3B).This table consists of Peak ID, genome location, start, end, score of CARs, annotation of region, numbers of motif, numbers of TFBS, enhancer numbers, CNV, SNP, SNV and tissue type.The word 'TFBS' is used to define a particular sequence (genetic or artificial), which is a place where factors combine.The word 'motif' is used for binding specific genetic descriptions, which are obtained by aggregating information from a series of sites.The user gets more annotations according to the interested Peak ID.The usage method is same as the usage of 'Search accessible regions by gene' and 'Search accessible regions by gene.In searching accessible regions by gene, users input interest gene in 'Gene Symbol', cancer type in 'Cancer Type' and chose one of 'Strategies/Algorithm'.The 'Example' option is an example that was provided by CATA.
After click on the 'search', the brief information of interest gene on the search results is displayed in a table on the result page (Figure 3D).After clicked in Peak ID, CATA provides a preview of CARs (Figure 3E), including raw count numbers, presented by bar plots and some summary information (overlap genes, proximal genes, closest genes, genome location, score of CARs, RNA average FPKM) about the accessible region of interest gene inaccessible region overview.The yaxis is the patient ID provided by TCGA in the plot.The x-axis is the count number that is the raw read count of ATAC-seq in the Peak ID.Users can search for interest gene annotation through the UCSC and CATA portals.
In accessible region annotation, annotation CARs information of CARs regions is provided in tabular form, including SNP, motif, CNV, SNV, TFBS, etc. (Figure 3F).Users can also download related annotation information.Users can choose three strategies to get gene expression levels in a variety of tumors, including closest gene, overlap gene and proximal gene of 'Peak ID' (Figure 3H).Users also have the option to download raw clinical data for analysis (Figure 3G) or to perform survival analysis online (Figure 3J).What CATA provides in the survival analysis section is GEPIA's analysis strategy, linked to GEPIA's online survival analysis (39).CATA also provides methylation visualization in 23 cancer types (Figure 3K).In upstream TF enrichment, interest gene binding TF and interacted gene can be obtained directly (Figure 3I).
Meanwhile, users can only input CAR ID to complete pathway enrichment online in CATA.CATA also supports the 'Threshold' option, allowing users to set different thresholds to ensure that the pathway enrichment for each user is highly accurate and suitable.For instance, we input in 'Peak ID' and chose related databases.The threshold is set to whatever users want.CATA will provide pathway enrichment of related genes.In addition, users can input the genome location to analyze the chromatin accessibility of the region.Users also upload files in the '.bed' format to analyze chromatin accessibility.CATA will then provide summary information (CAR ID, genomic location and brief annotation information) that correlates with the data uploaded by the user.In Genome-Browser, CATA implements CAR visualization using GIVE (Figure 3L).Users can select kinds of cases and tumor types whatever users want to analyze.In the end, users can download gene annotation and associated TF on the 'Download' page.CATA provided a download of gene annotation and associated TF files in the '.txt' format for each sample.

Personalized genome browser and data visualization
CATA deploys genome browser GIVE to visualize the CARs (Figure 3L).We provide 23 types of cancer and a total of 796 bigwig files for visualization of tissue type-specific CARs.Users can enter a region in the navigation bar and load the corresponding track for visualization.We divided all the samples into 23 groups in detail and named the samples according to the TCGA-patient ID.  chromatin, we could input genomic scope or upload 'bed' files.

Discussion
Chromatin accessibility plays a critical role in tumorigenesis.In cancer cells, CARs are frequently bound by TFs and contain much information about genetics.Some database has already stored chromatin accessibility data, such as Cistrome, TCGA and ENCODE.They have become useful data sources for studying chromatin accessibility.Compared with Cistrome and ENCODE, CATA is a chromatin accessibility database that focuses on cancer and provides extensive opening region annotation information.Compared with other existing databases, CATA not only stores 2 991 163 CARs from 23 cancer types but also provides comprehensive annotations about these regions, including common SNPs, risk SNPs, CNVs, somatic mutations, motif changes, eQTLs, TF binding regions, methylation, enhancer location and CRISPR/Cas9 target loci.Moreover, CATA supports cancer survival analysis of CAR-associated genes that helps researchers to identify driver genes.
CATA database mainly includes five user-friendly characteristics: (I) CATA provides four strategies, including 'Searchaccessible regions by cancer type' (input cancer type), 'Search accessible regions by gene' (input gene of interest, cancer type and strategies), 'Search accessible regions by TF' (input TF name of interest and cancer type) and 'Search chromatin accessibility region by genomic scope' (input cancer type and genomic position).(II) CATA has a more userfriendly 'Explorations' page.(III) CATA provides two analytical tools, including pathway downstream analysis and associated accessible region analysis.(IV) CATA supports data download of 23 types of cancers.(V) CATA provides detailed help documentation to quickly use and understand the database.In the future versions, we will provide relevant ChIP-seq data, cancer single-cell ATAC-seq data and practical analysis tools.This will lead to a better exploration of tumorigenesis mechanisms and cancer markers.
In summary, CATA is a novel chromatin accessibility database for cancer that provides a general collection of cancer CARs.Especially, CATA provides the most extensive cancer chromatin accessibility annotation.CATA provides an easy-to-use database platform for researchers to explore cancer CARs and detailed features.Our effort to establish this database was prompted by the great need of researchers to a comprehensive dataset of cancer-related CARs for their related genomic location, target genes, TFBS, mutation, methylation, functions and survival analysis.We expect that CATA will help researchers to understand cancer more comprehensively by providing this information in an integrative manner.

Figure 1 .
Figure 1.The percentage of chromatin-accessible region for per cancer type.

Figure 2 .
Figure 2. Database content and construction.CATA provides chromatin-accessible regions of cancer-based on TCGA ATAC-seq data.Genetic and epigenetic annotations of accessible regions were collected or calculated including common SNPs, eQTLs, risk SNPs, LD SNPs, TFBS, CNV, SNV, methylation sites and enhancer location.CATA also provides ATAC-seq samples associated with clinical data.CATA integrates multiple functions including storage, search, download, statistics, visualization, browse and analysis.

Figure 3 .
Figure 3.The main function and usage of CATA.(A) The navigation bar of CATA.(B) CATA provides user's friendly explorations.(C) Users can query using five methods: 'Search by Cancer type', 'Search by gene', 'Search by TF' and 'Advanced search by genome location'.(D) The display of search results.(E) Overview of chromatin-accessible regions.The y-axis is the patient ID provided by TCGA.The x-axis is the count number that is the raw read count of ATAC-seq in the Peak ID. (F) Interactive table of chromatin-accessible region, related annotation information.(G) The table of clinical data.(H) The visualization of RNA-expression.(I) Upstream TF enrichment graph.(J) The overall survival and disease-free survival analysis of the interest gene can be presented in the 'Survival' region.Meanwhile, genes with the most significant association with patient survival can be identified.(K) The visualization of the methylation level.(L) Personalized genome browser-GIVE.

Figure 4 .
Figure 4. Validation results associated with CA12 in breast cancer.(A) The navigation bar of CATA.(B) Input and parameters of 'Search accessible regions by gene'.(C) The brief annotation information about the detailed genetic information in chromatin-accessible regions of CA12, including SNP, motif, CNV, SNV, TFBS, etc.The score is a score of chromatin accessibility provided by TCGA.The higher the score, the more open the chromatin.(D) In accessible region overview, annotation CAR information of CA12 CAR regions including raw count numbers, presented by bar plots and some summary information (overlap genes, proximal genes, closest genes, genome location, score of chromatin-accessible regions, RNA average FPKM) about the accessible region of interest gene.The y-axis is the patient ID provided by TCGA.The x-axis is the count number that is the raw read count of ATAC-seq in the Peak ID. (E) In accessible region annotation, annotation CAR information of CAR regions are provided in tabular form, including SNP, motif, CNV, SNV, TFBS, etc. (F) The visualization of RNA expression of CA12.(G) Upstream TF enrichment of the CA12 graph.(H) The disease-free survival analysis of CA12.(I) Visualization of the CA12 chromatin-accessible region.