CATA: a comprehensive chromatin accessibility database for cancer

Chromatin accessibility is a crucial epigenetic concept that plays a biological role in oncology. As humans become more involved in cancer research, a comprehensive database is required to identify and annotate tumor chromatin accessible regions (CARs). Here, CATA was developed to provide cancer-related CAR annotation. Currently, CATA possesses 2,991,163 CARs, relevant clinical data, and transcription factor binding predictions for cancer CARs from 410 tumor samples of 24 cancer types. Furthermore, CARs were annotated by SNPs, risk SNPs, eQTLs, linkage disequilibrium SNPs, transcription factors, CNV, SNV, enhancer, and 450K methylation sites in our database. By combining all these resources, we believe that CATA will provide better service for researchers on oncology. Our database is accessible at http://bio.licpathway.net/cata/


Introduction
Chromatin accessibility refers to the regions of a chromosome that have transcriptional activity [1], while these regions bind many transcription factors (TFs).
Chromatin accessibility plays a very important role in tumorigenesis. It can help us discover more key information about epigenetics and cancer mechanisms. Researchers always used DNase-seq, to measure chromatin accessibility until that William J developed the Assay for Transposase-Accessible Chromatin using sequencing (ATACseq) technique in 2013 [2]. Compared to DNase-seq, ATAC-seq has 4 major strengthens.
First of all, the accuracy of ATAC-seq was consistent with that of DNase-seq but ATACseq experiment is easier to perform. Secondly, ATAC-seq kept simultaneous disclosure of accessible genomic location, DNA binding protein, transcriptional binding (TF) site interaction. Thirdly, ATAC-seq required fewer cell numbers. Fourthly, ATAC-seq showed great repeatability (R = 0.98) and also had good consistency with the DHS sequencing (R>0.79) [2]. ATAC-seq made it possible to assess the gene regulatory landscape in primary human cancers. Interestingly, chromatin accessibility changes are usually early cytological events in the context of various strain responses, stress responses, or developmental transitions [3]. In the early diagnosis and treatment of cancer, chromatin structure studies can provide valuable information [4]. However, the combinatorial effects of these phenomena in a specific biological function are poorly understood. Moreover, TFs are major players in the regulation of gene expression, and transcriptional regulation involves complex and detailed patterns of activity that bind to TFs.
Numerous studies had indicated that TFs regulated genes by binding regulatory elements in the accessible genome [5]. There are a large number of regulatory elements, such as enhancers and promoters, located in ATAC-peak. Chromatin accessibility is diverse in different cancers, so there is a lot of transcriptional regulation information contained in CARs [6]. Identification of chromatin accessibility of cancer would provide ample insight into cancer mechanisms. for chromatin accessibility investigation. However, the existing resources did not focus on cancer and only contain limited cancer-related CARs, also they didn't contain related comprehensive annotation information. Importantly, the cancer CAR usually contains a lot of regulatory elements [4,7] and could be bound by a lot of transcription factors, which might co-regulate target genes with the regulatory elements [8,9]. Here, we developed a comprehensive cancer chromatin accessibility database (CATA, http://bio.licpathway.net/cata/), which aims to provide a large number of available resources of cancer CARs and to annotate their potential roles in the regulation of a gene in cancer type-specific manner., CATA integrated data from 12 databases including TCGA, FUNTOM [10], 1000GENOMES [11], Jaspar [12] and Xena to annotated enhancers and TFs as well as mutation and methylation sites in CAR and developed CATA database. CATA contains 2,991,163 CAR and corresponding annotations for 410 tumor samples of 24 cancer types and supplies clinical data and survival analysis. CATA is also a comprehensive cancer ATAC-seq database that provides multiple functions, including storage, browsing, annotation, and analysis. It could be a powerful work platform for mining potential functions of CAR and explore relevant regular patterns about cancer.

Materials and methods
The collection of chromatin accessibility: We download chromatin accessible region data (bed file) from TCGA. First, The ATAC-seq data processing and alignment were performed using the PEPATAC pipeline (http://code.databio.org/PEPATAC/). The hg38 genome build used for alignment was obtained using Refgenie (https://github.com/databio/refgenie). Precisely, Bowtie2 was used to align to the hg38 human reference genome using "--very-sensitive -X 2000 --rg-id" options. Picard Moreover, CRISPR tool was used in designation, evaluation, and clone for guidance sequence of the CRISPR/Cas9 system.
Gene annotation: Three strategies were adopted to locate CAR associated genes.
ROSE2 [17] gene-mapper method was applied in the prediction of associated genes including overlap, proximal, and closest. resulting from GWAS catalog [19], and GWASdb [20] v2.0 set of planning were integrated into the table of human diseases/traits of SNP and insertion/deletion variant, which allowed functional annotation.
Motif changes: Annotation of motif mutations was based on TRANSFAC [21] weight JASPAR [12] collection position weight matrix.

User's friendly Explorations
CATA provides a user-friendly interface to help users to navigate quickly and easily.
On the left side of the page, the user can distinguish between samples, via four options (Tissue type, Cancer type, Annotation, Chromosome), which can be clicked. The results are filtered and the user can click on the 'Peak ID' to jump to the details page of the CAR for more information.

Personalized genome browser and data visualization
CATA provides the latest genome browsing area GIVE [25] to help visualize the open chromatin region. We provide 24 types of cancer and a total of 796 bigwig visualization files. Users can jump on the detail page, and view detailed regional information, or enter a region in the navigation bar and load the corresponding track for visualization.
We grouped 23 cancers in detail and named the samples according to the TCGA-patient ID.

Online analysis tools
CATA provides three analytical tools, including that: (1). Cell Pathway Analysis, in (https://zhong-lab-ucsd.github.io/GIVE_homepage/). We recommend using a modern web browser that supports the HTML5 standard such as Firefox, Google Chrome for the best display.

Discussion
CATA is a tumor chromatin accessible regions database for cancer, storing 29,366,632 accessible chromatin regions from 24 types of cancers, including pan-cancer. Meanwhile, the CATA database annotates 2,936,663 CARs and stores the binding site of 1,678 transcription factors, which could be a good predictor for cancerspecific transcription factors. CATA also integrates data of methylation, SNP, SNV from TCGA and CATA has a favorable interactive interface for users. Four kinds of searching methods are provided by the database, which is based on cancer type, transcription factor, gene symbol, and advance search through genomic location, respectively. CATA also provides survival analysis of some CAR genes via the built-in gepia2 (python package). Also, CATA supports pathway analysis of transcription factors binding to CAR. All of that was aim to help cancer researchers easy to mine potential information for cancer mechanisms. However, there are still some deficiencies in CATA. Since the existing chip-seq data cannot perfect diverse for cancer species, It is believed that an explosion of these data will occur, or perhaps relevant chip-seq data and single-cell cancer ATAC-seq data will be added in the second edition to embellish the database, which enables better tumorigenesis mechanism and cancer markers mining.

Acknowledgments
We thank TCGA for sharing their cancer chromatin accessibility data.
We thank Richard A. Young and his colleagues for sharing ROSE program with this work.
We thank Xiaoyi Cao help us to deploy Give genome-browsers.
We thank Zemin Zhang and his colleagues for sharing GIPIA2(python package) to this work.

Key points
• CATA is the first comprehensive resource for chromatin accessibility for cancer.
• CATA genome-browser provides visualization of chromatin accessibility for 23 types of cancer.