Chinese Glioma Genome Atlas (CGGA): A Comprehensive Resource with Functional Genomic Data from Chinese Glioma Patients

Gliomas are the most common and malignant intracranial tumors in adults. Recent studies have revealed the significance of functional genomics for glioma pathophysiological studies and treatments. However, access to comprehensive genomic data and analytical platforms is often limited. Here, we developed the Chinese Glioma Genome Atlas (CGGA), a user-friendly data portal for the storage and interactive exploration of cross-omics data, including nearly 2000 primary and recurrent glioma samples from Chinese cohort. Currently, open access is provided to whole-exome sequencing data (286 samples), mRNA sequencing (1018 samples) and microarray data (301 samples), DNA methylation microarray data (159 samples), and microRNA microarray data (198 samples), and to detailed clinical information (age, gender, chemoradiotherapy status, WHO grade, histological type, critical molecular pathological information, and survival data). In addition, we have developed several tools for users to analyze the mutation profiles, mRNA/microRNA expression, and DNA methylation profiles, and to perform survival and gene correlation analyses of specific glioma subtypes. This database removes the barriers for researchers, providing rapid and convenient access to high‐quality functional genomic data resources for biological studies and clinical applications. CGGA is available at http://www.cgga.org.cn.


Introduction
Gliomas are the most common intracranial malignant tumors in adults. According to a multi-center cross-sectional study on brain tumors in China, the age-standardized prevalence of primary brain tumors is approximately 22.52 per 100,000 for all populations, with gliomas accounting for 31.1% [1][2][3]. Despite advances in current treatment strategies, the survival rate of patients with glioma has not been obviously improved in decades, especially for aggressive gliomas (associated with a poor median survival time of 14.4 months) [4,5]. According to the 2016 World Health Organization (WHO) classification of central nervous system (CNS) tumors, gliomas are classified from grade II to grade IV by not only histological characteristics but also several molecular pathological features, e.g., IDH (IDH1 and IDH2) mutation and chromosome 1p/19q codeletion status [6]. Clinically, most lower-grade gliomas (LGGs) progress to glioblastoma (grade IV, GBM) in less than 10 years [6][7][8]. Glioma recurrence or malignant progression occurs likely for several reasons: (1) infiltrative tumor cells cannot be completely removed by neurosurgical resection [9,10]; (2) residual tumor cells cannot be effectively suppressed by limited postoperative treatment options [3,11,12]; (3) multiple lesions may progress sequentially [13,14]; (4) tumor cell cloning occurs rapidly under chemotherapy and/or radiotherapy [7,15]; and (5) tumor cells readily adapt to the immunosuppressive tumor microenvironment [16,17]. Glioma research is greatly hindered by limited data resources. Therefore, it is essential to collect clinical specimens and provide genomic sequencing data to the glioma research community.
Recently, high-throughput technologies have been extended to characterize genomic status including but not limited to DNA methylation modification, genetic alteration, and gene expression regulation. In the cancer research community, major large-scale projects, such as The Cancer Genome Atlas (TCGA, which includes 516 LGG samples and 617 GBM samples as of October 18, 2019) [18] and the International Cancer Genome Consortium [ICGC, which includes 80 adult GBM samples and 50 pediatric GBM samples (excluding the TCGA samples) as of April 3, 2019] [19,20], have generated an unparalleled amount of functional genomic data. These projects have changed our understandings of cancers and led to breakthroughs in diagnosis, treatments, and prevention. Importantly, they have provided opportunities for discovery and validation to researchers worldwide. However, the data generated by these projects are often difficult to access, analyze, and visualize, especially for researchers with little bioinformatics skill. These limitations have greatly hindered the use of functional genomics data to obtain novel findings of significance for drug development and clinical treatments. Although several webservers, e.g., cBioportal [21,22] and GlioVis [23], have been constructed to analyze multi-dimensional glioma data, they ignore the heterogeneity in tumors, as data obtained from recurrent glioma samples and subtype analyses are lacking.
Here, we introduce the Chinese Glioma Genome Atlas (CGGA, http://www.cgga.org.cn), an open-access and easy-to-use platform for the interactive exploration of multi-dimensional functional genomic datasets collected from nearly 2000 glioma samples from Chinese cohorts. The database currently contains a wide range of data derived from whole-exome sequencing (WES, 286 samples), mRNA sequencing (1018 samples) and microarray (301 samples), DNA methylation microarray (159 samples), and microRNA microarray analyses (198 samples), as well as comprehensive clinical data. Furthermore, we developed various online tools to browse mutational landscape profiles, mRNA/microRNA expression profiles, and DNA methylation profiles, and to perform survival and correlation analyses of specific subtypes. We hope that CGGA removes the barriers for researchers who need fast and convenient access to high-quality functional genomic data resources.

Database implementation
In CGGA, all data were organized using MySQL 14.14 based on relational schema, which will be supported in future CGGA updates. The website code was written based on Java Server Pages using the Java Servlet framework. The website is deployed on the Tomcat 6.0.44 web server and runs on a CentOS 5.5 Linux system. JQuery was used to generate, render, and manipulate data for visualization. The 'Analyze' module was realized by Perl and R script. The CGGA website has been fully tested in Google Chrome and Safari browsers. The design of CGGA is displayed in Figure 1.

Database content and usage Database content
The CGGA database is designed to archive functional genomic data and to allow the interactive exploration of multidimensional datasets from both primary and recurrent gliomas in Chinese cohorts. The database is available at http://www.cgga.org.cn. Currently, CGGA contains WES (286 samples), mRNA sequencing (a total of 1018 samples, with batch 1 comprising 693 samples and batch 2 comprising 325 samples), mRNA microarray (301 samples), DNA methylation microarray (159 samples), and microRNA microarray (198 samples) data, and detailed clinical data (including age, gender, chemoradiotherapy status, WHO grade, histological type, critical molecular pathological information, and survival data). Detailed statistical information of each dataset is provided in Table 1

The analyses and results
To facilitate analysis of the CGGA data, especially for bioinformatics beginners, we developed four online modules in the 'Analyze' tab ( Figure 2). 'WEseq data', 'mRNA data', 'methylation data', and 'microRNA data' are included for analyzing the WES, mRNA expression, DNA methylation, and micro-RNA expression data, respectively ( Figure 2A). A key feature of CGGA is its ease of use. In the example below, we illustrate the usage of the 'Analyze' tab in CGGA.
On the 'WEseq data' page, users can visualize the mutational profile of a gene set of interest and perform a survival analysis of a specific gene of interest in specific glioma subtypes ( Figure 2B). In the 'OncoPrint' section, users are guided to (a) input a gene set of interest, for example, IDH1, TP53, and ATRX; and (b) select a subtype of interest, for example, 'All'. Based on user input, the tool automatically generates results and displays to the users. In the results, data for each case or patient are presented in columns, each row corresponds to a gene; different kinds of mutations are marked in colors and a heatmap is presented below the table depicting clinical information ( Figure 2C). The 'OncoPrint' section can be very useful for visualizing the mutational profile of a gene set of interest in a specific glioma subtype and intuitively revealing mutual exclusivity or cooccurrence for a gene pair. In the example above, the mutations in gene IDH1 (47%), TP53 (46%), and ATRX (30%) were the most common mutations in all glioma samples included. In the 'Survival' section, users can input a specific gene (e.g., IDH1) and select a subtype (e.g., 'Primary LGG') to investigate the association of gene mutation with survival. Consistent with previous studies [24], primary LGG patients with IDH1 mutation show better overall survival than patients carrying wildtype IDH1 (P < 0.0001, Figure 2D, left). The results from the 'WEseq data' section can be exported in PDF format. To ensure repeatability, the input data ( Figure 2D, middle) and R code ( Figure 2D, right) are provided, enabling users to reproduce the figure with customized options according to their own need.
On the 'mRNA data' page, users can perform the distribution of gene expression, correlation, and survival analyses for a specific gene in a specific glioma subtype ( Figure 3A). Three mRNA datasets are available to users, including two batches of RNA-seq datasets (batch 1: 693 samples; batch 2: 325 samples) and one microarray dataset (301 samples). In the 'Distribution' section, users can display one gene distribution pattern for each glioma subtype by selecting a dataset (e.g., 'mRNAseq_325') and inputting a gene name of interest (e.g., ADAMTSL4).
Upon hovering the mouse over each point, the expression level and clinical information of each case appear in a popup window. The results show the gene expression pattern in each glioma subtype classified based on clinical information. In our illustrative case, similar to our previous studies [25], gene ADAMTSL4 was shown to be differentially expressed according to the WHO 2016 classification based on the IDH mutation and/or 1p/19q co-deletion status and WHO grade ( Figure 3B). In addition, a unique feature of the CGGA dataset is the inclusion of recurrent gliomas. This module allows users to infer whether a gene may be a candidate factor that drives malignant progression if it is differentially expressed between primary and recurrent gliomas. In the 'Correlation' section, the user can examine the co-expression pattern by selecting a dataset (e.g., 'mRNAseq_325') and entering a gene pair (e.g., ADAMTSL4 and CD274). As a result, the coexpression patterns in each glioma subtype are displayed with the results of Pearson's correlation and the P value ( Figure 3C). In the 'Survival' section, users can perform survival analysis based on gene expression by selecting a dataset (e.g., 'mRNAseq_325') and inputting a gene (e.g., ADAMTSL4). In our illustrative case, all primary glioma patients with low ADAMTSL4 expression have better overall survival than those with high ADAMTSL4 expression (P < 0.0001, Figure 3D left; P = 0.00023, Figure 3D Figure 1 Schematic of CGGA illustrating the data processing and display approaches (continued on next page)   LGG, lower-grade glioma; GBM, glioblastoma; A, astrocytoma; O, oligodendroglioma; OA, oligo-astrocytoma; AOA, anaplastic oligo-astrocytoma; AA: anaplastic astrocytoma; rGBM: recurrent glioblastoma; rAA, recurrent anaplastic astrocytoma; rA, recurrent astrocytoma; AO, anaplastic oligodendroglioma; rAO, recurrent anaplastic oligodendroglioma; rO, recurrent oligodendroglioma; rAOA, recurrent anaplastic oligo-astrocytoma.
previous study [25]. Similar to the 'mRNA data' page, on 'methylation data' page and the 'microRNA data' page, users can view the methylation/miRNA distribution and perform correlation and survival analyses. Further analyses can be accomplished in the 'Tools' section, such as differential expression analysis, clustering analysis, and correlation analysis. An expression matrix can be downloaded and rearranged by the user, and the user can upload an input matrix following the instructions. The resulting graph can be downloaded in PDF format.

Data acquisition
Users can download all datasets on the 'Download' page. Each data type is saved at the gene and/or probe level and is then combined with available clinical data, including basic clinical information, survival, and therapy information. The raw sequencing data can be accessed at the National Genomics Data Center (NGDC, https://ngdc.cncb.ac.cn) by filing an application online.

Data processing for WES data
Genomic DNA from each tumor and the matched blood sample was extracted and assessed for integrity by 1% agarose gel electrophoresis. The DNA was subsequently fragmented and subjected to quality control, and then pair-end libraries were prepared. The Agilent SureSelect kit v5.4 (Cat No. 5990-9857, Santa Clara, CA) was used for target capture. Sequencing was performed on a HiSeq 4000 platform (Illumina, San Diego, CA) using pair-end sequencing strategy. Valid DNA sequencing data were mapped to the reference human genome (UCSC hg19) using Burrows-Wheeler Aligner (v0.7.12-r1039, bwa mem) [26] with default parameters. Then, SAMtools (V1.2) [27] and Picard (V2.0.1, Broad Institute, Cambridge, MA) were used to sort the reads by coordinates and mark duplicates. Statistics such as sequencing depth and coverage were calculated based on the resultant BAM files. SAVI2 was used to identify somatic mutations (including single nucleotide variations and short insertion/deletions) as previously described [7,8]. Briefly, in this pipeline, SAMtools mpileup and bcftools (V0.1.19) [28] were used to perform variant calling; then, the preliminary variant list was filtered to remove positions with insufficient sequencing depth, positions with only low-quality reads, and positions that were biased toward either strand. Somatic mutations were identified and evaluated by an empirical Bayesian method. In particular, mutations with a mutation allele frequency in tumors significantly higher (P < 0.05) than that in normal controls were selected.
Data processing for mRNA sequencing data

Discussion and perspectives
The current version of CGGA is the first release of this database, which includes multi-dimensional functional genomic glioma data, e.g., WES, mRNA, and microRNA expression, and DNA methylation data, for nearly 2000 samples from Chinese cohorts. Considering the significance of these data for glioma research, we have decided to make CGGA publicly available for worldwide researchers. To the best of our knowledge, CGGA is the first database archiving functional genomic data of both recurrent LGG samples and GBM samples. In addition, CGGA provides online interactive functionalities, including mutational profile, gene expression distribution pattern, correlation, and survival analyses. Phenotype-focused exploration, differential expression analysis, and clustering analysis can be performed by uploading rearranged gene matrixes and online tools. These features will be convenient for obtaining and validating novel findings of biological significance for bioinformatics beginners. However, the current version of CGGA is still nascent. The visitor-interactive functionalities will be improved in future updates. Unlike TCGA, there are no neuroimaging data in CGGA currently, which is a limitation of the database. Such data will be uploaded in the near future. In addition to addressing these shortcomings, future improvement of our CGGA database is planned. First, relying on the Beijing Neurosurgical Institute, Beijing Tiantan Hospital and Chinese Glioma Cooperative Group (CGCG) Research Network, we will continue to collect glioma tissue samples, perform cross-omics sequencing/ microarray analyses, and update the database regularly. In addition, we plan to provide single-cell sequencing data that match a subset of patients in the existing cohort. Furthermore, we will improve the integrity of the molecular pathological data by retrospectively checking medical records or reanalyzing pathological slices.
In summary, CGGA provides access to multi-omics sequencing data on Chinese cohorts for the global research community. It provides a user-friendly interface for obtaining integrated datasets, performing intuitive visualized analysis, and downloading these datasets. CGGA greatly reduces the barriers for glioma researchers to gain access to complex functional genomic data, allowing them to harness functional genomic data for important biological insights and identify potential clinical applications.