Concordance of copy number loss and down-regulation of tumor suppressor genes: a pan-cancer study

Tumor suppressor genes (TSGs) encode the guardian molecules to control cell growth. The genomic alteration of TSGs may cause tumorigenesis and promote cancer progression. So far, investigators have mainly studied the functional effects of somatic single nucleotide variants in TSGs. Copy number variation (CNV) is another important form of genetic variation, and is often involved in cancer biology and drug treatment, but studies of CNV in TSGs are less represented in literature. In addition, there is a lack of a combinatory analysis of gene expression and CNV in this important gene set. Such a study may provide more insights into the relationship between gene dosage and tumorigenesis. To meet this demand, we performed a systematic analysis of CNVs and gene expression in TSGs to provide a systematic view of CNV and gene expression change in TSGs in pan-cancer. We identified 1170 TSGs with copy number gain or loss in 5846 tumor samples. Among them, 207 TSGs tended to have copy number loss (CNL), from which fifteen CNL hotspot regions were identified. The functional enrichment analysis revealed that the 207 TSGs were enriched in cancer-related pathways such as P53 signaling pathway and the P53 interactome. We further performed integrative analyses of CNV with gene expression using the data from the matched tumor samples. We found 81 TSGs with concordant CNL events and decreased gene expression in the tumor samples we examined. Remarkably, seven TSGs displayed concordant CNL and gene down-regulation in at least 50 tumor samples: MTAP (212 samples), PTEN (139), MCPH1 (85), FBXO25 (67), SMAD4 (64), TRIM35 (57), and RB1 (54). Specifically to MTAP, this concordance was found in 14 cancer types, an observation that is not much reported in literature yet. Further network-based analysis revealed that these TSGs with concordant CNL and gene down-regulation were highly connected. This study provides a draft landscape of CNV in pan-cancer. Our findings of systematic concordance between CNL and down-regulation of gene expression may help better understand the TSG biology in tumorigenesis and cancer progression.


Background
Cancer is characterized by unconstrained cell proliferation. In the normal cell, there is precise control of cell division such as cell cycle check points [1]. In cellular system, tumor suppressor genes (TSGs) are important guardian genes that protect a normal cell from one step on the path to uncontrolled growth [2,3]. In cancer cells, TSGs may lose their normal functions because of mutations occur at its critical sites. For single nucleotide or small insertions/deletions (indels), these mutations often lead to truncation of transcripts or proteins, including nonsense mutations, splicing site mutations, or frameshift mutations. Similar effects can be caused by larger scale mutations, such as copy number variations (CNVs), gene fusions, or structural variants (SVs) [4,5]. The mutated TSGs often coordinate with oncogenes for cancer progression [4,6,7]. Therefore, the identification and understanding of TSGs have profound influence to develop the diagnosis biomarkers and effective drugs for cancer therapies.
CNVs are the variable number of DNA fragments in the human genome. Their lengths typically range from a kilo base pairs to a mega base pairs [8]. CNVs are divided into two major groups: copy number loss (CNL) and copy number gain (CNG). CNL denotes the decreased gene (or sequence fragment) copies in the genome while CNG denotes the gain of additional gene copies in the human genome. With the development of high throughput technologies such as Comparative genomic hybridization (CGH) array and next-generation sequencing, a very large number of CNVs, as well as other types of mutations and genomics data (e.g., gene expression) have been unveiled, especially in cancer genomes [9,10]. This allows us to systematically study cancer mutation signatures, heterogeneity, and other molecular features [11]. For CNV, such deleted or duplicated DNA fragments often have profound effects on gene expression, which subsequently affects gene's function [12].
Despite a number of studies have explored CNVs and gene expression in various cancers [13], there has been no systematic study of the features in TSGs yet. Moreover, the results from single cancer type may not be representative in other types of cancer, or they may vary among the subtypes of the cancer. To overcome these limitations, we conducted a pan-cancer CNV analysis on TSGs to explore the landscape of CNV features and cross-validate some observations. This study may help us better elucidate the relationship between CNV and gene expression change in this important gene category in cancer.

The curated TSGs from thousands of literatures
To conduct a systematic CNV survey of TSGs, we downloaded all the 1207 curated human TSGs from TSGene database in a plain text format with all the Entrez Gene IDs and official symbols (version 2.0) [2]. In this version of TSGene database, there were 1088 protein-coding and 198 non-coding TSGs. All these TSGs were manually curated from over 9000 PubMed abstracts by us. To annotate TSGs with CNVs, it requires the genomic location for mapping. Therefore, we downloaded the corresponding RefSeq mapping information for TSGs from RefSeq database. We implemented an in-house script to extract all the genomic location information from the completed human genome RefSeq sequences (accession number starting with NC). In total, 1207 TSGs were annotated with accurate genomic locations in GRCH 38.
The pan-cancer CNV data from The Cancer Genome Atlas (TCGA) To explore the CNVs in pan-cancer systematically, we downloaded all the prepared TCGA CNV data with the GRCH 38 genomic coordinates from the Catalogue of Somatic Mutations in Cancer (COSMIC) database (V73). When integrating TCGA data, COSMIC introduced a few thresholds to define the copy number loss and gain. CNG was obtained by the following criteria: (the average genome ploidy < =2.7 AND total DNA segment copy number > =5) OR (average genome ploidy >2.7 AND total DNA segment copy number > =9). Similarly, the criteria for CNL were: (the average genome ploidy < =2.7 AND total DNA segment copy number =0) OR (average genome ploidy >2.7 AND total DNA segment copy number < (average genome ploidy -2.7)). In this study, we followed COSMIC criteria and overlapped all the CNV regions with TSGs using the GRCH 38 coordinates. By intersecting all the CNV gain and loss information to all the 1207 TSGs with GRCH 38 coordinates, we annotated 1170 TSGs with precise gain or loss information. For each cancer type, we calculate the number of samples with CNL and CNG, respectively. Since TSGs are often in loss-of-function in cancer progression, we pulled out those TSGs with higher frequency of CNLs than that with CNGs. Specifically, we set a cut-off of 2 to filter out those TSGs without having at least twice of tumor samples with CNLs as tumor samples with CNGs. This process resulted in 207 TSGs with the evidence of an overall loss of CNVs. These genes were used for the following gene expression analysis.

Gene expression analysis for TSGs with CNL
To check the CNV-correlated gene expression changes on TSGs, we downloaded the TCGA pan-cancer gene expression data from the COSMIC database (V73). In this study, we focused on only those gene expression changes in the matched TCGA samples with TSG CNLs. For the gene expression quantification, COSMIC started from FPKM calculated using trimmed short reads generate by the RNA-Seq platform and the RSEM quantification results from the RNAseq V2 platform. Here, FPKM denotes Fragments Per Kilobase of transcript per Million mapped reads, which is used to indicate the relative expression of a transcript. And RSEM is one of the popular measures for accurate transcript quantification of RNA-Seq data. The average and the standard deviation of expression were computed using those tumor samples that are diploid for each corresponding gene.
The standard Z score was used to characterize whether a TSG is over or under expressed. The Z-score with absolute value 2 was used as the threshold value. The Z-score over 2 was defined as over expression while the Z-score less than −2 represented the decreased gene expression. For those 81 TSGs with CNL-associated gene expression change, we further systematically examined their somatic CNV patterns in pan-cancer of TCGA samples using cBio portal [14].

Sub-network extraction for the TSGs with high frequency CNLs
To explore the relevant biological mechanisms related to TSGs with frequently observed CNLs and consistent gene down-regulation, we extracted a PPI network to connect 81 TSGs with the remaining human genes. To this goal, we started from a non-redundant human interactome extracted from the Pathway Commons database [15,16], containing 3629 proteins and 36,034 PPIs. It is worth noting that this integrated human interactome is based on well-curated pathway databases (HumanCyc, Reactome, and KEGG pathway database [17]). Therefore, those links in the interactome have biological meaning rather than physical interactions. Based on the pathwaybased interactions, we used the similar approach implemented in our previous study to extract a sub-network related to our 81 TSGs [16,18,19]. In this sub-network extraction strategy, all the 81 seed TSGs were overlapped to the human pathway-based interactome. Then, a sub-network with the maximum number of the seed TSGs was formed by connecting each TSG through the shortest path. To characterize the function of the network, we relied on the network topological properties (degree and shortest path) calculated from the network. In practice, we utilized NetworkAnalyzer plugin in Cytoscape 2.8 to compute topological properties in the TSG network [20]. The degree is defined as the number of direct connections of each node with other nodes in the TSG network [21,22]. The network layout was conducted based on Cytoscape 2.8 [20].

Genomic regions with frequent copy number loss in TSGs in multiple cancer types
To systematically survey the somatic CNVs in TSGs, our pipeline started with a list of 1207 human TSGs from the TSGene 2.0 database [2,23] (Fig. 1). These genes have multiple lines of evidence in literature and other data, and have been manually curated. To provide an unbiased global view of CNVs in major types of cancer, we overlapped all these 1207 TSGs with the somatic CNVs identified from TCGA, which is the largest cancer genomics data source. This resulted in a list of nonredundant 1170 TSGs, which are annotated with CNVs (Additional file 1: Table S1). However, the majority CNVs are not informative due to the lack of matched control tissue. In this study, we only focused on those TSGs with precise gain or loss data using the normal tissue as control (see Methods). By counting the number of samples with gain or loss of gene copies, we set a threshold to prioritize most informative CNV events for TSGs. Since TSGs typically play their roles by loss-offunction, we used only those TSGs that tend to have copy number loss. To this end, we required CNVs were observed in at least as twice the samples with copy number loss as those with copy number gain. The process resulted in a total of 207 TSGs. We named them as TSGs with CNL in cancer, and used them for the follow up functional enrichment and integrative analyses. The list is provided in Additional file 2: Table S2.
We performed functional enrichment analysis of these 207 genes using Gene Ontology (GO) terms as functional units. Figure 2 displays the main features of GOrelated functional features, and their clusters. Overall, they are enriched with cell proliferation, apoptosis, cell Interestingly, the 207 TSGs could highly cluster into fifteen chromosome regions. All these regions had the corrected enrichment p-values less than 0.01 (Table 1). Among the 15 regions, eight could be further clustered into three genomics locations: 3p21, 8p21-22, and 17p13.1-3. In 3p21, we found four enriched cytobands with a total of 27 TSGs.  CNLs in TSGs harbour not only well-known protein-coding TSGs such as TP53, but also six microRNAs (MIRLET7G, MIR135A1, MIR195, MIR320A, MIR383, and MIR497). By overlapping to TSGene database, we found all the six micro-RNAs are tumor suppressor microRNAs. Collectively, our systematic examination on CNL in TSG cluster regions provides precise information of such CNL in multiple cancers. The results may be useful for further studying the similar or different roles of CNL in differential cancer types as well as cancer heterogeneity.

Correlation of CNL with gene expression decrease using the matched tumor samples
Through incorporating the gene expression change of the TCGA samples with the CNL on TSGs, we examined the correlation between CNL and TSG gene down-regulation. We utilized the Z-score to assess whether a TSG is upregulated or down-regulated in specific TCGA samples.
Here, Z-score is a transformation of the p value calculated by the formula as below: Where μ represents the mean expression of a gene across multiple TCGA samples; σ represents the standard deviation of the expression scores of the gene in TCGA samples. Specifically, we used the Z-score threshold value −2 to identify down-regulated TSGs in specific TCGA samples. After examining the same TCGA tumor samples for both expression and CNV loss, we found 81 TSGs that had concordant decreased gene expression and loss of gene copy numbers in tumor samples (Additional file 3: Table S3). The functional enrichment analyses revealed that the 81 TSGs are mainly associated with cancerrelated pathways such as cell cycle (adjusted P-value = 1.15E-6) (Additional file 4: Table S4). Interestingly, they are also related to a number of cancer-related phenotypes such as hamartomatous polyposis (adjusted P-value = 1.05E-6) and intussusception (adjusted P-value = 3.98E-6). The CNV mutational patterns across multiple cancers are plotted in Fig. 2. In terms of their CNVs, these 81 TSGs are highly mutated. For example, in TCGA esophageal carcinoma cohort, there were 142 cases (77.2 %) that had at least one gene with copy number change (Additional file 5: Table S5). More than 50 % of the esophageal carcinoma patients had at least one deletion event in one of the 81 TSGs. The similar prevalence of copy number alteration (>60 % cases, including both CNLs and CNGs) was found in other 11 cancer datasets from 9 cancer types: metastatic prostate cancer, malignant peripheral nerve sheath tumor, sarcoma, ovarian serous cystadenocarcinoma, lymphoid neoplasm diffuse large B-cell lymphoma, bladder urothelial carcinoma, glioblastoma multiforme, uterine carcinosarcoma, and stomach adenocarcinoma. However, specific to CNLs, only the glioblastoma multiforme (GBM) cohort had over 60 % patients with CNLs (Fig. 3, the blue bar represent the CNLs in different cancer types). Although the other cancer cohorts also possess CNLs in the majority of the affected patients, a small portion of patients had CNGs rather than CNLs. This may imply the importance of the 81 TSGs in cancer progression of GBM via the massive copy number losses.
To further explore the potential CNL-induced gene expression change, we specifically checked four TSGs with the most frequently observed CNLs; all these genes were observed in more than 50 tumor samples. As shown in Fig. 4, TSG MTAP had CNL in more than 40 % cases in TCGA GBM cohort. The TSG MCPH1 was deleted in more than 14 % patients in a prostate cancer dataset. PTEN showed similar frequent CNV loss in prostate cancer samples. The gene loss of SMAD4  was prominent in the pancreatic cancer. Furthermore, we found consistent, low gene expression of these four genes in the tumor samples with CNL (Fig. 5). The results suggested that CNL might induce gene expression decrease as a common mechanism in cancer.

A connected biological map of TSGs with concordant CNL and decreased gene expression
To further investigate the common functional regulation and enhance our understanding of the cellular events of these 81 TSGs with decreased expression and CNL, we conducted a pathway-based protein-protein interaction (PPI) analysis using the pathway annotation data from Pathway Commons database [15]. These reliable interactions are based on evidence from known biological pathways such as the KEGG and Reactome pathway databases. Therefore, this pathway-based interactome is useful for the pathway reconstruction because such pathways may avoid high-level noises, sparseness, and highly skewed degree distribution, which are often observed in physical interaction-based PPI networks. By applying the Klein-Ravi algorithm for module searching [18], we first mapped the 81 TSGs to the human pathway interactome. Then, a subnetwork was extracted, allowing to connect as many as the 81 TSGs as possible. The reconstructed TSG network contains 54 nodes and 56 links (Fig. 6a). Among the 54 nodes, 35 are from the 81 TSGs. The remaining 19 are the linker genes to bridge those TSGs so that a fully connected map could be built. The degrees of the nodes in this reconstructed map potentially follow a power law distribution P(k)~k -b , where P(k) is the probability that a node has connections with other k nodes and b is an exponent with an estimated value of 1.4 (Fig. 6b). Moreover, most of the genes in the network can be connected by three to five steps on average, as measure by the shortest path (Fig. 6c). These two topological features (degree and shortest path) indicated that most TSGs in this map were closely connected with high modularity. Considering the tight  (4). It is interestingly that TP53 is the node with most connection in the network. SMAD4 is also in the centre of the map with six connections. In summary, our reconstructed map for the 81 TSGs with potential CNL-driven gene down-regulation contains some interesting features such as the TSGs with potential CNL-induced dysregulation.

Discussion
This study revealed some important somatic mutational features of TSGs in multiple cancer types, particularly with respect to the CNVs and their effects on gene expression. Since the loss-of-function is the typical mechanism that TSGs involve in cancer initiation and progression, a large-scale change of gene copy number may induce gene expression alteration. In this scenario, a critical regulation change is that CNL in a TSG leads to the over-expression of its guardian genes. Although previous studies have explored the balance of germline CNVs and gene expression, there still lack of direct links of somatic CNVs on gene expression dosage compensation. In this study, we only focused on the concordant patterns between CNL and gene down-regulation because TSGs often play functions in a manner of lossof-function. Our results only provided the insight of correlation between gene dosage and somatic CNV; more systematic examination of the expression quantitative trait locus may provide more depth on the relationship between CNV and gene expression. This study was mainly based on the TCGA genomic data. The cohort size of some cancer types is relatively small (e.g.,~100 samples). A small sample size may filter out many low-frequency CNVs. In addition, TCGA mainly relies on the CGH array between normal and tumor tissues to characterize CNVs, which may lose signals outside of pre-designed probes. These undetected CNVs may also contribute to TSGs functionality on cancer progression. Another limitation in this study is that we only incorporate the protein-coding gene expression, not including non-coding gene expression. The further integration of large-scale CNV data and gene expression of noncoding RNA (microRNA and long non-coding RNA) may provide new insight into the roles of the non-coding TSGs.
In this study, we made an effort to construct a biological map for the genes with consistent CNL and gene down-regulation in cancer. Although the majority of genes in the reconstructed map are linked with each other, the size of the network is relatively small. Therefore, it has limited power to explore the overall network functions based on the topological features. For example, we found the degree of the network might follow the power-law distribution. This feature is different from the Fig. 6 Reconstructed interaction map for the 81 tumor suppressor genes (TSGs) with decreased gene expression potentially induced by copy number losses (CNLs). a The network includes 35 genes (in yellow) from the 81 TSGs with decreased expression potentially induced by CNLs and 19 linker genes (in blue) that connect these 35 TSGs. The node size reflects the number of connection. A bigger size means more connections associated with the gene. b The degree distribution of the nodes (genes) in the network (a). c The distribution of the shortest path length whole human PPI network, in which the majority nodes (genes) are sparsely connected with exponent b as 2.9 [26]. It is not sufficient to impose the scale-free properties on this constructed small network due to the small size. For the same reason, it is not good for us to define the hub nodes based on the high connectivity. Nevertheless, the nodes with multiple connections in our network should provide some clues for the common CNL events related to gene down-regulation. The further experimental validation may provide more insight into the potential molecular mechanisms for those CNL events that were detected in multiple cancers.