Driver gene classification reveals a substantial overrepresentation of tumor suppressors among very large chromatin-regulating proteins

Compiling a comprehensive list of cancer driver genes is imperative for oncology diagnostics and drug development. While driver genes are typically discovered by analysis of tumor genomes, infrequently mutated driver genes often evade detection due to limited sample sizes. Here, we address sample size limitations by integrating tumor genomics data with a wide spectrum of gene-specific properties to search for rare drivers, functionally classify them, and detect features characteristic of driver genes. We show that our approach, CAnceR geNe similarity-based Annotator and Finder (CARNAF), enables detection of potentially novel drivers that eluded over a dozen pan-cancer/multi-tumor type studies. In particular, feature analysis reveals a highly concentrated pool of known and putative tumor suppressors among the <1% of genes that encode very large, chromatin-regulating proteins. Thus, our study highlights the need for deeper characterization of very large, epigenetic regulators in the context of cancer causality.


1) Functional driver gene classification
CARNAF was also able to distinguish TSGs from OGs without using tumor features at relatively high accuracy (area under curve (AUC) = 0.79± 0.04, std. computed using 10 6 bootstrap iterations) ( Supplementary Fig. 2). Comparatively, classification using only tumor genomics features (AUC = 0.94 ± 0.02) was significantly better (P = 4.46×10 -4 , computed by a permutation test with 10 6 iterations). The combined classifier was expectedly in-between the latter two (AUC = 0.90 ± 0.02). Similar results were seen when using the less biased feature set which omits gene ontology features ( Supplementary Fig. 4).
The strong performance using mutation features is not surprising as evaluation was performed using high confidence drivers, most of which are frequently mutated. Such TSGs harbor frequent loss of function mutations and deletions, in contrast to mutation clustering and gene amplifications for OGs. However, this performance cannot be generalized to infrequently mutated drivers due to weaker mutation patterns.

2) Curation of top ranked genes -TSG/OG classification
In contrast to driver gene detection recall rates, functional classification by CARNAF and the 15 data sources used in the study was somewhat less accurate compared to the literature. For CARNAF, functional predictions for 5 out of the 12 genes that had supporting functional evidence, TSG or OG, in the literature were contradictory. Using functional annotations from the 15 multi-tumor type sources used in the study, only 2 out of the 8 genes had functional predictions that were discordant. These results illustrate a degree of accuracy, but have a lower performance than the TSG versus OG classification among the high confidence driver genes ( Supplementary Fig. 2), although this is not surprising as infrequently mutated drivers generally display weaker mutation patterns and cancer-like features, and additionally the sample size is small. In addition, various driver genes can function as both TSGs and OGs 1-3 , a factor that also can lower true classification accuracy.

3) Curation of 4 genes ranked high by CARNAF that eluded the reference studies used in this work
We curated the top 15 CARNAF driver genes using all features (excluding the high confidence drivers) and found 4 genes (SIRT1, TGFBR1, CDK1, and SMAD1) with cancer-related evidence that were not present in the 15 multi-tumor type studies to which we compared (Supplementary Tables 4 and 8).
Of these 4 genes, SIRT1 and TGFBR1 also have supporting genomic alterations 4,5 . SIRT1 encodes an NAD-dependent histone deacetylase that links intracellular energetics with transcriptional regulation and may both suppress and promote tumors depending on the context 6 , with loss-offunction mutations and allelic loss being characterized in tumor cell lines 4 . TFGBR1 is a serine/threonine kinase receptor that mediates TGF-β signaling, with a meta-analysis of 35 studies proposing that two specific mutations are associated with cancer susceptibility 5 .
The other two genes may be infrequently mutated drivers. CDK1 encodes a serine/threonine kinase that controls cell cycle progression. Although little direct mutation evidence exists, CDK1 has been associated with chemotherapy resistance 7,8 , with high activity being correlated with poor colorectal cancer prognosis 9,10 . Finally, SMAD1 mediates bone morphogenetic protein signaling and is involved in various cancer-relevant processes. Phosphorylated, hyperactive SMAD1 may promote epithelial-mesenchymal transition in lung cancer cell lines 11 and elevated phosphorylation has been observed in breast cancer metastases 12 .

4) Evaluation of potential origins of large TSG protein size in relation to OGs
We evaluated several hypotheses why TSGs frequently encode very large proteins, particularly in comparison to OGs, to ensure it is not a known artifact of previously described findings.
First, as large genes are prone to high mutation rates due to low expression and late DNA replication, we verified that there is little association between TSG protein size and replication time (R = 0.06, P = 0.58; log scale) or expression (R = 0.04, P = 0.70; log scale) 13 ( Supplementary  Fig. 6).
Second, it has been proposed that driver genes encode large proteins given their high connectivity in protein networks 14 , as highly connected proteins tend to be large ( Supplementary Fig. 7). However, TSG protein size was not significantly correlated with either the number of protein-protein interactions or with presence in shortest network paths after accounting for testing of more than one hypothesis ( Supplementary Fig. 8). Moreover, connectivity values were similar between TSGs and OGs, despite the former encoding a distribution of larger proteins ( Supplementary Fig. 9).
Third, there is often reduced selective pressure against deletions of large genes which can lead them to be mistaken as TSGs. This is because large genes, which often encode large proteins, tend to reside in regions of low gene density 15 . However, we did not find significant association between TSG protein size and deletion frequency ( Supplementary Fig. 10), and there were also no differences between TSG size among different modes of inactivation ( Supplementary Fig. 11).
Fourth, essential genes and non-duplicate genes are known to encode large proteins 16,17 . However, differences between TSGs and OGs were minor and the abundance of duplicate genes was small in both driver gene classes (Supplementary Figs. 12 and 13).
Finally, we examined gene ontology categories in which TSGs reside and found that no biological process, function, localization or expression in particular tissues could explain TSG size (Supplementary Tables 12 and 13). Thus, the tendency of TSGs to encode very large proteins is not well explained by the hypotheses above.

5) Assessment of gene enrichment among very large, chromosome organization genes -HotNet2 evaluation
Similar to CARNAF, HotNet2 is a computational-based method aimed at detecting rare driver genes. Thus, we specifically analyzed the significantly mutated genes found by HotNet2 18 , a method aimed at detecting rare drivers, to evaluate if they also contain an enrichment of large, chromosome organization proteins.
Of the 104 HotNet2 genes that were not in our high confidence driver gene set, 22 (21.2%) were among the top 5% largest proteins in the genome, of which 8 (7.7%) were also chromosome organization proteins, an 18-fold enrichment (P = 3.22×10 -8 , hypergeometric test) compared to the remaining genome-wide concentration of HotNet2 genes. This further supports the presence of uncharacterized driver genes in this focused gene pool. Perhaps not surprisingly given the different methodologies, there was little overlap between the top ranked genes of CARNAF and HotNet2 ( Supplementary Fig. 17). Similarly, the majority (54%) of the 104 HotNet2 genes were also not present among the 3458 genes in the remaining 14 data sources used in the study to build the high confidence drivers, medium confidence driver genes, and background gene sets (Supplementary Table 4a).

6) Effect of down-sampling on gene rankings
The estimated posterior probability of being a TSG, OG, or BG is computed by an ensemble of classifiers. The training set for each classifier was down-sampled such that it contains the same number of driver (n=165) and background genes (n=165). This sampling does not preserve the ratio of driver genes to background genes. Below, we show that this procedure does not alter the ranking of driver genes. For ease of presentation, we consider a single classifier, with the generalization to an ensemble being relatively straightforward.
Denote as the (estimated) driver gene label of gene (as estimated by a classifier when excluding this gene from the training set), where if gene is a driver gene, and otherwise, denote as the features vector of gene , and denote as the posterior probability of gene being a driver gene. For each gene , we introduce a binary selection variable , such that if gene was included in the training set of the classifier. Since the classifier training set consists only of the subset of selected genes, the actual quantity estimated by the classifier when performing prediction for gene is .
Below we show that whenever . This indicates that driver genes ranking remains the same regardless of down-sampling.
The central premise for the above is that is conditionally independent of given , which stems from the fact that the down-sampling procedure uses only the observed labels to decide if a gene is to be included in the training set. By invoking Bayes rule and the conditional independence assumption, we obtain Assuming without loss of generality that gene is ranked higher than gene , we have . By using Equation 1, we obtain The second term of the numerator is equal under both sides. After dividing both sides by this term and rearranging, we obtain This completes the derivation. The derivation remains the same when considering an ensemble of classifiers with the exception that the estimated probability is now computed by many different classifiers and then averaged.
The derivation above holds for a binary classifier, demonstrating that driver gene rankings are preserved under down-sampling. However, the derivation does not hold for TSG and OG rankings as the classifier is not binary in this scenario. Rather, a stronger set of assumptions is required to guarantee rank preservation. In the presence of more than two classes, gene remains ranked above gene when the following inequality holds: where is a shorthand notation for . This inequality is likely to hold for pairs of genes with similar posterior probability distributions.

7) Assessing sensitivity to sample bias towards frequently altered driver genes
Infrequently altered driver genes are expected to often have weaker cancer-driving phenotypes than the well-known drivers used in the training set. As such, infrequent drivers have weaker tumor genomic signatures and potentially weaker cancer-associated non-genomic signatures, although exceptions are expected.
We applied a sample bias correction technique to evaluate the impact of the overrepresentation of frequently altered drivers in the high confidence set which was used for training. Although several sample bias correction approaches have been previously proposed [19][20][21] , they are not suitable for a PU learning setting wherein there is no clear distinction between the training and testing set. Instead, we propose a domain specific probabilistic model for this task.
Our probabilistic model treats the true (unknown) cancer role of a gene as a latent random variable and its training set label, which represents its currently known role, as an observed random variable. Below, we show that the classifier estimate of being a driver gene serves as a lower bound for the true estimated probability of being a driver gene. An upper bound on the estimated probability of being a true driver gene can also be derived, as detailed below. We formalize the model as follows. For each gene , we denote as the vector of features of gene . We further denote two random variables and , which indicate the true and training set label of the gene, respectively. Formally, , where the values indicate a tumor suppressor gene, an oncogene, and a background gene, respectively. We denote this model as the trinary model. It is also possible to define a binary model with only two values, , where denotes a driver gene, such that the statements , are equivalent to the statements , respectively. The following derivation applies to both the trinary and binary models. In the following, we do not explicitly condition on the selection variable defined in the previous section. Conditioning on this variable does not affect the derivations, because this variable is conditionally independent of all other variables given , thus it does not affect the independence structures described below.
We are interested in ranking genes according to the posterior probability being a TSG, OG or DG, for . As this quantity cannot be estimated directly, we derive lower and upper bounds to be used for gene ranking. Assuming all training set driver gene labels are correct, for every and for every features vector , the following inequalities hold: Here, consists of the tumor genomic features of gene i, is the complementary vector of all other features, and , consist of the corresponding entries of and , respectively. A more detailed definition is given below. To find the vector yielding the tightest possible upper bound in Equation 5, we iterate over all vectors of training set driver genes (high confidence set) and use CARNAF to estimate the upper bound.
The lower bound reflects the fact that the training set driver genes are a subset of all existing drivers. The upper bound consists of a sample bias correction term, which increases the estimated probability of less frequently mutated genes, since such genes were less likely to be included among the training set driver genes ( Supplementary Fig. 5). The denominator of the right hand side of Equation 5 encodes the probability for a synthetic gene, consisting of the tumor genomics features of gene i along with the non-tumor genomics features of a different gene. The upper bound in Equation 5 therefore assesses to what degree the probability estimated by the classifier for gene i may have been larger had the gene contained different non-genomics features. If the upper bound is high for every tested vector of non-genomic features, the gene may be underestimated due to weak tumor genomics features. In contrast, if the upper bound is low for at least one vector , it suggests that the tumor genomics features of this gene do not substantially hinder its detection and the resulting correction will be small.

7.1) Lower Bound
The lower bound of Equation 5 is derived under the assumptions that the high confidence driver gene set used for training does not contain non-driver genes and that all 165 high confidence genes are correctly identified as TSGs or OGs. These assumptions are encoded as the following pair of equations.
Using the assumptions above, the law of total probability, and restricting , we have We conclude that is lower bounded by whenever . This completes the derivation of the lower bound of Equation 5.

7.2) Upper bound
To derive the upper bound of Equation 5, we first divide the vector into two non-overlapping vectors, , such that consists of the tumor genomics features and is the complementary vector of all other features. Thus, we obtain Equation 9 encodes the assertion that is conditionally independent of given and . This results from the approximation that the training set driver genes were discovered using tumor genomics features alone, and validated to ensure that the findings are genuine. Therefore, the non-tumor genomics features did not directly participate in the creation of the label .

7.3) Assessing the sensitivity to sampling bias
We used Equation 5 to assess the sensitivity of CARNAF to sampling bias towards frequently altered driver genes by ranking genes according to the upper bound. Equation 12 demonstrates that this bound is likely to be tight for top ranked genes since the numerator of Equation 12 for such genes is large. Furthermore, the denominator is always larger than the numerator, and additionally the last element in Equation 12 is larger than the denominator owing to Equation 8. Consequently, all three quantities are large for top ranked genes, and the element on the right hand side is expected to be close to 1, yielding a tight bound.
Sensitivity to sampling bias was measured by measuring the overlap between top ranked gene lists, with and without applying the sampling bias correction (Supplementary Table 7). For completeness, we also report the overlap between these lists and the list of top ranked genes when using only non-tumor genomics features, as the latter list is expected to be less sensitive to sampling bias. The results show that all three methods yield similar lists of top ranked genes, indicating that sampling bias has a minor impact on top ranked genes.         (a) Depiction of the fraction of TSGs, OGs, and BGs (background genes) that reside in one of three categories as predicted by mouse homology 16 : essential genes, genes whose loss results in a measurable phenotype but are not essential, and genes with no phenotype. (b) Essential genes encode large proteins compared to non-essential genes. Outliers not shown due to large abundance and to reduce y-axis scale range. Comparison of gene duplication rates shows that driver genes have low duplication rates, but the difference between TSGs and OGs is not substantial. P values are derived using Fisher's exact test.    Fig. 4a, with differences in y-axis scaling. The abundance of high confidence OGs and CARNAF predicted OGs (excluding high confidence drivers) encoding very large (top 5% in genome) and small (the remaining 95%) proteins with respect to participation in chromosome organization processes. The top 81 CARNAF OG predictions were selected to match the abundance of OGs in the high confidence set (n=81). CARNAF predictions that overlap with the medium and low confidence driver gene sets are shown. Chrsm refers to the gene ontology biological process annotation chromosome organization.  Table Number  Title  Location  Supplementary  Table 1 Features used in CARNAF  Excel file   Supplementary  Table 2 Features used in study, but not in CARNAF Excel file Supplementary  Table 3 Driver gene data sources used in study Supplementary document Supplementary Table 4 Genes present in 15 multi-tumor type data sources Excel file Supplementary  Table 5 CARNAF gene ranks Excel file Supplementary Table 6 Evaluating the sensitivity of CARNAF to specific training set choices Supplementary document Supplementary Table 7 Evaluating the effect of sampling bias on CARNAF gene rankings Supplementary document Supplementary  Table 8 Curation of top 15 CARNAF driver gene predictions not included in training set drivers Excel file Supplementary  Table 9   Feature importance  Supplementary  document  Supplementary  Table 10 Feature association for high confidence TSG vs high confidence OG labels Supplementary  document  Supplementary  Table 11 Feature association for high confidence driver genes vs background genes (BGs) Supplementary document Supplementary  Table 12 Comparison of TSG and OG protein size versus non-TSG/OG protein size for GO slim categories and according to expressed tissues Excel file Supplementary  Table 13 Comparison of TSG protein size versus non-TSG protein size for lower level chromosome organization GO categories Excel file Supplementary Table 3. Driver gene data sources used in study. 15 multi-tumor type data sources were compiled to aid in CARNAF training and result evaluation. Gene label annotations included both mutation and copy number based driver genes. Several sources included functional annotation as TSGs and OGs. Translocations were not considered.

Data Source
Gene labels Description TUSON explorer 1 Mutation-based TSGs and OGs A statistical approach that uses sequenced tumors to classify genes as TSGs or OGs using four mutation pattern features. Vogelstein et al. 2 Mutation and copy number alteration based TSGs and OGs A highly-cited cancer genomics review that classified well known driver genes as TSGs and OGs using manual curation and the 20/20 rule. The rule classifies drivers with at least 20% inactivating mutations as TSGs, and those with at least 20% of mutations clustered at specific amino acids as OGs. The rule has a few exceptions. Zack et al. 3 Amplified and deleted genes in cancer Statistical analysis of small copy number alterations from a set of approximately 5,000 tumors. The authors manually curated amplified and deleted genes for known evidence. CGC 4 Mutated and copy number alteration based driver genes A highly-referenced manual curation of driver genes in cancer. Version 69 was used.
Garnett et al. 5 Expression-based gene biomarkers to drugs A large cell line-based screen to discover cancer drug biomarkers that tested 130 drugs across ~600 cell lines. Expression was measured for 14,500 genes. Genes with absolute effect >3 were selected from Supplementary Table 3. Santarius et al. 6 Amplificationbased driver genes A rigorous study using clinical, experimental, and biological knowledge evidence to identify amplification and overexpression based driver genes in cancer. Lawrence et al. 7 Mutated cancer genes Statistical analysis for significantly mutated cancer genes using approximately 5,000 tumors. The method corrects for known factors that affect mutation rates, notably expression levels and DNA replication time. Genes were taken from the union the high-confidence list (n=219). Genes marked as 'discussed' were those in which the authors directly or indirectly suggested a TSG or OG role in the main text. 2020 genome-wide implementation

Mutation-based TSGs and OGs
Genome-wide implementation by the authors of the 20/20 rule as described previously 2 using Cosmic tumor data (v69). The original 20/20 rule was applied on 138 driver genes. Tag DB 8 TSGs and OGs A text mining approach to identify TSGs and OGs. TSGene 9 TSGs A literature curation effort to compile a large list of TSGs. ActiveDriver 10 Mutated cancer genes A method that identifies genes enriched in phosphorylation site-associated mutations. Genes were retrieved from method implementation on 3,205 tumors across 12 cancer types 11 . OncoDriveFM 12 Mutated cancer genes A method that detects genes with a bias towards high functional mutations as predicted using three independent tools. Genes were retrieved from method implementation on 3,205 tumors across 12 cancer types 11 . OncoDriveClust 13 Mutated cancer genes A method that identifies genes that contain high regional clustering of mutations. Genes were retrieved from method implementation on 3,205 tumors across 12 cancer types 11 . MuSiC 14 Mutated cancer genes A study that searched for genes mutated more frequently than expected. Genes were retrieved from method implementation on 3,205 tumors across 12 cancer types 11 . HotNet2 15 Mutated cancer genes A study that used protein-protein network knowledge in conjunction with mutation data from 3,000 tumors to search for infrequently mutated driver genes.