Computational methods for cancer driver discovery: A survey

Identifying the genes responsible for driving cancer is of critical importance for directing treatment. Accordingly, multiple computational tools have been developed to facilitate this task. Due to the different methods employed by these tools, different data considered by the tools, and the rapidly evolving nature of the field, the selection of an appropriate tool for cancer driver discovery is not straightforward. This survey seeks to provide a comprehensive review of the different computational methods for discovering cancer drivers. We categorise the methods into three groups; methods for single driver identification, methods for driver module identification, and methods for identifying personalised cancer drivers. In addition to providing a “one-stop” reference of these methods, by evaluating and comparing their performance, we also provide readers the information about the different capabilities of the methods in identifying biologically significant cancer drivers. The biologically relevant information identified by these tools can be seen through the enrichment of discovered cancer drivers in GO biological processes and KEGG pathways and through our identification of a small cancer-driver cohort that is capable of stratifying patient survival.

Network-based methods. Network-based methods evaluate the role of genes in gene regulatory networks by using different techniques and combine with the mutations of genes to predict cancer drivers.

Resources for cancer driver research
There are two types of resources for developing computational methods for cancer driver discovery, including the resources for method development, e.g. gene expression data, network data, mutation data, etc; and the resources for gene annotations, e.g. a database with partial ground truth for evaluating or assessing the findings of a computational method. The resources are summarised in Table 1. About resource for method development, several databases have been developed from cancer sequencing projects and they provide rich data used in cancer driver identification methods. TCGA [1] is a significant project in this area. The TCGA project profiles and analyses human tumours to uncover molecular aberrations in DNA, XRNA, protein, and epigenetic levels [1]. TCGA data can be accessed through the Genomic Data Commons (GDC) data portal [13]. ICGC data portal is also a resource for cancer genomics data and it contains the data of genomic abnormalities of 50 cancer types [2]. Another data portal for cancer genomics is cBioPortal [3], which provides a web interface for accessing cancer genomic datasets, as well as for analysing and visualising the data online.
There are also some other resources which can be used for cancer driver discovery such as the Cancer3D [5], the Cancer Cell Line Encyclopedia (CCLE) [6], and the COSMIC database [7]. Cancer3D is a database which focuses on the influence of mutations on the structure of proteins and it provides the information for users to analyse distribution patterns of mutations and their relationship with changes in drug activity [5]. It contains mutations of more than 14,700 proteins, which are mapped to over 24,300 proteins in the Protein Data Bank [4]. The CCLE includes SNVs, CNAs, and gene expression [6]. The COSMIC database is a large and comprehensive source for investigating the mutational impact in cancer. It contains records of cancer mutations including both manually curated expert data and data from sequencing projects like TCGA or ICGC [7,14]. It has more than two million coding point mutations and over six million non-coding mutations [7].
As about the resources for gene annotations, currently several databases such as the CGC [8] (in COSMIC) can be used. The CGC contains driver genes which are manually curated or predicted by multiple methods. Beside the CGC, several other sources are available for gene annotations. The Atlas of Genetics and Cytogenetics in Oncology and Haematology (AGCOH) is another source for this purpose [9]. It comprises around 1,500 cancer genes which are merged results from numerous collaborative projects [9]. The Network of Cancer Genes (NCG) is an online database of cancer genes with over 500 known cancer genes and more than 1,000 candidate cancer genes [10]. Known cancer genes are genes which have already been confirmed through experiments while candidate cancer genes are those using statistical methods. One more database about disease genes is the Drug-Gene Interaction database (DGIdb) [11]. It contains not only cancer drivers but also the information about drugs and drug-gene interactions [11].
At the present, while coding drivers are established in cancer research, non-coding drivers are not. In [12], the authors have recently introduced OncomiR, which is a resource for investigating miRNA dysregulation in cancer through a web interface. It does statistical analyses based on RNA-seq, miRNA-seq, and clinical information from TCGA to discover miRNAs which are related to cancer progression. Although this database may not be used as a ground truth to validate miRNA cancer drivers, it can be used as a channel to explore miRNA dysregulation in detecting miRNA cancer drivers. To validate non-coding cancer drivers now, it is required to examine the literature manually [15,16].

Driver genes predicted by different methods
There are 63 breast cancer drivers predicted by at least by two of the five methods (DriverML, ActiveDriver, DriverNet, MutSigCV, and OncodriveFM). The details of these 63 drivers are presented in Table 2. We also evaluate the mutation frequency of these driver genes by using the breast mutation data downloaded from TCGA. We only select somatic mutations which are functional based on the variant classification of mutations, such as splice site, in frame del, and frame shift del. To validate these driver genes, we use the CGC from the COSMIC database [7] as a gold standard. The CGC is a commonly used cancer gene database for validating cancer drivers predicted by computational methods in cancer research. It can be seen from the table that most of the predicted driver genes are mutated genes. Especially, the 11 driver genes which are predicted by at least three methods have a high mutation frequency. In addition, all these 11 driver genes are in the CGC. Although computational methods may never completely replace wet laboratory experiments in biological research, the novel drivers predicted by these methods can be used as candidates for further wet laboratory experiments to confirm their roles in cancer development. Some potential breast cancer drivers discovered by these methods include RBMX, NCOA3, and ZFP36L1. There is evidence showing that a positive correlation exists between the expression of RBMX and the proapoptotic Bax gene in breast cancer patients [17]. NCOA3 is also known to regulate PERK-eIF2α-ATF4 signalling [18] and activates estrogen receptor α-mediated transactivation of PLAC1 in breast cancer [19]. ZFP36L1 has been show to suppress HIF1α and Cyclin D1 in breast cancer [20].