C3: Consensus Cancer Driver Gene Caller

Next-generation sequencing has allowed identification of millions of somatic mutations in human cancer cells. A key challenge in interpreting cancer genomes is to distinguish drivers of cancer development among available genetic mutations. To address this issue, we present the first web-based application, consensus cancer driver gene caller (C3), to identify the consensus driver genes using six different complementary strategies, i.e., frequency-based, machine learning-based, functional bias-based, clustering-based, statistics model-based, and network-based strategies. This application allows users to specify customized operations when calling driver genes, and provides solid statistical evaluations and interpretable visualizations on the integration results. C3 is implemented in Python and is freely available for public use at http://drivergene.rwebox.com/c3.

Abstract Next-generation sequencing has allowed identification of millions of somatic mutations in human cancer cells. A key challenge in interpreting cancer genomes is to distinguish drivers of cancer development among available genetic mutations. To address this issue, we present the first webbased application, consensus cancer driver gene caller (C 3 ), to identify the consensus driver genes using six different complementary strategies, i.e., frequency-based, machine learning-based, functional bias-based, clustering-based, statistics model-based, and network-based strategies. This application allows users to specify customized operations when calling driver genes, and provides solid statistical evaluations and interpretable visualizations on the integration results. C 3 is implemented in Python and is freely available for public use at http://drivergene.rwebox.com/c3.

Introduction
The continued advancement of next-generation sequencing (NGS) technology has allowed for the sequencing of large sets of cancer samples for somatic mutation discovery [1,2]. However, one of the main challenges in interpreting the cancer genomes is to efficiently distinguish the driver mutations from the passenger mutations. Driver mutations are causally implicated in oncogenes and positively selected along the lineage of cancer development under the specific microenvironment conditions in vivo, whereas passenger mutations do not confer clonal growth advantages and are thus irrelevant to tumor development [3]. To address this issue, various methods have been proposed to identify driver genes based on distinctive assump- Figure 1 Guideline of C 3 web server A. A schematic representation of the C 3 workflow. A cancer sample input into C 3 workflow is analyzed by six cancer driver gene calling strategies (shown on the left), resulting in a consensus driver gene set as the output. Then the driver gene set is evaluated in terms of precision and nDCG before it is annotated based on the reference databases (shown on the right). Finally, the results are visualized by SuperExactTest and Circos. B. Overview of six categories of distinct cancer driver gene calling strategies. C. SuperExactTest plot of the consensus calling results identified by C 3 . The inner rings (green and white blocks) represent six driver gene sets generated by the six strategies in C 3 . Blocks in white and green indicate the absence and presence of the driver gene sets, and each group represents an intersection of prediction results using 2-6 strategies (shown as the green blocks). The size of the intersections is proportionally shown by the heights of the bars on the outer ring. The number of each intersection is shown on the top of the respective bars and the color intensity of the bars represents the significance of the intersections (ÀLog 10 P value). D. Circos plot of potential driver genes identified by C 3 . From the outer to inner circles, the first circle indicates the whole chromosomes across the genome and the second circle with gene symbols indicates the top 100 consensus driver genes identified by C 3 . The six inner colorful circles represent the top 100 results predicted by the individual strategies, respectively. Names of the strategies are provided in the center of the circle. The size of the gene symbol is positively proportional to the rank order of the predicted results, with a larger size indicating the higher rank. BRCA dataset in TCGA was used as an example for analyses shown in panels C and D.
tions and strategies [4][5][6][7][8][9][10][11][12][13][14][15][16]. Intuitively, all these driver gene identification strategies exhibit the biased signals of positive selection exploited by corresponding mechanisms at varied degrees. Several studies have been reported on benchmarking these methods with consensus cancer driver genes derived from individual model [8,17,18]. Collin et al. [8] proposed an evaluation framework to benchmark several existing models based on several measurements including precision, consistency, and mean log fold change (MLFC). Matan et al. [17] also benchmarked the available methods by using measurements such as precision and recall. Eduard et al. [18] classified four subtypes of driver gene calling methods at a subgene resolution. Denis et al. [19] provided the most comprehensive benchmarking of 21 driver gene prediction methods and proposed a Borda-based integration approach ConsensusDriver.
(1) C 3 provides a solid evaluation of the consensus mutation calling results with Top-N-Precision and Top-N-nDCG [20]. (2) C 3 provides an efficient integration strategy to derive the consensus results by Robust Rank Aggregation (RRA) [21] and statistical model-based intersection visualization [22]. (3) Circos plots are presented in C 3 to visualize the consensus mutation calling results [22,23].

Method
General workflow of C 3 C 3 accepts mutation annotation format (MAF) [24] file as input. The MAF file is annotated from variant calling format (VCF) [25] file, which can be acquired by using variant calling tool like Mutect on the NGS data. A schematic representation of the C 3 workflow is shown in Figure 1A. The selected programs, including 20/20+, MutSigCV, OncodriveFM, Onco-driveCLUST, DrGaP, and MUFFINN ( Figure 1A and B; File S1 Part 1), run in the Ubuntu sever 16.04 system. Then all preprocessed input mutation data are processed in C 3 to obtain candidate driver genes list for each strategy separately. We use SuperExactTest model to evaluate the statistical significance of the intersection of individual calling results using all the protein-coding gene as a whole background gene set. In addition, based on each discrepant driver gene list, a rank ensemble method, RobustRankAggreg, is used to obtain a consensus driver gene list. Four databases including the Cancer Gene Census (CGC) [26], Integrative Onco Genomics (IntO-Gen) [10], Network of Cancer Genes (NCG) [27], and Online Mendelian Inheritance in Man (OMIM) [28] are used to annotate the predicted driver genes. Two evaluation measurements, i.e., the Top-N-Precision and Top-N-nDCG, are applied to evaluate the calling performance. Finally, the KEGG [29] pathway and Gene Ontology analyses are also performed on the consensus driver genes for comprehensive annotations.

Performance measurement
Previously, Collin et al. proposed a novel measurement of mean log fold change between the observed and desired theoretical P values [8]. Matan et al. [17] and Eduard et al. [18] applied measurements of precision and recall. Denis et al. also applied precision, recall, and F1 score [19] (File S1 Part 1). In our study, we applied the Top-N-Precision (using CGC data as a reference driver gene set [26]) and Top-N-nDCG (using IntO-Gen as a reference ranking driver gene set [30]) to facilitate the quantitative comparison and evaluation, focusing on the top n performance of the ranking results.

Precision
We evaluated the precision performance among the results acquired by the previous strategies based on the top 100 genes with respect to CGC cancer database through Equation (1). The average precision can measure a general predicting ability of individual methods among the pan-cancer cohort samples. We calculate the precision scores for each of 27 cancer types, and the SUM (precision) represents the sum of respective precision score of 27 cancer types (Equation (2)).

Top-n-precision
Average nDCG ¼ SUMðnDCG of each individual cancer typeÞ Number of cancer types ð9Þ Figure 2 General framework of C 3 web application C 3 web application provides a user-friendly and simple five-step workflow. These include (1) selecting tools used for analysis, (2) choosing a data file to upload from user's own computer (refer to our file format to verify the integrity of the input data), (3) selecting parameters for the selected tools (refer to the help documentation), (4) entering a name of the task (make sure to provide a valid e-mail address), and (5) inquiring and downloading the results with the request ID at ''Recent Request" page. The request ID is sent to the user via e-mail upon the completion of the analysis.
Here, n represents the number of top predicted genes; i represents the rank of predicted genes; CG n represents cumulative weight of top n predicted genes; DCG n represents CG n multiplied by a discount factor 1 log 2 i (i > 1); IDCG n represents a DCG n under the ideal condition , that is, the rank of predicted genes is exactly the same as that in the reference dataset. Top-N-nDCG represents normalized DCG n and measures the ranking performance of predicted genes.
To obtain the Top-N-nDCG, firstly, we download IntOGen cancer driver gene set (URL: https://www.intogen.org/) [31] and assign a weight for each reference driver gene in IntOGen based on their proportion of driver mutation counts [30] (Version 2014.12) calculated according to Equation (3). Specifically, the total number of cancer driver genes in IntOGen is 459. The weights of the predicted driver genes overlapping with the benchmark IntOGen dataset are calculated according to Equation (4). The weights of the predicted genes that are not available at the benchmark IntOGen dataset are set to 0. The Top-N-nDCG can be calculated through Equations (5)-(8) [20].

Rank aggregation
The RRA algorithm [21] is applied to obtain a consensus driver gene list, which aggregates the ranking driver genes predicted by individual tools. Comparing with the original RankAggreg algorithm [32], the RRA algorithm has three advantages: (1) it deals with incomplete rankings, which is common in practice, (2) it performs robustly with tolerance to the data noise, and (3) it is fast to be integrated for interactive data analysis.

Intersection visualization and evaluation with SuperExactTest and Circos
We applied SuperExactTest [22] and Circos [23] to organize our visualization results. The former is a scalable visualization tool to illustrate high-order relationships among multi sets beyond Venn diagrams [33]. It evaluates the overlap of each of tools and presents a circular plot illustrating all possible intersections with statistical methods. The latter visualizes the predicted driver gene sets intuitively ( Figure 1C and D; File S1 Part 5).

Implementation
As Figure 2 shows, C 3 web application accepts MAF [24] file or a modified micro-MAF file (Table S1) as the input. After users select driver gene calling strategies and parameters, C 3 runs as the back-end Ubuntu 16.04 system (with python-2.7, R-3.3.4 and MATLAB Runtime 2014). When the job is successfully  For each cancer type, a higher value indicates a better performance and for each cancer driver gene calling strategy, the larger area means the better performance. nDCG, normalized discounted cumulative gain.

Detailed information of the test datasets
We test the stability of C 3 web application by selecting tumor datasets collected from The Cancer Genome Atlas (TCGA) [2] databases. Initially, the whole dataset includes 34 cancer types with 7724 samples and 729,235 mutations, curated from the published whole-exome sequencing or whole-genome sequencing studies which are also used by TUSON [9] and Collin study [8]. Since some tools (such as MutSigCV and DrGaP) need additional cohort mutation information, we removed 7 cancer types with 290 samples and 5164 mutations through data preprocessing. Finally, we curated 27 cancer types with 7434 samples and 724,071 mutations for the final analysis, which constitute the updated comprehensive test datasets finally for driver gene calling ( Table 1 and File S1 Part 2).

Performance of C 3
We benchmarked the performance of the consensus results comparing with each alternative. As shown in Figure 3, the integration results of C 3 application outperformed other methods evaluated with Top-N-Precision and Top-N-nDCG, revealing its superiority in driver genes prediction (File S1 Part 4). C 3 also helps to identify reliable potential driver genes by SuperExactTest intersection between different driver gene calling strategies with reference to CGC and literature review. Detailed results are shown in Table S2 and Table S3.
In summary, although there exists a high discrepancy among different driver gene identification strategies, the intersection by individual strategies not only identifies the most reliable driver genes, but also helps to find potential novel driver genes that are not well-characterized.

Future developments
Currently C 3 has some limitations and warrants future updates. (1) C 3 is currently deployed on the Ali Cloud server, which requires a lot of memory and space to process the data. Any variant file exceeding 40,000 records may fail when running DrGaP. Since the Random Forest Model 20/20 + occupies too much CPU resources, it also takes a long time (>3 h for sample of 50,000 mutations with 8 cores of Intel Xeon E5-2643 3.3 GHz) to run a whole pipeline of C 3 . Future optimizations are required to accelerate C 3 . (2) Current version of C 3 only supports the GRCH37 reference genome, and a new version of the reference genome such as GRCH38 will be added in the next version. (3) One potential application of C 3 is to identify the target driver genes for drug discovery. However, the computationally predicted drivers should not be over-interpreted without additional experimental evidence.