DGPathinter: a novel model for identifying driver genes via knowledge-driven matrix factorization with prior knowledge from interactome and pathways

Cataloging mutated driver genes that confer a selective growth advantage for tumor cells from sporadic passenger mutations is a critical problem in cancer genomic research. Previous studies have reported that some driver genes are not highly frequently mutated and cannot be tested as statistically significant, which complicates the identification of driver genes. To address this issue, some existing approaches incorporate prior knowledge from an interactome to detect driver genes which may be dysregulated by interaction network context. However, altered operations of many pathways in cancer progression have been frequently observed, and prior knowledge from pathways is not exploited in driver gene identification task. In this paper, we introduce a driver genes prioritization method called DGPathinter, which is based on knowledge-based matrix factorization model with prior knowledge from both interactome and pathways incorporated. When DGPathinter is applied on somatic mutation datasets of three types of cancers and evaluated by known driver genes, the prioritizing performances of DGPathinter are better than the existing interactome driven methods. The top ranked genes detected by DGPathinter are also significantly enriched for known driver genes. Moreover, most of the top ranked scored pathways given by DGPathinter are also cancer progression associated pathways. These results suggest that DGPathinter is a useful tool to identify potential driver genes. ABSTRACT 14 Cataloging mutated driver genes that confer a selective growth advantage for tumor cells from sporadic passenger mutations is a critical problem in cancer genomic research. Previous studies have reported that some driver genes are not highly frequently mutated and cannot be tested as statistically signiﬁcant, which complicates the identiﬁcation of driver genes. To address this issue, some existing approaches incorporate prior knowledge from an interactome to detect driver genes which may be dysregulated by interaction network context. However, altered operations of many pathways in cancer progression have been frequently observed, and prior knowledge from pathways is not exploited in driver gene identiﬁcation task. In this paper, we introduce a driver genes prioritization method called DGPathinter, which is based on knowledge-based matrix factorization model with prior knowledge from both interactome and pathways incorporated. When DGPathinter is applied on somatic mutation datasets of three types of cancers and evaluated by known driver genes, the prioritizing performance of DGPathinter is better than the existing interactome driven methods. The top ranked genes detected by DGPathinter are also signiﬁcantly enriched for known driver genes. Moreover, most of the top ranked scored pathways given by DGPathinter are also cancer progression associated pathways. These results suggest that DGPathinter is a useful tool to identify potential driver genes.

. Although the pathway information is used in these approaches, they are not 135 designed to identify potential driver genes. Meanwhile, the driver gene detecting methods aforementioned 136 only use interactome information but not the pathway information. Consequently, the already available 137 knowledge from pathways remains an underexploited resource in the identification of potential driver 138 genes, and there is a lack of an approach that can effectively integrate information from both interactome 139 information and pathway as prior knowledge. 140 In this article, we introduce Driver Gene identification through Pathway and interactome information 141 (DGPathinter), to discover potential driver genes from mutation data through a knowledge-based matrix 142 factorization framework, where prior knowledge from pathways and interaction network is efficiently 143 integrated. By maximizing the correlation between the relations of mutation scores of genes and the 144 pathway scores (Chen and Zhang, 2016), we can identify potential driver genes driven by prior knowledge 145 from pathways. At the same time, we also use a graph Laplacian technique to adopt information from an 146 interaction network in the identification of driver genes (Xie et al., 2011). Besides, we use the framework 147 of matrix factorization to integrate the information of mutation profiles, interactome and pathways, 148 which is capable of factorizing the gene mutation scores from different sets of tumor samples and helps 149 DGPathinter to address tumor sample heterogeneity issue (Lee et

Computer Science
DGPathinter can provide highly scored pathways for the investigated tumor samples, while our previous 155 approach cannot. When we apply DGPathinter and three existing interactome driven methods on three 156 TCGA cancer datasets, the detection results of DGPathinter outperform those of the competing methods. 157 The top ranked genes detected by DGPathinter are also highly enriched for known driver genes. We 158 further investigate the top ranked scored pathways yielded by DGPathinter, demonstrating that most of 159 these pathways are also associated with cancer progressions. The remainder of the paper is organized as 160 follow: Section 2 introduces the rationales and detailed techniques of our method DGPathinter. In Section 161 3, we apply our method on three cancer datasets and evaluate DGPathinter with the three existing methods 162 through known driver genes. Finally, we discuss our future work and make a brief conclusion in Section 163 4. The code of DGPathinter can be freely accessed at https://github.com/USTC-HIlab/DGPathinter.

165
Somatic mutation datasets 166 For somatic mutation data of cancers, we focus on three types of cancers from TCGA datasets, which  To efficiently identify potential driver genes from somatic mutation data, we use a knowledge-driven matrix  (Xi et al., 2017), matrix factorization framework has been shown to be an appropriate framework for the 200 task of detecting driver genes from mutation data of heterogeneous cancers. Here we denote the matrix 201 G G G k×p = (g g g 1 , . . . , g g g k ) as the gene matrix and the binary matrix S S S n×k = (s s s 1 , . . . , s s s k ) as the sample matrix, 202 and use their multiplication S S SG G G T to approximate the mutation matrix X X X. The entries of G G G represent the In DGPathinter, we utilize prior knowledge from pathways and interactome in our model. The two types of prior knowledge are integrated via a knowledge-driven matrix factorization framework. This matrix factorization framework also decompose the somatic mutation matrix as the multiplication of two low rank matrices S S S n×k = (s s s 1 , . . . , s s s k ) and G G G T k×p = (g g g 1 , . . . , g g g k ) T , which is equivalent to the summation of k rank-one layers ∑ k i=1 (s s s i g g g T i ). The matrix S S S is a binary matrix, of which the entries represent to the assignments of the samples to the rank-one layers. The entries of the matrix G G G T denote the gene mutation scores for the samples in the rank-one layers. To integrate the pathway information into the analysis workflow, we project the gene scores in the matrix G G G onto their related pathways and maximize the covariance between the projection scores and pathway scores −Tr{G G G T F F FV V V }, where the bipartite matrix F F F p×m represents the relationships of the genes and the pathways, and the entries of the nonnegative pathway score matrix V V V m×k represent the scores of the respective pathways and rank-one layers. Meanwhile, to incorporate interactome information from an interaction network, we introduce a graph Laplacian regularization term Tr{G G G T L L LG G G} on the matrix G G G, where the matrix L L L p×p is the Laplacian matrix of the interaction network. For each gene, we choose the maximal gene mutation scores among the k rank-one layers from the matrix G G G and prioritize the driver genes. The top ranked genes are regarded as potential driver genes for further evaluations. S S S and G G G and the mutation matrix X X X can be formulate as X X X ≃ S S SG G G T + ε ε ε, where the ε ε ε is the residual matrix 210 between the matrix X X X and the multiplication S S SG G G T , and matrix S S S is subject to the equality restriction

215
To make the driver gene prioritization procedure in our model driven by prior knowledge from 216 pathways, we introduce an nonnegative matrix V V V m×k as the pathway score matrix. The row number 217 of the matrix m is total number of pathways used in our model. The column vectors in the matrix represent the scores of the pathways, and a higher score of a pathway indicates a 219 larger potential that the pathway is dysregulated in the related set of tumor samples. To incorporate 220 pathway information into gene scores for different sets of samples, we project the gene scores onto their 221 related pathways and maximize the covariance between the projection scores and pathway scores as and their related pathways. The entry F i j equaling 1 denotes that the j-th gene belongs to the i-th pathway.

225
In addition, to avoid overfitting problem, we also use Frobenius norm based regularization on the pathway where λ C , λ L and λ V are used to balance the data fitting, the coherence gene scores and pathway scores of the data matrix. We estimate the first layer by minimizing the following objective function where s s s 1 , g g g 1 and v v v 1 are the first column vectors of matrices S S S, G G G and V V V respectively, and 1 1 1 n×1 indicate 248 a vector with all coefficients being 1.
squared Frobenius norm of the vector. 250 We then apply an alternatively strategy to estimate the three vectors s s s 1 , g g g 1 and v v v 1 in Eq.
(2). When 251 the other two vectors v v v 1 and s s s 1 are fixed, the minimization problem for the mutation score vector g g g 1 can 252 be reformulated as below where I I I p is a p × p identity matrix.

255
Likewise, the optimization function to solve the pathway score vector v v v 1 in optimization problem in 256 Eq.
(2) is formulated as which is a nonnegative quadratic programming problem. The estimation of the vector v v v 1 in Eq. (5) can be 258 calculated as where {·} + is an operator which replace the negative coefficients of the input vector with zeros.  Table 1).

267
After convergence, the first rank-one layer s s s 1 g g g T 1 from the mutation matrix X X X, along with the related 268 pathway score vector v v v 1 , are obtained. Since the cancer data may display heterogeneity, it is not sufficient 269 to utilize only one layer to fit the mutation data matrix. Subsequently, we apply the one layer estimation Manuscript to be reviewed

Computer Science
Algorithm 1 DGPathinter: iterative estimation of the first rank-one layer Input: soamtic mutation matrix X X X n×p ; pathway by gene bipartite matrix F F F m×p ; graph Laplacian matrix of interaction network L L L p×p . Output: sample indicator vector s s s 1 (n × 1); gene score vector g g g 1 (p × 1); pathway score vector v v v 1 (m × 1).
1 and s s s 1 ← s s s (∞) 1 Note: 1 1 1 n×1 is an n × 1 vector with all coefficients being 1; 0 0 0 m×1 is an m × 1 vector with all coefficients being 0. Table 1. Pseudo-code of the first rank-one layer estimation of DGPathinter.
By using these benchmarking genes as ground truth genes in the evaluation studies, we firstly compute  genes, our approach also yields the smallest average rank among the competing methods ( Table 2). The     In this paper, we propose a knowledge-driven matrix factorization framework called DGPathinter, to iden-431 tify driver genes from mutation data with prior knowledge from interactome and pathways incorporated.

432
The knowledge of pathways is incorporated by maximizing the correlation between the pathway scores 433 and their relations of mutation scores (Chen and Zhang, 2016 The promising performance of DGPathinter in identification of driver genes may be due to three 447 potential reasons. First, the prior knowledge from pathways is important for understanding the roles of  Figure S5, using network information does not change the performance for the datasets of the three types 498 of cancers. When we further investigate this phenomenon, we find that some non-benchmarking genes 499 included in top ranked genes of the result with network information are different with those in the result 500 with no prior information, although the known benchmarking genes included in the two results are the 501 same. In addition, a possible expansion to DGPathinter would be to integrate multi-omics data from not 502 only mutations but also copy number alternation, gene expression and DNA methylation of genes, which 503 also play important roles in activating oncogenes and inactivating tumor suppressors (Yang et al., 2017).

504
Another interesting topic in future work is to generalize the framework of DGPathinter to pan-cancer 505 analysis, in which the samples of numerous different cancer types is combined as one large dataset and 506 some driver genes across many types of cancers will be identified in this case (Leiserson et al., 2014). In 507 conclusion, DGPathinter is an efficient method for prioritizing driver genes, which yield a sophisticated 508 perspective of cancer genome by utilizing prior knowledge from interactome and pathways.