GASN: gamma distribution test for driver genes identification based on similarity networks

ABSTRACT Cancer is a disease with a complex genome of altered functions. However, most existing driver gene identification approaches rarely consider driver genes may have the same functional properties. To overcome this issue, we propose the gamma distribution test for the driver gene identification based on similarity networks, termed GASN, which identifies driver genes by combining machine learning and distributional statistics methods. Similarity networks are able to learn gene similarities and key features that represent the functional impact of genes. In addition, we classify genes into different cellular compartments and use the gamma distribution test within cellular compartments to identify significant driver genes. The experimental results show that our method outperforms the other 17 comparative methods.


Introduction
Cancer cells have the ability of unlimited proliferation, infiltration, and destruction of normal human tissues (L.Zhang et al., 1997).Compared to normal cells, cancer cells have a great number of mutations, and only a small fraction of these mutations and genes contribute to cancer development, which is called driver mutations and driver genes (Y.Chen et al., 2014).How to effectively identify these driver genes from a large number of genes is of great importance to fundamentally understand the mechanistic study of cancer and is beneficial to discover new therapeutic targets.
In recent years, thanks to the vigorous development of high-throughput sequencing technology, some large-scale cancer genome projects, such as the International Cancer Genome Consortium (ICGC) (Joly et al., 2012) and the Cancer Genome Atlas (TCGA) (J.Liu et al., 2018), have systematically recorded DNA sequence data and clinical data of thousands of tumours.The establishment of this kind of database makes it possible for us to deeply discover driver genes.Moreover, due to the accumulation of data, provides a good experimental data basis for researchers to apply modern machine-learning methods for driver genes identification.Several driver genes identification methods have been developed (Isik & Ercan, 2017;Joly et al., 2012;J. Liu et al., 2018;Rahimi et al., 2019).
One of the most common approaches is to identify the driver genes by analysing a set of genes and finding genes that are frequently mutated in somatic cells.However, genespecific characteristics such as mutation frequency, mutation type, and mutant gene length may have an impact on the acquisition of mutations in genes.Therefore, methods based on mutation frequency are often compared with background mutation rates (BMR) to identify significantly mutated genes.For example, MuSic (Dees et al., 2012) selected genes with mutation frequencies higher than the BMR as driver genes, and MutSigCV (Lawrence et al., 2013) used patient-specific mutation frequencies and spectra to obtain more accurate BMRs by taking into account tumour heterogeneity and genome.Although some driver genes mutate at high frequencies ( > 20%), most cancer mutations occur at intermediate frequencies (2-20%) or lower than expected (Gu et al., 2020).If the evaluation of the background mutation rate is too high, some driver genes with significant mutations are difficult to be identified.By contrast, some genes with non-significant mutations will be misidentified.
The other one is based on the effect of gene function.Some researchers combine the known protein action network or evaluate the functional impact of mutations on proteins to determine driving mutations and driver genes by improving the sensitivity to driver genes with low mutation frequency (Bertrand et al., 2015;Leiserson et al., 2014;Mularoni et al., 2016).For example, Hierarchical HotNet (Reyna et al., 2018) used the protein interaction network to calculate the influence between paired genes and identify the sub-network containing cancer mutant genes.MaxMIF (Y.Hou et al., 2018) proposed a way to identify driver genes by integrating somatic mutation data and molecular interaction data using a maximum mutation impact function.NetSig (Horn et al., 2018) examined the nearest directly adjacent genes at each vertex and assigned scores to a 'star-shaped subnet' centered on each vertex to identify the driver genes.DawnRank (J.P. Hou & Ma, 2014) ranked potential driver genes according to their impact on the overall differential expression of their downstream genes in a network of molecular interactions.
Furthermore, with the discovery of more and more driver genes through experimental verification, based on these known driver genes, researchers began to combine machine learning algorithms to predict new driver genes (Anandanadarajah et al., 2021;B. Chen et al., 2016;Gu et al., 2020;Han et al., 2019;He et al., 2022).These methods usually train classifiers with gene characteristics.DriverML (Han et al., 2019) used statistical methods to quantify the impact scores of different mutation types on protein function and then combined it with a machine learning model to identify the cancer genes.FInet (Gu et al., 2020) used an artificial neural network to estimate the functional impact score of gene mutation, and then combined it with a hierarchical clustering algorithm to obtain the distribution of each gene in the class to identify the driver genes.ParsSNP (Kumar et al., 2016) used an unsupervised functional impact predictor with an expectation-maximising framework to find mutations that broadly explain tumour incidence.
However, the existing driver genes recognition methods based on machine learning still have the following problems: In the topology of the gene similarity network, some nodes with highly similar attributes usually have similar functions, or they complete a certain function together (Xing et al., 2020).For example, genes CCND1 and CCND2 with high sequence similarity have mechanisms involved in cell cycle regulation and affect cell proliferation (X.Wang et al., 2020).Gene TFEB, TFE3, and TFEC all have highly similar domains and are involved in the activation of target gene transcription (Xie et al., 2019).In addition, many methods do not strike the best balance between accuracy and sensitivity.Some methods are too ambitious and produce too many false positive driver genes, while others are too conservative and may miss many of the true driver genes.
Based on the aforementioned observations, this paper proposes a gamma distribution test driver genes identification method based on similarity networks (GASN).First, a gene similarity network is constructed using key feature information to describe the impact of gene function.Second, since some nodes with highly similar properties in the topology of similarity networks usually have similar functions, this work uses convolutional neural networks to learn the topology of gene similarity networks to predict the functional impact score of gene mutations, which can fully learn the similarity of genes, and it can also discover the key feature information by characterising the impact of gene functions.As proteins must be localised in the appropriate subcellular compartment to fulfill their function, the same genes in different biological modules play different roles in cell production and development.We have therefore classified genes according to the subcellular compartment in which they are located.On this basis, we fit the distribution of functional effects in the different subcellular compartments with a gamma distribution, compared the observed functional effect scores with the fitted distribution to obtain a p-value for each gene, and select genes with significant deviations as driver genes.We apply our method to the 10 TCGA datasets, and the experimental results show that GASN's performance is generally superior to the other 17 comparative methods.
The main contributions of this paper are as follows: • First, the gene similarity network combined with the convolutional neural network is used to predict the functional impact score of a gene mutation for the first time.That the convolutional neural network can not only learn the key characteristic information by characterising the impact of gene function but also fully exploit the information on the gene similarity network, genes with high similarity usually have similar functions or complete a certain function together, which improves the accuracy of the model.• In this paper, by classifying genes into different cellular compartments and using distribution tests to identify driver genes in cellular compartments.GASN has superior performance in 10 cancers compared to 17 other state-of-the-art methods.
In the rest of the paper, Section 2 describes the proposed method, Sections 3 and 4 are our experimental and analytical results, and Section 5 draws some conclusions.

Method
In this section, we propose a Gamma distribution test based on similarity networks for driver gene identification, as shown in Figure 1.We construct a gene similarity network and use it as an input to a convolutional neural network to predict the functional impact score of genes.Secondly, as genes function in different subcellular compartments, we group genes according to whether they are in the same subcellular compartment or not.Finally, we use Taking gene g i as an example, the similarity between g i and other genes is calculated.(c) The top k-nearest neighbor genes g sk in the similarity ranking of each gene are selected to form the topology.(d) Taking gene g i as an example, the gene similarity network is constructed, and its order is g i , g s1 , g i , g s2 , . . ., g i , g sk , the similarity network of all genes is used as the input of convolutional neural network.(e) It shows the 5-layer basic structure of the convolutional neural network used in this paper, including the input layer, convolution layer, pooling layer, full connection layer, and output layer.(f) The gene background distribution is fitted by gamma distribution in the subclass, and the observed FIS is compared with the predicted FIS in the background distribution, to obtain the p value of each gene and select the gene with significant deviation as the driver gene.
the predicted gene functional impact scores and Gamma distributions to model the background distribution of gene function and obtained significant driver genes by comparing the background distribution with the observed functional impact.

Calcuate the observed FISs for genes
It is worth noting that some values of these 12 genetic characteristics may be missing.Therefore, these missing values are imputed according to the method of paper (Gu et al., 2020).First, the distance between gene g i and gene g j is defined, as shown in Equation ( 1), as the basis for selecting nearest neighbor genes: where θi is a set of feature vectors of gene g i deletion, and θ i,l is the feature l of gene g i .We use Equation ( 2) to complement the deletion feature l of gene g i .
let N M i,l be the M nearest neighbor gene of gene g i without deletion feature l.In this article, M is set to 100.
Similarly, because some observed FIS cannot be evaluated by the mutation assessor, we map the variant classification of MAF file mutation (such as silent, synchronous, non-stop, nonsense, and frame deletion) to the corresponding mutation effect (null, silent, non-silent and non-coding ) according to the paper (Gu et al., 2020).According to Equation (3), the observed FIS of the average compensation of mutation r with mutation effect t can be obtained as follows: where Q t is the mutation set of mutation effect t, f q,t is the FIS of mutation r with mutation effect t, and |Q t | is the number of mutations with mutation effect t.
Notice that most existing methods to evaluate the functional impact of mutations always focus on non-synonymous somatic mutations, such as mutation suppressors, polymers, and screening factors.Synonymous mutations and some mutations affect proteins.If nonsense mutations and small index fission deletion, the average FIS of mutations with silencing and ineffective effects cannot be calculated from the mutant synthon.In general, silent, non-coding, non-silent, and null mutations have a progressively more significant effect on proteins.Silent mutation does not affect the amino acids of protein sequence, and its FIS should be the smallest.Although non-coding mutation does not change amino acids, it will promote the development of cancer cells.For example, the non-coding mutation in the 3'-untranslated region (3'-UTR) can change the binding efficiency of microRNA (miRNA), resulting in the loss/increase of gene function (Akdeli et al., 2014).The non-silent mutation changes the amino acid sequence of the protein and has a significant functional impact on protein, accelerating tumour progression.For example, the R132 mutation in the IDH1 gene was found to be associated with early glioma formation (Cui et al., 2016).Null mutation, including "nonsense mutation", "splice site", "frameshift insertion", and "frameshift deletion" will lead to continuous changes in amino acid sequences and have a more significant impact on organisms.For example, Waldenberg syndrome is caused by splicing mutation of PAX3 (Barber et al., 1999), and exon mutation is caused by nonsense/frameshift mutation of DMD gene, resulting in Becker muscular dystrophy (Al-Zaidy et al., 2015).Based on the above analysis, when the average FIS effect t cannot be calculated, the deletion FIS of mutation r with mutation effect t can be obtained as shown in Equation ( 4): The observed FIS of gene g i is obtained by accumulating the FIS of all mutation effects t of cumulative mutation r as follows: where T r is the total number of all mutation effects of mutation r.

Network-based convolution
In recent years, network models have been widely used in biological systems (Gu et al., 2020;Han et al., 2019).In network science, a widely recognised and partially experimentally verified hypothesis is that the topology of complex networks can fully reflect the connection properties of the corresponding real complex system, and nodes with highly similar topology usually have similar functions or close connections to perform a certain function (Jiang et al., 2022;C. Liu et al., 2022;Xing et al., 2020).To characterise the similarity relationship between genes, the Pearson correlation coefficient is used to calculate the similarity between genes, as shown in Equation ( 6): where θ i,l is the feature l of gene g i , L is the number of features in a gene, θi is the average value of a set of feature vectors of gene g i , and the similarity network is constructed by k-nearest neighbor algorithm, in which each gene relates to the top k genes in its similarity ranking, as shown in Figure .1(c). Figure 1(d) shows the similarity network φ i ∈ R 2k×L (i = 1, 2, . . ., N) composed of the eigenvectors of a given gene g i and its k similar neighbours g sk .where L is the characteristic dimension of genes, and N is the total number of genes.The similarity network of N genes constitute { | φ i , i = 1, 2, . . ., N} as the input data of the input layer.In φ i , the feature attributes between similar features are close to each other, so they can share the same filter in the convolution layer (Jiang et al., 2021).

Similarity neural network
Convolutional Neural Network (CNN) is a fundamental approach in the fields of image recognition, speech analysis, text recognition, and so on Le and Nguyen (2019).In the prediction of gene mutation function influence score, the traditional input data contains significant features representing different attributes of genes, which cannot be directly applied to CNN (X.Chen et al., 2022;Le & Ho, 2022;Lin et al., 2021).How to apply the input data containing the significant characteristics of different attributes of genes to CNN is the problem we need to solve at present.We find that the pixels in the same small area in the network share the same filter because they have similar gray levels.In the gene similarity network, genes and their adjacent genes also have similar properties (de Vries, 2020;Luo et al., 2019).If we reconstruct these traditional input data to make the characteristics of similar genes close to each other, the neural network can be applied to these reconstructed data.Based on this, the gene similarity network, according to the reconstructed data in Section 2.2 constructed, so that the convolutional neural network can learn the topology of the similarity network and the characteristic information characterising the impact of mutation function to improve the prediction accuracy of functional impact score.
As shown in Figure 1(e), the convolution neural network model based on a similarity network is composed of five layers: input layer, convolution layer, pooling layer, full connection layer, and output layer.The similarity network φ i ∈ R 2k×L (i = 1, 2, . . ., N) is composed of the eigenvectors of a given gene g i and its k similar neighbours g sk , where L is the characteristic dimension of genes.{ | φ i , i = 1, 2, . . ., N} composed of the similarity network of N genes is used as the input data of the input layer.According to Equation ( 7), the output result of the convolution layer is used as the nonlinear mapping, and the S(i, j) is used as the input of the pooling layer.
where b j represents the offset and w j , w j is the weight parameter, and f is the activation function.

Delineat cellular subdivisions
It can be seen from biological studies that nonsense mutations that generate stop codons, missense mutations caused by single amino acid residue replacement, non-silent mutation changes such as in-frame insertion or deletion, and frameshift of transcript reading frame will change the amino acid sequence of corresponding proteins, resulting in abnormal protein products, that is, gene mutations will affect protein function.These functional effects are usually evaluated by secondary and tertiary structural characteristics, evolutionary conservation, biochemical similarity of amino acids before and after replacement, position of side chains in three-dimensional protein structure, and so on Heinemann et al. (2001).
Compared with passenger mutation, driver mutation has a greater impact on protein function.Some researchers have attempted to identify driver genes by assessing differences in the distribution of the functional impact of passenger and driver mutations in subclasses delineated by clustering algorithms (Bashashati et al., 2012;Gu et al., 2020;Mai et al., 2020).However, this approach does not take into account that the same gene can play different roles in different cellular compartments, which results in a poor simulation of the background distribution of functional gene effects.Therefore, we divide the cells according to the cellular compartment in which the gene is located, as shown in Equation (8).
where S k denotes the kth cell compartment and G k denotes the kth subclass.

Construct a background distribution in a subclass
In this section, we need to judge which distribution the functional impact of mutation obeys.The histogram in Figure 2 shows the functional impact distribution of 9 cancer types.From Figure 2, the functional impact of mutations generally follows the gamma distribution, so the estimated distribution of FIS can be used to fit the gamma distribution.
Since the gamma distribution belongs to the normal skew distribution (Fay & Feuer, 1997), we replace the abnormal value with a smaller estimated FIS.Therefore, we use 5% truncation to estimate FIS instead of all data to fit the distribution.Specifically, the estimated FIS below the 5% quantile is removed.For non-positive FIS estimates, an overall adjustment is made to ensure that all estimated FIS is within the range of distribution.The F est,k i of genes g k i with non-positive FIS in the cell compartment G k is adjusted to: where min F est,k is the minimum estimated FIS in the cell compartment G k .ε belongs to the estimation deviation, and ε is set to 0.05.
The shape parameters of gamma distribution α k and scale parameters β k in cell compartment G k , then it is estimated by the maximum likelihood function, as shown in Equation ( 10): where and N k is the number of genes in the cell compartment is the density function of gamma distribution in the cell compartment G k .

Identification of driver genes
After obtaining the distribution of genes in each subclass through the gamma distribution, we need to evaluate genes with significant differences in the background distribution.First, the p-value significantly lower than FIS (≤ 0) is set to 0. To test the significance level of genes, we use the following null hypothesis that the observed FIS of gene g i is assumed to obey the above parameters (α, β).The p-value of gene g i in G k is given in Equation ( 13). where where H k is the cumulative function of the gamma distribution in G k and F obs,k i is the observed functional impact score of gene i in G k .
After obtaining the p-value of each gene, we further assign the q-value to each gene using the Benjamin-Hochberg error detection rate algorithm.Since there are k cell compartments and some genes may appear in more than one subcellular compartment, we choose the value with the largest mutation score calculated from Eq(15) as the final result.Genes whose q value exceeded the significance threshold (q ≤ 0.05)) are identified as driver genes.

Data source
We evaluate ten cancer types annotated by the TCGA, which consist of lung squamous cell carcinoma (LUSC), lung adenocarcinoma (LUAD), breast invasive carcinoma (BRCA), head and neck squamous cell carcinoma (HNSC), urothelial bladder carcinoma (BLCA), kidney renal clear cell carcinoma (KIRC), uterine corpus endometrioid carcinoma (UCEC), acute myeloid leukemia (LAML), ovarian serous cystadenocarcinoma (OV), and glioblastoma multiforme (GBM), with information on the proteins subcellular compartments including the nucleus, golgi apparatus, cytoskeleton, cytoplasm, endoplasmic reticulum, lysosome, peroxisome, extracellular gap, mitochondria, endonucleosome, plasma membrane and 11 other cellular compartments from Binder et al. (2014).The observation FIS used by GASN comes from FISs of mutation assessor (Gnad et al., 2013), which evaluates the functional impact of mutations based on the evolutionary protection of affected amino acids in protein homologs.The more significant the score of mutation assessment, the greater the impact of the mutation on function.To evaluate the predictive power of our model, ideally, we need an accurate, comprehensive, and unbiased gold-standard cancer gene set.
Unfortunately, such a cancer gene set is not available, so we collect annotated cancer genes as comparison benchmarks from different publicly available sources, where 2,372 proteincoding cancer genes are downloaded from NCG 6.0 (Repana et al., 2019), and 729 driver genes are downloaded from Cancer Gene Census (CGC) (Sondka et al., 2018).
To construct the gene similarity network, we also use 12 genetic features from multiomics data sources (epigenomics, transcriptomics, and genomics) as the feature vector of the similarity network.These features have been proved to affect the mutation-based function influence score as shown in Table 1:

Experimental parameters
As shown in Figure 1(e), the SCN structure in this work is mainly composed of a 2-layer convolution layer, 2-layer pooling layer, and 2-layer full connection layer.In addition, we use zero filling in the second convolution layer.The filter size, the window size of the aggregation layer, and the step size used in the convolution and the full connection layer are set to 2, according to experience.To measure the effectiveness of the similarity network and select the number of best nearest neighbor genes, we set the value of k as (0, 4-10).When k is equal to 0, SCN degenerates back to the general convolutional neural network.This paper uses the evaluation index R 2 and root mean square error (RMSE) commonly used in machine learning.The definition of these two indicators is shown in Equations ( 16) and ( 17).
where y i represents the true value, ŷi represents the predicted value, and ȳi represents the average value of y.
Table 2 shows the gene function influence scores predicted by SCN under different k values.From the table, we can see that the R 2 of SCN on all data sets is close to 1, which shows the effectiveness of our method.At the same time, no matter which cancer type, SCN performs better than CNN in performance indicators R 2 and RMSE, which shows that after adding the similarity network, it can learn the similarity of genes and the attributes of the original input data, to improve the accuracy of gene function impact score.We select the K value with the best R 2 and RMSE in Table 2 as the number of nearest neighbor genes of the similarity network and the best k value, such as the bold part.

Comparison of the number and accuracy of driver genes identified by different methods
Due to the wide heterogeneity of tumours, the number of driver genes varies in different types of cancers.The driver genes identification method based on sequencing data analysis can narrow the research scope of medical experiments.Therefore, it is extremely important to identify enough driver genes.If too few driver genes are identified, some key driver targets may be missed.If too many driver genes are identified, it will cause difficulties for subsequent medical experimental verification and further research.
The box plots in Figures 3-5 show the number and precision and recall of different methods for driver genes in 10 cancer types with NCG6.0 as the known driver gene benchmark.The number of driver genes obtained by GASN is mostly concentrated between 60 and 120 in these ten cancer types, much higher than the other 17 compared methods.The secondranked method is ActiveDriver, which obtained driver gene counts in the range of 30-80.Chromatin compartment (HiC) (Acemel et al., 2016) 5 Genomic copy number variation (CNA) (Gu et al., 2020) 6 The length of genomic regions (Wendl et al., 2011) 7 The hubness in a gene expression network (Khan et al., 2001) 8 The constraint score of non-synonymous mutation (Wyckoff et al., 2005) 9 The regulatory role of gene based on gene annotation databases (Huntley et al., 2009) 10 The total number of mutations in patients 11 The standard deviation of the patient's FIS 12 The number of deleterious mutation in patients   Most of the other methods identified no more than 30 genes in the ten cancer types.Similarly, when NCG is used as the baseline driver benchmark, GASN has a much higher recall rate in 10 cancers than other comparative methods.In addition, our proposed method can maintain high accuracy while identifying a large number of driver genes.GASN's accuracy in identifying driver genes across the 10 cancer types ranged from 75% to 80%, much higher than other comparative algorithms.We also use CGC as a benchmark for known driver genes, which is currently the most widely used and accurate benchmark for known driver genes.The box plots in Figures 6-8 show the number, precision, and recall of different methods for the 10 cancer types for which CGC is used as a benchmark for known driver genes.as can be seen in Figure 6, the number of CGC driver genes obtained by GASN ranges from 20 to 30 across the 10 cancer types, just below that of ActiveDriver.Figure 7 shows the recall rates of the 18 methods across the 10 cancer types, with the mean recall rate of GASN being higher than the other methods compared to the other algorithms, although in individual cancers the recall rate of GASN is slightly worse than that of ActiveDriver.This does not mean that our method is ineffective, as the number of genes identified by other methods is much lower than the number of genes obtained by our proposed method.
Based on the above analysis, we can see that the number and accuracy of driver genes identified by our method outperform other comparative methods, whether NCG6.0 or CGC is the known driver gene benchmark.

Enrichment analysis
The function of the enrichment analysis is to assess whether the identified driver genes have a common biological function.It is found that somatic mutations always target a set of cancer genes in regulatory and signaling pathways.Furthermore, these cancer-related driver gene mutations occur repeatedly in functional regions of proteins (e.g.kinase structural domains and binding domains), thereby disrupting key biological functions.In this study, we choose to enrich the three cancer types with the most identified candidate genes, BLRA, LUAD, and LUSC, and perform KEGG pathway enrichment analysis and gene ontology enrichment analysis using DAVID 6.8.
We select the top 118 candidate driver genes obtained from the identification of BLCA for enrichment analysis.In terms of biological processes, the selected genes focused on cell adhesion, bio-adhesion, cell development, and cellular component morphogenesis.In cellular components, these genes are mainly enriched in supramolecular complexes, extracellular matrix, protein extracellular matrix, neuronal fractions, and cytoplasmic regions, among others.In terms of molecular function, the selected genes are mainly enriched in ATP-dependent microtubule motor activity, minus-end-directed, structural molecule activity, calcium ion binding, actin binding, etc. Regarding the pathway analysis of the selected genes, these genes mainly affect the PI3K-Akt signaling pathway, Thyroid hormone signaling pathway, HIF-1 signaling pathway, FoxO signaling pathway, etc.Most of these pathways have been shown to be involved in cancer production and development.For example, the transcriptional activity of FoxO transcription factors is negatively regulated by PI3K/Akt, a the family mainly comprises FoxO1, FoxO3, and FoxO4, which are central molecules in the regulation of cellular stress responses and tissue-specific oncogenic molecules For LUAD, we select the top 157 candidate driver genes for enrichment analysis.In terms of biological processes, the selected genes are mainly focused on cell adhesion, bio-adhesion, nervous system development, and cell development.In terms of cellular components, these genes are mainly enriched in cell bodies, cell-cell junctions, extracellular matrix, and supramolecular complexes, among others.In terms of molecular function, the selected genes are mainly enriched in calcium ion binding, ATP-dependent microtubule motor activity, minus-end-directed, macromolecular complex binding, ATP binding, etc. From the results of the pathway analysis of the selected genes, we know that they are mainly enriched in the PI3K-Akt signaling pathway, Apelin signaling pathway, Calcium signaling pathway, and Oxytocin signaling pathway.
In addition, for LUSC cancer, the selected genes are mainly enriched in cell adhesion, biological adhesion, cell development, cellular developmental process, etc.In terms of cellular composition, these genes are concentrated in the neuronal fraction, plasma membrane region, cell-cell junctions and cell bodies, etc.In terms of cellular composition, these genes are concentrated in neuronal fractions, plasma membrane regions, cell-cell junctions, and cell bodies, among others.In terms of biological function, the genes identified by GASN are mainly enriched in calcium ion binding, structural molecule activity, ATP-dependent microtubule motor activity, minus-end-directed, protein complex binding, cadherin, and cadherin.The pathway analysis of the selected genes showed that they mainly affect the PI3K-Akt signaling pathway.

Conclusions
In this study, we present the gamma distribution test for driver gene identification based on similarity networks (GASN), which identifies driver genes by combining machine learning and distributional statistics methods.A similarity network is employed to learn gene similarities and key feature information characterising the functional impact of genes.In addition, considering the modularity of driver genes, we classify genes into different cellular compartments and use gamma distribution tests within cellular compartments to identify important driver genes.Experimental results show that our method outperforms other comparison methods.
Our proposed method is easy to be clinically accepted and applied by comparing the differences between the simulated distribution of gene function impact scores observed and the simulated distribution of gene function impact scores obtained by prediction, which is well interpretable.Furthermore, because of the use of machine learning methods to predict driver genes, driver genes can be more easily identified in a large number of passenger genes, which provides great convenience for the ensuing biological experiments.However, we should recognise some limitations of this paper.For example, we did not consider the interactions between proteins in the expression regulatory network when constructing the similarity network.Instead, the stronger the interaction between two genes, the greater the likelihood that they will also have a greater similarity.
tion Cross-Disciplinary Research Grant (2020LKSFG04D), and Science and Technology Major Project of Guangdong Province (STKJ2021005).

Figure 1 .
Figure1.The workflow of GASN is divided into 6 steps.Specifically, (a) is used to predict the genetic characteristics of gene mutation function influence score and the observed FIS, which are fused as the feature vector for constructing a similarity network.(b) Taking gene g i as an example, the similarity between g i and other genes is calculated.(c) The top k-nearest neighbor genes g sk in the similarity ranking of each gene are selected to form the topology.(d) Taking gene g i as an example, the gene similarity network is constructed, and its order is g i , g s1 , g i , g s2 , . . ., g i , g sk , the similarity network of all genes is used as the input of convolutional neural network.(e) It shows the 5-layer basic structure of the convolutional neural network used in this paper, including the input layer, convolution layer, pooling layer, full connection layer, and output layer.(f) The gene background distribution is fitted by gamma distribution in the subclass, and the observed FIS is compared with the predicted FIS in the background distribution, to obtain the p value of each gene and select the gene with significant deviation as the driver gene.

Figure 2 .
Figure 2. Histogram of gene background distribution in 9 cancer types.

Figure 3 .
Figure 3.The number of NCG6.0 genes identified by different methods in 10 cancer types.

Figure 4 .
Figure 4.The precision of NCG6.0 genes identified by different methods in 10 cancer types.

Figure 5 .
Figure 5.The recall of NCG6.0 genes identified by different methods in 10 cancer types.

Figure 6 .
Figure 6.The number of CGC genes identified by different methods in 10 cancer types.

Figure 7 .
Figure 7.The precision of CGC genes identified by different methods in 10 cancer types.

Figure 8 .
Figure 8.The recall of CGC genes identified by different methods in 10 cancer types.

Table 1 .
12 genetic characteristics of multimers data sources.

Table 2 .
Performance indexes of gene similarity network convolution neural network under different k values.