NMFNA: a non-negative matrix factorization network analysis method for identifying communities and characteristic genes from multi-type pancreatic cancer data


 Background: Pancreatic cancer (PC) is a common type of digestive system disease. Comprehensive analysis of multiple types of PC genetic data plays a crucial role in understanding its potential biological mechanisms. Currently, Non-negative Matrix Factorization (NMF) based methods are widely used for data analysis. Nevertheless, it is a challenge for them to integrate and decompose multiple types of data simultaneously.Results: In this paper, a Non-negative Matrix Factorization Network Analysis method, NMFNA, is proposed, which introduces a graph regularized constraint to NMF, for identifying communities and characteristic genes from multi-type PC data. Firstly, three PC networks, i.e., methylation network (ME), copy number variation network (CNV), and the bipartite network between them, are constructed by both ME and CNV data of PC downloaded from the TCGA database, using the Pearson Correlation Coefficient. Then, the NMFNA is proposed to detect core communities and characteristic genes from these three PC networks effectively due to its introduced graph regularized constraint, which is the highlight of NMFNA. Finally, both gene ontology enrichment analysis and pathway enrichment analysis are performed to deeply understand biological functions of detected core communities.Conclusions: Experimental results demonstrated that the NMFNA facilitates the integration and decomposition of multiple types of PC data simultaneously and can serve as an alternative method for detecting communities and characteristic genes from multiple genetic networks. The data and demo codes of NMFNA are available online at https://github.com/CDMB-lab/NMFNA.


Background
Pancreatic cancer (PC) is an extraordinarily deadly and highly invasive disease of the digestive system [1]. Consequently, exploring the causes and underlying molecular events of the PC has become a hot topic in the field of Bioinformatics [2], and various methods for understanding tumorigenesis of PC have been proposed. For instance, Akashi et al. [3] applied the lasso penalized Cox regression method to identify genes that were relevant to PC progression. To predict biomarkers of PC, Yang et al. [4] selected differentially expressed genes and pathways using a threshold screening method. Gong et al. [5] combined the doubly regularized Cox regression method with the integration of pathway information to detect a comprehensive set of candidate genes with PC survival. Kwon et al. [6] introduced the support vector machine to evaluate the diagnostic performance of PC tissues' biomarkers.
Tao et al. [7] performed a comprehensive search of electronic literature sources to mine information of PC data. Li et al. [8] applied k-nearest neighbor method and random forest algorithm to identify suspicious genes of PC.
In addition to above mentioned methods, several Non-negative Matrix Factorization (NMF) based methods also have been proposed to detect pathogenic factors associated with PC. For instance, Mishra et al. [9] used the NMF to distinct molecular subtypes of PC methylome data. Wang et al. [10] proposed the maximum correntropy criterion based NMF (NMF-MCC) to cluster the PC gene expression data. Zhao et al. [11] proposed the NMF bi-clustering method to determine the subtypes of PC. Ding et al. [12] proposed the orthogonal Non-negative Matrix Tri-Factorizations (Tri-NMF) with class aggregate distribution and multi-peak distribution to distinguish different molecular subtypes of cancers. Zhang et al. [13] introduced the joint NMF (jNMF) method to identify characteristic genes of cancers, which can project different types of expression data onto a common coordinate system. Xiao et al. [14] proposed the GRNMF method to infer potential associations between miRNAs and pancreatic neoplasms.
Nevertheless, most of them only considered individual genetic effects and ignored interaction effects among genes. In fact, it is widely believed that interaction effects among genes could unveil a large portion of the unexplained heritability of cancers [15]. To capture these interaction effects, several NMF based network analysis methods have been proposed due to network facilitating presenting interactions between genes. Yang et al. [16] introduced the integrative NMF (iNMF) to analyze the multi-omics matrices jointly, which included a sparsity option for decomposing heterogeneous data. Chen et al. [17] adopted the NetNMF method to identify modular patterns of cancers by integrating both gene expression network and miRNA network, which might ignore the spatial geometric characteristics of the data while analyzing the gene expression network. Gao et al. [18] proposed the iGMFNA model for clustering and network analysis. Nevertheless, rather than fusing and analyzing multiple networks directly, the iGMFNA decomposed the integrated data and used the decomposed sub-matrix to construct the network, which might weaken the internal connection between nodes in the network. Zheng et al. [19] integrated methylation (ME) network and copy number variations (CNV) network to identify molecular subtypes of ovarian cancer, which provides new insight into early disease diagnosis. Though these NMF based network analysis methods are not used for PC, they show their successes in capturing cancer patterns involving interaction effects, which provides a direction to unveil the unexplained heritability of PC.
In this paper, we proposed a Non-negative Matrix Factorization Network Analysis method, NMFNA for short, based on graph regularized constraint, to detect core communities and characteristic genes of multi-type PC data. Firstly, the Pearson Correlation Coefficient (PCC) was used to construct the ME network, CNV network, and the bipartite network between them by both ME and CNV data of PC downloaded from the TCGA database. Then, the NMFNA is used to integrate and decompose these networks, as well as to detect core communities and characteristic genes from them. Finally, we performed Gene ontology (GO) enrichment analysis and pathway enrichment analysis to validate biological functions of detected core communities. The experimental results demonstrated that the NMFNA can provide reliable molecular biomarkers for early detection and prognosis of PC, and can serve as an alternative method for detecting communities and characteristic genes from multiple genetic networks.

Method Typical NMF Methods
The NMF [20] and its variants have been increasingly applied to identify communities in a biological network. Given a gene expression data matrix X m n × , the NMF decomposes it into two

XV UVV
In addition to two-factor NMF, three-factor NMF also plays an important role in the process of matrix factorization. Different from two-factor NMF, three-factor NMF absorbs different scales of X m n × , U m k × , V k n × through an extra factor S : X USV ≈ . A three-factor NMF variant, TriNMF, was used in NMFNA [12], which is defined as: where U m k × and V k n × are used to identify communities from different levels, S is an k k × matrix presenting the relations between identified communities. The update process to minimize the objective function of (4) is, Besides, for multiple networks, the NetNMF [17] was used to analyze the ME network, CNV network, and the bipartite network between them, as well as identify communities by decomposing these networks simultaneously. Given the ME network R 1   , respectively [17]. The update process to minimize the objective function of the NetNMF is,

Non-negative Matrix Factorization Network Analysis Method
To improve the performance of analyzing multi-type networks, as well as identifying communities and characteristic genes, we proposed the NMFNA method by incorporating the graph regularized constraint [21], which can indicate the inherent geometrical structure of the input networks. In other words, graph regularized constraint ensures that interacted genes in the represented Euclidean space are also close to each other in the low-dimensional space, which can be defined as, where Z ij represents the similarity between gene i v and j v . Define a diagonal matrix D whose elements are column sums of matrix Z, that is, D . . , , , .

R G R G S ZG G G G G S G G S DG
The iterative process, the schematic diagram, and the convergence proof of NMFNA are 1 shown in Table 1, Fig.1, and Additional file 1, respectively.

Determination of Communities
Two types of communities could be identified from the ME network and CNV network while G 1 and G 2 were obtained, respectively. For the ME network matrix, z-scores ij x were calculated as ME weights at each column, and those MEs whose weights were higher than the threshold were considered as members of the corresponding community. . Similarly, for the CNV network matrix, z-scores at each column in G 2 were calculated to capture the corresponding community.
Here, the threshold was set to 2 θ = according to the reference [17].
The matrix S 12 indicated the similarity between G 1 and G 2 , which was obtained by the decomposing matrix R 12 , R G S G T 12 1 12 2 ≈ . As a diagonal matrix, R 12 enforced the i th community of the ME network only to interact with the i th -community of the CNV network.
Furthermore, the matrix S 11 represented the weight of G G T 1 1 in the reconstruction of R 11 . In the study, the gene numbers in communities indicated their importance. Hence we used the non-diagonal elements in S 11 to indicate the importance of relationships of communities from the ME network. Similarly, the non-diagonal elements in S 22 were used to indicate the importance of relationships of communities from the CNV network. Same to previous studies [23,24], this study also detected core communities from the ME network and CNV network, which refer to the communities with the most gene nodes.

Analyses of Communities and Characteristic Genes
GO enrichment analysis was carried out to annotate genes and identify biological features from core communities by using the DAVID online tool (https://david.ncifcrf.gov/) [25].
Besides, the KOBAS online analysis database (version 3.0) (http://kobas.cbi.pku.edu.cn/) [26] was applied to conduct pathway enrichment analysis. Databases used in pathway analysis included the KEGG pathway, BioCyc, PANTHER, and Reactome. Besides, we want to find characteristic genes in the entire network, which may contribute to the PC when they are abnormally expressed. To evaluate the importance of a gene in the network, we introduced the following Multi-measure Score for every gene x [18] by combining local and global features:

Results and discussions Datasets
The multi-type Pancreatic cancer data were downloaded from the TCGA database (https://tcgadata.nci.nih.gov/tcga/), which contains ME data and CNV data. The datasets of 1 PC are summarized in Table 2. We calculated the adjacency matrix R 11 to represent the ME network, R 22 to represent the CNV network and R 12 to represent the bipartite network between them by PCC, respectively. In the network to be the same for both terms [28]. From Fig. 2A it is seen that , and x C is the members of community x . To reduce the error of the decomposition results, we iterated multiple times of NMFNA. Fig. 2B shows that the convergence of NMFNA is relatively stable after iterating 200 times. In addition, k is determined by SVD method. Specifically, the first inflection point based on SVD, k 6834 = , was used as the dimensionality reduction value to approximate the PC matrix.

Analyses of GO and Pathway Enrichment
To demonstrate the validity of NMFNA, we compared it with NetNMF, TriNM, and NMF on PC data. Core communities were selected from the ME network and CNV network using NMFNA, NetNMF, TriNMF, and NMF methods, respectively. These core communities were used to perform the GO and pathway enrichment analysis. Fig. 3 shows numbers of GO terms and pathways (which satisfy P-Value 0.05 < ) respectively. From Fig. 3A and Fig. 3B, it is seen those core communities identified by NMFNA include more GO terms and pathways than those identified by other methods from the ME network and CNV network.
To further verify the biological significance of communities detected by NMFNA, we listed the GO annotations in Fig. 4 and Fig. 5. In both of them, GO terms mainly included three functional groups: molecular function, biological process, and cellular component. In of PC cells [30]. As a member of the cadherin/catenin family, p120(ctn) was found in the cytosol (GO:0005829) of PC cells [31], which correlated to the degree of tumor dedifferentiation. The fractional volume of extravascular extracellular space (GO:0005615 ) in PC was statistically larger than that in normal pancreas [32]. NPC-26 may effectively inhibit PC cells through destroying mitochondria (GO:0005739) [33]. The GO analysis of lung cancer genes involved in DNA-templated (GO:0006351) has been clinically used for treating the lung cancer [34], which might affect the modulation of other cancers. Nicotine

Identification of significant pathways
Furthermore, P-Values in common pathways from core communities obtained by four methods were also compared, which were listed in Metabolic pathways are known as one of the most significant pathways in biological processes. Among them, small molecule metabolic process [41] has been found as the enriched pathway for the biological process of PC. The reference [42] reported that the identification of immune system related regulation pathways could provide new insights for PC treatment and prognosis.
To further verify core communities detected by NMFNA, we listed details of the top 10 pathways according to their P-values in Fig 6, in which, the size of each node denotes the number of genes that are enriched in the corresponding pathway. The deeper the color of the node, the smaller the P-Value of the pathway is, therefore the more important the pathway is. Fig. 6A shows the top 10 pathways of the NMFNA core community from the ME network.
Similarly, Fig. 6B shows the top 10 pathways of the NMFNA core community from the CNV network. Among these pathways, transport of small molecules pathway, metabolism of

Detection of Characteristic Genes
To detect characteristic genes of core communities, we only considered edges with strong PCC (>0.8) values in these two networks. The Multi-measure Score of each gene was regarded as an indicator of its contribution to the interactions among genes in the ME 1 network and CNV network. The larger the value of the Multi-measure Score is, the more significant the gene is. After sorting genes of core communities in descending order using the Multi-measure Score, we respectively selected the top 10, top 30 and top 50 characteristic genes to sum over their total scores from both ME and CNV networks (Table 4), which showed the importance of characteristic genes related to PC. It is seen that in Table 4 the scores of NMFNA were larger than those of other compared methods, which indicated that characteristic genes detected by NMFNA were more relevant to PC than characteristic genes detected by other compared methods.
Besides, total PC-related genes of the top 10 characteristic genes detected by compared methods in both ME network and CNV network were listed in Table 5. It is seen that the NMFNA detected more PC-related genes than other methods. We further analyzed the biological functions of these PC-related genes detected by NMFNA. TNKS1BP1 has been reported to regulate actin cytoskeleton and cancer cell invasion, which might also be related to the progression of PC [32]. The level of TK1 was up-regulated 4-fold in the mice PC specimen, therefore we speculated that it might also play a potential role in the human PC [33]. The variation of KCNJ1 is associated with diabetes [44], which is the major risk factor of PC. As an independent prognostic marker, SDC1 has been identified to up-regulate in PC [45]. SDC1 is an important paralog of SDC3, hence we speculated that SDC3 may also be related to PC. NR3C2 has been identified as the target of miR-135b-5p [46], which may promote migration, invasion, and EMT of PC cells. USF1 is a protein coding gene, which has been verified to be associated with Hyperlipidemia. As a protein coding gene, TTC21A and its important paralog TTC21B are associated with several diseases including spermatogenic failure and non-syndromic male infertility. MAN1C1 is a protein coding gene, which is associated with breast abscess, alcohol-induced mental disorder and other diseases. It is worth noting that USF1, TTC21A, and MAN1C1 have been marked as PC associated genes in the GeneCards database.

Conclusions
To identify core communities and characteristic genes that are associated with PC, in this